Data Streaming Approach with Apache Flink

Dineth Kariyawasam
4 min readMar 21, 2021
Figure 0: Apache Flink logo

This is just the beginning of a series of articles about Apache Flink I have planned on releasing. First of all, let’s try to understand what Apache Flink is and what it does exactly. Apache Flink is a framework designed to perform stateful computations over unbounded and bounded data streams. Is that too much? Let’s break it down and see what it means exactly. So if we take the term data streams, any data kind produced as a sequence is defined as data streams. You will probably wonder what kind of data? Sensor readings, computer network traffic, ATM transactions, all of these data are generated as a stream.

Figure 1: A data stream

What are bounded and unbounded streams?

Bounded streams have a defined head and tail. We know exactly the start and end of the data stream.
Unbounded streams do have a start. But there is no defined end. It is endless.

So what Flink basically does is process bounded or unbounded data streams in real-time and Flink does this at any scale. The other important term is stateful computations. What does it mean? The information of the computation stored and it uses this information to process data streams or compute. For example, getting the average download speed reported by a router over the last second is a stateful computation.
Now we have an understanding of what Flink is and what it does. And I would like to highlight a few use cases of Flink.

  1. Anomaly detection
  2. Fraud detection
  3. Analysis of live data
  4. Real-time search index building

Obviously, there are many use cases I haven’t mentioned here. It is because Flink can operate on thousands of cores, processing millions of events per hour without any single point of failure. In the short term, Flink delivers high throughput and low latency.

Building Blocks for Streaming Applications

There are some fundamentals we have to sort out before diving into Flink application development. We call them building blocks for streaming applications. Streams, state & time are the building blocks for streaming applications.

  1. Streams — Streams are a fundamental aspect of stream processing. Earlier we discussed streams and stream types (bounded and unbounded streams).
  2. State — Earlier we discussed stateful computations using the router example. In order to get the average download speed, the packet sizes over the last second should be stored for a while, and that’s the state. Flink provides many types of states in handling the states.
  3. Time — Many stream computations are based on time. Flink provides a set of time-related features like window aggregations, watermarks, etc.

So basically when we are writing a streaming application using Flink, above mentioned ingredients are being used in our program.

Flink Architecture

We have now covered the basic concepts of the data streaming and it is time to understand Flink’s architecture as well. Here I am not diving deep into Flink’s architecture. That’s for another article. Flink architecture consists of two main components:

  1. Job Manager
  2. Task Managers

Here the cluster is the root level component. In the cluster, we have the job manager and task managers.

Figure 2: Basic Flink Architecture

The job manager (also known as the master) is the coordinator of the Flink cluster. It can coordinate tasks, checkpoints and it knows what to do when an execution failure happens.
The task manager (also known as the slave) is the one that executes tasks of dataflows, and buffer and exchange data streams. The task manager holds one or more task slots and it indicates the number of concurrent processing tasks.

Here the task is the execution code of the streaming application we write. It deploys on one of the task slots in a task manager.

There can be more than one task manager in a Flink cluster. And only one job manager can operate at a time. That means if the operating job manager fails, another one comes in if a set of job managers keep stand-by to replace the failed one. The communications between task managers and the job manager exist both ways.

Local installation

Finally, we are ready to install Flink in your local environment. Here I am using Flink 1.12.0 which is the latest version. The only thing you have to have is a Java 8 or 11 installation. Download the 1.12.0 release from here. And untar it.

$ tar -xzf flink-1.12.0-bin-scala_2.11.tgz
$ cd flink-1.12.0-bin-scala_2.11

And you can start the local Flink cluster and go to localhost:8081 (default port) to load the Flink dashboard.

$ ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host.
Starting taskexecutor daemon on host.
Figure 3: Flink dashboard

When you are finished you can stop the Flink cluster with all running components.

$ ./bin/stop-cluster.sh
Stopping taskexecutor daemon (pid: 12368) on host.
Stopping standalonesession daemon (pid: 12095) on host.

That’s it for this article. In the next article, I am hoping to get into detail about writing our own Flink program. Enjoy this article. Thank you all…

--

--