Tag Archives: apache kafka

All about Kafka Streaming

Lately, I have been hearing a lot about Kafka streaming. Even though I have worked on microservices, I haven’t really tackled heavy data applications. In my previous experience, where we did deal with heavy data for health insurance benefits, it was very different.

Now with Netflix and Amazon, data streaming has become a major target. With growing technology and information, it has become even more important to tackle the growing data. In simple terms, all web applications should be able to process large data sets with improved performance. Data sets size should not deter applications usage.

What is Kafka Data Streaming?

Firstly we used to process large data in batch, but that is not continuous processing and sometimes, it doesn’t work in real-time scenarios for applications. Like Netflix, batch processing will never work. What’s the alternative? The data streaming. Data streaming is a process of sending data sets continuously. This process is the backbone for applications like Netflix and Amazon. Also the growing social network platforms, data streaming is at the heart of handling large data.

The streamed data is often used for real-time aggregation and correlation, filtering, or sampling. One of the major benefits of data streaming is that it allows us to view and analyze data in real-time.

Few challenges that data streaming often faces

  1. Scalability
  2. Durability of data
  3. Fault Tolerance

Tools for Data Streaming

There are a bunch of tools that are available for data streaming. Amazon offers Kinesis, Apache has few open-source tools like Kafka, Storm, and Flink. Similarly, in future posts, I will talk more about Apache Kafka and its usage. Here I am just giving a brief idea about Apache Kafka.

Apache Kafka Streaming

Apache Kafka is a real-time distributed streaming platform. Basically it allows us to publish and subscribe to streams of records.

There are two main usages where Apache Kafka is used:

  1. Building data pipelines where records are streamed continuously.
  2. Building applications that can consume data pipelines and react accordingly

Above all, the basic idea that Kafka has adapted is from Hadoop. It runs as a cluster of one or more servers that can span multiple data centers. These clusters store data streams. Each record in the stream comprised of key, value, and timestamp. Kafka provides four main APIs Producer, Consumer, Streams, Connector.

Similarly, in future posts, I will break down these APIs in detail with their usage in a sample application.

References

  1. Apache Kafka – Kafka
  2. Data Streaming – Data Streaming
  3. Stream Processing – Stream Processing