Setting up and Getting started with Apache Kafka

Jouneid Raza
9 min readAug 21, 2021

--

Installation and Basic concepts of Apache Kafka

Apache Kafka was originated at LinkedIn and later became an open-sourced Apache project in 2011, then a First-class Apache project in 2012. Kafka is written in Scala and Java. Apache Kafka is a publish-subscribe-based fault-tolerant messaging system. It is fast, scalable, and distributed by design.

Apache Kafka is an event streaming platform used to collect, process, store, and integrate data at scale. It has numerous use cases including distributed logging, stream processing, data integration, and pub/sub messaging.

Let’s start with some basic concepts.

What are Events?

An event is any type of action, incident, or change that’s identified or recorded by software or applications. For example, a payment, a website click, or a temperature reading, along with a description of what happened.

Events in Kafka — Key/Value Pairs

Kafka is based on the abstraction of a distributed commit log. By splitting a log into partitions, Kafka is able to scale out systems. As such, Kafka models events as key/value pairs.

Internally, keys and values are just sequences of bytes, but externally in your programming language of choice, they are often structured objects represented in your language’s type system. Kafka famously calls the translation between language types and internal bytes serialization and deserialization. The serialized format is usually JSON, JSON Schema, Avro, or Protobuf.

What is Kafka Streams?

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client-side with the benefits of Kafka’s server-side cluster technology.

What is a messaging system?

A messaging system is a simple exchange of messages between two or more persons, devices, etc. Where one object is the sender and the second one is the receiver, there would be a medium, which is responsible for carrying a message.

Another type of messaging is to send one message to a group of people (like Pub sub). It’s more like multiple people are following a page and they receive notifications whenever admins post a new article on the page.

Messaging System in Kafka

The main task of managing a system is to transfer data from one application to another so that the applications can mainly work on data without worrying about sharing it.

Distributed messaging is based on the reliable message queuing process. Messages are queued non-synchronously between the messaging system and client applications.

There are two types of messaging patterns available:

  • Point to point messaging system
  • Publish-subscribe messaging system
  • Point to Point Messaging System

In this messaging system, messages continue to remain in a queue. More than one consumer can consume the messages in the queue but only one consumer can consume a particular message.

After the consumer reads the message in the queue, the message disappears from that queue.

  • Publish-Subscribe Messaging System

In this messaging system, messages continue to remain in a Topic. Contrary to Point to point messaging system, consumers can take more than one topic and consume every message on that topic. Message producers are known as publishers and Kafka consumers are known as subscribers.

Apache Kafka — Fundamentals

Kafka is essentially a commit log with a simple data structure. The Kafka Producer API, Consumer API, Streams API, and Connect API can be used to manage the platform, and the Kafka cluster architecture is made up of Brokers, Consumers, Producers, and ZooKeeper. … The order of items in Kafka logs is guaranteed.

In very simple words, a producer is an object which sending data to topics, topics are unique datastore, which have further partitions and offset to keep data unique. The broker is managing Partition and lets the producer know that you have these available topics to send data. Zookeeper manager brokers and consumers use/consume data from the topics.

Let’s see each component in a little detail way.

Producers.

Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer publishes a message to a broker, the broker simply appends the message to the last segment file.

Topic.

A bunch of messages that belong to a particular category is known as a Topic. Kafka stores data in topics. In addition, we can replicate and partition Topics.

Topics may have many partitions, so they can handle an arbitrary amount of data. Each partitioned message has a unique sequence id called offset. Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss.

A leader is the node responsible for all reads and writes for the given partition. Every partition has one server acting as a leader.

A node that follows leader instructions is called a follower. If the leader fails, one of the followers will automatically become the new leader. A follower acts as a normal consumer, pulls messages, and updates its own data store.

Broker.

Brokers are simple systems responsible for maintaining the published data. Each broker may have zero or more partitions per topic, if Kafka has more than one broker, that is what we call a Kafka cluster.

Consumers.

Consumers read data from brokers. Consumers subscribe to one or more topics and consume published messages by pulling data from the brokers.

Zookeeper.

With the help of the zookeeper, Kafka provides the brokers with metadata regarding the processes running in the system and grants health checking and broker leadership election.

Workflow of Pub-Sub Messaging

Following is the stepwise workflow of the Pub-Sub Messaging −

  • Producers send messages to a topic at regular intervals.
  • Kafka broker stores all messages in the partitions configured for that particular topic. It ensures the messages are equally shared between partitions. If the producer sends two messages and there are two partitions, Kafka will store one message in the first partition and the second message in the second partition.
  • Consumer subscribes to a specific topic.
  • Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble.
  • The consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
  • Once Kafka receives the messages from producers, it forwards these messages to the consumers.
  • Consumers will receive the message and process it.
  • Once the messages are processed, the consumer will send an acknowledgment to the Kafka broker.
  • Once Kafka receives an acknowledgment, it changes the offset to the new value and updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read the next message correctly even during server outrages.
  • The above flow will repeat until the consumer stops the request.
  • The consumer has the option to rewind/skip to the desired offset of a topic at any time and read all the subsequent messages.

Workflow of Queue Messaging

In a queue messaging system instead of a single consumer, a group of consumers having the same Group ID will subscribe to a topic. In simple terms, consumers subscribing to a topic with the same Group ID are considered as a single group and the messages are shared among them. Let us check the actual workflow of this system.

  • Producers send messages to a topic at a regular interval.
  • Kafka stores all messages in the partitions configured for that particular topic similar to the earlier scenario.
  • A single consumer subscribes to a specific topic, assume Topic-01 with Group ID as Group-1.
  • Kafka interacts with the consumer in the same way as Pub-Sub Messaging until a new consumer subscribes to the same topic, Topic-01 with the same Group ID as Group-1.
  • Once the new consumer arrives, Kafka switches its operation to share mode and shares the data between the two consumers. This sharing will go on until the number of consumers reaches the number of partitions configured for that particular topic.
  • Once the number of consumers exceeds the number of partitions, the new consumer will not receive any further message until any one of the existing consumers unsubscribes. This scenario arises because each consumer in Kafka will be assigned a minimum of one partition and once all the partitions are assigned to the existing consumers, the new consumers will have to wait.
  • This feature is also called a Consumer Group. In the same way, Kafka will provide the best of both systems in a very simple and efficient manner.

Setting Up and Running Apache Kafka on Windows

  1. Install JDK 8 SDK. Make sure you installed JAVA 8 SDK on your system successfully.

Run the following command on the command prompt to confirm

java -version

2. Download and install the Apache Kafka binaries. Go to Apache Kafka's official download page (https://kafka.apache.org/downloads) and download the binaries.

3.Create a folder on your system and extract these downloaded binaries in the folder, for example, NewFolder->Kafka_2.12–2.8.0.

Create a data folder inside kafka binaries, for Zookeeper and Apache Kafka.

Note: Command to launch txt file is created by myself, rest folder and files would be the same.

Now inside the data folder create two more folders, kafka, and zookeeper.

4-Change the default configuration value.

Update zookeeper data directory path in “config/zookeeper.Properties” configuration file. You have to give the absolute path, as mine is give below.

5.Update Apache Kafka log file path in “config/server.properties” configuration file.

The installation and configuration part is done, now we have to start the services. Use the separate command prompt tab for launching each service.

Note: I have extracted the apache Kafka binaries in D drive, so my path will be D:\kafka_2.12–2.8.0.

You have to maintain your path with drive and Kafka version folder name.

1-Start the Zookeeper

Now time to start zookeeper from the command prompt. Change your directory to bin\windows and execute zookeeper-server-start.bat command with config/zookeeper.Properties configuration file.

CD D:\kafka_2.12–2.8.0\bin\windows

and run this command to launch zookeeper

zookeeper-server-start.bat ../../config/zookeeper.properties

2-Start Apache Kafka

CD D:\kafka_2.12–2.8.0\bin\windows

and run this command to start Apache Kafka server

kafka-server-start.bat ../../config/server.properties

3-Creating Topics

Now create a topic with the name “test” and a replication factor of 1, as we have only one Kafka server running. Change directory and run following commands.

CD D:\kafka_2.12–2.8.0\bin\windows

kafka-topics.bat — create — zookeeper localhost:2181 — replication-factor 1 — partitions 1 — topic test

4-Create producer

CD D:\kafka_2.12–2.8.0\bin\windows

kafka-console-producer.bat — broker-list localhost:9092 — topic test

5-Create consumer

CD D:\kafka_2.12–2.8.0\bin\windows

kafka-console-consumer.bat — bootstrap-server localhost:9092 — topic test

Now type anything in the producer command prompt and press Enter, and you should be able to see the message in the other consumer command prompt.

Some Other Useful Commands

  1. List Topics: kafka-topics.bat --list --zookeeper localhost:2181
  2. Describe Topic: kafka-topics.bat --describe --zookeeper localhost:2181 --topic [Topic Name]
  3. Read messages from the beginning
  4. Before version < 2.0: kafka-console-consumer.bat --zookeeper localhost:2181 --topic [Topic Name] --from-beginning
  5. After version > 2.0: kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic [Topic Name] --from-beginn
  6. Delete Topic: kafka-run-class.bat kafka.admin.TopicCommand --delete --topic [topic_to_delete] --zookeeper localhost:2181

So that’s all from my side, Hope you have a little idea of what Apache Spark is. CHEERS Happy Learning :)

Feel free to contact me at:
LinkedIn https://www.linkedin.com/in/junaidraza52/
Instagram https://www.instagram.com/iamjunaidrana/
Whatsapp +92–3225847078

--

--

Jouneid Raza
Jouneid Raza

Written by Jouneid Raza

With 8 years of industry expertise, I am a seasoned data engineer specializing in data engineering with diverse domain experiences.

No responses yet