Apache Kafka is a distributed event streaming platform that is used for building real-time data pipelines and streaming applications. Here’s an overview of its architecture:

1. Producer

  • Role: Producers are responsible for sending records (data) to Kafka topics. Each record consists of a key, value, timestamp, and optional metadata.
  • Example: A web application generating logs can send each log entry to a Kafka topic.

2. Broker

  • Role: Brokers are Kafka servers that store and serve data. Kafka clusters typically consist of multiple brokers.
  • Partitioning: Topics in Kafka are divided into partitions, and each partition is replicated across multiple brokers. This ensures fault tolerance and scalability.
  • Example: If a topic has three partitions and the cluster has three brokers, each broker could store one partition.

3. Topic

  • Role: Topics are the categories in which records are stored. Each topic is split into partitions, and each partition is an ordered, immutable sequence of records.
  • Retention: Kafka allows you to configure how long records are retained before being deleted.
  • Example: You might have a “user-activity” topic for tracking user actions in an application.

4. Partition

  • Role: Partitions are the basic unit of parallelism in Kafka. Records within a partition are stored in order, and consumers read from them sequentially.
  • Replication: Each partition is replicated across multiple brokers to ensure availability. One of the replicas is the leader, and others are followers.
  • Example: A topic with four partitions can handle four consumers reading from it in parallel.

5. Consumer

  • Role: Consumers read records from Kafka topics. They can be part of a consumer group, where each consumer in the group reads from a different partition, ensuring parallel processing.
  • Consumer Offset: Kafka keeps track of the last record read by each consumer, using a concept called “offset.”
  • Example: An analytics system that processes logs could be a consumer of the “user-activity” topic.

6. Consumer Group

  • Role: A consumer group is a set of consumers that share the work of consuming records from a topic. Each partition is consumed by only one consumer in the group.
  • Load Balancing: Kafka automatically balances the load among consumers in a group.
  • Example: If there are three consumers in a group and a topic with six partitions, each consumer might read from two partitions.

7. ZooKeeper

  • Role: Apache Kafka uses ZooKeeper to manage and coordinate the Kafka brokers. ZooKeeper helps in maintaining the cluster’s metadata, leader election, and configuration.
  • Transition: Kafka is transitioning to KRaft mode, which eliminates the need for ZooKeeper and brings native Kafka metadata management.
  • Example: ZooKeeper keeps track of which broker is the leader for each partition.

8. Replication

  • Role: Kafka ensures data durability and availability through replication. Each partition can be replicated across multiple brokers.
  • Leader and Followers: One broker is the leader for a partition, and others are followers. The leader handles all reads and writes, while followers replicate the data.
  • Example: If a broker fails, a follower can take over as the leader, ensuring no data loss.

9. Kafka Streams

  • Role: Kafka Streams is a client library for building real-time applications that process data in Kafka. It allows you to process and analyze data stored in Kafka topics.
  • Example: A real-time monitoring system could use Kafka Streams to aggregate and analyze log data.

10. Connectors

  • Role: Kafka Connect is a tool for connecting Kafka with other data sources and sinks. It provides pre-built connectors for various databases, file systems, and other systems.
  • Example: You could use Kafka Connect to stream data from a MySQL database into a Kafka topic.