[Avg. reading time: 11 minutes]
Apache Kafka — Introduction
What Problem Does Kafka Solve?
When systems need to handle millions of events per second reliably, traditional messaging systems start failing.
- Data loss
- Poor scalability
- No easy replay of events
Kafka is built to solve these problems.
What is Kafka?
Apache Kafka is a distributed event streaming platform designed for:
- High throughput
- Fault tolerance
- Real-time data pipelines
At its core, Kafka is:
- A distributed commit log
- A publish-subscribe system
- A replayable event store
Key Characteristics
- High Throughput → Millions of messages per second
- Scalable → Horizontally scalable across brokers
- Fault-Tolerant → Data replication across servers
- Durable → Messages persisted and replayable
How Kafka Works
- Producer sends a message
- Kafka assigns it to a partition
- Message gets an offset
- Stored in a broker
- Consumers read using offsets
Basic Terms
1. Producer
A producer sends data to Kafka.
- Publishes messages to topics
- Can:
- Send to a specific partition
- Let Kafka decide
Partitioning logic:
- With key →
hash(key) % partitions - Without key → round-robin
2. Topic
A topic is a logical stream where messages are stored.
- Similar to a table or data stream
- Supports multiple consumers
- Append-only (no updates/deletes)
3. Message (Record)
A message is the basic unit of data in Kafka.
Structure:
- Key (optional) → partitioning
- Value → actual data
- Timestamp
- Headers (optional)
Messages are immutable.
4. Key
The key determines how messages are distributed.
- Same key → same partition
- Maintains ordering per key
If no key:
- Kafka uses round-robin distribution
5. Partition
A partition is a subset of a topic.
- Enables parallelism and scalability
- Append-only and ordered
Important:
- Each message has an offset
- Ordering is guaranteed only within a partition
- No global ordering across topic
6. Broker
A broker is a Kafka server.
Responsibilities:
- Receives messages
- Stores partitions
- Serves consumers
7. Consumer
A consumer reads messages from topics.
- Pull-based model
- Reads using offsets
- Can replay data
8. Consumer Group
A consumer group is a set of consumers working together.
- Each partition → only ONE consumer in group
- Enables parallel processing
Rebalancing:
- Happens when consumers join/leave
- Kafka redistributes partitions
9. Offset
An offset is a unique ID for messages in a partition.
- Starts from 0
- Incremental and immutable
Types:
- Current Offset → next to read
- Committed Offset → last saved
Kafka stores offsets in: __consumer_offsets
10. Batches
A batch is a group of messages sent together.
Benefits:
- Better network usage
- Compression
- Faster I/O
Trade-off:
- Larger batch → higher latency
- Smaller batch → lower latency
Brokers, Cluster, and Replication
Broker
- Single Kafka server
- Stores partitions
Cluster
- Multiple brokers working together
- Provides scalability and fault tolerance
Replication
- Partitions are replicated across brokers
- Ensures durability and availability
Message Delivery Semantics
Kafka supports three delivery guarantees:
1. At Most Once
- No duplicates
- Possible data loss
2. At Least Once (Default)
- No data loss
- Possible duplicates
3. Exactly Once
- No duplicates
- No data loss
- Higher overhead
- At Most Once → Fast but risky
- At Least Once → Safe but duplicates
- Exactly Once → Correct but expensive
Commit Strategies
-
Auto Commit
- Automatic at intervals
-
Manual Commit
- Controlled by consumer
- More reliable
Real-World Use Cases
- Log aggregation
- Event-driven microservices
- Real-time analytics
- Fraud detection
- User activity tracking
Summary
Kafka is not just a message queue.
It is a:
- Distributed log
- Streaming backbone
- Real-time data platform
Use Kafka when:
- Scale matters
- Reliability matters
- Real-time processing matters