[Avg. reading time: 11 minutes]

Apache Kafka — Introduction

What Problem Does Kafka Solve?

When systems need to handle millions of events per second reliably, traditional messaging systems start failing.

Data loss
Poor scalability
No easy replay of events

Kafka is built to solve these problems.

What is Kafka?

Apache Kafka is a distributed event streaming platform designed for:

High throughput
Fault tolerance
Real-time data pipelines

At its core, Kafka is:

A distributed commit log
A publish-subscribe system
A replayable event store

Key Characteristics

High Throughput → Millions of messages per second
Scalable → Horizontally scalable across brokers
Fault-Tolerant → Data replication across servers
Durable → Messages persisted and replayable

How Kafka Works

Producer sends a message
Kafka assigns it to a partition
Message gets an offset
Stored in a broker
Consumers read using offsets

Basic Terms

1. Producer

A producer sends data to Kafka.

Publishes messages to topics
Can:
- Send to a specific partition
- Let Kafka decide

Partitioning logic:

With key → hash(key) % partitions
Without key → round-robin

2. Topic

A topic is a logical stream where messages are stored.

Similar to a table or data stream
Supports multiple consumers
Append-only (no updates/deletes)

3. Message (Record)

A message is the basic unit of data in Kafka.

Structure:

Key (optional) → partitioning
Value → actual data
Timestamp
Headers (optional)

Messages are immutable.

4. Key

The key determines how messages are distributed.

Same key → same partition
Maintains ordering per key

If no key:

Kafka uses round-robin distribution

5. Partition

A partition is a subset of a topic.

Enables parallelism and scalability
Append-only and ordered

Important:

Each message has an offset
Ordering is guaranteed only within a partition
No global ordering across topic

6. Broker

A broker is a Kafka server.

Responsibilities:

Receives messages
Stores partitions
Serves consumers

7. Consumer

A consumer reads messages from topics.

Pull-based model
Reads using offsets
Can replay data

8. Consumer Group

A consumer group is a set of consumers working together.

Each partition → only ONE consumer in group
Enables parallel processing

Rebalancing:

Happens when consumers join/leave
Kafka redistributes partitions

9. Offset

An offset is a unique ID for messages in a partition.

Starts from 0
Incremental and immutable

Types:

Current Offset → next to read
Committed Offset → last saved

Kafka stores offsets in: __consumer_offsets

10. Batches

A batch is a group of messages sent together.

Benefits:

Better network usage
Compression
Faster I/O

Trade-off:

Larger batch → higher latency
Smaller batch → lower latency

Brokers, Cluster, and Replication

Broker

Single Kafka server
Stores partitions

Cluster

Multiple brokers working together
Provides scalability and fault tolerance

Replication

Partitions are replicated across brokers
Ensures durability and availability

Message Delivery Semantics

Kafka supports three delivery guarantees:

1. At Most Once

No duplicates
Possible data loss

2. At Least Once (Default)

No data loss
Possible duplicates

3. Exactly Once

No duplicates
No data loss
Higher overhead

At Most Once → Fast but risky
At Least Once → Safe but duplicates
Exactly Once → Correct but expensive

Commit Strategies

Auto Commit
- Automatic at intervals
Manual Commit
- Controlled by consumer
- More reliable

Real-World Use Cases

Log aggregation
Event-driven microservices
Real-time analytics
Fraud detection
User activity tracking

Summary

Kafka is not just a message queue.

It is a:

Distributed log
Streaming backbone
Real-time data platform

Use Kafka when:

Scale matters
Reliability matters
Real-time processing matters

#kafka #realtimeVer 6.0.25

Big Data Tools & Techniques