[Avg. reading time: 5 minutes]
Avro
Avro is a row-based binary data serialization format designed for data exchange and streaming systems.
Unlike Parquet, Avro is optimized for writing and reading one record at a time.
Why Avro Exists
Many systems need to:
- Send data between producers and consumers
- Handle continuous streams of events
- Evolve data schemas safely over time
Text formats like JSON are:
- Easy to read
- Slow and verbose
Avro solves this with:
- Compact binary encoding
- Strong schema support
Key Characteristics
- Row-based format
- Supports Schema evolution
- Binary and compact
- Schema-driven
- Designed for interoperability
- Excellent for streaming pipelines
Schema in Avro
Avro uses a JSON schema to define data structure.
The schema:
- Describes fields and data types
- Travels with the data or is shared separately
- Enables backward and forward compatibility
Example schema:
{
"type": "record",
"name": "Person",
"fields": [
{"name": "firstName", "type": "string"},
{"name": "lastName", "type": "string"},
{"name": "age", "type": "int"},
{"name": "email", "type": ["null","string"], "default": null}
]
}
Where Avro Is Used
- Kafka producers and consumers
- Streaming and real-time pipelines
- Data ingestion layers
- Cross-language data exchange
When NOT to Use Avro
- Analytical queries
- Aggregations
- Column-level filtering
Avro vs Parquet
| Feature | Avro | Parquet |
|---|---|---|
| Storage Style | Row-based | Columnar |
| Optimized For | Streaming, writes | Analytics, reads |
| Typical Access | One record at a time | Selected columns |
| Compression | Moderate | Very high |
| Common Use | Kafka, ingestion | Data lakes, OLAP |