[Avg. reading time: 5 minutes]

Avro

Avro is a row-based binary data serialization format designed for data exchange and streaming systems.

Unlike Parquet, Avro is optimized for writing and reading one record at a time.

Why Avro Exists

Many systems need to:

Send data between producers and consumers
Handle continuous streams of events
Evolve data schemas safely over time

Text formats like JSON are:

Easy to read
Slow and verbose

Avro solves this with:

Compact binary encoding
Strong schema support

Key Characteristics

Row-based format
Supports Schema evolution
Binary and compact
Schema-driven
Designed for interoperability
Excellent for streaming pipelines

Schema in Avro

Avro uses a JSON schema to define data structure.

The schema:

Describes fields and data types
Travels with the data or is shared separately
Enables backward and forward compatibility

Example schema:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null","string"], "default": null}
  ]
}

Where Avro Is Used

Kafka producers and consumers
Streaming and real-time pipelines
Data ingestion layers
Cross-language data exchange

When NOT to Use Avro

Analytical queries
Aggregations
Column-level filtering

Avro vs Parquet

Feature	Avro	Parquet
Storage Style	Row-based	Columnar
Optimized For	Streaming, writes	Analytics, reads
Typical Access	One record at a time	Selected columns
Compression	Moderate	Very high
Common Use	Kafka, ingestion	Data lakes, OLAP

tags:dataformat #avro #rowbasedVer 6.0.25

Big Data Tools & Techniques

Avro

Why Avro Exists

Key Characteristics

Schema in Avro

Avro vs Parquet