[Avg. reading time: 5 minutes]

Avro

Avro is a row-based binary data serialization format designed for data exchange and streaming systems.

Unlike Parquet, Avro is optimized for writing and reading one record at a time.


Why Avro Exists

Many systems need to:

  • Send data between producers and consumers
  • Handle continuous streams of events
  • Evolve data schemas safely over time

Text formats like JSON are:

  • Easy to read
  • Slow and verbose

Avro solves this with:

  • Compact binary encoding
  • Strong schema support

Key Characteristics

  • Row-based format
  • Supports Schema evolution
  • Binary and compact
  • Schema-driven
  • Designed for interoperability
  • Excellent for streaming pipelines

Schema in Avro

Avro uses a JSON schema to define data structure.

The schema:

  • Describes fields and data types
  • Travels with the data or is shared separately
  • Enables backward and forward compatibility

Example schema:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null","string"], "default": null}
  ]
}

Where Avro Is Used

  • Kafka producers and consumers
  • Streaming and real-time pipelines
  • Data ingestion layers
  • Cross-language data exchange

When NOT to Use Avro

  • Analytical queries
  • Aggregations
  • Column-level filtering

Avro vs Parquet

FeatureAvroParquet
Storage StyleRow-basedColumnar
Optimized ForStreaming, writesAnalytics, reads
Typical AccessOne record at a timeSelected columns
CompressionModerateVery high
Common UseKafka, ingestionData lakes, OLAP

tags:dataformat #avro #rowbasedVer 6.0.18

Last change: 2026-03-03