[Avg. reading time: 6 minutes]

Introduction to Data Formats

What Are Data Formats?

Data formats define how data is represented on disk or over the wire
They describe:
- Structure (rows, columns, trees, blocks)
- Encoding (text, binary)
- Schema handling (strict, flexible, embedded, external)
In Big Data, data formats are not just a storage choice, they are a performance decision

Big Data systems deal with:
- Huge volumes
- Distributed storage
- Parallel processing
A poor format choice can:
- Waste storage
- Slow down queries by orders of magnitude
- Break downstream systems

Choosing the right format directly impacts:

This is why data engineers care about formats more than application developers do.

Once data is written in a format, changing it later is expensive.

Feature	Traditional RDBMS	Big Data Formats
Storage Unit	Tables	Files or streams
Schema	Fixed, enforced on write	Often flexible or schema-on-read
Access Pattern	Row-based	Row, column, or block-based
Optimization	Indexes, transactions	Partitioning, compression, vectorized reads
Scale Model	Vertical or limited horizontal	Designed for distributed systems
Typical Use	OLTP, dashboards	ETL, analytics, ML pipelines

Databases optimize queries
Data formats optimize data movement and scanning
In Big Data:
- Data is written once
- Read many times
- Often by different engines

That’s why formats like CSV, JSON, Avro, Parquet, and ORC exist, each solving a different problem.