[Avg. reading time: 6 minutes]
Introduction to Data Formats
What Are Data Formats?
- Data formats define how data is represented on disk or over the wire
- They describe:
- Structure (rows, columns, trees, blocks)
- Encoding (text, binary)
- Schema handling (strict, flexible, embedded, external)
- In Big Data, data formats are not just a storage choice, they are a performance decision
Why Data Formats Matter in Big Data
- Big Data systems deal with:
- Huge volumes
- Distributed storage
- Parallel processing
- A poor format choice can:
- Waste storage
- Slow down queries by orders of magnitude
- Break downstream systems
Choosing the right format directly impacts:
- Storage efficiency
- Scan speed
- Compression ratio
- CPU usage
- Network I/O
This is why data engineers care about formats more than application developers do.
Big Data Reality Check
- Data rarely lives in a single database
- Data moves through:
- APIs
- Message queues
- Object storage
- Data lakes
- File formats become the contract between systems
Once data is written in a format, changing it later is expensive.
Data Formats vs Traditional Database Storage
| Feature | Traditional RDBMS | Big Data Formats |
|---|---|---|
| Storage Unit | Tables | Files or streams |
| Schema | Fixed, enforced on write | Often flexible or schema-on-read |
| Access Pattern | Row-based | Row, column, or block-based |
| Optimization | Indexes, transactions | Partitioning, compression, vectorized reads |
| Scale Model | Vertical or limited horizontal | Designed for distributed systems |
| Typical Use | OLTP, dashboards | ETL, analytics, ML pipelines |
Key Shift for Data Engineers
- Databases optimize queries
- Data formats optimize data movement and scanning
- In Big Data:
- Data is written once
- Read many times
- Often by different engines
That’s why formats like CSV, JSON, Avro, Parquet, and ORC exist, each solving a different problem.
What This Chapter Will Cover
- Text vs binary formats
- Row-based vs columnar storage
- Schema-on-write vs schema-on-read
- When formats break at scale
- Why Parquet dominates analytics workloads