[Avg. reading time: 6 minutes]

Introduction to Data Formats

What Are Data Formats?

  • Data formats define how data is represented on disk or over the wire
  • They describe:
    • Structure (rows, columns, trees, blocks)
    • Encoding (text, binary)
    • Schema handling (strict, flexible, embedded, external)
  • In Big Data, data formats are not just a storage choice, they are a performance decision

Why Data Formats Matter in Big Data

  • Big Data systems deal with:
    • Huge volumes
    • Distributed storage
    • Parallel processing
  • A poor format choice can:
    • Waste storage
    • Slow down queries by orders of magnitude
    • Break downstream systems

Choosing the right format directly impacts:

  • Storage efficiency
  • Scan speed
  • Compression ratio
  • CPU usage
  • Network I/O

This is why data engineers care about formats more than application developers do.

Big Data Reality Check

  • Data rarely lives in a single database
  • Data moves through:
    • APIs
    • Message queues
    • Object storage
    • Data lakes
  • File formats become the contract between systems

Once data is written in a format, changing it later is expensive.

Data Formats vs Traditional Database Storage

FeatureTraditional RDBMSBig Data Formats
Storage UnitTablesFiles or streams
SchemaFixed, enforced on writeOften flexible or schema-on-read
Access PatternRow-basedRow, column, or block-based
OptimizationIndexes, transactionsPartitioning, compression, vectorized reads
Scale ModelVertical or limited horizontalDesigned for distributed systems
Typical UseOLTP, dashboardsETL, analytics, ML pipelines

Key Shift for Data Engineers

  • Databases optimize queries
  • Data formats optimize data movement and scanning
  • In Big Data:
    • Data is written once
    • Read many times
    • Often by different engines

That’s why formats like CSV, JSON, Avro, Parquet, and ORC exist, each solving a different problem.

What This Chapter Will Cover

  • Text vs binary formats
  • Row-based vs columnar storage
  • Schema-on-write vs schema-on-read
  • When formats break at scale
  • Why Parquet dominates analytics workloads

#bigdata #dataformat #rdbmsVer 6.0.18

Last change: 2026-03-03