[Avg. reading time: 0 minutes] Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 2 minutes]

Disclaimer

I want you to know these things.

  • First, you are not behind, you are learning on schedule.
  • Second, feeling like an imposter is normal, it means you are stretching your skills.
  • Third, ignore the online noise. Learning is simple: learn something, think about it, practice it, repeat.
  • Lastly, tools will change, but your ability to learn will stay.

Certificates are good, but projects and understanding matter more. Ask questions, help each other, and don’t do this journey alone.Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

Required Tools

Install these softwares before Week 2.

Windows

Mac

Common Tools

Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Big Data Overview

  1. Introduction
  2. Job Opportunities
  3. What is Data?
  4. How does it help?
  5. Types of Data
  6. The Big V’s
    1. Variety
    2. Volume
    3. Velocity
    4. Veracity
    5. Other V’s
  7. Trending Technologies
  8. Big Data Concerns
  9. Big Data Challenges
  10. Data Integration
  11. Scaling
  12. Cap Theorem
  13. Optimistic Concurrency
  14. Eventual Consistency
  15. Concurrent vs Parallel
  16. GPL
  17. DSL
  18. Big Data Tools
  19. NO Sql Databases
  20. Learning Big Data means?

#introduction #bigdata #chapter1Ver 6.0.1

[Avg. reading time: 2 minutes]

Understanding the Big Data Landscape

Expectation in this course

The first set of questions, which everyone is curious to know.

What is Big Data?

When does the data become Big Data?

Why collect so much Data?

How secure is Big Data?

How does it help?

Where can it be stored?

Which Tools are used to handle Big Data?


The second set of questions to get in deep.

What should I learn?

Does certification help?

Which technology is the best?

How many tools do I need to learn?

Apart from the top 50 corporations, do other companies use Big Data?

#overview #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Job Opportunities

RoleOn-PremBig Data SpecificCloud
Database Developer
Data Engineer
Database Administrator
Data Architect
Database Security Eng.
Database Manager
Data Analyst
Business Intelligence

Database Developer: Designs and writes efficient queries, procedures, and data models for structured databases.

Data Engineer: Builds and maintains scalable data pipelines and ETL processes for large-scale data movement and transformation.

Database Administrator (DBA): Manages and optimizes database systems, ensuring performance, security, and backups.

Data Architect: Defines high-level data strategy and architecture, ensuring alignment with business and technical needs.

Database Security Engineer: Implements and monitors security controls to protect data assets from unauthorized access and breaches.

Database Manager: Oversees database teams and operations, aligning database strategy with organizational goals.

Data Analyst: Interprets data using statistical tools to generate actionable insights for decision-makers.

Business Intelligence (BI) Developer: Creates dashboards, reports, and visualizations to help stakeholders understand data trends and KPIs.

All small to enterprise organizations use Big data to develop their business.

#jobs #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

What is Data?

Data is simply facts and figures. When processed and contextualized, data becomes information.

Everything is data

  • What we say
  • Where we go
  • What we do

How to measure data?

byte        - 1 letter
1 Kilobyte  - 1024 B
1 Megabyte  - 1024 KB
1 Gigabyte  - 1024 MB
1 Terabyte  - 1024 GB    
(1,099,511,627,776 Bytes)
1 Petabyte  - 1024 TB
1 Exabyte   - 1024 PB
1 Zettabyte - 1024 EB
1 Yottabyte - 1024 ZB

Examples of Traditional Data

  • Banking Records
  • Student Information
  • Employee Profiles
  • Customer Details
  • Sales Transactions

When Data becomes Big Data?

When data expands

  • Banking: One bank branch vs. global consolidation (e.g., CitiBank)
  • Education: One college vs. nationwide student data (e.g., US News)
  • Media: Traditional news vs. user-generated content on Social Media

When data gets granular

  • Monitoring CPU/Memory usage every second
  • Cell phone location & usage logs
  • IoT sensor telemetry (temperature, humidity, etc.)
  • Social media posts, reactions, likes
  • Live traffic data from vehicles and sensors

These fine-grained data points fuel powerful analytics and real-time insights.

Why Collect So Much Data?

  • Storage is cheap and abundant
  • Tech has advanced to process massive data efficiently
  • Businesses use data to innovate, predict trends, and grow

#data #bigdata #traditionaldataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

How Big Data helps us

From raw blocks to building knowledge, Big Data drives global progress.

Data to Wisdom

Stages

  • Data → scattered observations
  • Information → contextualized
  • Knowledge → structured relationships
  • Insight → patterns emerge
  • Wisdom → actionable strategy

Raw Data to Analysis

Raw Data to Analysis

Stages

  • Raw Data – Messy, unprocessed
  • Organized – Grouped by category
  • Arranged – Structured to show comparisons
  • Visualized – Charts or graphs
  • Analysis – Final understanding or solution

Big Data Applications: Changing the World

Here are some real-world domains where Big Data is making a difference:

  • Healthcare – Diagnose diseases earlier and personalize treatment
  • Agriculture – Predict crop yield and detect pest outbreaks
  • Space Exploration – Analyze signals from space and optimize missions
  • Disaster Management – Forecast earthquakes, floods, and storms
  • Crime Prevention – Predict and detect crime patterns
  • IoT & Smart Devices – Real-time decision making in smart homes, vehicles, and cities

#bigdata #rawdata #knowledge #analysisVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

Types of Data

Understanding the types of data is key to processing and analyzing it effectively. Broadly, data falls into two main categories: Quantitative and Qualitative.

Quantitative Data

Quantitative data deals with numbers and measurable forms. It can be further classified as Discrete or Continuous.

  • Measurable values (e.g., memory usage, CPU usage, number of likes, shares, retweets)
  • Collected from the real world
  • Usually close-ended

Discrete

  • Represented by whole numbers
  • Countable and finite

Example:

  • Number of cameras in a phone
  • Memory size in GB

Qualitative Data

Qualitative data describes qualities or characteristics that can’t be easily measured numerically.

  • Descriptive or abstract
  • Can come from text, audio, or images
  • Collected via interviews, surveys, or observations
  • Usually open-ended

Examples

  • Gender: Male, Female, Non-Binary, etc.
  • Smartphones: iPhone, Pixel, Motorola, etc.

Nominal

Categorical data without any intrinsic order

Examples:

  • Red, Blue, Green
  • Types of fruits: Apple, Banana, Mango

Can you rank them logically? No — that’s what makes them nominal.


graph TD
  A[Types of Data]
  
  A --> B[Quantitative]
  A --> C[Qualitative]
  
  B --> B1[Discrete]
  B --> B2[Continuous]
  
  C --> C1[Nominal]
  C --> C2[Ordinal]
CategorySubtypeDescriptionExamples
QuantitativeDiscreteWhole numbers, countableNumber of phones, number of users
ContinuousMeasurable, can take fractional valuesTemperature, CPU usage
QualitativeNominalCategorical with no natural orderGender, Colors (Red, Blue, Green)
OrdinalCategorical with a meaningful orderT-shirt sizes (S, M, L), Grades (A, B, C…)

Abstract Understanding

Some qualitative data comes from non-traditional sources like:

  • Conversations
  • Audio or video files
  • Observations or open-text survey responses

This type of data often requires interpretation before it’s usable in models or analysis.

Abstract Understanding

#quantitative #qualitative #discrete #continuous #nominal #ordinalVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 1 minute]

The Big V’s of Big Data

#bigv #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

Variety

Variety refers to the different types, formats, and sources of data collected — one of the 5 Vs of Big Data.

Types of Data : By Source

  • Social Media: YouTube, Facebook, LinkedIn, Twitter, Instagram
  • IoT Devices: Sensors, Cameras, Smart Meters, Wearables
  • Finance/Markets: Stock Market, Cryptocurrency, Financial APIs
  • Smart Systems: Smart Cars, Smart TVs, Home Automation
  • Enterprise Systems: ERP, CRM, SCM Logs
  • Public Data: Government Open Data, Weather Stations

Types of Data : By Data format

  • Structured Data – Organized in rows and columns (e.g., CSV, Excel, RDBMS)
  • Semi-Structured Data – Self-describing but irregular (e.g., JSON, XML, Avro, YAML)
  • Unstructured Data – No fixed schema (e.g., images, audio, video, emails)
  • Binary Data – Encoded, compressed, or serialized data (e.g., Parquet, Protocol Buffers, images, MP3)

Generally unstructured data files are stored in binary format, Example: Images, Video, Audio

But not all binary files contain unstructured data. Example: Parquet, Executable.

Structured Data

Tabular data from databases, spreadsheets.

Example:

  • Relational Table
  • Excel
IDNameJoin Date
101Rachel Green2020-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Semi-Structred Data

Data with tags or markers but not strictly tabular.

JSON

[
   {
      "id":1,
      "name":"Rachel Green",
      "gender":"F",
      "series":"Friends"
   },
   {
      "id":"2",
      "name":"Sheldon Cooper",
      "gender":"M",
      "series":"BBT"
   }
]

XML

<?xml version="1.0" encoding="UTF-8"?>
<actors>
   <actor>
      <id>1</id>
      <name>Rachel Green</name>
      <gender>F</gender>
      <series>Friends</series>
   </actor>

   <actor>
      <id>2</id>
      <name>Sheldon Cooper</name>
      <gender>M</gender>
      <series>BBT</series>
   </actor>
</actors>

Unstructured Data

Media files, free text, documents, logs – no predefined structure.

Rachel Green acted in Friends series. Her role is very popular. 
Similarly Sheldon Cooper acted in BBT. He acted as nerd physicist.

Types:

  • Images (JPG, PNG)
  • Video (MP4, AVI)
  • Audio (MP3, WAV)
  • Documents (PDF, DOCX)
  • Emails
  • Logs (system logs, server logs)
  • Web scraping content (HTML, raw text)

Note: Now we have lot of LLM (AI tools) that helps us parse Unstructured Data into tabular data quickly.

#structured #unstructured #semistructured #binary #json #xml #image #bigdata #bigvVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

Volume

Volume refers to the sheer amount of data generated every second from various sources around the world. It’s one of the core characteristics that makes data big.With the rise of the internet, smartphones, IoT devices, social media, and digital services, the amount of data being produced has reached zettabyte and soon yottabyte scales.

  • YouTube users upload 500+ hours of video every minute.
  • Facebook generates 4 petabytes of data per day.
  • A single connected car can produce 25 GB of data per hour.
  • Enterprises generate terabytes to petabytes of log, transaction, and sensor data daily.

Why It Matters

With the rise of Artificial Intelligence (AI) and especially Large Language Models (LLMs) like ChatGPT, Bard, and Claude, the volume of data being generated, consumed, and required for training is skyrocketing.

  • LLMs Need Massive Training Data

  • LLMs generated content is exponential — blogs, reports, summaries, images, audio, and even code.

  • Storage systems must scale horizontally to handle petabytes or more.

  • Traditional databases can’t manage this scale efficiently.

  • Volume impacts data ingestion, processing speed, query performance, and cost.

  • It influences how data is partitioned, replicated, and compressed in distributed systems.

Data Cycle

#bigdata #volume #bigvVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

Velocity

Velocity refers to the speed at which data is generated, transmitted, and processed. In the era of Big Data, it’s not just about handling large volumes of data, but also about managing the continuous and rapid flow of data in real-time or near real-time.

High-velocity data comes from various sources such as:

  • Social Media Platforms: Tweets, posts, likes, and shares occurring every second.
  • Sensor Networks: IoT devices transmitting data continuously.
  • Financial Markets: Real-time transaction data and stock price updates.
  • Online Streaming Services: Continuous streaming of audio and video content.
  • E-commerce Platforms: Real-time tracking of user interactions and transactions.

Managing this velocity requires systems capable of:

  • Real-Time Data Processing: Immediate analysis and response to incoming data.
  • Scalability: Handling increasing data speeds without performance degradation.
  • Low Latency: Minimizing delays in data processing and response times.

Velocity Source1

#bigdata #velocity #bigv


1: https://keywordseverywhere.com/blog/data-generated-per-day-stats/Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

Veracity

Veracity refers to the trustworthiness, quality, and accuracy of data. In the world of Big Data, not all data is created equal — some may be incomplete, inconsistent, outdated, or even deliberately false. The challenge is not just collecting data, but ensuring it’s reliable enough to make sound decisions.

Why Veracity Matters

  • Poor data quality can lead to wrong insights, flawed models, and bad business decisions.

  • With increasing sources (social media, sensors, web scraping), there’s more noise than ever.

  • Real-world data often comes with missing values, duplicates, biases, or outliers.

Key Dimensions of Veracity in Big Data

DimensionDescriptionExample
TrustworthinessConfidence in the accuracy and authenticity of data.Verifying customer feedback vs. bot reviews
OriginThe source of the data and its lineage or traceability.Knowing if weather data comes from reliable source
CompletenessWhether the dataset has all required fields and values.Missing values in patient health records
IntegrityEnsuring the data hasn’t been altered, corrupted, or tampered with during storage or transfer.Using checksums to validate data blocks

How to Tackle Veracity Issues

  • Data Cleaning: Remove duplicates, correct errors, fill missing values.
  • Validation & Verification: Check consistency across sources.
  • Data Provenance: Track where the data came from and how it was transformed.
  • Bias Detection: Identify and reduce systemic bias in training datasets.
  • Robust Models: Build models that can tolerate and adapt to noisy inputs.

Websites & Tools to Generate Sample Data

Highly customizable fake data generator; supports exporting as CSV, JSON, SQL. https://mockaroo.com

Easy UI to create datasets with custom fields like names, dates, numbers, etc. https://www.onlinedatagenerator.com

Apart from this, there are few Data generating libraries.

https://faker.readthedocs.io/en/master/

https://github.com/databrickslabs/dbldatagen

Question?

Is generating fake data good or bad?

When we have real data? why generate fake data?

#bigv #veracity #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Other V’s in Big Data

Other V’sMeaningKey Question / Use Case
ValueBusiness/Customer ImpactWhat value does this data bring to the business or end users?
VisualizationData RepresentationCan the data be visualized clearly to aid understanding and decisions?
ViabilityProduction/SustainabilityIs it viable to operationalize and sustain this data in production systems?
ViralityShareability/ImpactWill the message or insight be effective when shared across channels (e.g., social media)?
VersionData VersioningDo we need to maintain different versions? Is the cost of versioning justified?
ValidityTime-SensitivityHow long is the data relevant? Will its meaning or utility change over time?

Example

  • Validity: Zoom usage data from 2020 was valid during lockdown, can that be used for benchmarking?

  • Virality: A meme might go viral on Instagram and not received well in Twitter or LinkedIn.

  • Version: For some master records, we might need versioned data. For simple web traffic counts, maybe not.

#bigdata #otherv #value #version #validityVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

Trending Technologies

Powered by Big Data

Big Data isn’t just about storing and processing huge volumes of information — it’s the engine that drives modern innovation. From healthcare to self-driving cars, Big Data plays a critical role in shaping the technologies we use and depend on every day.

Where Big Data Is Making an Impact

  • Robotics
    Enhances learning and adaptive behavior in robots by feeding real-time and historical data into control algorithms.

  • Artificial Intelligence (AI)
    The heart of AI — machine learning models rely on Big Data to train, fine-tune, and make accurate predictions.

  • Internet of Things (IoT)
    Millions of devices — from smart thermostats to industrial sensors — generate data every second. Big Data platforms analyze this for real-time insights.

  • Internet & Mobile Apps
    Collect user behavior data to power personalization, recommendations, and user experience optimization.

  • Autonomous Cars & VANETs (Vehicular Networks)
    Use sensor and network data for route planning, obstacle avoidance, and decision-making.

  • Wireless Networks & 5G
    Big Data helps optimize network traffic, reduce latency, and predict service outages before they occur.

  • Voice Assistants (Siri, Alexa, Google Assistant)
    Depend on Big Data and NLP models to understand speech, learn preferences, and respond intelligently.

  • Cybersecurity
    Uses pattern detection on massive datasets to identify anomalies, prevent attacks, and detect fraud in real time.

  • Bioinformatics & Genomics
    Big Data helps decode genetic sequences, enabling personalized medicine and new drug discoveries. Big Data was a game-changer in the development and distribution of COVID-19 vaccines

    https://pmc.ncbi.nlm.nih.gov/articles/PMC9236915/

  • Renewable Energy
    Analyzes weather, consumption, and device data to maximize efficiency in solar, wind, and other green technologies.

  • Neural Networks & Deep Learning
    These advanced AI models require large-scale labeled data for training complex tasks like image recognition or language translation.


Broad Use Areas for Big Data

AreaDescription
Data Mining & AnalyticsFinding patterns and insights from raw data
Data VisualizationPresenting data in a human-friendly, understandable format
Machine LearningTraining models that learn from historical data

#bigdata #technologies #iot #ai #roboticsVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 6 minutes]

Big Data Concerns

Big Data brings massive potential, but it also introduces ethical, technical, and societal challenges. Below is a categorized view of key concerns and how they can be mitigated.

Privacy, Security & Governance

Concerns

  • Privacy: Risk of misuse of sensitive personal data.
  • Security: Exposure to cyberattacks and data breaches.
  • Governance: Lack of clarity on data ownership and access rights.

Mitigation

  • Use strong encryption, anonymization, and secure access controls.
  • Conduct regular security audits and staff awareness training.
  • Define and enforce data governance policies on ownership, access, and lifecycle.
  • Establish consent mechanisms and transparent data usage policies.

Data Quality, Accuracy & Interpretation

Concerns

  • Inaccurate, incomplete, or outdated data may lead to incorrect decisions.
  • Misinterpretation due to lack of context or domain understanding.

Mitigation

  • Implement data cleaning, validation, and monitoring procedures.
  • Train analysts to understand data context.
  • Use cross-functional teams for balanced analysis.
  • Maintain data lineage and proper documentation.

Ethics, Fairness & Bias

Concerns

  • Potential for discrimination or unethical use of data.
  • Over-reliance on algorithms may overlook human factors.

Mitigation

  • Develop and follow ethical guidelines for data usage.
  • Perform bias audits and impact assessments regularly.
  • Combine data-driven insights with human judgment.

Regulatory Compliance

Concerns

  • Complexity of complying with regulations like GDPR, HIPAA, etc.

Mitigation

  • Stay current with relevant data protection laws.
  • Assign a Data Protection Officer (DPO) to ensure ongoing compliance and oversight.

Environmental and Social Impact

Concerns

  • High energy usage of data centers contributes to carbon emissions.
  • Digital divide may widen gaps between those who can access Big Data and those who cannot.

Mitigation

  • Use energy-efficient infrastructure and renewable energy sources.
  • Support data literacy, open data access, and inclusive education initiatives.

#bigdata #concerns #mitigationVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

Big Data Challenges

As organizations adopt Big Data, they face several challenges — technical, organizational, financial, legal, and ethical. Below is a categorized overview of these challenges along with effective mitigation strategies.

1. Data Storage & Management

Challenge:

Efficiently storing and managing ever-growing volumes of structured, semi-structured, and unstructured data.

Mitigation:

  • Use scalable cloud storage and distributed file systems like HDFS or Delta Lake.
  • Establish data lifecycle policies, retention rules, and metadata catalogs for better management.

2. Data Processing & Real-Time Analytics

Challenges:

  • Processing huge datasets with speed and accuracy.
  • Delivering real-time insights for time-sensitive decisions.

Mitigation:

  • Leverage tools like Apache Spark, Flink, and Hadoop for distributed processing.
  • Use streaming platforms like Kafka or Spark Streaming.
  • Apply parallel and in-memory processing where possible.

3. Data Integration & Interoperability

Challenge:

Bringing together data from diverse sources, formats, and systems into a unified view.

Mitigation:

  • Implement ETL/ELT pipelines, data lakes, and integration frameworks.
  • Apply data transformation and standardization best practices.

4. Privacy, Security & Compliance

Challenges:

  • Preventing data breaches and unauthorized access.
  • Adhering to global and regional data regulations (e.g., GDPR, HIPAA, CCPA).

Mitigation:

  • Use encryption, role-based access controls, and audit logging.
  • Conduct regular security assessments and appoint a Data Protection Officer (DPO).
  • Stay current with evolving regulations and enforce compliance frameworks.

5. Data Quality & Trustworthiness

Challenge:

Ensuring that data is accurate, consistent, timely, and complete.

Mitigation:

  • Use data validation, cleansing tools, and automated quality checks.
  • Monitor for data drift and inconsistencies in real time.
  • Maintain data provenance for traceability.

6. Skill Gaps & Talent Shortage

Challenge:

A lack of professionals skilled in Big Data technologies, analytics, and data engineering.

Mitigation:

  • Invest in upskilling programs, certifications, and academic partnerships.
  • Foster a culture of continuous learning and data literacy across roles.

7. Cost & Resource Management

Challenge:

Managing the high costs associated with storing, processing, and analyzing large-scale data.

Mitigation:

  • Optimize workloads using cloud-native autoscaling and resource tagging.
  • Use open-source tools where possible.
  • Monitor and forecast data usage to control spending.

8. Scalability & Performance

Challenge:

Keeping up with growing data volumes and system demands without compromising performance.

Mitigation:

  • Design for horizontal scalability using microservices and cloud-native infrastructure.
  • Implement load balancing, data partitioning, and caching strategies.

9. Ethics, Governance & Transparency

Challenges:

  • Managing bias, fairness, and responsible data usage.
  • Ensuring transparency in algorithms and decisions.

Mitigation:

  • Establish data ethics policies and review boards.
  • Perform regular audits and impact assessments.
  • Clearly communicate how data is collected, stored, and used.

#bigdata #ethics #storage #realtime #interoperability #privacy #dataquality Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

Data Integration

Data integration in the Big Data ecosystem differs significantly from traditional Relational Database Management Systems (RDBMS). While traditional systems rely on structured, predefined workflows, Big Data emphasizes scalability, flexibility, and performance.

ETL: Extract Transform Load

ETL is a traditional data integration approach used primarily with RDBMS technologies such as MySQL, SQL Server, and Oracle.

Workflow

  • Extract data from source systems.
  • Transform it into the required format.
  • Load it into the target system (e.g., a data warehouse).

ETL Tools

  • SSIS / SSDT – SQL Server Integration Services / Data Tools
  • Pentaho Kettle – Open-source ETL platform
  • Talend – Data integration and transformation platform
  • Benetl – Lightweight ETL for MySQL and PostgreSQL

ETL tools are well-suited for batch processing and structured environments but may struggle with scale and unstructured data.

<abbr title="Extract Transform Load">ETL</abbr>

src 1

<abbr title="Extract Transform Load">ETL</abbr> vs <abbr title="Extract Load Transform">ELT</abbr>

src 2

ELT: Extract Load Transform

ELT is the modern, Big Data-friendly approach. Instead of transforming data before loading, ELT prioritizes loading raw data first and transforming later.

Benefits

  • Immediate ingestion of all types of data (structured or unstructured)
  • Flexible transformation logic, applied post-load
  • Faster load times and higher throughput
  • Reduced operational overhead for loading processes

Challenges

  • Security blind spots may arise from loading raw data upfront
  • Compliance risks due to delayed transformation (HIPAA, GDPR, etc.)
  • High storage costs if raw data is stored unfiltered in cloud/on-prem systems

ELT is ideal for data lakes, streaming, and cloud-native architectures.

Typical Big Data Flow

Raw Data → Cleansed Data → Data Processing → Data Warehousing → ML / BI / Analytics

  • Raw Data: Initial unprocessed input (logs, JSON, CSV, APIs, sensors)
  • Cleansed Data: Cleaned and standardized
  • Processing: Performed through tools like Spark, DLT, or Flink
  • Warehousing: Data is stored in structured formats (e.g., Delta, Parquet)
  • Usage: Data is consumed by ML models, dashboards, or analysts

Each stage involves pipelines, validations, and metadata tracking.

<abbr title="Extract Transform Load">ETL</abbr> vs <abbr title="Extract Load Transform">ELT</abbr>

#etl #elt #pipeline #rawdata #datalake


1: Leanmsbitutorial.com

2: https://towardsdatascience.com/how-i-redesigned-over-100-etl-into-elt-data-pipelines-c58d3a3cb3cVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

Scaling & Distributed Systems

Scalability is a critical factor in Big Data and cloud computing. As workloads grow, systems must adapt.

There are two main ways to scale infrastructure:

vertical scaling and horizontal scaling. These often relate to how distributed systems are designed and deployed.

Vertical Scaling (Scaling Up)

Vertical scaling means increasing the capacity of a single machine.

Like upgrading your personal computer — adding more RAM, a faster CPU, or a bigger hard drive.

Pros:

  • Simple to implement
  • No code or architecture changes needed
  • Good for monolithic or legacy applications

Cons:

  • Hardware has physical limits
  • Downtime may be required during upgrades
  • More expensive hardware = diminishing returns

Used In:

  • Traditional RDBMS
  • Standalone servers
  • Small-scale workloads

Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines (nodes) to handle the load collectively.

Like hiring more team members instead of just working overtime yourself.

Pros:

  • More scalable: Keep adding nodes as needed
  • Fault tolerant: One machine failure doesn’t stop the system
  • Supports distributed computing

Cons:

  • More complex to configure and manage
  • Requires load balancing, data partitioning, and synchronization
  • More network overhead

Used In:

  • Distributed databases (e.g., Cassandra, MongoDB)
  • Big Data platforms (e.g., Hadoop, Spark)
  • Cloud-native applications (e.g., Kubernetes)

Distributed Systems

A distributed system is a network of computers that work together to perform tasks. The goal is to increase performance, availability, and fault tolerance by sharing resources across machines.

Analogy:

A relay team where each runner (node) has a specific part of the race, but success depends on teamwork.

Key Features of Distributed Systems

FeatureDescription
ConcurrencyMultiple components can operate at the same time independently
ScalabilityEasily expand by adding more nodes
Fault ToleranceIf one node fails, others continue to operate with minimal disruption
Resource SharingNodes share tasks, data, and workload efficiently
DecentralizationNo single point of failure; avoids bottlenecks
TransparencySystem hides its distributed nature from users (location, access, replication)

Horizontal Scaling vs. Distributed Systems

AspectHorizontal ScalingDistributed System
DefinitionAdding more machines (nodes) to handle workloadA system where multiple nodes work together as one unit
GoalTo increase capacity and performance by scaling outTo coordinate tasks, ensure fault tolerance, and share resources
ArchitectureNot necessarily distributedAlways distributed
CoordinationMay not require nodes to communicateRequires tight coordination between nodes
Fault ToleranceDepends on implementationBuilt-in as a core feature
ExampleLoad-balanced web serversHadoop, Spark, Cassandra, Kafka
Storage/ProcessingEach node may handle separate workloadsNodes often share or split workloads and data
Use CaseQuick capacity boost (e.g., web servers)Large-scale data processing, distributed storage

Vertical scaling helps improve single-node power, while horizontal scaling enables distributed systems to grow flexibly. Most modern Big Data systems rely on horizontal scaling for scalability, reliability, and performance.

#scaling #vertical #horizontal #distributedVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

CAP Theorem

src 1

The CAP Theorem is a fundamental concept in distributed computing. It states that in the presence of a network partition, a distributed system can guarantee only two out of the following three properties:

The Three Components

  1. Consistency (C)
    Every read receives the most recent write or an error.
    Example: If a book’s location is updated in a library system, everyone querying the catalog should see the updated location immediately.

  2. Availability (A)
    Every request receives a (non-error) response, but not necessarily the most recent data.
    Example: Like a convenience store that’s always open, even if they occasionally run out of your favorite snack.

  3. Partition Tolerance (P)
    The system continues to function despite network failures or communication breakdowns.
    Example: A distributed team in different rooms that still works, even if their intercom fails.

What the CAP Theorem Means

You can only pick two out of three:

Guarantee CombinationSacrificed PropertyTypical Use Case
CP (Consistency + Partition)AvailabilityBanking Systems, RDBMS
AP (Availability + Partition)ConsistencyDNS, Web Caches
CA (Consistency + Availability)Partition Tolerance (Not realistic in distributed systems)Only feasible in non-distributed systems

CAP Theorem src 2

Real-World Examples

CAP Theorem trade-offs can be seen in:

  • Social Media Platforms – Favor availability and partition tolerance (AP)
  • Financial Systems – Require consistency and partition tolerance (CP)
  • IoT Networks – Often prioritize availability and partition tolerance (AP)
  • eCommerce Platforms – Mix of AP and CP depending on the service
  • Content Delivery Networks (CDNs) – Strongly AP-focused for high availability and responsiveness

src 3

graph TD
    A[Consistency]
    B[Availability]
    C[Partition Tolerance]

    A -- CP System --> C
    B -- AP System --> C
    A -- CA System --> B

    subgraph CAP Triangle
        A
        B
        C
    end

This diagram shows that you can choose only two at a time:

  • CP (Consistency + Partition Tolerance): e.g., traditional databases
  • AP (Availability + Partition Tolerance): e.g., DNS, Cassandra
  • CA is only theoretical in a distributed environment (it fails when partition occurs)

In distributed systems, network partitions are unavoidable. The CAP Theorem helps us choose which trade-off makes the most sense for our use case.

#cap #consistency #availability #partitiontolerant


1: blog.devtrovert.com

2: Factor-bytes.com

3: blog.bytebytego.comVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 6 minutes]

Optimistic concurrency

Optimistic Concurrency is a concurrency control strategy used in databases and distributed systems that allows multiple users or processes to access the same data simultaneouslywithout locking resources.

Instead of preventing conflicts upfront by using locks, it assumes that conflicts are rare. If a conflict does occur, it’s detected after the operation, and appropriate resolution steps (like retries) are taken.


How It Works

  • Multiple users/processes read and attempt to write to the same data.
  • Instead of using locks, each update tracks the version or timestamp of the data.
  • When writing, the system checks if the data has changed since it was read.
  • If no conflict, the write proceeds.
  • If conflict detected, the system throws an exception or prompts a retry.

Let’s look at a simple example:

Sample inventory Table

| item_id | item_nm | stock |
|---------|---------|-------|
|    1    | Apple   |  10   |
|    2    | Orange  |  20   |
|    3    | Banana  |  30   |

Imagine two users, UserA and UserB, trying to update the apple stock simultaneously.

User A’s update:

UPDATE inventory SET stock = stock + 5 WHERE item_id = 1;

User B’s update:

UPDATE inventory SET stock = stock - 3 WHERE item_id = 1;
  • Both updates execute concurrently without locking the table.
  • After both operations, system checks for version conflicts.
  • If there’s no conflict, the changes are merged.
New price of Apple stock = 10 + 5 - 3 = 12
  • If there was a conflicting update (e.g., both changed the same field from different base versions), one update would fail, and the user must retry the transaction.

Optimistic Concurrency Is Ideal When

ConditionExplanation
Low write contentionMost updates happen on different parts of data
Read-heavy, write-light systemsUpdates are infrequent or less overlapping
High performance is criticalAvoiding locks reduces wait times
Distributed systemsLocking is expensive and hard to coordinate

#optimistic #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 6 minutes]

Eventual consistency

Eventual consistency is a consistency model used in distributed systems (like NoSQL databases and distributed storage) where updates to data may not be immediately visible across all nodes. However, the system guarantees that all replicas will eventually converge to the same state — given no new updates are made.

Unlike stronger models like serializability or linearizability, eventual consistency prioritizes performance and availability, especially in the face of network latency or partitioning.

Simple Example: Distributed Key-Value Store

Imagine a distributed database with three nodes: Node A, Node B, and Node C. All store the value for a key called "item_stock":

Node A: item_stock = 10
Node B: item_stock = 10
Node C: item_stock = 10

Now, a user sends an update to change item_stock to 15, and it reaches only Node A initially:

Node A: item_stock = 15
Node B: item_stock = 10
Node C: item_stock = 10

At this point, the system is temporarily inconsistent. Over time, the update propagates:

Node A: item_stock = 15
Node B: item_stock = 15
Node C: item_stock = 10

Eventually, all nodes reach the same value:

Node A: item_stock = 15
Node B: item_stock = 15
Node C: item_stock = 15

Key Characteristics

  • Temporary inconsistencies are allowed
  • Data will converge across replicas over time
  • Reads may return stale data during convergence
  • Prioritizes availability and partition tolerance over strict consistency

When to Use Eventual Consistency

Eventual consistency is ideal when:

SituationWhy It Helps
High-throughput, low-latency systemsAvoids the overhead of strict consistency
Geo-distributed deploymentsTolerates network delays and partitions
Systems with frequent writesEnables faster response without locking or blocking
Availability is more critical than accuracyKeeps services running even during network issues

#eventualconsistency #bigdataVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 6 minutes]

Concurrent vs. Parallel

Understanding the difference between concurrent and parallel programming is key when designing efficient, scalable applications — especially in distributed and multi-core systems.

Concurrent Programming

Concurrent programming is about managing multiple tasks at once, allowing them to make progress without necessarily executing at the same time.

  • Tasks overlap in time.
  • Focuses on task coordination, not simultaneous execution.
  • Often used in systems that need to handle many events or users, like web servers or GUIs.

Key Traits

  • Enables responsive programs (non-blocking)
  • Utilizes a single core or limited resources efficiently
  • Requires mechanisms like threads, coroutines, or async/await

Parallel Programming

Parallel programming is about executing multiple tasks simultaneously, typically to speed up computation.

  • Tasks run at the same time, often on multiple cores.
  • Focuses on performance and efficiency.
  • Common in high-performance computing, such as scientific simulations or data processing.

Key Traits

  • Requires multi-core CPUs or GPUs
  • Ideal for data-heavy workloads
  • Uses multithreading, multiprocessing, or vectorization

Analogy: Cooking in a Kitchen

Concurrent Programming

One chef is working on multiple dishes. While a pot is simmering, the chef chops vegetables for the next dish. Tasks overlap, but only one is actively running at a time.

Parallel Programming

A team of chefs in a large kitchen, each cooking a different dish at the same time. Multiple dishes are actively being cooked simultaneously, speeding up the overall process.

Summary Table

FeatureConcurrent ProgrammingParallel Programming
Task TimingTasks overlap, but not necessarily at onceTasks run simultaneously
FocusManaging multiple tasks efficientlyImproving performance through parallelism
Execution ContextOften single-core or logical threadMulti-core, multi-threaded or GPU-based
Tools/MechanismsThreads, coroutines, async I/OThreads, multiprocessing, SIMD, OpenMP
Example Use CaseWeb servers, I/O-bound systemsScientific computing, big data, simulations

#concurrent #parallelprogrammingVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

General-Purpose Language (GPL)

What is a GPL?

A GPL is a programming language designed to write software in multiple problem domains. It is not limited to a particular application area.

Swiss Army Knife

Examples

  • Python – widely used in ML, web, scripting, automation.
  • Java – enterprise applications, Android, backend.
  • C++ – system programming, game engines.
  • Rust – performance + memory safety.
  • JavaScript – web front-end & server-side with Node.js.

Use Cases

  • Building web apps (backend/frontend).
  • Developing AI/ML pipelines.
  • Writing system software and operating systems.
  • Implementing data processing frameworks (e.g., Apache Spark in Scala).
  • Creating mobile and desktop applications.

Why Use GPL?

  • Flexibility to work across domains.
  • Rich standard libraries and ecosystems.
  • Ability to combine different kinds of tasks (e.g., networking + ML).

#gpl #python #rustVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

DSL

A DSL is a programming or specification language dedicated to a particular problem domain, a particular problem representation technique, and/or a particular solution technique.

Examples

  • SQL – querying and manipulating relational databases.
  • HTML – for structuring content on the web.
  • R – statistical computing and graphics.
  • Makefiles – for building projects.
  • Regular Expressions – for pattern matching.
  • Markdown (READ.md or https://stackedit.io/app#)
  • Mermaid - Mermaid (https://mermaid.live/)

Use Cases

  • Building data pipelines (e.g., dbt, Airflow DAGs).
  • Writing infrastructure-as-code (e.g., Terraform HCL).
  • Designing UI layout (e.g., QML for Qt UI design).
  • IoT rule engines (e.g., IFTTT or Node-RED flows).
  • Statistical models using R.

Why Use DSL?

  • Shorter, more expressive code in the domain.
  • Higher-level abstractions.
  • Reduced risk of bugs for domain experts.

Optional Challenge: Build Your Own DSL!

Design your own mini Domain-Specific Language (DSL)! You can keep it simple.

  • Start with a specific problem.
  • Create your own syntax that feels natural to all.
  • Try few examples and ask your friends to try.
  • Try implementing a parser using your favourite GPL.

#domain #dsl #SQL #HTMLVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

Popular Big Data Tools & Platforms

Big Data ecosystems rely on a wide range of tools and platforms for data processing, real-time analytics, streaming, and cloud-scale storage. Here’s a list of some widely used tools categorized by functionality:

Distributed Processing Engines

  • Apache Spark – Unified analytics engine for large-scale data processing; supports batch, streaming, and ML.
  • Apache Flink – Framework for stateful computations over data streams with real-time capabilities.

Real-Time Data Streaming

  • Apache Kafka – Distributed event streaming platform for building real-time data pipelines and streaming apps.

Log & Monitoring Stack

  • ELK Stack (Elasticsearch, Logstash, Kibana) – Searchable logging and visualization suite for real-time analytics.

Cloud-Based Platforms

  • AWS (Amazon Web Services) – Scalable cloud platform offering Big Data tools like EMR, Redshift, Kinesis, and S3.
  • Azure – Microsoft’s cloud platform with tools like Azure Synapse, Data Lake, and Event Hubs.
  • GCP (Google Cloud Platform) – Offers BigQuery, Dataflow, Pub/Sub for large-scale data analytics.
  • Databricks – Unified data platform built around Apache Spark with powerful collaboration and ML features.
  • Snowflake – Cloud-native data warehouse known for performance, elasticity, and simplicity.

#bigdata #tools #cloud #kafka #sparkVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

NoSQL Database Types

NoSQL databases are optimized for flexibility, scalability, and performance, making them ideal for Big Data and real-time applications. They are categorized based on how they store and access data:

Key-Value Stores

Store data as simple key-value pairs. Ideal for caching, session storage, and high-speed lookups.

  • Redis
  • Amazon DynamoDB

Columnar Stores

Store data in columns rather than rows, optimized for analytical queries and large-scale batch processing.

  • Apache HBase
  • Apache Cassandra
  • Amazon Redshift

Document Stores

Store semi-structured data like JSON or BSON documents. Great for flexible schemas and content management systems.

  • MongoDB
  • Amazon DocumentDB

Graph Databases

Use nodes and edges to represent and traverse relationships between data. Ideal for social networks, recommendation engines, and fraud detection.

  • Neo4j
  • Amazon Neptune

Tip: Choose the NoSQL database type based on your data access patterns and application needs.

Not all NoSQL databases solve the same problem.

#nosql #keyvalue #documentdb #graphdb #columnarVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

Learning Big Data

Learning Big Data goes beyond just handling large datasets. It involves building a foundational understanding of data types, file formats, processing tools, and cloud platforms used to store, transform, and analyze data at scale.

Types of Files & Formats

  • Data File Types: CSV, JSON
  • File Formats: CSV, TSV, TXT, Parquet

Linux & File Management Skills

  • Essential Linux Commands: ls, cat, grep, awk, sort, cut, sed, etc.
  • Useful Libraries & Tools:
    • awk, jq, csvkit, grep – for filtering, transforming, and managing structured data

Data Manipulation Foundations

  • Regular Expressions: For pattern matching and advanced string operations
  • SQL / RDBMS: Understanding relational data and query languages
  • NoSQL Databases: Working with document, key-value, columnar, and graph stores

Cloud Technologies

  • Introduction to major platforms: AWS, Azure, GCP
  • Services for data storage, compute, and analytics (e.g., S3, EMR, BigQuery)

Big Data Tools & Frameworks

  • Tools like Apache Spark, Flink, Kafka, Dask
  • Workflow orchestration (e.g., Airflow, DBT, Databricks Workflows)

Miscellaneous Tools & Libraries

  • Visualization: matplotlib, seaborn, Plotly
  • Data Engineering: pandas, pyarrow, sqlalchemy
  • Streaming & Real-time: Kafka, Spark Streaming, Flume

Tip: Big Data learning is a multi-disciplinary journey. Start small — explore files and formats — then gradually move into tools, pipelines, cloud platforms, and real-time systems.

#bigdata #learning #learningVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 1 minute]

Developer Tools

  1. Introduction
  2. UV
  3. Other Python Tools
  4. Duck DB
  5. JQ
Ver 6.0.1

[Avg. reading time: 5 minutes]

Introduction

Before diving into Data or ML frameworks, it's important to have a clean and reproducible development setup. A good environment makes you:

  • Faster: less time fighting dependencies.
  • Consistent: same results across laptops, servers, and teammates.
  • Confident: tools catch errors before they become bugs.

A consistent developer experience saves hours of debugging. You spend more time solving problems, less time fixing environments.


Python Virtual Environment

  • A virtual environment is like a sandbox for Python.
  • It isolates your project’s dependencies from the global Python installation.
  • Easy to manage different versions of library.
  • Must depend on requirements.txt, it has to be managed manually.

Without it, installing one package for one project may break another project.

Open the CMD prompt (Windows)

Open the Terminal (Mac)

# Step 0: Create a project folder under your Home folder.

mkdir project

cd project


# Step 1: Create a virtual environment
python -m venv myenv

# Step 2: Activate it
# On Mac/Linux:
source myenv/bin/activate

# On Windows:
myenv\Scripts\activate.bat

# Step 3: Install packages (they go inside `myenv`, not global)
pip install faker

# Step 4: Open Python
python

# Step 5: Verify 

import sys

sys.prefix

sys.base_prefix

# Step 6: Run this sample

from faker import Faker
fake = Faker()
fake.name()

# Step 6: Close Python (Control + Z)

# Step 7: Deactivate the venv when done

deactivate

As a next step, you can either use Poetry or UV as your package manager.

#venv #python #uv #poetry developer_toolsVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

UV

Dependency & Environment Manager

  • Written in Rust.
  • Syntax is lightweight.
  • Automatic Virtual environment creation.

Create a new project:

# Initialize a new uv project
uv init uv_helloworld

Sample layout of the directory structure

.
├── main.py
├── pyproject.toml
├── README.md
└── uv.lock
# Change directory
cd uv_helloworld

# # Create a virtual environment myproject
# uv venv myproject

# or create a UV project with specific version of Python

# uv venv myproject --python 3.11

# # Activate the Virtual environment

# source myproject/bin/activate

# # Verify the Virtual Python version

# which python3

# add library (best practice)
uv add faker

# verify the list of libraries under virtual env
uv tree

# To find the list of libraries inside Virtual env

uv pip list

edit the main.py

from faker import Faker
fake = Faker()
print(fake.name())
uv run main.py

Read More on the differences between UV and Poetry

#uv #rust #venvVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 12 minutes]

Python Developer Tools

PEP

PEP, or Python Enhancement Proposal, is the official style guide for Python code. It provides conventions and recommendations for writing readable, consistent, and maintainable Python code.

PEP Conventions

  • PEP 8 : Style guide for Python code (most famous).
  • PEP 20 : "The Zen of Python" (guiding principles).
  • PEP 484 : Type hints (basis for MyPy).
  • PEP 517/518 : Build system interfaces (basis for pyproject.toml, used by Poetry/UV).
  • PEP 572 : Assignment expressions (the := walrus operator).
  • PEP 695 : Type parameter syntax for generics (Python 3.12).

Indentation

  • Use 4 spaces per indentation level
  • Continuation lines should align with opening delimiter or be indented by 4 spaces.

Line Length

  • Limit lines to a maximum of 79 characters.
  • For docstrings and comments, limit lines to 72 characters.

Blank Lines

  • Use 2 blank lines before top-level functions and class definitions.
  • Use 1 blank line between methods inside a class.

Imports

  • Imports should be on separate lines.
  • Group imports into three sections: standard library, third-party libraries, and local application imports.
  • Use absolute imports whenever possible.
# Correct
import os
import sys

# Wrong
import sys, os

Naming Conventions

  • Use snake_case for function and variable names.
  • Use CamelCase for class names.
  • Use UPPER_SNAKE_CASE for constants.
  • Avoid single-character variable names except for counters or indices.

Whitespace

  • Don’t pad inside parentheses/brackets/braces.
  • Use one space around operators and after commas, but not before commas.
  • No extra spaces when aligning assignments.

Comments

  • Write comments that are clear, concise, and helpful.
  • Use complete sentences and capitalize the first word.
  • Use # for inline comments, but avoid them where the code is self-explanatory.

Docstrings

  • Use triple quotes (""") for multiline docstrings.
  • Describe the purpose, arguments, and return values of functions and methods.

Code Layout

  • Keep function definitions and calls readable.
  • Avoid writing too many nested blocks.

Consistency

  • Consistency within a project outweighs strict adherence.
  • If you must diverge, be internally consistent.

Linting

Linting is the process of automatically checking your Python code for:

  • Syntax errors

  • Stylistic issues (PEP 8 violations)

  • Potential bugs or bad practices

  • Keeps your code consistent and readable.

  • Helps catch errors early before runtime.

  • Encourages team-wide coding standards.


# Incorrect
import sys, os

# Correct
import os
import sys
# Bad spacing
x= 5+3

# Good spacing
x = 5 + 3

Ruff : Linter and Code Formatter

Ruff is a fast, modern tool written in Rust that helps keep your Python code:

  • Consistent (follows PEP 8)
  • Clean (removes unused imports, fixes spacing, etc.)
  • Correct (catches potential errors)

Install

poetry add ruff
uv add ruff

Verify

ruff --version 
ruff --help

example.py

import os, sys 

def greet(name): 
  print(f"Hello, {name}")

def message(name): print(f"Hi, {name}")

def calc_sum(a, b): return a+b

greet('World')
greet('Ruff')
message('Ruff')

poetry run ruff check example.py
poetry run ruff check example.py --fix

poetry run ruff format example.py --check
poetry run ruff format example.py

OR

uv run ruff check example.py
uv run ruff check example.py --fix
uv run ruff format example.py --check
uv run ruff check example.py

MyPy : Type Checking Tool

mypy is a static type checker for Python. It checks your code against the type hints you provide, ensuring that the types are consistent throughout the codebase.

It primarily focuses on type correctness—verifying that variables, function arguments, return types, and expressions match the expected types.

Install

    poetry add mypy

    or

    uv add mypy

    or

    pip install mypy

sample.py

x = 1
x = 1.0
x = True
x = "test"
x = b"test"

print(x)

def add(a: int, b: int) -> int:
    return a + b

print(add(100, 123))      

print(add("hello", "world"))


uv run mypy sample.py

or

poetry run mypy sample.py

or

mypy sample.py

#mypy #pep #ruff #lintVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 11 minutes]

DUCK DB

DuckDB is a single file built with no dependencies.

All the great features can be read here https://duckdb.org/

Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.

Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.

Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.

How to use DuckDB?

Download the CLI Client

DuckDB in Data Engineering

Download orders.parquet from

https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet

More files are available here

https://github.com/cwida/duckdb-data/releases/

Open Command Prompt or Terminal

./duckdb

# Create / Open a database

.open ordersdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"

Not only this, but it also allows you to query the file as-is. (This feature is similar to one data bricks supports)

select * from "orders.parquet" limit 3;

DuckDB supports CTAS syntax and helps to create tables from the actual file.

show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";
SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;
select * from demo.json;
describe demo.json;

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

#duckdb #singlefiledatabase #parquet #tools #cliVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 8 minutes]

JQ

  • jq is a lightweight and flexible command-line JSON processor.
  • Reads JSON from stdin or a file, applies filters, and writes JSON to stdout.
  • Useful when working with APIs, logs, or config files in JSON format.
  • Handy tool in Automation.
  1. Download JQ CLI (Preferred) and learn JQ.

JQ Download

  1. Use the VSCode Extension and learn JQ.

VSCode Extension

Download the sample JSON

https://raw.githubusercontent.com/gchandra10/jqtutorial/refs/heads/master/sample_nows.json

Note: As this has no root element, '.' is used.

1. View JSON file in readable format

jq '.' sample_nows.json

2. Read the First JSON element / object

jq 'first(.[])' sample_nows.json

3. Read the Last JSON element

jq 'last(.[])' sample_nows.json

4. Read top 3 JSON elements

jq 'limit(3;.[])' sample_nows.json

5. Read 2nd & 3rd element. Remember, Python has the same format. LEFT Side inclusive, RIGHT Side exclusive

jq '.[2:4]' sample_nows.json

6. Extract individual values. | Pipeline the output

jq '.[] | [.balance,.age]' sample_nows.json

7. Extract individual values and do some calculations

jq '.[] | [.age, 65 - .age]' sample_nows.json

8. Return CSV from JSON

jq '.[] | [.company, .phone, .address] | @csv ' sample_nows.json

9. Return Tab Separated Values (TSV) from JSON

jq '.[] | [.company, .phone, .address] | @tsv ' sample_nows.json

10. Return with custom pipeline delimiter ( | )

jq '.[] | [.company, .phone, .address] | join("|")' sample_nows.json

Pro TIP : Export this result > output.txt and Import to db using bulk import tools like bcp, load data infile

11. Convert the number to string and return | delimited result

jq '.[] | [.balance,(.age | tostring)] | join("|") ' sample_nows.json

12. Process Array return Name (returns as list / array)

jq '.[] | [.friends[].name]' sample_nows.json

or (returns line by line)

jq '[].friends[].name' sample_nows.json

13. Parse multi level values

returns as list / array

jq '.[] | [.name.first, .name.last]' sample_nows.json 

returns line by line

jq '.[].name.first, .[].name.last' sample_nows.json 

14. Query values based on condition, say .index > 2

jq 'map(select(.index > 2))' sample_nows.json
jq 'map(select(.index > 2)) | .[] | [.index,.balance,.age]' sample_nows.json

15. Sorting Elements

# Sort by Age ASC
jq 'sort_by(.age)' sample_nows.json
# Sort by Age DESC
jq 'sort_by(-.age)' sample_nows.json
# Sort on multiple keys
jq 'sort_by(.age, .index)' sample_nows.json

Use Cases

curl -s https://www.githubstatus.com/api/v2/status.json
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.'
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.status'

#jq #tools #json #parser #cli #automationVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 0 minutes]

Dataformat

  1. Introduction
  2. Common Data Formats
  3. JSON
  4. Parquet
  5. Arrow
  6. Delta
  7. SerDeVer 6.0.1

[Avg. reading time: 3 minutes]

Introduction to Data Formats

What are Data Formats?

  • Data formats define how data is structured, stored, and exchanged between systems.
  • In Big Data, the choice of data format is crucial because it affects:
    • Storage efficiency
    • Processing speed
    • Interoperability
    • Compression

Why are Data Formats Important in Big Data?

  • Big Data often involves massive volumes of data from diverse sources.
  • Choosing the right format ensures:
    • Efficient data storage
    • Faster querying and processing
    • Easier integration with analytics frameworks like Spark, Flink, etc.

Data Formats vs. Traditional Database Storage

FeatureTraditional RDBMSBig Data Formats
StorageTables with rows and columnsFiles/Streams with structured data
SchemaFixed and enforcedFlexible, sometimes schema-on-read
ProcessingTransactional, ACIDBatch or stream, high throughput
Data ModelRelationalStructured, semi-structured, binary
Use CasesOLTP, ReportingETL, Analytics, Machine Learning

#bigdata #dataformat #rdbmsVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Common Data Formats

CSV (Comma-Separated Values)

A simple text-based format where each row is a record and columns are separated by commas.

Example:

name,age,city
Rachel,30,New York
Phoebe,25,San Francisco

Use Cases:

  • Data exchange between systems
  • Lightweight storage

Pros:

  • Human-readable
  • Easy to generate and parse

Cons:

  • No support for nested or complex structures
  • No schema enforcement
  • Inefficient for very large data

TSV (Tab-Separated Values)

Like CSV but uses tabs instead of commas.

Example:

name    age    city
Rachel   30     New York
Phoebe     25     San Francisco

Use Cases:

Similar to CSV but avoids issues with commas in data

Pros:

  • Easy to read and parse
  • Handles data with commas

Cons:

  • Same as CSV: no schema, no nested data

#bigdata #dataformat #csv #parquet #arrow Ver 6.0.1

Last change: 2026-01-17

[Avg. reading time: 8 minutes]

JSON

Java Script Object Notation.

  • This is neither a row-based nor Columnar Format.
  • The flexible way to store & share data across systems.
  • It's a text file with curly braces & key-value pairs { }

Simplest JSON format

{"id": "1","name":"Rachel"}

Properties

  • Language Independent.
  • Self-describing and easy to understand.

Basic Rules

  • Curly braces to hold the objects.
  • Data is represented in Key Value or Name Value pairs.
  • Data is separated by a comma.
  • The use of double quotes is necessary.
  • Square brackets [ ] hold an array of data.

JSON Values

String  {"name":"Rachel"}

Number  {"id":101}

Boolean {"result":true, "status":false}  (lowercase)

Object  {
            "character":{"fname":"Rachel","lname":"Green"}
        }

Array   {
            "characters":["Rachel","Ross","Joey","Chanlder"]
        }

NULL    {"id":null}

Sample JSON Document

{
    "characters": [
        {
            "id" : 1,
            "fName":"Rachel",
            "lName":"Green",
            "status":true
        },
        {
            "id" : 2,
            "fName":"Ross",
            "lName":"Geller",
            "status":true
        },
        {
            "id" : 3,
            "fName":"Chandler",
            "lName":"Bing",
            "status":true
        },
        {
            "id" : 4,
            "fName":"Phebe",
            "lName":"Buffay",
            "status":false
        }
    ]
}

JSON Best Practices

No Hyphen in your Keys.

{"first-name":"Rachel","last-name":"Green"}  is not right. ✘

Under Scores Okay

{"first_name":"Rachel","last_name":"Green"} is okay ✓

Lowercase Okay

{"firstname":"Rachel","lastname":"Green"} is okay ✓

Camelcase best

{"firstName":"Rachel","lastName":"Green"} is the best. ✓

Use Cases

  • APIs and Web Services: JSON is widely used in RESTful APIs for sending and receiving data.

  • Configuration Files: Many modern applications and development tools use JSON for configuration.

  • Data Storage: Some NoSQL databases like MongoDB use JSON or BSON (binary JSON) for storing data.

  • Serialization and Deserialization: Converting data to/from a format that can be stored or transmitted.

Python Example

Serialize : Convert Python Object to JSON (Shareable) Format.

DeSerialize : Convert JSON (Shareable) String to Python Object.


import json

def json_serialize(file_name):
    # Python dictionary with Friend's characters
    friends_characters = {
        "characters": [{
            "name": "Rachel Green",
            "job": "Fashion Executive"
        }, {
            "name": "Ross Geller",
            "job": "Paleontologist"
        }, {
            "name": "Monica Geller",
            "job": "Chef"
        }, {
            "name": "Chandler Bing",
            "job": "Statistical Analysis and Data Reconfiguration"
        }, {
            "name": "Joey Tribbiani",
            "job": "Actor"
        }, {
            "name": "Phoebe Buffay",
            "job": "Massage Therapist"
        }]
    }

    print(type(friends_characters), friends_characters)

    print("-" * 200)

    # Serializing json
    json_data = json.dumps(friends_characters, indent=4)
    print(type(json_data), json_data)

    # Saving to a file
    with open(file_name, 'w') as file:
        json.dump(friends_characters, file, indent=4)


def json_deserialize(file_name):
    #file_path = 'friends_characters.json'

    # Open the file and read the JSON content
    with open(file_name, 'r') as file:
        data = json.load(file)
    print(data, type(data))


def main():
    file_name = 'friends_characters.json'
    json_serialize(file_name)
    json_deserialize(file_name)


if __name__ == "__main__":
    print("Starting JSON Serialization...")
    main()
    print("Done!")

#bigdata #dataformat #json #hierarchicalVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 20 minutes]

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Originally developed by Twitter and Cloudera, Parquet provides a compact and efficient way of storing large, flat datasets.

Best suited for WORM (Write Once, Read Many) workloads.


Row Storage

Row Storage

Give me list of Total T-Shirts sold or Customers from UK

It scans the entire dataset.

Row Scan

Columnar Storage

Columnar Storage


Terms to Know

Projection: Columns that are needed by the query.

    select product, country, salesamount from sales;

Here the projections are: product, country & salesamount

Predicate: A filter condition that selects rows.

    select product, country, salesamount from sales where country='UK';

Here predicate is where country = 'UK'

Row Groups in Parquet

  • Parquet divides data into row groups, each containing column chunks for all columns.

  • Horizontal partition—each row group can be processed independently.

  • Row groups enable parallel processing and make it possible to skip unnecessary data using metadata.


Parquet - Columnar Storage + Row Groups

Row Groups


Parquet File format

Parquet Fileformat Layout {{footnote: https://parquet.apache.org/docs/file-format/}}

File Format

Sample Data

ProductCustomerCountryDateSales Amount
BallJohn DoeUSA2023-01-01100
T-ShirtJohn DoeUSA2023-01-02200
SocksJane DoeUK2023-01-03150
SocksJane DoeUK2023-01-04180
T-ShirtAlexUSA2023-01-05120
SocksAlexUSA2023-01-06220

Data stored inside Parquet

┌──────────────────────────────────────────────┐
│                File Header                   │
│  ┌────────────────────────────────────────┐  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 1                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Ball, T-Shirt, Socks       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: John Doe, John Doe, Jane Doe│ │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: USA, USA, UK               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-01, 2023-01-02,    │  │
│  │            2023-01-03                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 100, 200, 150              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Ball/T-Shirt/Socks      │  │
│  │     • Customer: Jane Doe/John Doe      │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-01 to 2023-01-03    │  │
│  │     • Sales Amount: 100 to 200         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 2                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Socks, T-Shirt, Socks      │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: Jane Doe, Alex, Alex       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: UK, USA, USA               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-04, 2023-01-05,    │  │
│  │            2023-01-06                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 180, 120, 220              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Socks/T-Shirt           │  │
│  │     • Customer: Alex/Jane Doe          │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-04 to 2023-01-06   │  │
│  │     • Sales Amount: 120 to 220         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Metadata                 │
│  ┌────────────────────────────────────────┐  │
│  │ Schema:                                │  │
│  │  • Product: string                     │  │
│  │  • Customer: string                    │  │
│  │  • Country: string                     │  │
│  │  • Date: date                          │  │
│  │  • Sales Amount: double                │  │
│  ├────────────────────────────────────────┤  │
│  │ Compression Codec: Snappy              │  │
│  ├────────────────────────────────────────┤  │
│  │ Num Row Groups: 2                      │  │
│  ├────────────────────────────────────────┤  │
│  │ Offsets to Row Groups                  │  │
│  │  • Row Group 1: offset 128             │  │
│  │  • Row Group 2: offset 1024            │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Footer                   │
│  ┌────────────────────────────────────────┐  │
│  │ Offset to File Metadata: 2048          │  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

PAR1 - A 4-byte string "PAR1" indicating this is a Parquet file.

The type of compression used (e.g., Snappy).

Snappy

  • Low CPU Util
  • Low Compression Rate
  • Splittable
  • Use Case: Hot Layer
  • Compute Intensive

GZip

  • High CPU Util
  • High Compression Rate
  • Splittable
  • Use Case: Cold Layer
  • Storage Intensive

Encoding

Encoding is the process of converting data into a different format to:

  • Save space (compression)
  • Enable efficient processing
  • Support interoperability between systems

Packing clothes and necessities in a luggage vs organizing them in separate sections for easier retrieval.

Plain Encoding

  • Stores raw values as-is (row-by-row, then column-by-column).
  • Default for columns that don’t compress well or have high cardinality (too many unique values,ex id/email). Ex: Sales Amount

Dictionary Encoding

  • Stores a dictionary of unique values and then stores references (indexes) to those values in the data pages.

  • Great for columns with repeated values.

Example:

- 0: Ball
- 1: T-Shirt
- 2: Socks
- Data Page: [0,1,2,2,1,2]

Reduces storage for repetitive values like "Socks".

Run-Length Encoding (RLE)

  • Compresses consecutive repeated values into a count + value pair.

  • Ideal when the data is sorted or has runs of the same value.

Example:

If Country column was sorted: [USA, USA, USA, UK, UK, UK]

RLE: [(3, USA), (3, UK)]
  • Efficient storage for sorted or grouped data.

Delta Encoding

  • Stores the difference between consecutive values.

  • Best for numeric columns with increasing or sorted values (like dates).

Example:

Date column: [2023-01-01, 2023-01-02, 2023-01-03, ...]
Delta Encoding: [2023-01-01, +1, +1, +1, ...]
  • Very compact for sequential data.

Bit Packing

  • Packs small integers using only the bits needed rather than a full byte.

  • Often used with dictionary-encoded indexes.

Example:

Dictionary indexes for Product: [0,1,2,2,1,2]

Needs only 2 bits to represent values (00, 01, 10).

Saves space vs. storing full integers.

Key Features of Parquet

Columnar Storage

Schema Evolution

  • Supports complex nested data structures (arrays, maps, structs).
  • Allows the schema to evolve over time, making it highly flexible for changing data models.

Compression

  • Parquet allows the use of highly efficient compression algorithms like Snappy and Gzip.

  • Columnar layout improves compression by grouping similar data together—leading to significant storage savings.

Various Encodings

Language Agnostic

  • Parquet is built from the ground up for cross-language compatibility.
  • Official libraries exist for Java, C++, Python, and many other languages—making it easy to integrate with diverse tech stacks.

Seamless Integration

  • Designed to integrate smoothly with a wide range of big data frameworks, including:

    • Apache Hadoop
    • Apache Spark
    • Amazon Glue/Athena
    • Clickhouse
    • DuckDB
    • Snowflake
    • and many more.

Python Example


import pandas as pd

file_path = 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

# Read the CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

# Write DataFrame to a Parquet file
df.to_parquet('sample.parquet')

Some utilities to inspect Parquet files

WIN/MAC

https://aloneguid.github.io/parquet-dotnet/parquet-floor.html#installing

MAC

https://github.com/hangxie/parquet-tools
parquet-tools row-count sample.parquet
parquet-tools schema sample.parquet
parquet-tools cat sample.parquet
parquet-tools meta sample.parquet

Remote Files

parquet-tools row-count https://github.com/gchandra10/filestorage/raw/refs/heads/main/sales_onemillion.parquet

#bigdata #dataformat #parquet #columnar #compressedVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 18 minutes]

Apache Arrow

Apache Arrow is a universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics. It contains a set of technologies that enable data systems to efficiently store, process, and move data.

It enables zero-copy reads, cross-language compatibility, and fast data interchange between tools (like Pandas, Spark, R, and more).

Why another format?

Traditional formats (like CSV, JSON, or even Parquet) are often optimized for storage rather than in-memory analytics.

Arrow focuses on:

  • Speed: Using Vector Processing, the analytics tasks run up to 10x faster on modern CPUs with SIMD. (Single Instruction Multiple Data). One CPU instruction operated on multiple data elements at the same time.

Vector here means a sequence of data elements (like an array or a column). Vector processing is a computing technique where a single instruction operates on an Entire vector of data at once, rather than on one data point at a time.

Row-wise

Each element is processed one at a time.

data = [1, 2, 3, 4]
for i in range(len(data)):
    data[i] = data[i] + 10

The CPU applies the addition across the entire vector at once.

Vectorized

data = [1, 2, 3, 4]
result = data + 10
  • Interoperability: Share data between Python, R, C++, Java, Rust, etc. without serialization overhead.

  • Efficiency: Supports nested structures and complex types.

Arrow supports Zero-Copy.

Analogy: English speaker - audience who speaks different languages.

Parquet -> Speaker notes stored in document and read by different people and translated at their own pace.

Arrow -> Speech is instantly shared across different people in their native language, without additional serialization and deserilization. Using Zero-Copy.

  • NumPy = Optimized compute (fast math, but Python-only).

  • Parquet = Optimized storage (compressed, universal, but needs deserialization on read).

  • Arrow = Optimized interchange (in-memory, zero-copy, instantly usable across languages).

Demonstration (With and Without Vectorization)


import time
import numpy as np
import pyarrow as pa

N = 10_000_000
data_list = list(range(N))           # Python list
data_array = np.arange(N)            # NumPy array
arrow_arr = pa.array(data_list)      # Arrow array
np_from_arrow = arrow_arr.to_numpy() # Convert Arrow buffer to NumPy

# ---- Traditional Python list loop ----
start = time.time()
result1 = [x + 1 for x in data_list]
print(f"List processing time: {time.time() - start:.4f} seconds")

# ---- NumPy vectorized ----
start = time.time()
result2 = data_array + 1
print(f"NumPy processing time: {time.time() - start:.4f} seconds")

# ---- Arrow + NumPy ----
start = time.time()
result3 = np_from_arrow + 1
print(f"Arrow + NumPy processing time: {time.time() - start:.4f} seconds")

Read Parquet > Arrow table > NumPy view > ML model > back to Arrow > save Parquet.

Use Cases

Data Science & Machine Learning

  • Share data between Pandas, Spark, R, and ML libraries without copying or converting.

Streaming & Real-Time Analytics

  • Ideal for passing large datasets through streaming frameworks with low latency.

Data Exchange

  • Move data between different systems with a common representation (e.g. Pandas → Spark → R).

Big Data

  • Integrates with Parquet, Avro, and other formats for ETL and analytics.

Parquet vs Arrow

FeatureApache ArrowApache Parquet
PurposeIn-memory processing & interchangeOn-disk storage & compression
StorageData kept in RAM (zero-copy)Data stored on disk (columnar files)
CompressionTypically uncompressed (can compress via IPC streams)Built-in compression (Snappy, Gzip)
UsageAnalytics engines, data exchangeData warehousing, analytics storage
QueryIn-memory, real-time queryingBatch analytics, query engines

Think of Arrow as the in-memory twin of Parquet: Arrow is perfect for fast, interactive analytics; Parquet is great for long-term, compressed storage.

Terms to Know

RPC (Remote Procedure Call)

A Remote Procedure Call (RPC) is a software communication protocol that one program uses to request a service from another program located on a different computer and network, without having to understand the network's details.

Specifically, RPC is used to call other processes on remote systems as if the process were a local system. A procedure call is also sometimes known as a function call or a subroutine call.

Ordering (RPC) food via food delivery app. You don't know who takes the request, who prepares it, how its prepared, who delivers it or what the traffic is. RPC abstracts away the network communication and details between systems.

Example: Discord. WhatsApp. You just use your phone, but behind the scenes it does lot of things.

DEMO

git clone https://github.com/gchandra10/python_rpc_demo.git



Arrow Flight

Apache Arrow Flight is a high-performance RPC (Remote Procedure Call) framework built on top of Apache Arrow.

It’s designed to efficiently transfer large Arrow datasets between systems over the network — avoiding slow serialization steps common in traditional APIs.

Uses gRPC under the hood for network communication.

Arrow vs Arrow Flight

FeatureApache ArrowArrow Flight
PurposeIn-memory, columnar formatEfficient transport of Arrow data
StorageData in-memory (RAM)Data transfer between systems
SerializationNone (data is already Arrow)Uses Arrow IPC but optimized via Flight
CommunicationNo network built-inUses gRPC for client-server data transfer
PerformanceFast in-memory readsFast networked transfer of Arrow data

Traditional vs ArrowFlight

Arrow Flight SQL

  • Adds SQL support on top of Arrow Flight.
  • Submit SQL queries to a server and receive Arrow Flight responses.
  • Easier for BI tools (e.g. Tableau, Power BI) connect to a Flight SQL server.

ADBC

ADBC stands for Arrow Database Connectivity. It’s a set of libraries and standards that define how to connect to databases using Apache Arrow data structures.

Think of it as a modern, Arrow-based alternative to ODBC/JDBC — but built for columnar analytics and big data workloads.

#dataformat #arrow #flightsql #flightrpc #adbcVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Delta

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It sits on top of existing cloud storage systems like S3, ADLS, or GCS and adds transactional consistency and schema enforcement to your Parquet files.

Use Cases

Data Lakes with ACID Guarantees: Perfect for real-time and batch data processing in Data Lake environments.

Streaming + Batch Workflows: Unified processing with support for incremental updates.

Time Travel: Easy rollback and audit of data versions.

Upserts (MERGE INTO): Efficient updates/deletes on Parquet data using Spark SQL.

Slowly Changing Dimensions (SCD): Managing dimension tables in a data warehouse setup.

Technical Context

Underlying Format: Parquet

Transaction Log: _delta_log folder with JSON commit files

Operations Supported:

-MERGE
-UPDATE / DELETE
-OPTIMIZE / ZORDER

Integration: Supported in open-source via delta-rs, [Delta Kernel], and Delta Standalone Reader.

git clone https://github.com/gchandra10/python_delta_demo

#bigdata #delta #acidVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 10 minutes]

Serialization-Deserialization

(SerDe)

Serialization converts a data structure or object state into a format that can be stored or transmitted (e.g., file, message, or network).

Deserialization is the reverse process, reconstructing the original object from the serialized form.

(Python/Scala/Rust) Objects to JSON back to Objects (Python/Scala/Rust)

The analogy of translating from Spanish to English (Universal Language) and to German

JSON


import json

# Serialization
data = {"name": "Alice", "age": 25, "city": "New York"}
json_str = json.dumps(data)
print(json_str)

# Deserialization
obj = json.loads(json_str)
print(obj["name"])

AVRO

Apache Avro is a binary serialization format designed for efficiency, compactness, and schema evolution.

  • Compact & Efficient: Binary encoding → smaller and faster than JSON.
  • Schema Evolution: Supports backward/forward compatibility.
  • Rich Data Types: Handles nested, array, map, union types.
  • Language Independent: Works across Python, Java, Scala, Rust, etc.
  • Big Data Integration: Works seamlessly with Hadoop, Kafka, Spark.
  • Self-Describing: Schema travels with the data.

Schemas

An Avro schema defines the structure of the Avro data format. It’s a JSON document that describes your data types and protocols, ensuring that even complex data structures are adequately represented. The schema is crucial for data serialization and deserialization, allowing systems to interpret the data correctly.

Example of Avro Schema

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

Here is the list of Primitive & Complex Data Types which Avro supports:

  • null,boolean,int,long,float,double,bytes,string
  • records,enums,arrays,maps,unions,fixed

JSON vs Avro

FeatureJSONAvro
Format TypeText-based (human-readable)Binary (machine-efficient)
SizeLarger (verbose)Smaller (compact)
SpeedSlower to serialize/deserializeMuch faster (binary encoding)
SchemaOptional / loosely definedMandatory and embedded with data
Schema EvolutionNot supportedFully supported (backward & forward compatible)
Data TypesBasic (string, number, bool, array, object)Rich (records, enums, arrays, maps, unions, fixed)
ReadabilityHuman-friendlyNot human-readable
IntegrationCommon in APIs, configsCommon in Big Data (Kafka, Spark)
Use CaseSimple data exchange (REST APIs)High-performance data pipelines, streaming systems

In short,

  • Use JSON when simplicity & readability matter.
  • Use Avro when performance, compactness, and schema evolution matter (especially in Big Data systems).
git clone https://github.com/gchandra10/python_serialization_deserialization_examples.git

Parquet vs Avro

FeatureAvroParquet
Format TypeRow-based binary formatColumnar binary format
Best ForStreaming, message passing, row-oriented reads/writesAnalytics, queries, column-oriented reads
CompressionModerate (row blocks)Very high (per column)
Read PatternReads entire rowsReads only required columns → faster for queries
Write PatternFast row inserts / appendsBest for batch writes (not streaming-friendly)
SchemaEmbedded JSON schema, supports evolutionEmbedded schema, supports evolution (with constraints)
Data EvolutionFlexible backward/forward compatibilitySupported, but limited (column addition/removal)
Use CaseKafka, Spark streaming, data ingestion pipelinesData warehouses, lakehouse tables, analytics queries
IntegrationHadoop, Kafka, Spark, HiveSpark, Hive, Trino, Databricks, Snowflake
ReadabilityNot human-readableNot human-readable
Typical File Extension.avro.parquet

#serialization #deserialization #avroVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 1 minute]

Protocols

  1. Introduction
  2. HTTP
  3. Monolithic Architecture
  4. Statefulness
  5. Microservices
  6. Statelessness
  7. Idempotency
  8. REST API
  9. API Performance
  10. API in Big Data worldVer 6.0.1

[Avg. reading time: 2 minutes]

Introduction

Protocols are standardized rules that govern how data is transmitted, formatted, and processed across systems.

In Big Data, protocols are essential for:

  • Data ingestion (getting data in)
  • Inter-node communication in clusters
  • Remote access to APIs/services
  • Serialization of structured data
  • Security and authorization
ProtocolLayerUse Case Example
HTTP/HTTPSApplicationREST API for ingesting external data
KafkaMessagingStream processing with Spark or Flink
gRPCRPCMicroservices in ML model serving
MQTTMessagingIoT data push to cloud
Avro/ProtoSerializationBinary data for logs and schema
OAuth/KerberosSecuritySecure access to data lakes

#protocols #grpc #http #mqttVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 2 minutes]

HTTP

Basics

HTTP (HyperText Transfer Protocol) is the foundation of data communication on the web, used to transfer data (such as HTML files and images).

GET - Navigate to a URL or click a link in real life.

POST - Submit a form on a website, like a username and password.


200 Series (Success): 200 OK, 201 Created.

300 Series (Redirection): 301 Moved Permanently, 302 Found.

400 Series (Client Error): 400 Bad Request, 401 Unauthorized, 404 Not Found.

500 Series (Server Error): 500 Internal Server Error, 503 Service Unavailable.

#http #get #put #post #statuscodesVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Monolithic Architecture

Definition: A monolithic architecture is a software design pattern in which an application is built as a unified unit. All application components (user interface, business logic, and data access layers) are tightly coupled and run as a single service.

Characteristics: This architecture is simple to develop, test, deploy, and scale vertically. However, it can become complex and unwieldy as the application grows.

Monolithic

Examples

  • Traditional Banking Systems.
  • Enterprise Resource Planning (SAP ERP) Systems.
  • Content Management Systems like WordPress.
  • Legacy Government Systems. (Tax filing, public records management, etc.)

Advantages and Disadvantages

Advantages: Simplicity in development and deployment, straightforward horizontal scaling, and often more accessible debugging since all components are in one place. Reduced Latency in the case of Amazon Prime.

Disadvantages: Scaling challenges, difficulty implementing changes or updates (especially in large systems), and potential for more extended downtime during maintenance.

#monolithic #banking #amazonprime tightlycoupledVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 8 minutes]

Statefulness

The server stores information about the client’s current session in a stateful system. This is common in traditional web applications. Here’s what characterizes a stateful system:

Session Memory: The server remembers past interactions and may store session data like user authentication, preferences, and other activities.

Server Dependency: Since the server holds session data, the same server usually handles subsequent requests from the same client. This is important for consistency.

Resource Intensive: Maintaining state can be resource-intensive, as the server needs to manage and store session data for each client.

Example: A web application where a user logs in, and the server keeps track of their authentication status and interactions until they log out.

Statefulness

In this diagram:

Initial Request: The client sends the initial request to the load balancer.

Load Balancer to Server 1: The load balancer forwards the request to Server 1.

Response with Session ID: Server 1 responds to the client with a session ID, establishing a sticky session.

Subsequent Requests: The client sends subsequent requests with the session ID.

Load Balancer Routes to Server 1: The load balancer forwards these requests to Server 1 based on the session ID, maintaining the sticky session.

Server 1 Processes Requests: Server 1 continues to handle requests from this client.

Server 2 Unused: Server 2 remains unused for this particular client due to the stickiness of the session with Server 1.

Stickiness (Sticky Sessions)

Stickiness or sticky sessions are used in stateful systems, particularly in load-balanced environments. It ensures that requests from a particular client are directed to the same server instance. This is important when:

Session Data: The server needs to maintain session data (like login status), and it’s stored locally on a specific server instance.

Load Balancers: In a load-balanced environment, without stickiness, a client’s requests could be routed to different servers, which might not have the client’s session data.

Trade-off: While it helps maintain session continuity, it can reduce the load balancing efficiency and might lead to uneven server load.

Methods of Implementing Stickiness

Cookie-Based Stickiness: The most common method, where the load balancer uses a special cookie to track the server assigned to a client.

IP-Based Stickiness: The load balancer routes requests based on the client’s IP address, sending requests from the same IP to the same server.

Custom Header or Parameter: Some load balancers can use custom headers or URL parameters to track and maintain session stickiness.

#stateful #stickiness #loadbalancerVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

Microservices

Microservices architecture is a method of developing software applications as a suite of small, independently deployable services. Each service in a microservices architecture is focused on a specific business capability, runs in its process, and communicates with other services through well-defined APIs. This approach stands in contrast to the traditional monolithic architecture, where all components of an application are tightly coupled and run as a single service.

Characteristics:

Modularity: The application is divided into smaller, manageable pieces (services), each responsible for a specific function or business capability.

Independence: Each microservice is independently deployable, scalable, and updatable. This allows for faster development cycles and easier maintenance.

Decentralized Control: Microservices promote decentralized data management and governance. Each service manages its data and logic.

Technology Diversity: Teams can choose the best technology stack for their microservice, leading to a heterogeneous technology environment.

Resilience: Failure in one microservice doesn’t necessarily bring down the entire application, enhancing the system’s overall resilience.

Scalability: Microservices can be scaled independently, allowing for more efficient resource utilization based on demand for specific application functions.

Microservices

Data Ingestion Microservices: Collect and process data from multiple sources.

Data Storage: Stores processed weather data and other relevant information.

User Authentication Microservice: Manages user authentication and communicates with the User Database for validation.

User Database: Stores user account information and preferences.

API Gateway: Central entry point for API requests, routes requests to appropriate microservices, and handles user authentication.

User Interface Microservice: Handles the logic for the user interface, serving web and mobile applications.

Data Retrieval Microservice: Fetches weather data from the Data Storage and provides it to the frontends.

Web Frontend: The web interface for end-users, making requests through the API Gateway.

Mobile App Backend: Backend services for the mobile application, also making requests through the API Gateway.

Advantages:

Agility and Speed: Smaller codebases and independent deployment cycles lead to quicker development and faster time-to-market.

Scalability: It is easier to scale specific application parts that require more resources.

Resilience: Isolated services reduce the risk of system-wide failures.

Flexibility in Technology Choices: Microservices can use different programming languages, databases, and software environments.

Disadvantages:

Complexity: Managing a system of many different services can be complex, especially regarding network communication, data consistency, and service discovery.

Overhead: Each microservice might need its own database and transaction management, leading to duplication and increased resource usage.

Testing Challenges: Testing inter-service interactions can be more complex compared to a monolithic architecture.

Deployment Challenges: Requires robust DevOps practices, including continuous integration and continuous deployment (CI/CD) pipelines.

#microservices #RESTAPI #CICDVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 6 minutes]

Statelessness

In a stateless system, each request from the client must contain all the information the server needs to fulfill that request. The server does not store any state of the client’s session. This is a crucial principle of RESTful APIs. Characteristics include:

No Session Memory: The server remembers nothing about the user once the transaction ends. Each request is independent.

Scalability: Stateless systems are generally more scalable because the server doesn’t need to maintain session information. Any server can handle any request.

Simplicity and Reliability: The stateless nature makes the system simpler and more reliable, as there’s less information to manage and synchronize across systems.

Example: An API where each request contains an authentication token and all necessary data, allowing any server instance to handle any request.

Statlessness

In this diagram:

Request 1: The client sends a request to the load balancer.

Load Balancer to Server 1: The load balancer forwards Request 1 to Server 1.

Response from Server 1: Server 1 processes the request and sends a response back to the client.

Request 2: The client sends another request to the load balancer.

Load Balancer to Server 2: This time, the load balancer forwards Request 2 to Server 2.

Response from Server 2: Server 2 processes the request and responds to the client.

Statelessness: Each request is independent and does not rely on previous interactions. Different servers can handle other requests without needing a shared session state.

Token-Based Authentication

Common in stateless architectures, this method involves passing a token for authentication with each request instead of relying on server-stored session data. JWT (JSON Web Tokens) is a popular example.

#statelessness #jwt #RESTVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 2 minutes]

Idempotency

In simple terms, idempotency is the property where an operation can be applied multiple times without changing the result beyond the initial application.

Think of an elevator button: whether you press it once or mash it ten times, the elevator is still only called once to your floor. The first press changed the state; the subsequent ones are “no-ops.”

In technology, this is the “secret sauce” for reliability. If a network glitch occurs and a request is retried, idempotency ensures you don’t end up with duplicate orders, double payments, or corrupted data.

Idempotency

Popular Examples

  • The MERGE (Upsert) Operation
  • ABS(-5)
  • Using Terraform to deploy server

#idempotent #merge #upsert #teraform #absVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 9 minutes]

REST API

REpresentational State Transfer is a software architectural style developers apply to web APIs.

REST APIs provide simple, uniform interfaces because they can be used to make data, content, algorithms, media, and other digital resources available through web URLs. Essentially, REST APIs are the most common APIs used across the web today.

Use of a uniform interface (UI)

REST <abbr title="Application Programming Interface">API</abbr>

HTTP Methods

GET: This method allows the server to find the data you requested and send it back to you.

POST: This method permits the server to create a new entry in the database.

PUT: If you perform the ‘PUT’ request, the server will update an entry in the database.

DELETE: This method allows the server to delete an entry in the database.

Sample REST API URI

https://api.zippopotam.us/us/08028

http://api.tvmaze.com/search/shows?q=friends

https://jsonplaceholder.typicode.com/posts

https://jsonplaceholder.typicode.com/posts/1

https://jsonplaceholder.typicode.com/posts/1/comments

https://reqres.in/api/users?page=2

https://reqres.in/api/users/2

http://universities.hipolabs.com/search?country=United+States

https://itunes.apple.com/search?term=michael&limit=1000

https://www.boredapi.com/api/activity

https://techcrunch.com/wp-json/wp/v2/posts?per_page=100&context=embed

Usage

curl https://api.zippopotam.us/us/08028
curl https://api.zippopotam.us/us/08028 -o zipdata.json

Browser based

https://httpie.io/app

VS Code based

Get Thunder Client

Python way

using requests library

Summary

Definition: REST (Representational State Transfer) API is a set of guidelines for building web services. A RESTful API is an API that adheres to these guidelines and allows for interaction with RESTful web services.

How It Works: REST uses standard HTTP methods like GET, POST, PUT, DELETE, etc. It is stateless, meaning each request from a client to a server must contain all the information needed to understand and complete the request.

Data Format: REST APIs typically exchange data in JSON or XML format.

Purpose: REST APIs are designed to be a simple and standardized way for systems to communicate over the web. They enable the backend services to communicate with front-end applications (like SPAs) or other services.

Use Cases: REST APIs are used in web services, mobile applications, and IoT (Internet of Things) applications for various purposes like fetching data, sending commands, and more.

#restapi #REST #curl #requestsVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

API Performance

<abbr title="Application Programming Interface">API</abbr> Performance

Caching

Store frequently accessed data in a cache so you can access it faster.

If there’s a cache miss, fetch the data from the database.

It’s pretty effective, but it can be challenging to invalidate and decide on the caching strategy.

Scale-out with Load Balancing

You can consider scaling your API to multiple servers if one server instance isn’t enough. Horizontal scaling is the way to achieve this.

The challenge will be to find a way to distribute requests between these multiple instances.

Load Balancing

It not only helps with performance but also makes your application more reliable.

However, load balancers work best when your application is stateless and easy to scale horizontally.

Pagination

If your API returns many records, you need to explore Pagination.

You limit the number of records per request.

This improves the response time of your API for the consumer.

Async Processing

With async processing, you can let the clients know that their requests are registered and under process.

Then, you process the requests individually and communicate the results to the client later.

This allows your application server to take a breather and give its best performance.

But of course, async processing may not be possible for every requirement.

Connection Pooling

An API often needs to connect to the database to fetch some data.

Creating a new connection for each request can degrade performance.

It’s a good idea to use connection pooling to set up a pool of database connections that can be reused across requests.

This is a subtle aspect, but connection pooling can dramatically impact performance in highly concurrent systems.

YT Visual representation

#api #performance #loadbalancing #pagination #connectionpoolVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 4 minutes]

API in Big Data World

Big data and REST APIs are often used together in modern data architectures. Here’s how they interact:

Ingestion gateway

  • Applications push events through REST endpoints
  • Gateway converts to Kafka, Kinesis, or file landing zones
  • REST is entry door, not the pipeline itself

Serving layer

  • Processed data in Hive, Elasticsearch, Druid, or Delta
  • APIs expose aggregated results to apps and dashboards
  • REST is read interface on top of heavy compute

Control plane

  • Spark job submission via REST
  • Kafka topic management
  • cluster monitoring and scaling
  • authentication and governance

Microservices boundary

  • Each service owns a slice of data
  • APIs expose curated views
  • internal pipelines stay streaming or batch

What REST is NOT in Big Data

  • Not used for bulk petabyte transfer
  • Not used inside Spark transformations
  • Not the transport between Kafka and processors

Example of API

https://docs.redis.com/latest/rs/references/rest-api/

https://rapidapi.com/search/big-data

https://www.kaggle.com/discussions/general/315241

#apiinbigdata #kafka #sparksVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Advance Python

  1. Functional Programming Concepts
  2. Decorator
  3. Python Classes
  4. Unit Testing
  5. Data Frames
  6. Error Handling
  7. Logging

Ver 6.0.1

[Avg. reading time: 20 minutes]

Functional Programming Concepts

Functional programming in Python emphasizes the use of functions as first-class citizens, immutability, and declarative code that avoids changing state and mutable data.

def counter():
    count = 0  # Initialize the state
    count += 1
    return count

print(counter())
print(counter())
print(counter())

Regular functions internal State & mutable data

def counter():
    # Define an internal state using an attribute
    if not hasattr(counter, "count"):
        counter.count = 0  # Initialize the state

    # Modify the internal state
    counter.count += 1
    return counter.count

print(counter())
print(counter())
print(counter())

Internal state & immutability

Example without Lambda

increment = lambda x: x + 1

print(increment(5))  # Output: 6
print(increment(5))  # Output: 6

Using Lambda

Lambda functions as a way to write quick, one-off functions without defining a full function using def.

Example without Lambda

def square(x):
    return x ** 2

print(square(4))

Using Lambda

square = lambda x: x ** 2
print(square(4))

Without Lambda

def get_age(person):
    return person['age']

people = [
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25},
    {'name': 'Charlie', 'age': 35}
]

# Using a defined function to sort
sorted_people = sorted(people, key=get_age)
print(sorted_people)

Using Lambda

people = [
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25},
    {'name': 'Charlie', 'age': 35}
]

# Using a lambda function to sort
sorted_people = sorted(people, key=lambda person: person['age'])
print(sorted_people)

Map, Filter, Reduce Functions

Map, filter, and reduce are higher-order functions in Python that enable a functional programming style, allowing you to work with data collections in a more expressive and declarative manner.

Map

The map() function applies a given function to each item of an iterable (like a list or tuple) and returns an iterator with the results.

Map Without Functional Approach

numbers = [1, 2, 3, 4, 5]
squares = []
for num in numbers:
    squares.append(num ** 2)
print(squares)  # Output: [1, 4, 9, 16, 25]

Map With Lambda and Map

numbers = [1, 2, 3, 4, 5]
squares = list(map(lambda x: x ** 2, numbers))
print(squares)  # Output: [1, 4, 9, 16, 25]

Filter

The filter() function filters items out of an iterable based on whether they meet a condition defined by a function, returning an iterator with only those elements for which the function returns True.

Filter Without Functional Approach

numbers = [1, 2, 3, 4, 5]
evens = []
for num in numbers:
    if num % 2 == 0:
        evens.append(num)
print(evens)  # Output: [2, 4]

Filter using Functional Approach

numbers = [1, 2, 3, 4, 5]
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens)  # Output: [2, 4]

Reduce

The reduce() function, from the functools module, applies a rolling computation to pairs of values in an iterable. It reduces the iterable to a single accumulated value.

At the same time, in many cases, simpler functions like sum() or loops may be more readable.

Reduce Without Functional Approach

  • First, 1 * 2 = 2
  • Then, 2 * 3 = 6
  • Then, 6 * 4 = 24
  • Then, 24 * 5 = 120
numbers = [1, 2, 3, 4, 5]
product = 1
for num in numbers:
    product *= num
print(product)  # Output: 120

Reduce With Lambda

from functools import reduce

numbers = [1, 2, 3, 4, 5]
product = reduce(lambda x, y: x * y, numbers)
print(product)  # Output: 120

Using an Initliazer

from functools import reduce

numbers = [1, 2, 3]

# Start with an initial value of 10
result = reduce(lambda x, y: x + y, numbers, 10)
print(result)  
# Output: 16

Using SUM() instead of Reduce()

# So its not necessary to use Reduce all the time :)

numbers = [1, 2, 3, 4, 5]

# Using sum to sum the list
result = sum(numbers)
print(result)  # Output: 15

String Concatenation

from functools import reduce

words = ['Hello', 'World', 'from', 'Python']

result = reduce(lambda x, y: x + ' ' + y, words)
print(result) 
# Output: "Hello World from Python"

List Comprehension and Generators

List Comprehension

List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list.

Generates the entire list in memory at once, which can consume a lot of memory for large datasets.

Uses: [ ]

Without List Comprehension

numbers = [1, 2, 3, 4, 5]
squares = []
for num in numbers:
    squares.append(num ** 2)
print(squares)

With List Comprehensions

numbers = [1, 2, 3, 4, 5]
squares = [x ** 2 for x in numbers]
print(squares)  

List Generator

Generator expressions are used to create generators, which are iterators that generate values on the fly and yield one item at a time.

Generator expressions generate items lazily, meaning they yield one item at a time and only when needed. This makes them much more memory efficient for large datasets.

Uses: ( )

numbers = [1, 2, 3, 4, 5]
squares = (x ** 2 for x in numbers)
print(sum(squares))  # Output: 55
numbers = [1, 2, 3, 4, 5]
squares = (x ** 2 for x in numbers)
print(list(squares))

Best suited

  • Only one line is in memory at a time.
  • Suitable for processing large or infinite data streams.

#functional #lambda #generator #comprehensionVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 14 minutes]

Decorator

Decorators in Python are a powerful way to modify or extend the behavior of functions or methods without changing their code. Decorators are often used for tasks like logging, authentication, and adding additional functionality to functions. They are denoted by the “@” symbol and are applied above the function they decorate.

def say_hello():
    print("World")

say_hello()

How do we change the output without changing the say hello() function?

wrapper() is not reserved word. It can be anyting.

Use Decorators

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello,")
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

If you want to replace the new line character and the end of the print statement, use end=''

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello, ", end='')
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

Multiple functions inside the Decorator

def hello_decorator(func):
    def first_wrapper():
        print("First wrapper, doing something before the second wrapper.")
        #func()
    
    def second_wrapper():
        print("Second wrapper, doing something before the actual function.")
        #func()
    
    def main_wrapper():
        first_wrapper()  # Call the first wrapper
        second_wrapper()  # Then call the second wrapper, which calls the actual function
        func()
    
    return main_wrapper

@hello_decorator
def say_hello():
    print("World")

say_hello()

Args & Kwargs

  • *args: This is used to represent positional arguments. It collects all the positional arguments passed to the decorated function as a tuple.
  • **kwargs: This is used to represent keyword arguments. It collects all the keyword arguments (arguments passed with names) as a dictionary.
from functools import wraps

def my_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print("Positional Arguments (*args):", args)
        print("Keyword Arguments (**kwargs):", kwargs)
        result = func(*args, **kwargs)
        return result
    return wrapper

@my_decorator
def example_function(a, b, c=0, d=0):
    print("Function Body:", a, b, c, d)

# Calling the decorated function with different arguments
example_function(1, 2)
example_function(3, 4, c=5)

Popular Example

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"Execution time of {func.__name__}: {end - start} seconds")
        return result
    return wrapper
    
@timer
def add(x, y):
    """Returns the sum of x and y"""
    return x + y

@timer
def greet(name, message="Hello"):
    """Returns a greeting message with the name"""
    return f"{message}, {name}!"

print(add(2, 3))
print(greet("Rachel"))

The purpose of @wraps is to preserve the metadata of the original function being decorated.

Practice Item

from functools import wraps

# Decorator without @wraps
def decorator_without_wraps(func):
    def wrapper(*args, **kwargs):
        return func(*args, **kwargs)
    return wrapper

# Decorator with @wraps
def decorator_with_wraps(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        return func(*args, **kwargs)
    return wrapper

# Original function with a docstring
def original_function():
    """
    This is the original function's docstring.
    """
    pass

# Decorate the original function
decorated_function_without_wraps = decorator_without_wraps(original_function)
decorated_function_with_wraps = decorator_with_wraps(original_function)

# Display metadata of decorated functions
print("Without @wraps:")
print(f"Name: {decorated_function_without_wraps.__name__}")
print(f"Docstring: {decorated_function_without_wraps.__doc__}")

print("\nWith @wraps:")
print(f"Name: {decorated_function_with_wraps.__name__}")
print(f"Docstring: {decorated_function_with_wraps.__doc__}")

Memoization

Memoization is a technique used in Python to optimize the performance of functions by caching their results. When a function is called with a particular set of arguments, the result is stored. If the function is called again with the same arguments, the cached result is returned instead of recomputing it.

Benefits

Improves Performance: Reduces the number of computations by returning pre-computed results.

Efficient Resource Utilization: Saves computation time and resources, especially for recursive or computationally expensive functions.

git clone https://github.com/gchandra10/python_memoization

#decorator #memoizationVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 5 minutes]

Python Classes

Classes are templates used to define the properties and methods of objects in code. They can describe the kinds of data the class holds and how a programmer interacts with them.

Attributes - Properties

Methods - Action

Img src: https://www.datacamp.com/tutorial/python-classes

class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def bark(self):
        print(f"{self.name} says woof! and its {self.age} years old")

my_dog = Dog("Buddy", 2)

my_dog.bark()

Class Definition: We start with the class keyword followed by Dog, the name of our class. This is the blueprint for creating Dog objects.

Constructor Method (__init__): This particular method is called automatically when a new Dog object is created. It initializes the object’s attributes. In this case, each Dog has a name and an age. The self parameter is a reference to the current instance of the class.

Attribute: self.name and self.age These are attributes of the class. These variables are associated with each class instance, holding the specific data.

Method: bark It is a method of the class. It’s a function that all Dog instances can perform. When called, it prints a message indicating that the dog is barking.

Python supports two types of methods within classes.

  • StaticMethod
  • InstanceMethod

Fork & Clone

git clone https://github.com/gchandra10/python_classes_demo.git

#classes #dataclassVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 3 minutes]

Unit Testing

A unit test tests a small “unit” of code - usually a function or method - independently from the rest of the program.

Some key advantages of unit testing include:

  • Isolates code - This allows testing individual units in isolation from other parts of the codebase, making bugs easier to identify.
  • Early detection - Tests can catch issues early in development before code is deployed, saving time and money.
  • Regression prevention - Existing unit tests can be run whenever code is changed to prevent new bugs or regressions.
  • Facilitates changes - Unit tests give developers the confidence to refactor or update code without breaking functionality.
  • Quality assurance - High unit test coverage helps enforce quality standards and identify edge cases.

Every language has its unit testing framework. In Python, some popular ones are

  • unittest
  • pytest
  • doctest
  • testify

Example:

Using Pytest & UV

git clone https://github.com/gchandra10/pytest-demo.git

Using Unittest & Poetry

git clone https://github.com/gchandra10/python_calc_unittests

#unittesting #pytestVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 19 minutes]

Data Frames

DataFrames are the core abstraction for tabular data in modern data processing — used across analytics, ML, and ETL workflows.

They provide:

  • Rows and columns like a database table or Excel sheet.
  • Rich APIs to filter, aggregate, join, and transform data.
  • Interoperability with CSV, Parquet, JSON, and Arrow.

Pandas

Pandas is a popular Python library for data manipulation and analysis. A DataFrame in Pandas is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

Eager Evaluation: Pandas performs operations eagerly, meaning that each operation is executed immediately when called.

In-Memory Copy - Full DataFrame in RAM, single copy

Sequential Processing - Single threaded, one operation at at time.

Pros

  • Easy to use and intuitive syntax.
  • Rich functionality for data manipulation, including filtering, grouping, and merging.
  • Large ecosystem and community support.

Cons

  • Performance issues with very large datasets (limited by memory).
  • Single-threaded operations, making it slower for big data tasks.

Example

import pandas as pd

# Load the CSV file using Pandas
df = pd.read_csv('data/sales_100.csv')

# Display the first few rows
print(df.head())

Polars

Polars is a fast, multi-threaded DataFrame library in Rust and Python, designed for performance and scalability. It is known for its efficient handling of larger-than-memory datasets.

Supports both eager and lazy evaluation.

Lazy Evaluation: Instead of loading the entire CSV file into memory right away, a Lazy DataFrame builds a blueprint or execution plan describing how the data should be read and processed. The actual data is loaded only when the computation is triggered (for example, when you call a collect or execute command).

Optimizations: Using scan_csv allows Polars to optimize the entire query pipeline before loading any data. This approach is beneficial for large datasets because it minimizes memory usage and improves execution efficiency.

  • pl.read_csv() or pl.read_parquet() - eager evaluation
  • pl.scan_csv() or pl.scan_parquet() - lazy evaluation

Parallel Execution: Multi-threaded compute.

Columnar efficiency: Uses Arrow columnar memory format under the hood.

Pros

  • High performance due to multi-threading and memory-efficient execution.
  • Lazy evaluation, optimizing the execution of queries.
  • Handles larger datasets effectively.

Cons

  • Smaller community and ecosystem compared to Pandas.
  • Less mature with fewer third-party integrations.

Example

import polars as pl

# Load the CSV file using Polars
df = pl.scan_csv('data/sales_100.csv')

print(df.head())

# Display the first few rows
print(df.collect())

df1 = pl.read_csv('data/sales_100.csv')
print(df1.head())

Dask

Dask is a parallel computing library that scales Python libraries like Pandas for large, distributed datasets.

Client (Python Code)
   │
   ▼
Scheduler (builds + manages task graph)
   │
   ▼
Workers (execute tasks in parallel)
   │
   ▼
Results gathered back to client

Open Source https://docs.dask.org/en/stable/install.html

Dask Cloud Coiled Cloud

Lazy Reading: Dask builds a task graph instead of executing immediately — computations run only when triggered (similar to Polars lazy execution).

Partitioning: A Dask DataFrame is split into many smaller Pandas DataFrames (partitions) that can be processed in parallel.

Task Graph: Dask represents your workflow as a directed acyclic graph (DAG) showing the sequence and dependencies of tasks.

Distributed Compute: Dask executes tasks across multiple cores or machines, enabling scalable, parallel data processing.


import dask.dataframe as dd

ddf = dd.read_csv(
    "data/sales_*.csv",
    dtype={"category": "string", "value": "float64"},
    blocksize="64MB"
)

# 2) Lazy transform: per-partition groupby + sum, then global combine
agg = ddf.groupby("category")["value"].sum().sort_values(ascending=False)

# 3) Trigger execution and bring small result to driver
result = agg.compute()

print(result.head(10))

blocksize determines the parition. If omitted dask automatically uses 64MB

flowchart LR
  A1[CSV part 1] --> P1[parse p1]
  A2[CSV part 2] --> P2[parse p2]
  A3[CSV part 3] --> P3[parse p3]

  P1 --> G1[local groupby-sum p1]
  P2 --> G2[local groupby-sum p2]
  P3 --> G3[local groupby-sum p3]

  G1 --> C[combine-aggregate]
  G2 --> C
  G3 --> C

  C --> S[sort values]
  S --> R[collect to Pandas]

Pros

  • Can handle datasets that don’t fit into memory by processing in parallel.
  • Scales to multiple cores and clusters, making it suitable for big data tasks.
  • Integrates well with Pandas and other Python libraries.

Cons

  • Slightly more complex API compared to Pandas.
  • Performance tuning can be more challenging.

Where to start?

  • Start with Pandas for learning and small datasets.
  • Switch to Polars when performance matters.
  • Use Dask when data exceeds single-machine memory or needs cluster execution.
git clone https://github.com/gchandra10/python_dataframe_examples.git

Pandas vs Polars vs Dask

FeaturePandasPolarsDask
LanguagePythonRust with Python bindingsPython
Execution ModelSingle-threadedMulti-threadedMulti-threaded, distributed
Data HandlingIn-memoryIn-memory, Arrow-basedIn-memory, out-of-core
ScalabilityLimited by memoryLimited to single machineScales across clusters
PerformanceGood for small to medium dataHigh performance for single machineGood for large datasets
API FamiliarityWidely known, matureSimilar to PandasSimilar to Pandas
Ease of UseVery easy, large ecosystemEasy, but smaller ecosystemModerate, requires understanding of parallelism
Fault ToleranceNoneLimitedHigh, with task retries and rescheduling
Machine LearningIntegration with Python ML libsPreprocessing onlyIntegration with Dask-ML and other libs
Lazy EvaluationNoYesYes, with task graphs
Best ForData analysis, small datasetsFast preprocessing on single machineLarge-scale data processing
Cluster ManagementN/AN/ASupports Kubernetes, YARN, etc.
Use CasesData manipulation, analysisFast data manipulationLarge data, ETL, scaling Python code

#pandas #polars #daskVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 8 minutes]

Error Handling

Python uses try/except blocks for error handling.

The basic structure is:

try:
    # Code that may raise an exception
except ExceptionType:
    # Code to handle the exception
finally:
    # Code executes all the time

Uses

Improved User Experience: Instead of the program crashing, you can provide a user-friendly error message.

Debugging: Capturing exceptions can help you log errors and understand what went wrong.

Program Continuity: Allows the program to continue running or perform cleanup operations before terminating.

Guaranteed Cleanup: Ensures that certain operations, like closing files or releasing resources, are always performed.

Some key points

  • You can catch specific exception types or use a bare except to catch any exception.

  • Multiple except blocks can be used to handle different exceptions.

  • An else clause can be added to run if no exception occurs.

  • A finally clause will always execute, whether an exception occurred or not.


Without Try/Except

x = 10 / 0 

Basic Try/Except

try:
    x = 10 / 0 
except ZeroDivisionError:
    print("Error: Division by zero!")

Generic Exception

try:
    file = open("nonexistent_file.txt", "r")
except:
    print("An error occurred!")

Find the exact error

try:
    file = open("nonexistent_file.txt", "r")
except Exception as e:
    print(str(e))

Raise - Else and Finally

try:
    x = -10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

try:
    x = 10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

Nested Functions


def divide(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        print("Error in divide(): Cannot divide by zero!")
        raise  # Re-raise the exception

def calculate_and_print(x, y):
    try:
        result = divide(x, y)
        print(f"The result of {x} divided by {y} is: {result}")
    except ZeroDivisionError as e:
        print(str(e))
    except TypeError as e:
        print(str(e))

# Test the nested error handling
print("Example 1: Valid division")
calculate_and_print(10, 2)

print("\nExample 2: Division by zero")
calculate_and_print(10, 0)

print("\nExample 3: Invalid type")
calculate_and_print("10", 2)

#errorhandling #exception #tryVer 6.0.1

Last change: 2026-01-17

[Avg. reading time: 7 minutes]

Logging

Python’s logging module provides a flexible framework for tracking events in your applications. It’s used to log messages to various outputs (console, files, etc.) with different severity levels like DEBUG, INFO, WARNING, ERROR, and CRITICAL.

Use Cases of Logging

Debugging: Identify issues during development. Monitoring: Track events in production to monitor behavior. Audit Trails: Capture what has been executed for security or compliance. Error Tracking: Store errors for post-mortem analysis. Rotating Log Files: Prevent logs from growing indefinitely using size or time-based rotation.

Python Logging Levels

LevelUsageNumeric ValueDescription
DEBUGDetailed information for diagnosing problems.10Useful during development and debugging stages.
INFOGeneral information about program execution.20Highlights normal, expected behavior (e.g., program start, process completion).
WARNINGIndicates something unexpected but not critical.30Warns of potential problems or events to monitor (e.g., deprecated functions, nearing limits).
ERRORAn error occurred that prevented some part of the program from working.40Represents recoverable errors that might still allow the program to continue running.
CRITICALSevere errors indicating a major failure.50Marks critical issues requiring immediate attention (e.g., system crash, data corruption).

INFO

import logging

logging.basicConfig(level=logging.INFO)  # Set the logging level to INFO

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")
logging.error("This is an error message.")
logging.critical("This is a critical message.")

Error

import logging

logging.basicConfig(level=logging.ERROR)  # Set the logging level to ERROR

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")
logging.error("This is an error message.")
logging.critical("This is a critical message.")

import logging

logging.basicConfig(
    level=logging.DEBUG, 
    format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")

More Examples

git clone https://github.com/gchandra10/python_logging_examples.git

#logging #infoVer 6.0.1

Last change: 2026-01-17

Tags

abs

/Protocol/Idempotency

acid

/Data Format/Delta

adbc

/Data Format/Arrow

ai

/Big Data Overview/Trending Technologies

amazonprime

/Protocol/Monolithic Architecture

analysis

/Big Data Overview/How does it help?

api

/Protocol/API Performance

apiinbigdata

/Protocol/API in Big Data world

arrow

/Data Format/Arrow

/Data Format/Common Data Formats

automation

/Developer Tools/JQ

availability

/Big Data Overview/Cap Theorem

avro

/Data Format/SerDe

banking

/Protocol/Monolithic Architecture

bigdata

/Big Data Overview

/Big Data Overview/Big Data Challenges

/Big Data Overview/Big Data Concerns

/Big Data Overview/Big Data Tools

/Big Data Overview/Eventual Consistency

/Big Data Overview/How does it help?

/Big Data Overview/Introduction

/Big Data Overview/Job Opportunities

/Big Data Overview/Learning Big Data means?

/Big Data Overview/Optimistic Concurrency

/Big Data Overview/The Big V's

/Big Data Overview/The Big V's/Other V's

/Big Data Overview/The Big V's/Variety

/Big Data Overview/The Big V's/Velocity

/Big Data Overview/The Big V's/Veracity

/Big Data Overview/The Big V's/Volume

/Big Data Overview/Trending Technologies

/Big Data Overview/What is Data?

/Data Format/Common Data Formats

/Data Format/Delta

/Data Format/Introduction

/Data Format/JSON

/Data Format/Parquet

bigv

/Big Data Overview/The Big V's

/Big Data Overview/The Big V's/Variety

/Big Data Overview/The Big V's/Velocity

/Big Data Overview/The Big V's/Veracity

/Big Data Overview/The Big V's/Volume

binary

/Big Data Overview/The Big V's/Variety

cap

/Big Data Overview/Cap Theorem

chapter1

/Big Data Overview

cicd

/Protocol/Microservices

classes

/Advanced Python/Python Classes

cli

/Developer Tools/Duck DB

/Developer Tools/JQ

cloud

/Big Data Overview/Big Data Tools

columnar

/Big Data Overview/NO Sql Databases

/Data Format/Parquet

comprehension

/Advanced Python/Functional Programming Concepts

compressed

/Data Format/Parquet

concerns

/Big Data Overview/Big Data Concerns

concurrent

/Big Data Overview/Concurrent vs Parallel

connectionpool

/Protocol/API Performance

consistency

/Big Data Overview/Cap Theorem

continuous

/Big Data Overview/Types of Data

csv

/Data Format/Common Data Formats

curl

/Protocol/REST API

dask

/Advanced Python/Data Frames

data

/Big Data Overview/What is Data?

dataclass

/Advanced Python/Python Classes

dataformat

/Data Format/Arrow

/Data Format/Common Data Formats

/Data Format/Introduction

/Data Format/JSON

/Data Format/Parquet

datalake

/Big Data Overview/Data Integration

dataquality

/Big Data Overview/Big Data Challenges

decorator

/Advanced Python/Decorator

delta

/Data Format/Delta

deserialization

/Data Format/SerDe

discrete

/Big Data Overview/Types of Data

distributed

/Big Data Overview/Scaling

documentdb

/Big Data Overview/NO Sql Databases

domain

/Big Data Overview/DSL

dsl

/Big Data Overview/DSL

duckdb

/Developer Tools/Duck DB

elt

/Big Data Overview/Data Integration

errorhandling

/Advanced Python/Error Handling

ethics

/Big Data Overview/Big Data Challenges

etl

/Big Data Overview/Data Integration

eventualconsistency

/Big Data Overview/Eventual Consistency

exception

/Advanced Python/Error Handling

flightrpc

/Data Format/Arrow

flightsql

/Data Format/Arrow

functional

/Advanced Python/Functional Programming Concepts

generator

/Advanced Python/Functional Programming Concepts

get

/Protocol/HTTP

gpl

/Big Data Overview/GPL

graphdb

/Big Data Overview/NO Sql Databases

grpc

/Protocol/Introduction

hierarchical

/Data Format/JSON

horizontal

/Big Data Overview/Scaling

html

/Big Data Overview/DSL

http

/Protocol/HTTP

/Protocol/Introduction

idempotent

/Protocol/Idempotency

image

/Big Data Overview/The Big V's/Variety

info

/Advanced Python/Logging

interoperability

/Big Data Overview/Big Data Challenges

introduction

/Big Data Overview

iot

/Big Data Overview/Trending Technologies

jobs

/Big Data Overview/Job Opportunities

jq

/Developer Tools/JQ

json

/Big Data Overview/The Big V's/Variety

/Data Format/JSON

/Developer Tools/JQ

jwt

/Protocol/Statelessness

kafka

/Big Data Overview/Big Data Tools

/Protocol/API in Big Data world

keyvalue

/Big Data Overview/NO Sql Databases

knowledge

/Big Data Overview/How does it help?

lambda

/Advanced Python/Functional Programming Concepts

learning

/Big Data Overview/Learning Big Data means?

/Big Data Overview/Learning Big Data means?

lint

/Developer Tools/Other Python Tools

loadbalancer

/Protocol/Statefulness

loadbalancing

/Protocol/API Performance

logging

/Advanced Python/Logging

memoization

/Advanced Python/Decorator

merge

/Protocol/Idempotency

microservices

/Protocol/Microservices

mitigation

/Big Data Overview/Big Data Concerns

monolithic

/Protocol/Monolithic Architecture

mqtt

/Protocol/Introduction

mypy

/Developer Tools/Other Python Tools

nominal

/Big Data Overview/Types of Data

nosql

/Big Data Overview/NO Sql Databases

optimistic

/Big Data Overview/Optimistic Concurrency

ordinal

/Big Data Overview/Types of Data

otherv

/Big Data Overview/The Big V's/Other V's

overview

/Big Data Overview/Introduction

pagination

/Protocol/API Performance

pandas

/Advanced Python/Data Frames

parallelprogramming

/Big Data Overview/Concurrent vs Parallel

parquet

/Data Format/Common Data Formats

/Data Format/Parquet

/Developer Tools/Duck DB

parser

/Developer Tools/JQ

partitiontolerant

/Big Data Overview/Cap Theorem

pep

/Developer Tools/Other Python Tools

performance

/Protocol/API Performance

pipeline

/Big Data Overview/Data Integration

poetry

/Developer Tools/Introduction

polars

/Advanced Python/Data Frames

post

/Protocol/HTTP

privacy

/Big Data Overview/Big Data Challenges

protocols

/Protocol/Introduction

put

/Protocol/HTTP

pytest

/Advanced Python/Unit Testing

python

/Big Data Overview/GPL

/Developer Tools/Introduction

qualitative

/Big Data Overview/Types of Data

quantitative

/Big Data Overview/Types of Data

rawdata

/Big Data Overview/Data Integration

/Big Data Overview/How does it help?

rdbms

/Data Format/Introduction

realtime

/Big Data Overview/Big Data Challenges

requests

/Protocol/REST API

rest

/Protocol/REST API

/Protocol/Statelessness

restapi

/Protocol/Microservices

/Protocol/REST API

robotics

/Big Data Overview/Trending Technologies

ruff

/Developer Tools/Other Python Tools

rust

/Big Data Overview/GPL

/Developer Tools/UV

scaling

/Big Data Overview/Scaling

semistructured

/Big Data Overview/The Big V's/Variety

serialization

/Data Format/SerDe

singlefiledatabase

/Developer Tools/Duck DB

spark

/Big Data Overview/Big Data Tools

/Protocol/API in Big Data world

sql

/Big Data Overview/DSL

stateful

/Protocol/Statefulness

statelessness

/Protocol/Statelessness

statuscodes

/Protocol/HTTP

stickiness

/Protocol/Statefulness

storage

/Big Data Overview/Big Data Challenges

structured

/Big Data Overview/The Big V's/Variety

technologies

/Big Data Overview/Trending Technologies

teraform

/Protocol/Idempotency

tools

/Big Data Overview/Big Data Tools

/Developer Tools/Duck DB

/Developer Tools/JQ

traditionaldata

/Big Data Overview/What is Data?

try

/Advanced Python/Error Handling

unittesting

/Advanced Python/Unit Testing

unstructured

/Big Data Overview/The Big V's/Variety

upsert

/Protocol/Idempotency

uv

/Developer Tools/Introduction

/Developer Tools/UV

validity

/Big Data Overview/The Big V's/Other V's

value

/Big Data Overview/The Big V's/Other V's

velocity

/Big Data Overview/The Big V's/Velocity

venv

/Developer Tools/Introduction

/Developer Tools/UV

veracity

/Big Data Overview/The Big V's/Veracity

version

/Big Data Overview/The Big V's/Other V's

vertical

/Big Data Overview/Scaling

volume

/Big Data Overview/The Big V's/Volume

xml

/Big Data Overview/The Big V's/Variety