[Avg. reading time: 0 minutes] Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

Disclaimer

1. Week 1 > 2. Week 8 > 3. Week 15

In this AI era, remember the following.

  • First, you are not behind, you are learning on schedule.
  • Second, feeling like an imposter is normal, it means you are stretching your skills.
  • Third, ignore the online noise. Learning is simple: learn something, think about it, practice it, repeat.
  • Lastly, tools will change, but your ability to learn will stay.

Certificates are good, but projects and understanding matter more. Ask questions, help each other, and don’t do this journey alone.Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

Required Tools

Install these softwares before Week 2.

Windows

Mac

Common Tools (Windows & Mac)

Last change: 2026-04-21

[Avg. reading time: 13 minutes]

Setting up Bigdata Environment

This setup creates a ready-made development environment for this course.

Instead of installing the necessary softwares, libraries, compilers, and tools on your laptop, everything runs inside a container.

This guarantees everyone has the exact same setup, so there’s no “it works on my machine” problem.

We will learn how this works in later weeks.


Video

Step by Step

  1. Install VSCode and Remote Development Extension

  1. Install Docker Personal and make sure Engine is running

  1. Copy the gitrepo https://github.com/gchandra10/workspace-bigdata

  1. Click “Copy URL to clipboard”

  1. Open Terminal / Command Prompt and clone the Repo

  1. Step after cloning the repo

  1. Click “Open Workspace from File…”

  1. Choose the Workspace file inside the folder

  1. VSCode will prompt to Reopen in Container, click that Button.

  1. After few minutes (depends on your computer capability and network speed), you will see a message like this.

  1. If you see /workspaces/workspace-bigdata $ your installation is successful

  1. Verify the Python version. It may vary depending upon what is latest at that time.

How to close the Workspace

  1. Click “Close Remote Connection”

How to ReOpen Workspace again

  1. Click “File”

  1. Click “Open Workspace from File…”

  1. Click “Documents”

  1. Click “text field”

  1. Click “text field”

  1. Click “open workspace from file”

Tip: This time it will load the Remote Workspace immediately.

  1. Click “image”

Reset and Retry

  • Close VSCode
  • Delete workspace-bigdata folder and all files
  • Open command prompt
  • Run the following commands to clean the existing containers
docker rm $(docker ps -aq)

docker rmi $(docker image -aq)

docker volume rm $(docker volume ls -q) 
  • Goto command prompt clone the repository (I have updated a newer version)

https://github.com/gchandra10/workspace-bigdata.git

And follow the steps mentioned above

Note: pls make sure docker is running and you have enough space.

#setup #workspace #devcontainerVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Big Data Overview

  1. Introduction
  2. Job Opportunities
  3. What is Data?
  4. How does it help?
  5. Types of Data
  6. The Big V’s
    1. Variety
    2. Volume
    3. Velocity
    4. Veracity
    5. Other V’s
  7. Trending Technologies
  8. Big Data Concerns
  9. Big Data Challenges
  10. Data Integration
  11. Scaling
  12. CAP Theorem
  13. PACELC Theorem
  14. Optimistic Concurrency
  15. Eventual Consistency
  16. Concurrent vs Parallel
  17. GPL
  18. DSL
  19. Big Data Tools
  20. NO Sql Databases
  21. Learning Big Data means?

#introduction #bigdata #chapter1Ver 6.0.25

[Avg. reading time: 2 minutes]

Understanding the Big Data Landscape

Expectation in this course

The first set of questions, which everyone is curious to know.

What is Big Data?

When does the data become Big Data?

Why collect so much Data?

How secure is Big Data?

How does it help?

Where can it be stored?

Which Tools are used to handle Big Data?


The second set of questions to get in deep.

What should I learn?

Does certification help?

Which technology is the best?

How many tools do I need to learn?

Apart from the top 50 corporations, do other companies use Big Data?

#overview #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Job Opportunities

RoleOn-PremBig Data SpecificCloud
Database Developer
Data Engineer
Database Administrator
Data Architect
Database Security Eng.
Database Manager
Data Analyst
Business Intelligence

Database Developer: Designs and writes efficient queries, procedures, and data models for structured databases.

Data Engineer: Builds and maintains scalable data pipelines and ETL processes for large-scale data movement and transformation.

Database Administrator (DBA): Manages and optimizes database systems, ensuring performance, security, and backups.

Data Architect: Defines high-level data strategy and architecture, ensuring alignment with business and technical needs.

Database Security Engineer: Implements and monitors security controls to protect data assets from unauthorized access and breaches.

Database Manager: Oversees database teams and operations, aligning database strategy with organizational goals.

Data Analyst: Interprets data using statistical tools to generate actionable insights for decision-makers.

Business Intelligence (BI) Developer: Creates dashboards, reports, and visualizations to help stakeholders understand data trends and KPIs.

All small to enterprise organizations use Big data to develop their business.

#jobs #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

What is Data?

Data is simply facts and figures. When processed and contextualized, data becomes information.

Everything is data

  • What we say
  • Where we go
  • What we do

How to measure data?

byte        - 1 letter
1 Kilobyte  - 1024 B
1 Megabyte  - 1024 KB
1 Gigabyte  - 1024 MB
1 Terabyte  - 1024 GB    
(1,099,511,627,776 Bytes)
1 Petabyte  - 1024 TB
1 Exabyte   - 1024 PB
1 Zettabyte - 1024 EB
1 Yottabyte - 1024 ZB

Examples of Traditional Data

  • Banking Records
  • Student Information
  • Employee Profiles
  • Customer Details
  • Sales Transactions

When Data becomes Big Data?

When data expands

  • Banking: One bank branch vs. global consolidation (e.g., CitiBank)
  • Education: One college vs. nationwide student data (e.g., US News)
  • Media: Traditional news vs. user-generated content on Social Media

When data gets granular

  • Monitoring CPU/Memory usage every second
  • Cell phone location & usage logs
  • IoT sensor telemetry (temperature, humidity, etc.)
  • Social media posts, reactions, likes
  • Live traffic data from vehicles and sensors

These fine-grained data points fuel powerful analytics and real-time insights.

Why Collect So Much Data?

  • Storage is cheap and abundant
  • Tech has advanced to process massive data efficiently
  • Businesses use data to innovate, predict trends, and grow

#data #bigdata #traditionaldataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

How Big Data helps us

From raw blocks to building knowledge, Big Data drives global progress.

Data to Wisdom

Stages

  • Data → scattered observations
  • Information → contextualized
  • Knowledge → structured relationships
  • Insight → patterns emerge
  • Wisdom → actionable strategy

Raw Data to Analysis

Raw Data to Analysis

Stages

  • Raw Data – Messy, unprocessed
  • Organized – Grouped by category
  • Arranged – Structured to show comparisons
  • Visualized – Charts or graphs
  • Analysis – Final understanding or solution

Big Data Applications: Changing the World

Here are some real-world domains where Big Data is making a difference:

  • Healthcare – Diagnose diseases earlier and personalize treatment
  • Agriculture – Predict crop yield and detect pest outbreaks
  • Space Exploration – Analyze signals from space and optimize missions
  • Disaster Management – Forecast earthquakes, floods, and storms
  • Crime Prevention – Predict and detect crime patterns
  • IoT & Smart Devices – Real-time decision making in smart homes, vehicles, and cities

#bigdata #rawdata #knowledge #analysisVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Types of Data

Understanding the types of data is key to processing and analyzing it effectively. Broadly, data falls into two main categories: Quantitative and Qualitative.

Quantitative Data

Quantitative data deals with numbers and measurable forms. It can be further classified as Discrete or Continuous.

  • Measurable values (e.g., memory usage, CPU usage, number of likes, shares, retweets)
  • Collected from the real world
  • Usually close-ended

Discrete

  • Represented by whole numbers
  • Countable and finite

Example:

  • Number of cameras in a phone
  • Memory size in GB

Qualitative Data

Qualitative data describes qualities or characteristics that can’t be easily measured numerically.

  • Descriptive or abstract
  • Can come from text, audio, or images
  • Collected via interviews, surveys, or observations
  • Usually open-ended

Examples

  • Gender: Male, Female, Non-Binary, etc.
  • Smartphones: iPhone, Pixel, Motorola, etc.

Nominal

Categorical data without any intrinsic order

Examples:

  • Red, Blue, Green
  • Types of fruits: Apple, Banana, Mango

Can you rank them logically? No — that’s what makes them nominal.


graph TD
  A[Types of Data]
  
  A --> B[Quantitative]
  A --> C[Qualitative]
  
  B --> B1[Discrete]
  B --> B2[Continuous]
  
  C --> C1[Nominal]
  C --> C2[Ordinal]
CategorySubtypeDescriptionExamples
QuantitativeDiscreteWhole numbers, countableNumber of phones, number of users
ContinuousMeasurable, can take fractional valuesTemperature, CPU usage
QualitativeNominalCategorical with no natural orderGender, Colors (Red, Blue, Green)
OrdinalCategorical with a meaningful orderT-shirt sizes (S, M, L), Grades (A, B, C…)

Abstract Understanding

Some qualitative data comes from non-traditional sources like:

  • Conversations
  • Audio or video files
  • Observations or open-text survey responses

This type of data often requires interpretation before it’s usable in models or analysis.

Abstract Understanding

#quantitative #qualitative #discrete #continuous #nominal #ordinalVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

The Big V’s of Big Data

#bigv #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Variety

Variety refers to the different types, formats, and sources of data collected — one of the 5 Vs of Big Data.

Types of Data : By Source

  • Social Media: YouTube, Facebook, LinkedIn, Twitter, Instagram
  • IoT Devices: Sensors, Cameras, Smart Meters, Wearables
  • Finance/Markets: Stock Market, Cryptocurrency, Financial APIs
  • Smart Systems: Smart Cars, Smart TVs, Home Automation
  • Enterprise Systems: ERP, CRM, SCM Logs
  • Public Data: Government Open Data, Weather Stations

Types of Data : By Data format

  • Structured Data – Organized in rows and columns (e.g., CSV, Excel, RDBMS)
  • Semi-Structured Data – Self-describing but irregular (e.g., JSON, XML, Avro, YAML)
  • Unstructured Data – No fixed schema (e.g., images, audio, video, emails)
  • Binary Data – Encoded, compressed, or serialized data (e.g., Parquet, Protocol Buffers, images, MP3)

Generally unstructured data files are stored in binary format, Example: Images, Video, Audio

But not all binary files contain unstructured data. Example: Parquet, Executable.

Structured Data

Tabular data from databases, spreadsheets.

Example:

  • Relational Table
  • Excel
IDNameJoin Date
101Rachel Green2020-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Semi-Structred Data

Data with tags or markers but not strictly tabular.

JSON

[
   {
      "id":1,
      "name":"Rachel Green",
      "gender":"F",
      "series":"Friends"
   },
   {
      "id":"2",
      "name":"Sheldon Cooper",
      "gender":"M",
      "series":"BBT"
   }
]

XML

<?xml version="1.0" encoding="UTF-8"?>
<actors>
   <actor>
      <id>1</id>
      <name>Rachel Green</name>
      <gender>F</gender>
      <series>Friends</series>
   </actor>

   <actor>
      <id>2</id>
      <name>Sheldon Cooper</name>
      <gender>M</gender>
      <series>BBT</series>
   </actor>
</actors>

Unstructured Data

Media files, free text, documents, logs – no predefined structure.

Rachel Green acted in Friends series. Her role is very popular. 
Similarly Sheldon Cooper acted in BBT. He acted as nerd physicist.

Types:

  • Images (JPG, PNG)
  • Video (MP4, AVI)
  • Audio (MP3, WAV)
  • Documents (PDF, DOCX)
  • Emails
  • Logs (system logs, server logs)
  • Web scraping content (HTML, raw text)

Note: Now we have lot of LLM (AI tools) that helps us parse Unstructured Data into tabular data quickly.

#structured #unstructured #semistructured #binary #json #xml #image #bigdata #bigvVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

Volume

Volume refers to the sheer amount of data generated every second from various sources around the world. It’s one of the core characteristics that makes data big.With the rise of the internet, smartphones, IoT devices, social media, and digital services, the amount of data being produced has reached zettabyte and soon yottabyte scales.

  • YouTube users upload 500+ hours of video every minute.
  • Facebook generates 4 petabytes of data per day.
  • A single connected car can produce 25 GB of data per hour.
  • Enterprises generate terabytes to petabytes of log, transaction, and sensor data daily.

Why It Matters

With the rise of Artificial Intelligence (AI) and especially Large Language Models (LLMs) like ChatGPT, Bard, and Claude, the volume of data being generated, consumed, and required for training is skyrocketing.

  • LLMs Need Massive Training Data

  • LLMs generated content is exponential — blogs, reports, summaries, images, audio, and even code.

  • Storage systems must scale horizontally to handle petabytes or more.

  • Traditional databases can’t manage this scale efficiently.

  • Volume impacts data ingestion, processing speed, query performance, and cost.

  • It influences how data is partitioned, replicated, and compressed in distributed systems.

Data Cycle

#bigdata #volume #bigvVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

Velocity

Velocity refers to the speed at which data is generated, transmitted, and processed. In the era of Big Data, it’s not just about handling large volumes of data, but also about managing the continuous and rapid flow of data in real-time or near real-time.

High-velocity data comes from various sources such as:

  • Social Media Platforms: Tweets, posts, likes, and shares occurring every second.
  • Sensor Networks: IoT devices transmitting data continuously.
  • Financial Markets: Real-time transaction data and stock price updates.
  • Online Streaming Services: Continuous streaming of audio and video content.
  • E-commerce Platforms: Real-time tracking of user interactions and transactions.

Managing this velocity requires systems capable of:

  • Real-Time Data Processing: Immediate analysis and response to incoming data.
  • Scalability: Handling increasing data speeds without performance degradation.
  • Low Latency: Minimizing delays in data processing and response times.

Velocity Source1

#bigdata #velocity #bigv


1: https://keywordseverywhere.com/blog/data-generated-per-day-stats/Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Veracity

Veracity refers to the trustworthiness, quality, and accuracy of data. In the world of Big Data, not all data is created equal — some may be incomplete, inconsistent, outdated, or even deliberately false. The challenge is not just collecting data, but ensuring it’s reliable enough to make sound decisions.

Why Veracity Matters

  • Poor data quality can lead to wrong insights, flawed models, and bad business decisions.

  • With increasing sources (social media, sensors, web scraping), there’s more noise than ever.

  • Real-world data often comes with missing values, duplicates, biases, or outliers.

Key Dimensions of Veracity in Big Data

DimensionDescriptionExample
TrustworthinessConfidence in the accuracy and authenticity of data.Verifying customer feedback vs. bot reviews
OriginThe source of the data and its lineage or traceability.Knowing if weather data comes from reliable source
CompletenessWhether the dataset has all required fields and values.Missing values in patient health records
IntegrityEnsuring the data hasn’t been altered, corrupted, or tampered with during storage or transfer.Using checksums to validate data blocks

How to Tackle Veracity Issues

  • Data Cleaning: Remove duplicates, correct errors, fill missing values.
  • Validation & Verification: Check consistency across sources.
  • Data Provenance: Track where the data came from and how it was transformed.
  • Bias Detection: Identify and reduce systemic bias in training datasets.
  • Robust Models: Build models that can tolerate and adapt to noisy inputs.

Websites & Tools to Generate Sample Data

Highly customizable fake data generator; supports exporting as CSV, JSON, SQL. https://mockaroo.com

Easy UI to create datasets with custom fields like names, dates, numbers, etc. https://www.onlinedatagenerator.com

Apart from this, there are few Data generating libraries.

https://faker.readthedocs.io/en/master/

https://github.com/databrickslabs/dbldatagen

Question?

Is generating fake data good or bad?

When we have real data? why generate fake data?

#bigv #veracity #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Other V’s in Big Data

Other V’sMeaningKey Question / Use Case
ValueBusiness/Customer ImpactWhat value does this data bring to the business or end users?
VisualizationData RepresentationCan the data be visualized clearly to aid understanding and decisions?
ViabilityProduction/SustainabilityIs it viable to operationalize and sustain this data in production systems?
ViralityShareability/ImpactWill the message or insight be effective when shared across channels (e.g., social media)?
VersionData VersioningDo we need to maintain different versions? Is the cost of versioning justified?
ValidityTime-SensitivityHow long is the data relevant? Will its meaning or utility change over time?

Example

  • Validity: Zoom usage data from 2020 was valid during lockdown, can that be used for benchmarking?

  • Virality: A meme might go viral on Instagram and not received well in Twitter or LinkedIn.

  • Version: For some master records, we might need versioned data. For simple web traffic counts, maybe not.

#bigdata #otherv #value #version #validityVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Trending Technologies

Powered by Big Data

Big Data isn’t just about storing and processing huge volumes of information — it’s the engine that drives modern innovation. From healthcare to self-driving cars, Big Data plays a critical role in shaping the technologies we use and depend on every day.

Where Big Data Is Making an Impact

  • Robotics
    Enhances learning and adaptive behavior in robots by feeding real-time and historical data into control algorithms.

  • Artificial Intelligence (AI)
    The heart of AI — machine learning models rely on Big Data to train, fine-tune, and make accurate predictions.

  • Internet of Things (IoT)
    Millions of devices — from smart thermostats to industrial sensors — generate data every second. Big Data platforms analyze this for real-time insights.

  • Internet & Mobile Apps
    Collect user behavior data to power personalization, recommendations, and user experience optimization.

  • Autonomous Cars & VANETs (Vehicular Networks)
    Use sensor and network data for route planning, obstacle avoidance, and decision-making.

  • Wireless Networks & 5G
    Big Data helps optimize network traffic, reduce latency, and predict service outages before they occur.

  • Voice Assistants (Siri, Alexa, Google Assistant)
    Depend on Big Data and NLP models to understand speech, learn preferences, and respond intelligently.

  • Cybersecurity
    Uses pattern detection on massive datasets to identify anomalies, prevent attacks, and detect fraud in real time.

  • Bioinformatics & Genomics
    Big Data helps decode genetic sequences, enabling personalized medicine and new drug discoveries. Big Data was a game-changer in the development and distribution of COVID-19 vaccines

    https://pmc.ncbi.nlm.nih.gov/articles/PMC9236915/

  • Renewable Energy
    Analyzes weather, consumption, and device data to maximize efficiency in solar, wind, and other green technologies.

  • Neural Networks & Deep Learning
    These advanced AI models require large-scale labeled data for training complex tasks like image recognition or language translation.


Broad Use Areas for Big Data

AreaDescription
Data Mining & AnalyticsFinding patterns and insights from raw data
Data VisualizationPresenting data in a human-friendly, understandable format
Machine LearningTraining models that learn from historical data

#bigdata #technologies #iot #ai #roboticsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Big Data Concerns

Big Data brings massive potential, but it also introduces ethical, technical, and societal challenges. Below is a categorized view of key concerns and how they can be mitigated.

Privacy, Security & Governance

Concerns

  • Privacy: Risk of misuse of sensitive personal data.
  • Security: Exposure to cyberattacks and data breaches.
  • Governance: Lack of clarity on data ownership and access rights.

Mitigation

  • Use strong encryption, anonymization, and secure access controls.
  • Conduct regular security audits and staff awareness training.
  • Define and enforce data governance policies on ownership, access, and lifecycle.
  • Establish consent mechanisms and transparent data usage policies.

Data Quality, Accuracy & Interpretation

Concerns

  • Inaccurate, incomplete, or outdated data may lead to incorrect decisions.
  • Misinterpretation due to lack of context or domain understanding.

Mitigation

  • Implement data cleaning, validation, and monitoring procedures.
  • Train analysts to understand data context.
  • Use cross-functional teams for balanced analysis.
  • Maintain data lineage and proper documentation.

Ethics, Fairness & Bias

Concerns

  • Potential for discrimination or unethical use of data.
  • Over-reliance on algorithms may overlook human factors.

Mitigation

  • Develop and follow ethical guidelines for data usage.
  • Perform bias audits and impact assessments regularly.
  • Combine data-driven insights with human judgment.

Regulatory Compliance

Concerns

  • Complexity of complying with regulations like GDPR, HIPAA, etc.

Mitigation

  • Stay current with relevant data protection laws.
  • Assign a Data Protection Officer (DPO) to ensure ongoing compliance and oversight.

Environmental and Social Impact

Concerns

  • High energy usage of data centers contributes to carbon emissions.
  • Digital divide may widen gaps between those who can access Big Data and those who cannot.

Mitigation

  • Use energy-efficient infrastructure and renewable energy sources.
  • Support data literacy, open data access, and inclusive education initiatives.

#bigdata #concerns #mitigationVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

Big Data Challenges

As organizations adopt Big Data, they face several challenges — technical, organizational, financial, legal, and ethical. Below is a categorized overview of these challenges along with effective mitigation strategies.

1. Data Storage & Management

Challenge:

Efficiently storing and managing ever-growing volumes of structured, semi-structured, and unstructured data.

Mitigation:

  • Use scalable cloud storage and distributed file systems like HDFS or Delta Lake.
  • Establish data lifecycle policies, retention rules, and metadata catalogs for better management.

2. Data Processing & Real-Time Analytics

Challenges:

  • Processing huge datasets with speed and accuracy.
  • Delivering real-time insights for time-sensitive decisions.

Mitigation:

  • Leverage tools like Apache Spark, Flink, and Hadoop for distributed processing.
  • Use streaming platforms like Kafka or Spark Streaming.
  • Apply parallel and in-memory processing where possible.

3. Data Integration & Interoperability

Challenge:

Bringing together data from diverse sources, formats, and systems into a unified view.

Mitigation:

  • Implement ETL/ELT pipelines, data lakes, and integration frameworks.
  • Apply data transformation and standardization best practices.

4. Privacy, Security & Compliance

Challenges:

  • Preventing data breaches and unauthorized access.
  • Adhering to global and regional data regulations (e.g., GDPR, HIPAA, CCPA).

Mitigation:

  • Use encryption, role-based access controls, and audit logging.
  • Conduct regular security assessments and appoint a Data Protection Officer (DPO).
  • Stay current with evolving regulations and enforce compliance frameworks.

5. Data Quality & Trustworthiness

Challenge:

Ensuring that data is accurate, consistent, timely, and complete.

Mitigation:

  • Use data validation, cleansing tools, and automated quality checks.
  • Monitor for data drift and inconsistencies in real time.
  • Maintain data provenance for traceability.

6. Skill Gaps & Talent Shortage

Challenge:

A lack of professionals skilled in Big Data technologies, analytics, and data engineering.

Mitigation:

  • Invest in upskilling programs, certifications, and academic partnerships.
  • Foster a culture of continuous learning and data literacy across roles.

7. Cost & Resource Management

Challenge:

Managing the high costs associated with storing, processing, and analyzing large-scale data.

Mitigation:

  • Optimize workloads using cloud-native autoscaling and resource tagging.
  • Use open-source tools where possible.
  • Monitor and forecast data usage to control spending.

8. Scalability & Performance

Challenge:

Keeping up with growing data volumes and system demands without compromising performance.

Mitigation:

  • Design for horizontal scalability using microservices and cloud-native infrastructure.
  • Implement load balancing, data partitioning, and caching strategies.

9. Ethics, Governance & Transparency

Challenges:

  • Managing bias, fairness, and responsible data usage.
  • Ensuring transparency in algorithms and decisions.

Mitigation:

  • Establish data ethics policies and review boards.
  • Perform regular audits and impact assessments.
  • Clearly communicate how data is collected, stored, and used.

#bigdata #ethics #storage #realtime #interoperability #privacy #dataquality Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Data Integration

Data integration in the Big Data ecosystem differs significantly from traditional Relational Database Management Systems (RDBMS). While traditional systems rely on structured, predefined workflows, Big Data emphasizes scalability, flexibility, and performance.

ETL: Extract Transform Load

ETL is a traditional data integration approach used primarily with RDBMS technologies such as MySQL, SQL Server, and Oracle.

Workflow

  • Extract data from source systems.
  • Transform it into the required format.
  • Load it into the target system (e.g., a data warehouse).

ETL Tools

  • SSIS / SSDT – SQL Server Integration Services / Data Tools
  • Pentaho Kettle – Open-source ETL platform
  • Talend – Data integration and transformation platform
  • Benetl – Lightweight ETL for MySQL and PostgreSQL

ETL tools are well-suited for batch processing and structured environments but may struggle with scale and unstructured data.

ETL

src 1

ETL vs ELT

src 2

ELT: Extract Load Transform

ELT is the modern, Big Data-friendly approach. Instead of transforming data before loading, ELT prioritizes loading raw data first and transforming later.

Benefits

  • Immediate ingestion of all types of data (structured or unstructured)
  • Flexible transformation logic, applied post-load
  • Faster load times and higher throughput
  • Reduced operational overhead for loading processes

Challenges

  • Security blind spots may arise from loading raw data upfront
  • Compliance risks due to delayed transformation (HIPAA, GDPR, etc.)
  • High storage costs if raw data is stored unfiltered in cloud/on-prem systems

ELT is ideal for data lakes, streaming, and cloud-native architectures.

Typical Big Data Flow

Raw Data → Cleansed Data → Data Processing → Data Warehousing → ML / BI / Analytics

  • Raw Data: Initial unprocessed input (logs, JSON, CSV, APIs, sensors)
  • Cleansed Data: Cleaned and standardized
  • Processing: Performed through tools like Spark, DLT, or Flink
  • Warehousing: Data is stored in structured formats (e.g., Delta, Parquet)
  • Usage: Data is consumed by ML models, dashboards, or analysts

Each stage involves pipelines, validations, and metadata tracking.

ETL vs ELT

#etl #elt #pipeline #rawdata #datalake


1: Leanmsbitutorial.com

2: https://towardsdatascience.com/how-i-redesigned-over-100-etl-into-elt-data-pipelines-c58d3a3cb3cVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

Scaling & Distributed Systems

Scalability is a critical factor in Big Data and cloud computing. As workloads grow, systems must adapt.

There are two main ways to scale infrastructure:

vertical scaling and horizontal scaling. These often relate to how distributed systems are designed and deployed.

Vertical Scaling (Scaling Up)

Vertical scaling means increasing the capacity of a single machine.

Like upgrading your personal computer — adding more RAM, a faster CPU, or a bigger hard drive.

Pros:

  • Simple to implement
  • No code or architecture changes needed
  • Good for monolithic or legacy applications

Cons:

  • Hardware has physical limits
  • Downtime may be required during upgrades
  • More expensive hardware = diminishing returns

Used In:

  • Traditional RDBMS
  • Standalone servers
  • Small-scale workloads

Horizontal Scaling (Scaling Out)

Horizontal scaling means adding more machines (nodes) to handle the load collectively.

Like hiring more team members instead of just working overtime yourself.

Pros:

  • More scalable: Keep adding nodes as needed
  • Fault tolerant: One machine failure doesn’t stop the system
  • Supports distributed computing

Cons:

  • More complex to configure and manage
  • Requires load balancing, data partitioning, and synchronization
  • More network overhead

Used In:

  • Distributed databases (e.g., Cassandra, MongoDB)
  • Big Data platforms (e.g., Hadoop, Spark)
  • Cloud-native applications (e.g., Kubernetes)

Distributed Systems

A distributed system is a network of computers that work together to perform tasks. The goal is to increase performance, availability, and fault tolerance by sharing resources across machines.

Analogy:

A relay team where each runner (node) has a specific part of the race, but success depends on teamwork.

Key Features of Distributed Systems

FeatureDescription
ConcurrencyMultiple components can operate at the same time independently
ScalabilityEasily expand by adding more nodes
Fault ToleranceIf one node fails, others continue to operate with minimal disruption
Resource SharingNodes share tasks, data, and workload efficiently
DecentralizationNo single point of failure; avoids bottlenecks
TransparencySystem hides its distributed nature from users (location, access, replication)

Horizontal Scaling vs. Distributed Systems

AspectHorizontal ScalingDistributed System
DefinitionAdding more machines (nodes) to handle workloadA system where multiple nodes work together as one unit
GoalTo increase capacity and performance by scaling outTo coordinate tasks, ensure fault tolerance, and share resources
ArchitectureNot necessarily distributedAlways distributed
CoordinationMay not require nodes to communicateRequires tight coordination between nodes
Fault ToleranceDepends on implementationBuilt-in as a core feature
ExampleLoad-balanced web serversHadoop, Spark, Cassandra, Kafka
Storage/ProcessingEach node may handle separate workloadsNodes often share or split workloads and data
Use CaseQuick capacity boost (e.g., web servers)Large-scale data processing, distributed storage

Vertical scaling helps improve single-node power, while horizontal scaling enables distributed systems to grow flexibly. Most modern Big Data systems rely on horizontal scaling for scalability, reliability, and performance.

#scaling #vertical #horizontal #distributedVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

CAP Theorem

src 1

The CAP Theorem is a fundamental concept in distributed computing. It states that in the presence of a network partition, a distributed system can guarantee only two out of the following three properties:

The Three Components

  1. Consistency (C)
    Every read receives the most recent write or an error.
    Example: If a book’s location is updated in a library system, everyone querying the catalog should see the updated location immediately.

  2. Availability (A)
    Every request receives a (non-error) response, but not necessarily the most recent data.
    Example: Like a convenience store that’s always open, even if they occasionally run out of your favorite snack.

  3. Partition Tolerance (P)
    The system continues to function despite network failures or communication breakdowns.
    Example: A distributed team in different rooms that still works, even if their intercom fails.

What the CAP Theorem Means

You can only pick two out of three:

Guarantee CombinationSacrificed PropertyTypical Use Case
CP (Consistency + Partition)AvailabilityBanking Systems, RDBMS
AP (Availability + Partition)ConsistencyDNS, Web Caches
CA (Consistency + Availability)Partition Tolerance (Not realistic in distributed systems)Only feasible in non-distributed systems

CAP Theorem src 2

Real-World Examples

CAP Theorem trade-offs can be seen in:

  • Social Media Platforms – Favor availability and partition tolerance (AP)
  • Financial Systems – Require consistency and partition tolerance (CP)
  • IoT Networks – Often prioritize availability and partition tolerance (AP)
  • eCommerce Platforms – Mix of AP and CP depending on the service
  • Content Delivery Networks (CDNs) – Strongly AP-focused for high availability and responsiveness

src 3

graph TD
    A[Consistency]
    B[Availability]
    C[Partition Tolerance]

    A -- CP System --> C
    B -- AP System --> C
    A -- CA System --> B

    subgraph CAP Triangle
        A
        B
        C
    end

This diagram shows that you can choose only two at a time:

  • CP (Consistency + Partition Tolerance): e.g., traditional databases
  • AP (Availability + Partition Tolerance): e.g., DNS, Cassandra
  • CA is only theoretical in a distributed environment (it fails when partition occurs)

In distributed systems, network partitions are unavoidable. The CAP Theorem helps us choose which trade-off makes the most sense for our use case.

#cap #consistency #availability #partitiontolerant


1: blog.devtrovert.com

2: Factor-bytes.com

3: blog.bytebytego.comVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

PACELC

The PACELC theorem is indeed a direct extension of the CAP theorem.

If Partition exists choose between Availability or Consistency Else Latency or Consistency

What If Partition Exists (P) means

  • A network partition has occurred
  • Some nodes cannot communicate with others
  • Messages are dropped, not just delayed

When CAP exists why PACELC?

CAP focuses exclusively on what happens during a network failure (a “partition”), PACELC addresses a major critique: it accounts for how a system behaves during normal, healthy operation.

  • Most systems run without network partitions most of the time
  • Datacenters are engineered to avoid partitions
  • Partitions are rare but catastrophic
  • So when everything works, you still trade consistency vs latency.
                Distributed System
                        |
                        v
             Is there a network partition?
                        |
            +-----------+-----------+
            |                       |
          YES (P)                NO (ELSE)
            |                       |
            v                       v
   Availability (A)         Low Latency (L)
            |                       |
   - Keep serving            - Read nearest replica
   - May return              - Async replication
     inconsistent data       - Possible staleness
            |
            |
            v
     Consistency (C)         Consistency (C)
            |                       |
   - Block / error           - Quorum / consensus
   - Wait for quorum         - Higher latency
   - Data always correct     - Strong guarantees
DatabaseP: Availability vs ConsistencyELSE: Latency vs ConsistencyPACELC ClassNotes
CassandraAvailabilityLatencyPA / ELAlways-on design, async replication, eventual consistency
DynamoDBAvailabilityLatencyPA / ELDynamo-style, low latency reads, consistency is optional
RiakAvailabilityLatencyPA / ELConflict resolution after the fact
CouchDBAvailabilityLatencyPA / ELMulti-master replication, conflicts expected
MongoDB (Replica Set)ConsistencyConsistencyPC / ECPrimary-based writes, blocks during elections
HBaseConsistencyConsistencyPC / ECStrong consistency via HDFS, higher coordination cost
Google SpannerConsistencyConsistencyPC / ECGlobal consensus, correctness over latency
CockroachDBConsistencyConsistencyPC / ECDistributed SQL, serializable isolation
ElasticsearchAvailabilityLatencyPA / ELSearch-first, stale reads acceptable
Redis ClusterAvailabilityLatencyPA / ELSpeed first, eventual consistency under failure
Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Optimistic concurrency

Optimistic Concurrency is a concurrency control strategy used in databases and distributed systems that allows multiple users or processes to access the same data simultaneouslywithout locking resources.

Instead of preventing conflicts upfront by using locks, it assumes that conflicts are rare. If a conflict does occur, it’s detected after the operation, and appropriate resolution steps (like retries) are taken.


How It Works

  • Multiple users/processes read and attempt to write to the same data.
  • Instead of using locks, each update tracks the version or timestamp of the data.
  • When writing, the system checks if the data has changed since it was read.
  • If no conflict, the write proceeds.
  • If conflict detected, the system throws an exception or prompts a retry.

Let’s look at a simple example:

Sample inventory Table

| item_id | item_nm | stock |
|---------|---------|-------|
|    1    | Apple   |  10   |
|    2    | Orange  |  20   |
|    3    | Banana  |  30   |

Imagine two users, UserA and UserB, trying to update the apple stock simultaneously.

User A’s update:

UPDATE inventory SET stock = stock + 5 WHERE item_id = 1;

User B’s update:

UPDATE inventory SET stock = stock - 3 WHERE item_id = 1;
  • Both updates execute concurrently without locking the table.
  • After both operations, system checks for version conflicts.
  • If there’s no conflict, the changes are merged.
New price of Apple stock = 10 + 5 - 3 = 12
  • If there was a conflicting update (e.g., both changed the same field from different base versions), one update would fail, and the user must retry the transaction.

Optimistic Concurrency Is Ideal When

ConditionExplanation
Low write contentionMost updates happen on different parts of data
Read-heavy, write-light systemsUpdates are infrequent or less overlapping
High performance is criticalAvoiding locks reduces wait times
Distributed systemsLocking is expensive and hard to coordinate

#optimistic #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Eventual consistency

Eventual consistency is a consistency model used in distributed systems (like NoSQL databases and distributed storage) where updates to data may not be immediately visible across all nodes. However, the system guarantees that all replicas will eventually converge to the same state — given no new updates are made.

Unlike stronger models like serializability or linearizability, eventual consistency prioritizes performance and availability, especially in the face of network latency or partitioning.

Simple Example: Distributed Key-Value Store

Imagine a distributed database with three nodes: Node A, Node B, and Node C. All store the value for a key called "item_stock":

Node A: item_stock = 10
Node B: item_stock = 10
Node C: item_stock = 10

Now, a user sends an update to change item_stock to 15, and it reaches only Node A initially:

Node A: item_stock = 15
Node B: item_stock = 10
Node C: item_stock = 10

At this point, the system is temporarily inconsistent. Over time, the update propagates:

Node A: item_stock = 15
Node B: item_stock = 15
Node C: item_stock = 10

Eventually, all nodes reach the same value:

Node A: item_stock = 15
Node B: item_stock = 15
Node C: item_stock = 15

Key Characteristics

  • Temporary inconsistencies are allowed
  • Data will converge across replicas over time
  • Reads may return stale data during convergence
  • Prioritizes availability and partition tolerance over strict consistency

When to Use Eventual Consistency

Eventual consistency is ideal when:

SituationWhy It Helps
High-throughput, low-latency systemsAvoids the overhead of strict consistency
Geo-distributed deploymentsTolerates network delays and partitions
Systems with frequent writesEnables faster response without locking or blocking
Availability is more critical than accuracyKeeps services running even during network issues

#eventualconsistency #bigdataVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Concurrent vs. Parallel

Understanding the difference between concurrent and parallel programming is key when designing efficient, scalable applications — especially in distributed and multi-core systems.

Concurrent Programming

Concurrent programming is about managing multiple tasks at once, allowing them to make progress without necessarily executing at the same time.

  • Tasks overlap in time.
  • Focuses on task coordination, not simultaneous execution.
  • Often used in systems that need to handle many events or users, like web servers or GUIs.

Key Traits

  • Enables responsive programs (non-blocking)
  • Utilizes a single core or limited resources efficiently
  • Requires mechanisms like threads, coroutines, or async/await

Parallel Programming

Parallel programming is about executing multiple tasks simultaneously, typically to speed up computation.

  • Tasks run at the same time, often on multiple cores.
  • Focuses on performance and efficiency.
  • Common in high-performance computing, such as scientific simulations or data processing.

Key Traits

  • Requires multi-core CPUs or GPUs
  • Ideal for data-heavy workloads
  • Uses multithreading, multiprocessing, or vectorization

Analogy: Cooking in a Kitchen

Concurrent Programming

One chef is working on multiple dishes. While a pot is simmering, the chef chops vegetables for the next dish. Tasks overlap, but only one is actively running at a time.

Parallel Programming

A team of chefs in a large kitchen, each cooking a different dish at the same time. Multiple dishes are actively being cooked simultaneously, speeding up the overall process.

Summary Table

FeatureConcurrent ProgrammingParallel Programming
Task TimingTasks overlap, but not necessarily at onceTasks run simultaneously
FocusManaging multiple tasks efficientlyImproving performance through parallelism
Execution ContextOften single-core or logical threadMulti-core, multi-threaded or GPU-based
Tools/MechanismsThreads, coroutines, async I/OThreads, multiprocessing, SIMD, OpenMP
Example Use CaseWeb servers, I/O-bound systemsScientific computing, big data, simulations

#concurrent #parallelprogrammingVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

General-Purpose Language (GPL)

What is a GPL?

A GPL is a programming language designed to write software in multiple problem domains. It is not limited to a particular application area.

Swiss Army Knife

Examples

  • Python – widely used in ML, web, scripting, automation.
  • Java – enterprise applications, Android, backend.
  • C++ – system programming, game engines.
  • Rust – performance + memory safety.
  • JavaScript – web front-end & server-side with Node.js.

Use Cases

  • Building web apps (backend/frontend).
  • Developing AI/ML pipelines.
  • Writing system software and operating systems.
  • Implementing data processing frameworks (e.g., Apache Spark in Scala).
  • Creating mobile and desktop applications.

Why Use GPL?

  • Flexibility to work across domains.
  • Rich standard libraries and ecosystems.
  • Ability to combine different kinds of tasks (e.g., networking + ML).

#gpl #python #rustVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

DSL

A DSL is a programming or specification language dedicated to a particular problem domain, a particular problem representation technique, and/or a particular solution technique.

Examples

  • SQL – querying and manipulating relational databases.
  • HTML – for structuring content on the web.
  • R – statistical computing and graphics.
  • Makefiles – for building projects.
  • Regular Expressions – for pattern matching.
  • Markdown (READ.md or https://stackedit.io/app#)
  • Mermaid - Mermaid (https://mermaid.live/)

Use Cases

  • Building data pipelines (e.g., dbt, Airflow DAGs).
  • Writing infrastructure-as-code (e.g., Terraform HCL).
  • Designing UI layout (e.g., QML for Qt UI design).
  • IoT rule engines (e.g., IFTTT or Node-RED flows).
  • Statistical models using R.

Why Use DSL?

  • Shorter, more expressive code in the domain.
  • Higher-level abstractions.
  • Reduced risk of bugs for domain experts.

Optional Challenge: Build Your Own DSL!

Design your own mini Domain-Specific Language (DSL)! You can keep it simple.

  • Start with a specific problem.
  • Create your own syntax that feels natural to all.
  • Try few examples and ask your friends to try.
  • Try implementing a parser using your favourite GPL.

#domain #dsl #SQL #HTMLVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

Popular Big Data Tools & Platforms

Big Data ecosystems rely on a wide range of tools and platforms for data processing, real-time analytics, streaming, and cloud-scale storage. Here’s a list of some widely used tools categorized by functionality:

Distributed Processing Engines

  • Apache Spark – Unified analytics engine for large-scale data processing; supports batch, streaming, and ML.
  • Apache Flink – Framework for stateful computations over data streams with real-time capabilities.

Real-Time Data Streaming

  • Apache Kafka – Distributed event streaming platform for building real-time data pipelines and streaming apps.

Log & Monitoring Stack

  • ELK Stack (Elasticsearch, Logstash, Kibana) – Searchable logging and visualization suite for real-time analytics.

Cloud-Based Platforms

  • AWS (Amazon Web Services) – Scalable cloud platform offering Big Data tools like EMR, Redshift, Kinesis, and S3.
  • Azure – Microsoft’s cloud platform with tools like Azure Synapse, Data Lake, and Event Hubs.
  • GCP (Google Cloud Platform) – Offers BigQuery, Dataflow, Pub/Sub for large-scale data analytics.
  • Databricks – Unified data platform built around Apache Spark with powerful collaboration and ML features.
  • Snowflake – Cloud-native data warehouse known for performance, elasticity, and simplicity.

#bigdata #tools #cloud #kafka #sparkVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

NoSQL Database Types

NoSQL databases are optimized for flexibility, scalability, and performance, making them ideal for Big Data and real-time applications. They are categorized based on how they store and access data:

Key-Value Stores

Store data as simple key-value pairs. Ideal for caching, session storage, and high-speed lookups.

  • Redis
  • Amazon DynamoDB

Columnar Stores

Store data in columns rather than rows, optimized for analytical queries and large-scale batch processing.

  • Apache HBase
  • Apache Cassandra
  • Amazon Redshift

Document Stores

Store semi-structured data like JSON or BSON documents. Great for flexible schemas and content management systems.

  • MongoDB
  • Amazon DocumentDB

Graph Databases

Use nodes and edges to represent and traverse relationships between data. Ideal for social networks, recommendation engines, and fraud detection.

  • Neo4j
  • Amazon Neptune

Tip: Choose the NoSQL database type based on your data access patterns and application needs.

Not all NoSQL databases solve the same problem.

#nosql #keyvalue #documentdb #graphdb #columnarVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

Learning Big Data

Learning Big Data goes beyond just handling large datasets. It involves building a foundational understanding of data types, file formats, processing tools, and cloud platforms used to store, transform, and analyze data at scale.

Types of Files & Formats

  • Data File Types: CSV, JSON
  • File Formats: CSV, TSV, TXT, Parquet

Linux & File Management Skills

  • Essential Linux Commands: ls, cat, grep, awk, sort, cut, sed, etc.
  • Useful Libraries & Tools:
    • awk, jq, csvkit, grep – for filtering, transforming, and managing structured data

Data Manipulation Foundations

  • Regular Expressions: For pattern matching and advanced string operations
  • SQL / RDBMS: Understanding relational data and query languages
  • NoSQL Databases: Working with document, key-value, columnar, and graph stores

Cloud Technologies

  • Introduction to major platforms: AWS, Azure, GCP
  • Services for data storage, compute, and analytics (e.g., S3, EMR, BigQuery)

Big Data Tools & Frameworks

  • Tools like Apache Spark, Flink, Kafka, Dask
  • Workflow orchestration (e.g., Airflow, DBT, Databricks Workflows)

Miscellaneous Tools & Libraries

  • Visualization: matplotlib, seaborn, Plotly
  • Data Engineering: pandas, pyarrow, sqlalchemy
  • Streaming & Real-time: Kafka, Spark Streaming, Flume

Tip: Big Data learning is a multi-disciplinary journey. Start small — explore files and formats — then gradually move into tools, pipelines, cloud platforms, and real-time systems.

#bigdata #learning #learningVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

Developer Tools

  1. Introduction
  2. UV
  3. Other Python ToolsVer 6.0.25

[Avg. reading time: 5 minutes]

Introduction

Before diving into Data or ML frameworks, it's important to have a clean and reproducible development setup. A good environment makes you:

  • Faster: less time fighting dependencies.
  • Consistent: same results across laptops, servers, and teammates.
  • Confident: tools catch errors before they become bugs.

A consistent developer experience saves hours of debugging. You spend more time solving problems, less time fixing environments.


Python Virtual Environment

  • A virtual environment is like a sandbox for Python.
  • It isolates your project’s dependencies from the global Python installation.
  • Easy to manage different versions of library.
  • Must depend on requirements.txt, it has to be managed manually.

Without it, installing one package for one project may break another project.

#venv #python #uv #poetry developer_toolsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

UV

Dependency & Environment Manager

  • Written in Rust.
  • Syntax is lightweight.
  • Automatic Virtual environment creation.

Create a new project:

# Initialize a new uv project
uv init uv_helloworld

Sample layout of the directory structure

.
├── main.py
├── pyproject.toml
├── README.md
└── uv.lock
# Change directory
cd uv_helloworld

# # Create a virtual environment myproject
# uv venv myproject

# or create a UV project with specific version of Python

# uv venv myproject --python 3.11

# # Activate the Virtual environment

# source myproject/bin/activate

# # Verify the Virtual Python version

# which python3

# add library (best practice)
uv add faker

# verify the list of libraries under virtual env
uv tree

# To find the list of libraries inside Virtual env

uv pip list

edit the main.py

from faker import Faker
fake = Faker()
print(fake.name())
uv run main.py

Read More on the differences between UV and Poetry

#uv #rust #venvVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 17 minutes]

Python Developer Tools

PEP

PEP, or Python Enhancement Proposal, is the official style guide for Python code. It provides conventions and recommendations for writing readable, consistent, and maintainable Python code.

PEP Conventions

  • PEP 8 : Style guide for Python code (most famous).
  • PEP 20 : "The Zen of Python" (guiding principles).
  • PEP 484 : Type hints (basis for MyPy).
  • PEP 517/518 : Build system interfaces (basis for pyproject.toml, used by Poetry/UV).
  • PEP 572 : Assignment expressions (the := walrus operator).
  • PEP 440 : Mention versions in Libraries

Indentation

  • Use 4 spaces per indentation level
  • Continuation lines should align with opening delimiter or be indented by 4 spaces.

Line Length

  • Limit lines to a maximum of 79 characters.
  • For docstrings and comments, limit lines to 72 characters.

Blank Lines

  • Use 2 blank lines before top-level functions and class definitions.
  • Use 1 blank line between methods inside a class.

Imports

  • Imports should be on separate lines.
  • Group imports into three sections: standard library, third-party libraries, and local application imports.
  • Use absolute imports whenever possible.
# Correct
    import os
    import sys

# Wrong
    import sys, os

Naming Conventions

  • Use snake_case for function and variable names.
  • Use CamelCase for class names.
  • Use UPPER_SNAKE_CASE for constants.
  • Avoid single-character variable names except for counters or indices.

Whitespace

  • Don’t pad inside parentheses/brackets/braces.
  • Use one space around operators and after commas, but not before commas.
  • No extra spaces when aligning assignments.

Comments

  • Write comments that are clear, concise, and helpful.
  • Use complete sentences and capitalize the first word.
  • Use # for inline comments, but avoid them where the code is self-explanatory.

Docstrings

  • Use triple quotes (""") for multiline docstrings.
  • Describe the purpose, arguments, and return values of functions and methods.

Code Layout

  • Keep function definitions and calls readable.
  • Avoid writing too many nested blocks.

Consistency

  • Consistency within a project outweighs strict adherence.
  • If you must diverge, be internally consistent.

PEP 20 - The Zen of Python

https://peps.python.org/pep-0020/

Simple is better than complex

Complex

result = (lambda x: (x*x + 2*x + 1))(5)

Simple

x = 5
result = (x + 1) ** 2

Readability counts

No Good

a=10;b=20;c=a+b;print(c)

Good

first_value = 10
second_value = 20
sum_of_values = first_value + second_value
print(sum_of_values)

Errors should never pass silently

No Good

try:
    x = int("abc")
except:
    pass

Good

try:
    x = int("abc")
except ValueError as e:
    print("Conversion failed:", e)

PEP 572

Walrus Operator :=

Assignment within Expression Operator

Old Way

inputs = []
current = input("Write something ('quit' to stop): ")
while current != "quit":
    inputs.append(current)
    current = input("Write something ('quit' to stop): ")

Using Walrus

inputs = []
while (current := input("Write something ('quit' to stop): ")) != "quit":
    inputs.append(current)

Another Example

Old Way

import re

m = re.search(r"\d+", text)
if m:
    print(m.group())

New Way

import re

if (m := re.search(r"\d+", text)):
    print(m.group())

Linting

Linting is the process of automatically checking your Python code for:

  • Syntax errors

  • Stylistic issues (PEP 8 violations)

  • Potential bugs or bad practices

  • Keeps your code consistent and readable.

  • Helps catch errors early before runtime.

  • Encourages team-wide coding standards.


# Incorrect
import sys, os

# Correct
import os
import sys
# Bad spacing
x= 5+3

# Good spacing
x = 5 + 3

Ruff : Linter and Code Formatter

Ruff is a fast, modern tool written in Rust that helps keep your Python code:

  • Consistent (follows PEP 8)
  • Clean (removes unused imports, fixes spacing, etc.)
  • Correct (catches potential errors)

Install

uv add ruff

Verify

ruff --version 
ruff --help

example.py

import os, sys 

def greet(name): 
  print(f"Hello, {name}")

def message(name): print(f"Hi, {name}")

def calc_sum(a, b): return a+b

greet('World')
greet('Ruff')
message('Ruff')

uv run ruff check example.py
uv run ruff check example.py --fix
uv run ruff format example.py --check
uv run ruff check example.py

PEP 484 - MyPy : Type Checking Tool

Python is a Dynamically typed programming language. Meaning

x=26 x= "hello"

both are valid.

MyPy is introduced to make it statically typed.

mypy is a static type checker for Python. It checks your code against the type hints you provide, ensuring that the types are consistent throughout the codebase.

It primarily focuses on type correctness—verifying that variables, function arguments, return types, and expressions match the expected types.

What mypy checks:

  • Variable reassignment types
  • Function arguments
  • Return types
  • Expressions and operations
  • Control flow narrowing

What mypy does not do:

  • Runtime validation
  • Performance checks
  • Logical correctness

Install

    uv add mypy

    or

    pip install mypy

Example 1 : sample.py

x = 1
x = 1.0
x = True
x = "test"
x = b"test"

print(x)

uv run mypy sample.py

or

mypy sample.py

Example 2: Type Safety

def add(a: int, b: int) -> int:
    return a + b

print(add(100, 123))
print(add("hello", "world"))

Example 3: Return Type Violation

def divide(a: int, b: int) -> int:
    if b == 0:
        return "invalid"
    return a // b

Example 4: Optional Types

from typing import Optional

def get_username(user_id: int) -> Optional[str]:
    if user_id == 0:
        return None
    return "admin"

name = get_username(0)
print(name.upper())

What is wrong in this? name can also be None and there is no upper for None

#mypy #pep #ruff #lintVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

Dataformat

  1. Introduction
  2. CSV-TSV
  3. JSON
  4. Parquet
  5. Arrow
  6. Avro
  7. YAML
  8. Duck DBVer 6.0.25

[Avg. reading time: 6 minutes]

Introduction to Data Formats

What Are Data Formats?

  • Data formats define how data is represented on disk or over the wire
  • They describe:
    • Structure (rows, columns, trees, blocks)
    • Encoding (text, binary)
    • Schema handling (strict, flexible, embedded, external)
  • In Big Data, data formats are not just a storage choice, they are a performance decision

Why Data Formats Matter in Big Data

  • Big Data systems deal with:
    • Huge volumes
    • Distributed storage
    • Parallel processing
  • A poor format choice can:
    • Waste storage
    • Slow down queries by orders of magnitude
    • Break downstream systems

Choosing the right format directly impacts:

  • Storage efficiency
  • Scan speed
  • Compression ratio
  • CPU usage
  • Network I/O

This is why data engineers care about formats more than application developers do.

Big Data Reality Check

  • Data rarely lives in a single database
  • Data moves through:
    • APIs
    • Message queues
    • Object storage
    • Data lakes
  • File formats become the contract between systems

Once data is written in a format, changing it later is expensive.

Data Formats vs Traditional Database Storage

FeatureTraditional RDBMSBig Data Formats
Storage UnitTablesFiles or streams
SchemaFixed, enforced on writeOften flexible or schema-on-read
Access PatternRow-basedRow, column, or block-based
OptimizationIndexes, transactionsPartitioning, compression, vectorized reads
Scale ModelVertical or limited horizontalDesigned for distributed systems
Typical UseOLTP, dashboardsETL, analytics, ML pipelines

Key Shift for Data Engineers

  • Databases optimize queries
  • Data formats optimize data movement and scanning
  • In Big Data:
    • Data is written once
    • Read many times
    • Often by different engines

That’s why formats like CSV, JSON, Avro, Parquet, and ORC exist, each solving a different problem.

What This Chapter Will Cover

  • Text vs binary formats
  • Row-based vs columnar storage
  • Schema-on-write vs schema-on-read
  • When formats break at scale
  • Why Parquet dominates analytics workloads

#bigdata #dataformat #rdbmsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Common Data Formats

CSV (Comma-Separated Values)

A simple text-based format where each row represents a record and each column is separated by a comma.

Example

name,age,city
Rachel,30,New York
Phoebe,25,San Francisco

Use Cases

  • Data exchange between systems
  • Lightweight storage
  • Import/export from databases and spreadsheets

Pros

  • Human-readable
  • Easy to generate and parse
  • Supported by almost every tool

Cons

  • No support for nested or complex structures
  • No schema enforcement
  • No data types, everything is text
  • Inefficient for very large datasets

TSV (Tab-Separated Values)

Similar to CSV, but uses tab characters instead of commas as delimiters.

Example

name    age    city
Rachel   30     New York
Phoebe     25     San Francisco

Use Cases

  • Same use cases as CSV
  • Useful when data contains commas frequently

Pros

  • Simple and human-readable
  • Avoids issues with commas inside values
  • Easy to parse

Cons

  • No schema enforcement
  • No nested or complex data support
  • Same scalability and performance issues as CSV

#bigdata #dataformat #csv #tsv Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

JSON

JavaScript Object Notation

  • Neither row-based nor columnar
  • Flexible way to store and share data across systems
  • Text-based format using curly braces and key-value pairs

Simplest JSON Example

{"id": "1","name":"Rachel"}

Properties

  • Language independent
  • Self-describing
  • Human-readable
  • Widely supported across platforms

Basic Rules

  • Curly braces {} hold objects
  • Data is represented as key-value pairs
  • Entries are separated by commas
  • Double quotes are mandatory
  • Square brackets [] hold arrays

JSON Values

String  {"name":"Rachel"}

Number  {"id":101}

Boolean {"result":true, "status":false}  (lowercase)

Object  {
            "character":{"fname":"Rachel","lname":"Green"}
        }

Array   {
            "characters":["Rachel","Ross","Joey","Chanlder"]
        }

NULL    {"id":null}

Sample JSON Document

{
    "characters": [
        {
            "id" : 1,
            "fName":"Rachel",
            "lName":"Green",
            "status":true
        },
        {
            "id" : 2,
            "fName":"Ross",
            "lName":"Geller",
            "status":true
        },
        {
            "id" : 3,
            "fName":"Chandler",
            "lName":"Bing",
            "status":true
        },
        {
            "id" : 4,
            "fName":"Phebe",
            "lName":"Buffay",
            "status":false
        }
    ]
}

JSON Best Practices

No Hyphen in your Keys.

{"first-name":"Rachel","last-name":"Green"}  is not right. ✘
data.first-name

is parsed as

(data.first) - (name)

Under Scores Okay

{"first_name":"Rachel","last_name":"Green"} is okay ✓

Lowercase Okay

{"firstname":"Rachel","lastname":"Green"} is okay ✓

Camelcase best

{"firstName":"Rachel","lastName":"Green"} is the best. ✓

Use Cases

  • APIs and web services
  • Configuration files
  • NoSQL databases
  • Serialization and deserialization

Python Example

Serialize : Convert Python Object to JSON (Shareable) Format. DeSerialize : Convert JSON (Shareable) String to Python Object.


import json

def json_serialize(file_name):
    friends_characters={
        "characters":[
            {"name":"Rachel Green","job":"Fashion Executive"},
            {"name":"Ross Geller","job":"Paleontologist"},
            {"name":"Monica Geller","job":"Chef"},
            {"name":"Chandler Bing","job":"Statistical Analysis and Data Reconfiguration"},
            {"name":"Joey Tribbiani","job":"Actor"},
            {"name":"Phoebe Buffay","job":"Massage Therapist"}
        ]
    }
    json_data=json.dumps(friends_characters,indent=4)
    with open(file_name,"w") as f:
        json.dump(friends_characters,f,indent=4)

def json_deserialize(file_name):
    with open(file_name,"r") as f:
        data=json.load(f)
    print(data,type(data))

def main():
    file_name="friends_characters.json"
    json_serialize(file_name)
    json_deserialize(file_name)

if __name__=="__main__":
    main()

#bigdata #dataformat #json #hierarchicalVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 16 minutes]

Parquet

Parquet is a columnar storage file format designed for big data analytics.

  • Optimized for reading large datasets
  • Works extremely well with engines like Spark, Hive, DuckDB, Athena
  • Best suited for WORM workloads (Write Once, Read Many)

Why Parquet Exists

Most analytics questions look like this:

  • Total sales per country
  • Total T-Shirts sold
  • Revenue for UK customers

These queries do not need all columns.

Row-based formats still scan everything.
Parquet does not.


Row-Based Storage (CSV, JSON)

Row Storage

If you ask:

Total T-Shirts sold or Customers from UK

The engine must scan every column of every row.

Row Scan

This is slow at scale.


Columnar Storage (Parquet)

Columnar Storage

  • Each column is stored separately
  • Queries read only required columns
  • Massive reduction in disk I/O

Two Important Query Terms

Projection

Columns required by the query.

    select product, country, salesamount from sales;

Projection:

  • product
  • country
  • salesamount

Predicate

Row-level filter condition.

    select product, country, salesamount from sales where country='UK';

Predicate:

country = 'UK'

Parquet uses metadata to skip unnecessary data.


Row Groups

Parquet splits data into row groups.

Each row group contains:

  • All columns
  • Metadata (min/max values)

This allows:

  • Parallel processing
  • Skipping row groups that don’t match filters.

Parquet - Columnar Storage + Row Groups

Row Groups


Sample Data

ProductCustomerCountryDateSales Amount
BallJohn DoeUSA2023-01-01100
T-ShirtJohn DoeUSA2023-01-02200
SocksJane DoeUK2023-01-03150
SocksJane DoeUK2023-01-04180
T-ShirtAlexUSA2023-01-05120
SocksAlexUSA2023-01-06220

Data stored inside Parquet

┌──────────────────────────────────────────────┐
│                File Header                   │
│  ┌────────────────────────────────────────┐  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 1                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Ball, T-Shirt, Socks       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: John Doe, John Doe, Jane Doe│ │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: USA, USA, UK               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-01, 2023-01-02,    │  │
│  │            2023-01-03                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 100, 200, 150              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Ball/T-Shirt/Socks      │  │
│  │     • Customer: Jane Doe/John Doe      │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-01 to 2023-01-03    │  │
│  │     • Sales Amount: 100 to 200         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 2                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Socks, T-Shirt, Socks      │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: Jane Doe, Alex, Alex       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: UK, USA, USA               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-04, 2023-01-05,    │  │
│  │            2023-01-06                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 180, 120, 220              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Socks/T-Shirt           │  │
│  │     • Customer: Alex/Jane Doe          │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-04 to 2023-01-06   │  │
│  │     • Sales Amount: 120 to 220         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Metadata                 │
│  ┌────────────────────────────────────────┐  │
│  │ Schema:                                │  │
│  │  • Product: string                     │  │
│  │  • Customer: string                    │  │
│  │  • Country: string                     │  │
│  │  • Date: date                          │  │
│  │  • Sales Amount: double                │  │
│  ├────────────────────────────────────────┤  │
│  │ Compression Codec: Snappy              │  │
│  ├────────────────────────────────────────┤  │
│  │ Num Row Groups: 2                      │  │
│  ├────────────────────────────────────────┤  │
│  │ Offsets to Row Groups                  │  │
│  │  • Row Group 1: offset 128             │  │
│  │  • Row Group 2: offset 1024            │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Footer                   │
│  ┌────────────────────────────────────────┐  │
│  │ Offset to File Metadata: 2048          │  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

Example:

SELECT product, salesamount
FROM sales
WHERE country = 'UK';

Parquet will:

  • Read only product, salesamount, country
  • Skip row groups where country != UK
  • Ignore all other columns

This is why Parquet is fast.

Compression

Parquet compresses per column, which works very well.

Common codecs:

Snappy

  • Fast
  • Low CPU usage
  • Lower compression
  • Used in hot / frequently queried data

GZip

  • Slower
  • Higher compression
  • Used in cold / archival data

Encoding

Encoding reduces storage before compression.

Dictionary Encoding

  • Replaces repeated values with small integers
- 0: Ball
- 1: T-Shirt
- 2: Socks
- Data Page: [0,1,2,2,1,2]

Run-Length Encoding

  • Compresses repeated consecutive values
If Country column was sorted: [USA, USA, USA, UK, UK, UK]
RLE: [(3, USA), (3, UK)]

Delta Encoding

  • Stores differences between values (dates, counters)

This makes Parquet compact and efficient.

Date column: [2023-01-01, 2023-01-02, 2023-01-03, ...]
Delta Encoding: [2023-01-01, +1, +1, +1, ...]

Summary about Parquet

  • Columnar storage
  • Very fast analytical queries
  • Excellent compression
  • Schema support
  • Works across languages and engines
  • Industry standard for data lakes

Python Example


import pandas as pd

file_path = 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

# Read the CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

# Write DataFrame to a Parquet file
df.to_parquet('sample.parquet')

Some utilities to inspect Parquet files

WIN/MAC

https://aloneguid.github.io/parquet-dotnet/parquet-floor.html#installing

MAC

https://github.com/hangxie/parquet-tools
parquet-tools row-count sample.parquet
parquet-tools schema sample.parquet
parquet-tools cat sample.parquet
parquet-tools meta sample.parquet

Remote Files

parquet-tools row-count https://github.com/gchandra10/filestorage/raw/refs/heads/main/sales_onemillion.parquet

#bigdata #dataformat #parquet #columnar #compressedVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

Apache Arrow

Apache Arrow is an in-memory columnar data format designed for fast data exchange and analytics.

  • Parquet is for disk
  • Arrow is for memory

Arrow allows different systems to share data without copying or converting it.


Why Arrow Exists

Traditional formats focus on storage:

  • CSV, JSON → human-readable, slow
  • Parquet → compressed, efficient on disk

But once data is loaded into memory:

  • Engines still spend time converting data
  • Python, JVM, C++, R all use different memory layouts

Arrow solves this by providing a common in-memory columnar layout.


What Arrow Is Good At

  • Fast in-memory analytics
  • Zero-copy data sharing
  • Cross-language interoperability
  • Vectorized processing

Arrow is not a replacement for Parquet.

They work together.


Row-by-Row vs Vectorized Processing

Row-wise Processing (Slow)

Each value is processed one at a time.

data=[1,2,3,4]
for i in range(len(data)):
    data[i]=data[i]+10

Vectorized Processing (Fast)

One operation runs on the entire column at once.

import numpy as np
data=np.array([1,2,3,4])
data=data+10

Zero-Copy

Normally:

  • Data is copied when moving between tools
  • Copying costs time and memory

With Arrow:

  • Arrow enables zero-copy of Data when systems support it.
  • No serialization.
  • No extra copies.

Parquet → Arrow → Pandas → ML → Arrow → Parquet

  • Fast, clean, efficient.
FeatureApache ArrowApache Parquet
PurposeIn-memory analyticsOn-disk storage
LocationRAMDisk
PerformanceVery fast, interactiveOptimized for scans
CompressionMinimalHeavy compression
Use CaseData exchange, computeData lakes, warehousing

Demonstration (With and Without Vectorization)


import time
import numpy as np
import pyarrow as pa

N = 10_000_000
data_list = list(range(N))           # Python list
data_array = np.arange(N)            # NumPy array
arrow_arr = pa.array(data_list)      # Arrow array
np_from_arrow = arrow_arr.to_numpy() # Convert Arrow buffer to NumPy

# ---- Traditional Python list loop ----
start = time.time()
result1 = [x + 1 for x in data_list]
print(f"List processing time: {time.time() - start:.4f} seconds")

# ---- NumPy vectorized ----
start = time.time()
result2 = data_array + 1
print(f"NumPy processing time: {time.time() - start:.4f} seconds")

# ---- Arrow + NumPy ----
start = time.time()
result3 = np_from_arrow + 1
print(f"Arrow + NumPy processing time: {time.time() - start:.4f} seconds")

Use Cases

Data Science & Machine Learning

  • Share data between Pandas, Spark, R, and ML libraries without copying or converting.

Streaming & Real-Time Analytics

  • Ideal for passing large datasets through streaming frameworks with low latency.

Data Exchange

  • Move data between different systems with a common representation (e.g. Pandas → Spark → R).

Big Data

  • Integrates with Parquet, Avro, and other formats for ETL and analytics.

Think of Arrow as the in-memory twin of Parquet: Arrow is perfect for fast, interactive analytics; Parquet is great for long-term, compressed storage.

#dataformat #arrowVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 5 minutes]

Avro

Avro is a row-based binary data serialization format designed for data exchange and streaming systems.

Unlike Parquet, Avro is optimized for writing and reading one record at a time.


Why Avro Exists

Many systems need to:

  • Send data between producers and consumers
  • Handle continuous streams of events
  • Evolve data schemas safely over time

Text formats like JSON are:

  • Easy to read
  • Slow and verbose

Avro solves this with:

  • Compact binary encoding
  • Strong schema support

Key Characteristics

  • Row-based format
  • Supports Schema evolution
  • Binary and compact
  • Schema-driven
  • Designed for interoperability
  • Excellent for streaming pipelines

Schema in Avro

Avro uses a JSON schema to define data structure.

The schema:

  • Describes fields and data types
  • Travels with the data or is shared separately
  • Enables backward and forward compatibility

Example schema:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null","string"], "default": null}
  ]
}

Where Avro Is Used

  • Kafka producers and consumers
  • Streaming and real-time pipelines
  • Data ingestion layers
  • Cross-language data exchange

When NOT to Use Avro

  • Analytical queries
  • Aggregations
  • Column-level filtering

Avro vs Parquet

FeatureAvroParquet
Storage StyleRow-basedColumnar
Optimized ForStreaming, writesAnalytics, reads
Typical AccessOne record at a timeSelected columns
CompressionModerateVery high
Common UseKafka, ingestionData lakes, OLAP

tags:dataformat #avro #rowbasedVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

YAML

YAML stands for YAML Ain’t Markup Language.

  • Human-readable data serialization format
  • Designed for configuration, not large datasets
  • Structure is defined by indentation
  • Whitespace matters


Core Data Structures

Key–Value (Map / Dictionary)

app: analytics
version: 1.0

List (Sequence / Array)

ports:
  - 8080
  - 9090

Nested structures

database:
  host: localhost
  port: 5432
  credentials:
    user: admin
    password: secret

Scalars

  • string, int, float, bool, null
  • true, false, null are native types

YAML vs JSON

  • YAML is superet of JSON, YAML can parse JSON syntax.
  • No braces, no commas
  • Comments are allowed
  • Types inferred, not enforced
  • Easier diffs in git
  • Easier to break with bad indentation

Tradeoff is real. YAML is readable but fragile.

{"id":1,"name":"event","tags":["click","mobile"]}
id: 1
name: event
tags:
  - click
  - mobile

Real world usecases

Here are some of the popular usecases in Data Engineering

  • CICD
  • Terraform
  • Docker
  • Airflow

JSON is for DATA and YAML is for Config

YAML is a bad choice for Data if

  • Dataset is Large
  • High Write frequency
  • Streaming or Continous Data
  • Schema critical systems

Because

  • YAML is slow to parse (compared to JSON)
  • Hard to validate strictly
  • No native indexing
  • YAML parsers build large memory trees

Lightweight portable command-line

https://mikefarah.gitbook.io/yq/

#dataformat #yaml #yqVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

DuckDB

DuckDB is a lightweight analytical database designed to run locally with no external dependencies.

  • Single-file database
  • Zero setup
  • Optimized for analytics
  • Excellent support for modern data formats like Parquet

DuckDB is often called the SQLite for analytics.


Why DuckDB Is Useful Here

DuckDB helps us experience the impact of data formats.

It allows us to:

  • Query CSV and Parquet directly
  • See why columnar formats are faster
  • Run analytical queries without Spark or a cluster

DuckDB is a tool for learning, not the topic itself.


Key Capabilities (High Level)

  • Automatic parallel query execution
  • Fast analytical SQL engine
  • Native Parquet support
  • Reads files directly without loading them into tables
  • Works well with Python and data science workflows

Download the CLI Client

DuckDB in Data Engineering

Download orders.parquet

Open Command Prompt or Terminal

./duckdb

or

duckdb.exe
# Create / Open a database

.open ordersdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"
select * from "orders.parquet" limit 3;
show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";
SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;
select * from demo.json;
describe demo.json;

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

#duckdb #singlefiledatabase #parquet #tools #cliVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Protocols

  1. Introduction
  2. HTTP
  3. Monolithic Architecture
  4. Statefulness
  5. Microservices
  6. Statelessness
  7. Idempotency
  8. REST API
  9. API Performance
  10. API in Big Data worldVer 6.0.25

[Avg. reading time: 2 minutes]

Introduction

Protocols are standardized rules that govern how data is transmitted, formatted, and processed across systems.

In Big Data, protocols are essential for:

  • Data ingestion (getting data in)
  • Inter-node communication in clusters
  • Remote access to APIs/services
  • Serialization of structured data
  • Security and authorization
ProtocolLayerUse Case Example
HTTP/HTTPSApplicationREST API for ingesting external data
KafkaMessagingStream processing with Spark or Flink
gRPCRPCMicroservices in ML model serving
MQTTMessagingIoT data push to cloud
Avro/ProtoSerializationBinary data for logs and schema
OAuth/KerberosSecuritySecure access to data lakes

#protocols #grpc #http #mqttVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

HTTP

Basics

HTTP (HyperText Transfer Protocol) is the foundation of data communication on the web, used to transfer data (such as HTML files and images).

GET - Navigate to a URL or click a link in real life.

POST - Submit a form on a website, like a username and password.


200 Series (Success): 200 OK, 201 Created.

300 Series (Redirection): 301 Moved Permanently, 302 Found.

400 Series (Client Error): 400 Bad Request, 401 Unauthorized, 404 Not Found.

500 Series (Server Error): 500 Internal Server Error, 503 Service Unavailable.

#http #get #put #post #statuscodesVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Monolithic Architecture

Definition: A monolithic architecture is a software design pattern in which an application is built as a unified unit. All application components (user interface, business logic, and data access layers) are tightly coupled and run as a single service.

Characteristics: This architecture is simple to develop, test, deploy, and scale vertically. However, it can become complex and unwieldy as the application grows.

Monolithic

Examples

  • Traditional Banking Systems.
  • Enterprise Resource Planning (SAP ERP) Systems.
  • Content Management Systems like WordPress.
  • Legacy Government Systems. (Tax filing, public records management, etc.)

Advantages and Disadvantages

Advantages: Simplicity in development and deployment, straightforward horizontal scaling, and often more accessible debugging since all components are in one place. Reduced Latency in the case of Amazon Prime.

Disadvantages: Scaling challenges, difficulty implementing changes or updates (especially in large systems), and potential for more extended downtime during maintenance.

#monolithic #banking #amazonprime tightlycoupledVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

Statefulness

The server stores information about the client’s current session in a stateful system. This is common in traditional web applications. Here’s what characterizes a stateful system:

Session Memory: The server remembers past interactions and may store session data like user authentication, preferences, and other activities.

Server Dependency: Since the server holds session data, the same server usually handles subsequent requests from the same client. This is important for consistency.

Resource Intensive: Maintaining state can be resource-intensive, as the server needs to manage and store session data for each client.

Example: A web application where a user logs in, and the server keeps track of their authentication status and interactions until they log out.

Statefulness

In this diagram:

Initial Request: The client sends the initial request to the load balancer.

Load Balancer to Server 1: The load balancer forwards the request to Server 1.

Response with Session ID: Server 1 responds to the client with a session ID, establishing a sticky session.

Subsequent Requests: The client sends subsequent requests with the session ID.

Load Balancer Routes to Server 1: The load balancer forwards these requests to Server 1 based on the session ID, maintaining the sticky session.

Server 1 Processes Requests: Server 1 continues to handle requests from this client.

Server 2 Unused: Server 2 remains unused for this particular client due to the stickiness of the session with Server 1.

Stickiness (Sticky Sessions)

Stickiness or sticky sessions are used in stateful systems, particularly in load-balanced environments. It ensures that requests from a particular client are directed to the same server instance. This is important when:

Session Data: The server needs to maintain session data (like login status), and it’s stored locally on a specific server instance.

Load Balancers: In a load-balanced environment, without stickiness, a client’s requests could be routed to different servers, which might not have the client’s session data.

Trade-off: While it helps maintain session continuity, it can reduce the load balancing efficiency and might lead to uneven server load.

Methods of Implementing Stickiness

Cookie-Based Stickiness: The most common method, where the load balancer uses a special cookie to track the server assigned to a client.

IP-Based Stickiness: The load balancer routes requests based on the client’s IP address, sending requests from the same IP to the same server.

Custom Header or Parameter: Some load balancers can use custom headers or URL parameters to track and maintain session stickiness.

#stateful #stickiness #loadbalancerVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

Microservices

Microservices architecture is a method of developing software applications as a suite of small, independently deployable services. Each service in a microservices architecture is focused on a specific business capability, runs in its process, and communicates with other services through well-defined APIs. This approach stands in contrast to the traditional monolithic architecture, where all components of an application are tightly coupled and run as a single service.

Characteristics:

Modularity: The application is divided into smaller, manageable pieces (services), each responsible for a specific function or business capability.

Independence: Each microservice is independently deployable, scalable, and updatable. This allows for faster development cycles and easier maintenance.

Decentralized Control: Microservices promote decentralized data management and governance. Each service manages its data and logic.

Technology Diversity: Teams can choose the best technology stack for their microservice, leading to a heterogeneous technology environment.

Resilience: Failure in one microservice doesn’t necessarily bring down the entire application, enhancing the system’s overall resilience.

Scalability: Microservices can be scaled independently, allowing for more efficient resource utilization based on demand for specific application functions.

Microservices

Data Ingestion Microservices: Collect and process data from multiple sources.

Data Storage: Stores processed weather data and other relevant information.

User Authentication Microservice: Manages user authentication and communicates with the User Database for validation.

User Database: Stores user account information and preferences.

API Gateway: Central entry point for API requests, routes requests to appropriate microservices, and handles user authentication.

User Interface Microservice: Handles the logic for the user interface, serving web and mobile applications.

Data Retrieval Microservice: Fetches weather data from the Data Storage and provides it to the frontends.

Web Frontend: The web interface for end-users, making requests through the API Gateway.

Mobile App Backend: Backend services for the mobile application, also making requests through the API Gateway.

Advantages:

Agility and Speed: Smaller codebases and independent deployment cycles lead to quicker development and faster time-to-market.

Scalability: It is easier to scale specific application parts that require more resources.

Resilience: Isolated services reduce the risk of system-wide failures.

Flexibility in Technology Choices: Microservices can use different programming languages, databases, and software environments.

Disadvantages:

Complexity: Managing a system of many different services can be complex, especially regarding network communication, data consistency, and service discovery.

Overhead: Each microservice might need its own database and transaction management, leading to duplication and increased resource usage.

Testing Challenges: Testing inter-service interactions can be more complex compared to a monolithic architecture.

Deployment Challenges: Requires robust DevOps practices, including continuous integration and continuous deployment (CI/CD) pipelines.

#microservices #RESTAPI #CICDVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Statelessness

In a stateless system, each request from the client must contain all the information the server needs to fulfill that request. The server does not store any state of the client’s session. This is a crucial principle of RESTful APIs. Characteristics include:

No Session Memory: The server remembers nothing about the user once the transaction ends. Each request is independent.

Scalability: Stateless systems are generally more scalable because the server doesn’t need to maintain session information. Any server can handle any request.

Simplicity and Reliability: The stateless nature makes the system simpler and more reliable, as there’s less information to manage and synchronize across systems.

Example: An API where each request contains an authentication token and all necessary data, allowing any server instance to handle any request.

Statlessness

In this diagram:

Request 1: The client sends a request to the load balancer.

Load Balancer to Server 1: The load balancer forwards Request 1 to Server 1.

Response from Server 1: Server 1 processes the request and sends a response back to the client.

Request 2: The client sends another request to the load balancer.

Load Balancer to Server 2: This time, the load balancer forwards Request 2 to Server 2.

Response from Server 2: Server 2 processes the request and responds to the client.

Statelessness: Each request is independent and does not rely on previous interactions. Different servers can handle other requests without needing a shared session state.

Token-Based Authentication

Common in stateless architectures, this method involves passing a token for authentication with each request instead of relying on server-stored session data. JWT (JSON Web Tokens) is a popular example.

#statelessness #jwt #RESTVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

Idempotency

In simple terms, idempotency is the property where an operation can be applied multiple times without changing the result beyond the initial application.

Think of an elevator button: whether you press it once or mash it ten times, the elevator is still only called once to your floor. The first press changed the state; the subsequent ones are “no-ops.”

In technology, this is the “secret sauce” for reliability. If a network glitch occurs and a request is retried, idempotency ensures you don’t end up with duplicate orders, double payments, or corrupted data.

Idempotency

Popular Examples

  • The MERGE (Upsert) Operation
  • ABS(-5)
  • Using Terraform to deploy server

#idempotent #merge #upsert #teraform #absVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

REST API

REpresentational State Transfer is a software architectural style developers apply to web APIs.

REST APIs provide simple, uniform interfaces because they can be used to make data, content, algorithms, media, and other digital resources available through web URLs. Essentially, REST APIs are the most common APIs used across the web today.

Use of a uniform interface (UI)

REST API

HTTP Methods

GET: This method allows the server to find the data you requested and send it back to you.

POST: This method permits the server to create a new entry in the database.

PUT: If you perform the ‘PUT’ request, the server will update an entry in the database.

DELETE: This method allows the server to delete an entry in the database.

Sample REST API URI

https://api.zippopotam.us/us/08028

http://api.tvmaze.com/search/shows?q=friends

https://jsonplaceholder.typicode.com/posts

https://jsonplaceholder.typicode.com/posts/1

https://jsonplaceholder.typicode.com/posts/1/comments

https://reqres.in/api/users?page=2

https://reqres.in/api/users/2

http://universities.hipolabs.com/search?country=United+States

https://itunes.apple.com/search?term=michael&limit=1000

https://www.boredapi.com/api/activity

https://techcrunch.com/wp-json/wp/v2/posts?per_page=100&context=embed

Usage

curl https://api.zippopotam.us/us/08028
curl https://api.zippopotam.us/us/08028 -o zipdata.json

Browser based

https://httpie.io/app

VS Code based

Get Thunder Client

Python way

using requests library

Summary

Definition: REST (Representational State Transfer) API is a set of guidelines for building web services. A RESTful API is an API that adheres to these guidelines and allows for interaction with RESTful web services.

How It Works: REST uses standard HTTP methods like GET, POST, PUT, DELETE, etc. It is stateless, meaning each request from a client to a server must contain all the information needed to understand and complete the request.

Data Format: REST APIs typically exchange data in JSON or XML format.

Purpose: REST APIs are designed to be a simple and standardized way for systems to communicate over the web. They enable the backend services to communicate with front-end applications (like SPAs) or other services.

Use Cases: REST APIs are used in web services, mobile applications, and IoT (Internet of Things) applications for various purposes like fetching data, sending commands, and more.

#restapi #REST #curl #requestsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

API Performance

API Performance

Caching

Store frequently accessed data in a cache so you can access it faster.

If there’s a cache miss, fetch the data from the database.

It’s pretty effective, but it can be challenging to invalidate and decide on the caching strategy.

Scale-out with Load Balancing

You can consider scaling your API to multiple servers if one server instance isn’t enough. Horizontal scaling is the way to achieve this.

The challenge will be to find a way to distribute requests between these multiple instances.

Load Balancing

It not only helps with performance but also makes your application more reliable.

However, load balancers work best when your application is stateless and easy to scale horizontally.

Pagination

If your API returns many records, you need to explore Pagination.

You limit the number of records per request.

This improves the response time of your API for the consumer.

Async Processing

With async processing, you can let the clients know that their requests are registered and under process.

Then, you process the requests individually and communicate the results to the client later.

This allows your application server to take a breather and give its best performance.

But of course, async processing may not be possible for every requirement.

Connection Pooling

An API often needs to connect to the database to fetch some data.

Creating a new connection for each request can degrade performance.

It’s a good idea to use connection pooling to set up a pool of database connections that can be reused across requests.

This is a subtle aspect, but connection pooling can dramatically impact performance in highly concurrent systems.

YT Visual representation

#api #performance #loadbalancing #pagination #connectionpoolVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

API in Big Data World

Big data and REST APIs are often used together in modern data architectures. Here’s how they interact:

Ingestion gateway

  • Applications push events through REST endpoints
  • Gateway converts to Kafka, Kinesis, or file landing zones
  • REST is entry door, not the pipeline itself

Serving layer

  • Processed data in Hive, Elasticsearch, Druid, or Delta
  • APIs expose aggregated results to apps and dashboards
  • REST is read interface on top of heavy compute

Control plane

  • Spark job submission via REST
  • Kafka topic management
  • cluster monitoring and scaling
  • authentication and governance

Microservices boundary

  • Each service owns a slice of data
  • APIs expose curated views
  • internal pipelines stay streaming or batch

What REST is NOT in Big Data

  • Not used for bulk petabyte transfer
  • Not used inside Spark transformations
  • Not the transport between Kafka and processors

Example of API in Big Data

https://docs.redis.com/latest/rs/references/rest-api/

https://rapidapi.com/search/big-data

https://www.kaggle.com/discussions/general/315241

#apiinbigdata #kafka #sparkVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

Advance Python

  1. Data Frames
  2. Decorator
  3. Unit Testing
  4. Error Handling
  5. Logging

Ver 6.0.25

[Avg. reading time: 21 minutes]

Data Frames

DataFrames are the core abstraction for tabular data in analytics, machine learning, and ETL systems.

Think of a DataFrame as:

  • A database table
  • An Excel sheet
  • A SQL result set
  • A structured dataset in memory

But with a programmable API.

Using Data Frames helps you to

  • Select columns
  • Filter rows
  • Aggregate data
  • Join datasets
  • Transform data efficiently
  • Read and write formats like CSV, Parquet, JSON, Arrow

A DataFrame is:

  • Column-oriented
  • Vectorized
  • Designed for batch transformations
  • Not meant for row-by-row Python loops

Wrong Idea

for row in df:
    total = price * quantity

Correct Idea

You think in transformations, no iteration.

df["total"] = df["price"] * df["quantity"]

Pandas

Pandas is a popular Python library for data manipulation and analysis. A DataFrame in Pandas is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

Eager Evaluation: Pandas performs operations eagerly, meaning that each operation is executed immediately when called.

In-Memory Copy - Full DataFrame in RAM, single copy

Sequential Processing - Single threaded, one operation at at time.

Strengths

  • Extremely intuitive API
  • Huge ecosystem
  • Excellent for exploration
  • Strong integration with ML libraries
  • Perfect for small to medium datasets

Weaknesses

  • Limited by RAM
  • Single-core execution
  • Slow for very large datasets
  • No query optimizer

Example

import pandas as pd

df = pd.read_csv("data/sales_100.csv")

# Filter
filtered = df[df["region"] == "East"]

# Group and aggregate
result = filtered.groupby("category")["sales"].sum()

print(result.head())

When not to use Pandas

  • Data exceeds available memory
  • Computations become slow
  • CPU only uses one core
  • Processing large CSV files takes too long

Google Colab - Pandas

Polars

Polars is a fast, multi-threaded DataFrame library in Rust and Python, designed for performance and scalability. It is known for its efficient handling of larger-than-memory datasets.

Supports both eager and lazy evaluation.

Lazy Evaluation: Instead of loading the entire CSV file into memory right away, a Lazy DataFrame builds a blueprint or execution plan describing how the data should be read and processed. The actual data is loaded only when the computation is triggered (for example, when you call a collect or execute command).

Optimizations: Using scan_csv allows Polars to optimize the entire query pipeline before loading any data. This approach is beneficial for large datasets because it minimizes memory usage and improves execution efficiency.

  • pl.read_csv() or pl.read_parquet() - eager evaluation
  • pl.scan_csv() or pl.scan_parquet() - lazy evaluation

Parallel Execution: Multi-threaded compute.

Columnar efficiency: Uses Arrow columnar memory format under the hood.

Pros

  • High performance due to multi-threading and memory-efficient execution.
  • Lazy evaluation, optimizing the execution of queries.
  • Handles larger datasets effectively.

Cons

  • Smaller community and ecosystem compared to Pandas.
  • Less mature with fewer third-party integrations.

Example

import polars as pl

# Load the CSV file using Polars
df = pl.scan_csv('data/sales_100.csv')

print(df.head())

# Display the first few rows
print(df.collect())

df1 = pl.read_csv('data/sales_100.csv')
print(df1.head())

Google Colab - Polars


Dask

Dask is a parallel computing library that scales Python libraries like Pandas for large, distributed datasets.

Client (Python Code)
   │
   ▼
Scheduler (builds + manages task graph)
   │
   ▼
Workers (execute tasks in parallel)
   │
   ▼
Results gathered back to client

Open Source https://docs.dask.org/en/stable/install.html

Dask Cloud Coiled Cloud

Lazy Reading: Dask builds a task graph instead of executing immediately — computations run only when triggered (similar to Polars lazy execution).

Partitioning: A Dask DataFrame is split into many smaller Pandas DataFrames (partitions) that can be processed in parallel.

Task Graph: Dask represents your workflow as a directed acyclic graph (DAG) showing the sequence and dependencies of tasks.

Distributed Compute: Dask executes tasks across multiple cores or machines, enabling scalable, parallel data processing.


import dask.dataframe as dd

ddf = dd.read_csv(
    "data/sales_*.csv",
    dtype={"category": "string", "value": "float64"},
    blocksize="64MB"
)

# 2) Lazy transform: per-partition groupby + sum, then global combine
agg = ddf.groupby("category")["value"].sum().sort_values(ascending=False)

# 3) Trigger execution and bring small result to driver
result = agg.compute()

print(result.head(10))

blocksize determines the parition. If omitted dask automatically uses 64MB

flowchart LR
  A1[CSV part 1] --> P1[parse p1]
  A2[CSV part 2] --> P2[parse p2]
  A3[CSV part 3] --> P3[parse p3]

  P1 --> G1[local groupby-sum p1]
  P2 --> G2[local groupby-sum p2]
  P3 --> G3[local groupby-sum p3]

  G1 --> C[combine-aggregate]
  G2 --> C
  G3 --> C

  C --> S[sort values]
  S --> R[collect to Pandas]

Pros

  • Can handle datasets that don’t fit into memory by processing in parallel.
  • Scales to multiple cores and clusters, making it suitable for big data tasks.
  • Integrates well with Pandas and other Python libraries.

Cons

  • Slightly more complex API compared to Pandas.
  • Performance tuning can be more challenging.

Where to start?

  • Start with Pandas for learning and small datasets.
  • Switch to Polars when performance matters.
  • Use Dask when data exceeds single-machine memory or needs cluster execution.

Google Colab - Dask

Pandas vs Polars vs Dask

FeaturePandasPolarsDask
LanguagePythonRust with Python bindingsPython
Execution ModelSingle-threadedMulti-threadedMulti-threaded, distributed
Data HandlingIn-memoryIn-memory, Arrow-basedIn-memory, out-of-core
ScalabilityLimited by memoryLimited to single machineScales across clusters
PerformanceGood for small to medium dataHigh performance for single machineGood for large datasets
API FamiliarityWidely known, matureSimilar to PandasSimilar to Pandas
Ease of UseVery easy, large ecosystemEasy, but smaller ecosystemModerate, requires understanding of parallelism
Fault ToleranceNoneLimitedHigh, with task retries and rescheduling
Machine LearningIntegration with Python ML libsPreprocessing onlyIntegration with Dask-ML and other libs
Lazy EvaluationNoYesYes, with task graphs
Best ForData analysis, small datasetsFast preprocessing on single machineLarge-scale data processing
Cluster ManagementN/AN/ASupports Kubernetes, YARN, etc.
Use CasesData manipulation, analysisFast data manipulationLarge data, ETL, scaling Python code

#pandas #polars #daskVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 16 minutes]

Decorator

Decorators in Python are a powerful way to modify or extend the behavior of functions or methods without changing their code. Decorators are often used for tasks like logging, authentication, and adding additional functionality to functions. They are denoted by the “@” symbol and are applied above the function they decorate.

def say_hello():
    print("World")

say_hello()

How do we change the output without changing the say hello() function?

wrapper() is not reserved word. It can be anyting.

Use Decorators

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello,")
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

If you want to replace the new line character and the end of the print statement, use end=''

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello, ", end='')
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

Multiple functions inside the Decorator

def hello_decorator(func):
    def first_wrapper():
        print("First wrapper, doing something before the second wrapper.")
        #func()
    
    def second_wrapper():
        print("Second wrapper, doing something before the actual function.")
        #func()
    
    def main_wrapper():
        first_wrapper()  # Call the first wrapper
        second_wrapper()  # Then call the second wrapper, which calls the actual function
        func()
    
    return main_wrapper

@hello_decorator
def say_hello():
    print("World")

say_hello()

Args & Kwargs

  • *args: This is used to represent positional arguments. It collects all the positional arguments passed to the decorated function as a tuple.
  • **kwargs: This is used to represent keyword arguments. It collects all the keyword arguments (arguments passed with names) as a dictionary.
from functools import wraps

def my_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print("Positional Arguments (*args):", args)
        print("Keyword Arguments (**kwargs):", kwargs)
        result = func(*args, **kwargs)
        return result
    return wrapper

@my_decorator
def example_function(a, b, c=0, d=0):
    print("Function Body:", a, b, c, d)

# Calling the decorated function with different arguments
example_function(1, 2)
example_function(3, 4, c=5)

Without Wraps


import time
import random
from functools import wraps

def timer(func):
    def wrapper(*args, **kwargs):
        name = wrapper.__name__
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{name} took {end - start:.6f} seconds")
        return result
    return wrapper


@timer
def built_in_sort(data):
    return sorted(data)


@timer
def bubble_sort(data):
    arr = data.copy()
    n = len(arr)
    for i in range(n):
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr


data = [random.randint(1, 100000) for _ in range(5000)]

built_in_sort(data)
bubble_sort(data)

Using Wraps

import time
import random
from functools import wraps

def timer(label=None):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            name = label or wrapper.__name__
            start = time.perf_counter()
            result = func(*args, **kwargs)
            end = time.perf_counter()
            print(f"{name} took {end - start:.6f} seconds")
            return result
        return wrapper
    return decorator


@timer()
def built_in_sort(data):
    return sorted(data)


@timer("Custom Bubble Sort")
def bubble_sort(data):
    arr = data.copy()
    n = len(arr)
    for i in range(n):
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr


data = [random.randint(1, 100000) for _ in range(5000)]

built_in_sort(data)
bubble_sort(data)

The purpose of @wraps is to preserve the metadata of the original function being decorated.

#decorator #memoizationVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Unit Testing

A unit test tests a small “unit” of code - usually a function or method - independently from the rest of the program.

Some key advantages of unit testing include:

  • Isolates code - This allows testing individual units in isolation from other parts of the codebase, making bugs easier to identify.
  • Early detection - Tests can catch issues early in development before code is deployed, saving time and money.
  • Regression prevention - Existing unit tests can be run whenever code is changed to prevent new bugs or regressions.
  • Facilitates changes - Unit tests give developers the confidence to refactor or update code without breaking functionality.
  • Quality assurance - High unit test coverage helps enforce quality standards and identify edge cases.

Every language has its unit testing framework. In Python, some popular ones are

  • unittest
  • pytest
  • doctest
  • testify

Example:

Using Pytest & UV

git clone https://github.com/gchandra10/pytest-demo.git

#unittesting #pytestVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

Error Handling

Python uses try/except blocks for error handling.

The basic structure is:

try:
    # Code that may raise an exception
except ExceptionType:
    # Code to handle the exception
finally:
    # Code executes all the time

Uses

Improved User Experience: Instead of the program crashing, you can provide a user-friendly error message.

Debugging: Capturing exceptions can help you log errors and understand what went wrong.

Program Continuity: Allows the program to continue running or perform cleanup operations before terminating.

Guaranteed Cleanup: Ensures that certain operations, like closing files or releasing resources, are always performed.

Some key points

  • You can catch specific exception types or use a bare except to catch any exception.

  • Multiple except blocks can be used to handle different exceptions.

  • An else clause can be added to run if no exception occurs.

  • A finally clause will always execute, whether an exception occurred or not.


Without Try/Except

x = 10 / 0 

Basic Try/Except

try:
    x = 10 / 0 
except ZeroDivisionError:
    print("Error: Division by zero!")

Generic Exception

try:
    file = open("nonexistent_file.txt", "r")
except:
    print("An error occurred!")

Find the exact error

try:
    file = open("nonexistent_file.txt", "r")
except Exception as e:
    print(str(e))

Raise - Else and Finally

try:
    x = -10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

try:
    x = 10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

Nested Functions


def divide(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        print("Error in divide(): Cannot divide by zero!")
        raise  # Re-raise the exception

def calculate_and_print(x, y):
    try:
        result = divide(x, y)
        print(f"The result of {x} divided by {y} is: {result}")
    except ZeroDivisionError as e:
        print(str(e))
    except TypeError as e:
        print(str(e))

# Test the nested error handling
print("Example 1: Valid division")
calculate_and_print(10, 2)

print("\nExample 2: Division by zero")
calculate_and_print(10, 0)

print("\nExample 3: Invalid type")
calculate_and_print("10", 2)

#errorhandling #exception #tryVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Logging

Python’s logging module provides a flexible framework for tracking events in your applications. It’s used to log messages to various outputs (console, files, etc.) with different severity levels like DEBUG, INFO, WARNING, ERROR, and CRITICAL.

Use Cases of Logging

Debugging: Identify issues during development. Monitoring: Track events in production to monitor behavior. Audit Trails: Capture what has been executed for security or compliance. Error Tracking: Store errors for post-mortem analysis. Rotating Log Files: Prevent logs from growing indefinitely using size or time-based rotation.

Python Logging Levels

LevelUsageNumeric ValueDescription
DEBUGDetailed information for diagnosing problems.10Useful during development and debugging stages.
INFOGeneral information about program execution.20Highlights normal, expected behavior (e.g., program start, process completion).
WARNINGIndicates something unexpected but not critical.30Warns of potential problems or events to monitor (e.g., deprecated functions, nearing limits).
ERRORAn error occurred that prevented some part of the program from working.40Represents recoverable errors that might still allow the program to continue running.
CRITICALSevere errors indicating a major failure.50Marks critical issues requiring immediate attention (e.g., system crash, data corruption).

INFO

import logging

logging.basicConfig(level=logging.INFO)  # Set the logging level to INFO

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")
logging.error("This is an error message.")
logging.critical("This is a critical message.")

Error

import logging

logging.basicConfig(level=logging.ERROR)  # Set the logging level to ERROR

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")
logging.error("This is an error message.")
logging.critical("This is a critical message.")

import logging

logging.basicConfig(
    level=logging.DEBUG, 
    format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logging.debug("This is a debug message.")
logging.info("This is an info message.")
logging.warning("This is a warning message.")

More Examples

git clone https://github.com/gchandra10/python_logging_examples.git

#logging #infoVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

Containers

  1. CPU Architecture Fundamentals
  2. Introduction
  3. VMs or Containers
  4. What Container does
  5. Docker
  6. Docker ExamplesVer 6.0.25

[Avg. reading time: 8 minutes]

CPU Architecture Fundamentals

Introduction

CPU architecture defines:

  • The instruction set a processor understands
  • Register structure
  • Memory addressing model
  • Binary format

It determines what machine code can run on a processor.

If software is compiled for one architecture, it cannot run on another without translation.


Major CPU Architectures

In todays world.

1. amd64 (x86_64)

  • Designed by AMD, adopted by Intel
  • Dominates desktops and traditional servers
  • Common in enterprise data centers
  • Most Windows laptops
  • Intel-based Macs

Characteristics:

  • High performance
  • Higher power consumption

2. arm64 (aarch64)

  • Designed for power efficiency
  • Common in embedded systems and mobile devices
  • Raspberry Pi
  • Apple Silicon (M*)
  • Many IoT gateways

Characteristics:

  • Energy efficient
  • Dominant in IoT and edge computing

Mac/Linux

uname -m

Windows

echo %%PROCESSOR_ARCHITECTURE%%
systeminfo | findstr /B /C:"System Type"

How Programming Languages Relate to Architecture

                +----------------------+
                |     Source Code      |
                |  (C, Rust, Python)   |
                +----------+-----------+
                           |
                           v
                +----------------------+
                |     Compiler /       |
                |     Interpreter      |
                +----------+-----------+
                           |
         +-----------------+-----------------+
         |                                   |
         v                                   v
+---------------------+          +----------------------+
|  amd64 Binary       |          |  arm64 Binary       |
|  (x86_64 machine    |          |  (ARM machine       |
|   instructions)     |          |   instructions)     |
+----------+----------+          +----------+-----------+
           |                                |
           v                                v
+---------------------+          +----------------------+
|  Intel / AMD CPU    |          |  ARM CPU            |
|  (Laptop, Server)   |          |  (Raspberry Pi,     |
|                     |          |   IoT Gateway)      |
+---------------------+          +----------------------+

Compiled Languages

Examples: C, C++, Rust, Go

When compiled, they produce native machine code.

Compile on Windows - produces an amd64 binary.

Compile on Raspberry Pi or new Mac - produces an arm64 binary.

That binary cannot run on a different architecture.

Interpreted Languages

Examples: Python, Node.js

Source code is architecture-neutral. Interpreter handles it.

The interpreter (Python, Node) is architecture-specific

Native extensions are architecture-specific.

Java and Bytecode

            +------------------+
            |   Java Source    |
            +--------+---------+
                     |
                     v
            +------------------+
            |    Bytecode      |
            |   (.class file)  |
            +--------+---------+
                     |
         +-----------+-----------+
         |                       |
         v                       v
+------------------+     +------------------+
| JVM (amd64)      |     | JVM (arm64)      |
+--------+---------+     +--------+---------+
         |                        |
         v                        v
   Intel CPU                ARM CPU

Java uses a different model.

Compile: javac MyApp.java

Produces: MyApp.class

This is bytecode, not native machine code.

Bytecode runs on the JVM (Java Virtual Machine).

The JVM is architecture-specific.

Same bytecode runs on amd64 JVM

Same bytecode runs on arm64 JVM

Java achieves portability through a virtual machine layer.

Cross Compilation

It is possible to cross compile for a different architecture than your current architecture.

Developer Laptop (amd64)
        |
        | build
        v
   amd64 binary
        |
        | deploy
        v
Raspberry Pi (arm64)
        |
        X  Fails (architecture mismatch)
Developer Laptop
        |
        | cross-build for arm64
        v
   arm64 binary
        |
        v
Raspberry Pi (runs successfully)

Architecture in IoT Upper Stack

LayerTypical Architecture
MicrocontrollerARM (32-bit or 64-bit)
Edge Gatewayarm64
Cloud VMamd64 or arm64
Personal Machinesamd64 or arm64

#architecture #arm #amdVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Containers

World before containers

Physical Machines

Physical Machines

  • 1 Physical Server
  • 1 Host Machine (say some Linux)
  • 3 Applications installed

Limitation:

  • Need of physical server.
  • Version dependency (Host and related apps)
  • Patches ”hopefully” not affecting applications.
  • All apps should work with the same Host OS.

Multiple Physical Machines

  • 3 physical server
  • 3 Host Machine (diff OS)
  • 3 Applications installed

Limitation:

  • Need of physical server(s).
  • Version dependency (Host and related apps)
  • Patches ”hopefully” not affecting applications.
  • Maintenance of 3 machines.
  • Network all three so they work together.

Virtual Machines

Virtual Machine

  • Virtual Machines emulate a real computer by virtualizing it to execute applications,running on top of a real computer.

  • To emulate a real computer, virtual machines use a Hypervisor to create a virtual computer.

  • On top of the Hypervisor, we have a Guest OS that is a Virtualized Operating System where we can run isolated applications, called Guest Operating System.

  • Applications that run in Virtual Machines have access to Binaries and Libraries on top of the operating system.

( + ) Full Isolation, Full virtualization

( - ) Too many layers, Heavy-duty servers.

Containers

Containers

Containers are lightweight, portable environments that package an application with everything it needs to run—like code, runtime, libraries, and system tools—ensuring consistency across different environments. They run on the same operating system kernel and isolate applications from each other, which improves security and makes deployments easier.

  • Containers are isolated processes that share resources with their host and, unlike VMs, don’t virtualize the hardware and don’t need a Guest OS.

  • Containers share resources with other Containers in the same host.

  • This gives more performance than VMs (no separate guest OS).

  • Container Engine in place of Hypervisor.

Pros

  • Isolated Process
  • Mounted Files
  • Lightweight Process

Cons

  • Same Host OS
  • Security

#containers #dockerVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

VMs or Containers

VMs are great for running multiple, isolated OS environments on a single hardware platform. They offer strong security isolation and are useful when applications need different OS versions or configurations.

Containers are lightweight and share the host OS kernel, making them faster to start and less resource-intensive. They’re perfect for microservices, CI/CD pipelines, and scalable applications.

Smart engineers focus on the right tool for the job rather than getting caught up in “better or worse” debates.

Use them in combination to make life better.

Docker: The most widely used container platform, known for its simplicity, portability, and extensive ecosystem.

Podman: A daemonless container engine that’s compatible with Docker but emphasizes security, running containers as non-root users.

We will be using Docker for this course.

#vm #container #dockerVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

What container does

It brings to us the ability to create applications without worrying about their environment.

Works on my machine

  • Docker turns “my machine” into the machine
  • Docker is not a magic want.
  • It only guarantees the environment is identical
  • Correctness still depends on what you build and how you run it.

#worksforme #container #dockerVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

Docker Basics

At a conceptual level, Docker is built around two core abstractions:

  • Images – what you build
  • Containers – what you run

Everything else in Docker exists to build, store, distribute, and execute these two artifacts.

Images

  • An image is an immutable, layered filesystem snapshot
  • Built from a Dockerfile
  • Each instruction creates a new read-only layer
  • Images are content-addressed via SHA256 digests

Image is a versioned, layered blueprint

Key properties:

  • Immutable
  • Reusable
  • Cached aggressively
  • Portable across environments

Container

A container is a running instance of an image

  • A writable layer on top of image layers
  • Namespaces for isolation (PID, USER)
  • Containers are processes, not virtual machines
  • When the main process exits, the container stops

Image vs Container

AspectImageContainer
NatureStaticDynamic
MutabilityImmutableMutable
LifecycleBuild-timeRuntime
RoleArtifactInstance

Where Do Images Come From?

Docker Hub

https://hub.docker.com/

  • Default public container registry
  • Hosts official and community images
  • Supports tags, digests, vulnerability scans
  • Docker Hub is default, not mandatory

Apart from Docker Hub, there are few other common registries

AWS ECR

GCP Artifact Registry

Azure Container Registry

GitHub Container Registry

Private / On-Prem Registries

Harbor

JFrog Artifactory

Enterprises widely use on-prem or private registries. JFrog Artifactory is extremely common in regulated environments.

#docker #container #repositories #hubVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 17 minutes]

Docker Examples

Mac Users

Open Terminal

Windows Users

Open Git Bash

GIT Bash Menu

Is Docker Running?

docker info


  • Lists images available on the local machine
docker image ls
  • To get a specific image
docker image pull <imagename>
docker image pull python:3.12-slim
  • To inspect the downloaded image
docker image inspect python:3.12-slim

Check the architecture, ports open etc..

  • Create a container
docker create \
    --name edge-http \
    -p 8000:8000 \
    python:3.12-slim \
    python -m http.server 

List the Image and container again

  • Start the container
docker start edge-http

Open browser and check http://localhost:8000 shows the docker internal file structure.

docker inspect edge-http
  • Shows all running containers
docker container ls
  • Shows all containers
docker container ls -a
  • Disk usage by images, containers, volumes
docker system df
  • Logs Inspection
docker logs edge-http
docker inspect edge-http
  • Stop and remove
docker stop edge-http
docker rm edge-http

docker run is a wrapper for docker pull, docker create, docker start

Deploy MySQL Database using Containers

Create the following folder

Linux / Mac

mkdir -p container/mysql
cd container/mysql

Windows

md container
cd container
md mysql
cd mysql

Note: If you already have MySQL Server installed in your machine then please change the port to 3307 as given below.

-p 3307:3306 \

Run the container


docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    docker.io/library/mysql:8.4.4

-d : detached (background mode) -p : 3306:3306 maps mysql default port 3306 to host machines port 3306 3307:3306 maps mysql default port 3306 to host machines port 3307

-e MYSQL_ROOT_HOST=“%” Allows to login to MySQL using MySQL Workbench

Login to MySQL Container

docker exec -it mysql bash

List all the Containers

docker container ls -a

Stop MySQL Container

docker stop mysql

Delete the container**

docker rm mysql

Preserve the Data for future

Inside container/mysql

mkdir data

docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    -v ./data:/var/lib/mysql \
    docker.io/library/mysql:8.4.4

-- Create database
CREATE DATABASE IF NOT EXISTS friends_tv_show;
USE friends_tv_show;

-- Create Characters table
CREATE TABLE characters (
    character_id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    actor_name VARCHAR(100) NOT NULL,
    date_of_birth DATE,
    occupation VARCHAR(100),
    apartment_number VARCHAR(10)
);

INSERT INTO characters (first_name, last_name, actor_name, date_of_birth, occupation, apartment_number) VALUES
('Ross', 'Geller', 'David Schwimmer', '1967-10-02', 'Paleontologist', '3B'),
('Rachel', 'Green', 'Jennifer Aniston', '1969-02-11', 'Fashion Executive', '20'),
('Chandler', 'Bing', 'Matthew Perry', '1969-08-19', 'IT Procurement Manager', '19'),
('Monica', 'Geller', 'Courteney Cox', '1964-06-15', 'Chef', '20'),
('Joey', 'Tribbiani', 'Matt LeBlanc', '1967-07-25', 'Actor', '19'),
('Phoebe', 'Buffay', 'Lisa Kudrow', '1963-07-30', 'Massage Therapist/Musician', NULL);

select * from characters;

Build your own Image


mkdir -p container
cd container

Python Example

Follow the README.md

Fork & Clone

git clone https://github.com/gchandra10/docker_mycalc_demo.git

Web App Demo

Fork & Clone

git clone https://github.com/gchandra10/docker_webapp_demo.git

Docker Compose

Docker Compose is a tool that lets you define and run multi-container Docker applications using a single YAML file.

Instead of manually running multiple docker run commands, you describe:

  • Services (containers)
  • Networks
  • Volumes
  • Environment variables
  • Dependencies between services

…all inside a docker-compose.yml file.

Sample docker-compose.yaml

version: "3.9"

services:
  app:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: example
docker compose up -d
docker compose down

Usecases

  • Reproducible environments
  • Clean dev setups
  • Ideal for microservices
  • Great for IoT stacks like broker + processor + DB

Docker Compose Demo

https://github.com/gchandra10/docker-compose-mysql-python-demo

Publish Image to Docker Hub

Login to Docker Hub

  • Create a Repository “my_faker_calc”
  • Under Account Settings
    • Personal Access Token
    • Create a PAT token with Read/Write access for 1 day

Replace gchandra10 with yours.

docker login

enter userid
enter PAT token

Then build the Image with your userid

docker build -t gchandra10/my_faker_calc:1.0 .
docker image ls

Copy the ImageID of gchandra10/my_fake_calc:1.0

Tag the ImageID with necessary version and latest

docker image tag <image_id> gchandra10/my_faker_calc:latest

Push the Images to Docker Hub (version and latest)

docker push gchandra10/my_faker_calc:1.0 
docker push gchandra10/my_faker_calc:latest

Image Security

Trivy

Open Source Scanner.

https://trivy.dev/latest/getting-started/installation/

trivy image python:3.12-slim

# Focus on high risk only

trivy image --severity HIGH,CRITICAL python:3.12-slim

# Show only fixes available
trivy image --ignore-unfixed false python:3.12-slim

trivy image gchandra10/my_faker_calc

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH --format table

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH  --output result.txt

Grype

Open Source Scanner

grype python:3.12-slim

Common Mitigation Rules

  • Upgrade the base
    • move to newer version of python if 3.12 has issues
  • Minimize OS packages
    • check our how many layers of packages are installed
  • Pin versions on libraries
    • requirements.txt make sure Library versions are pinned for easy detection
  • Run as non-root
    • Create local user instead of running as root
  • Don’t share Secrets
    • dont copy .env or any secrets in your script or application.

#docker #container #dockerhubVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

CICD

  1. Introduction
  2. CICD Tools
  3. CI Yaml
  4. CD YamlVer 6.0.25

[Avg. reading time: 4 minutes]

CICD Intro

A CI/CD pipeline is a development practice focused on one core goal:

Ship high-quality features to production faster and more reliably.

Without CI/CD, every step in the software lifecycle is manual: building code, running tests, and deploying changes. This slows teams down and introduces human error.

src: https://www.freecodecamp.org/

What Happens Without CI/CD?

  • Developers manually trigger builds
  • Testing is inconsistent or delayed
  • Deployments are error-prone
  • Releases take longer and break more often

CI/CD fixes this by automating the entire flow.

Continuous Integration (CI)

  • Automatically builds and tests code whenever changes are pushed to a shared repository
  • Detects issues early before they reach production
  • Ensures new code doesn’t break existing functionality

Keep the codebase stable at all times

Continuous Delivery (CD)

  • Automatically deploys validated code to staging or testing environments
  • Production deployment is still a manual decision

Always be ready to release.

Continuous Deployment

  • Extends Continuous Delivery
  • Every successful change is automatically deployed to production

Remove manual steps and release continuously.

src 1

#cicd #ci #cd


1: www.freecodecamp.org/Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

CICD Tools

There are many CI/CD tools available. They differ mainly in hosting model and integration ecosystem.

Categories of CI/CD Tools

Self-Hosted / On-Prem

  • Jenkins

  • CircleCI (can be self-hosted, though mostly SaaS now)

  • You need full control

  • Strict security/compliance

  • Custom infrastructure

SaaS / Web-Based

  • GitHub Actions

  • GitLab CI/CD

  • You want quick setup

  • Tight integration with source control

Cloud-Native Tools

  • AWS CodeBuild / CodePipeline

  • Azure DevOps

  • Google Cloud Build

  • You’re already in that cloud

  • Need deep integration with cloud services


GitHub Actions

One of the most widely used CI/CD tools today.

  • Native integration with GitHub
  • Free tier available
  • Huge marketplace of reusable actions

Core Five Concepts

Workflows

  • Define the automation pipeline
  • Stored as YAML files in .github/workflows/
  • Think: entire pipeline definition

Jobs

  • A workflow is made up of one or more jobs
  • Jobs run independently (can be parallel)
  • Each job contains multiple steps

Steps

  • Individual tasks inside a job
  • Example: install dependencies, run tests

Events (Triggers)

Trigger the execution of the job.

  • on push / pull
  • on schedule
  • on workflow_dispatch (Manual Trigger)

Actions

Reusable building blocks.

Example:

  • checkout repo
  • setup Python
  • deploy apps

https://github.com/features/actions

Runners

Remote computer that GitHub Actions uses to execute the jobs.

Github-Hosted Runners

  • ubuntu-latest
  • windows-latest
  • macos-latest

Self-Hosted Runners

  • Specific OS that Github does not offer.
  • Connection to a private network/environment.
  • To save costs for projects with high usage. (Enterprise plans are expensive)

YAML (Yet Another Markup Language)

  • Human-readable
  • Key-value structure
  • Indentation matters

https://learnxinyminutes.com/docs/yaml/

Sample

name: CI Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:
  schedule:
    - cron: '0 0 * * *'

jobs:
  build-and-test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run tests
        run: pytest

DEMO

Multiple Runners Demo

https://github.com/gchandra10/github-actions-multiple-runners-demo
https://github.com/gchandra10/python_cicd_calculator

#cicd #calc #githubactionsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 4 minutes]

CI YAML

CI/CD is not just a tool or a YAML file. It is a system of interconnected components working together to ensure code quality.

checkout → install → test → notify

  • CI is not about automation alone.
  • It is about reducing the time between writing code and discovering problems.
  • Faster feedback = better code quality

name: Build and Test

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: ci-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test:
    name: Test Calculator App
    runs-on: ubuntu-latest
    timeout-minutes: 10

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python environment
        uses: actions/setup-python@v6
        with:
          python-version: "3.11"
          cache: pip

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run tests
        run: |
          python -m unittest test_calc.py -v

      - name: Send Discord failure notification
        if: failure()
        env:
          DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK }}
        uses: Ilshidur/action-discord@0.4.0
        with:
          args: >
            @here The Calculator App integration test failed for
            ${{ github.repository }}.
            Check run ${{ github.run_id }} on GitHub for details.

      - name: Send Discord success notification
        if: success()
        env:
          DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK }}
        uses: Ilshidur/action-discord@0.4.0
        with:
          args: >
            The Calculator App for ${{ github.repository }}
            passed successfully.
            Run ID: ${{ github.run_id }}

#github #githubactions #yamlVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

CD Yaml

    - name: Deploy to Server
      if: success()
      uses: appleboy/ssh-action@master
      with:
        host: ${{ secrets.SERVER_HOST }}
        username: ${{ secrets.SERVER_USER }}
        key: ${{ secrets.SSH_PRIVATE_KEY }}
        port: 22 # Optional if your SSH server uses a different port
        script: |
          cd /path/to/your/project
          git pull
          # Any other deployment or restart service commands

    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: your-aws-region

    - name: Deploy to AWS Lambda
      run: |
        # Package your application
        zip -r package.zip .
        # Deploy/update your Lambda function
        aws lambda update-function-code --function-name your-lambda-function-name --zip-file fileb://package.zip

#yaml #cdVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Data Engineering

  1. Introduction
  2. Batch vs Streaming
  3. Medallion Architecture
  4. Data Quality Checks
  5. Data Engineering Model
  6. Quality & Governance
  7. Data Mesh
  8. KAFKA
    1. Introduction
    2. Use Cases
    3. Kafka Software
    4. Python Scripts
    5. Different types of streamingVer 6.0.25

[Avg. reading time: 2 minutes]

Introduction to Data Engineering

Data Engineering is not about dashboards or ML hype. It’s about building systems that move and shape data reliably at scale.

At its core, data engineering answers three questions:

  • How does data enter the system
  • How does it change as it moves
  • How do we trust it when it’s used

Everything else is implementation detail.

Data comes from multiple sources:

  • APIs
  • Files (CSV, JSON, Parquet)
  • Databases
  • Streams

The real challenge is not loading data. It’s handling reality:

  • Millions of records
  • Partial failures
  • Schema changes
  • Late-arriving data
  • Duplicate data

#dataengineering #pipelineVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 5 minutes]

Batch - Streaming - Microbatch

Batch Processing

Batch means collect first, process later.

  • Works on large chunks of accumulated data
  • High throughput, cheaper, simpler
  • Results are not real-time
  • Typically minutes, hours, or days delayed

Examples:

  • Daily or weekly sales reports
  • End-of-day stock portfolio reconciliation
  • Monthly billing cycles
  • ETL pipelines that refresh a data warehouse

Use cases

  • Immediate action is not required
  • Delay is acceptable
  • Working with large historical datasets

Stream Processing

Streaming means process events the moment they arrive.

  • Low-latency (milliseconds to seconds)
  • Continuous, event-by-event processing
  • Ideal for real-time analytics and alerting
  • Stateful systems maintain event history or running context

Examples:

  • Stock price updates
  • Fraud detection for credit cards
  • Real-time gaming leaderboards
  • IoT sensor monitoring

Use cases

  • You need instant reactions
  • Delays cause risk, loss, or bad UX

Micro Batch

Micro-batching = small batches processed very frequently.

  • Latency: ~0.5 to a few seconds
  • Not true real-time, but close
  • Simpler than full streaming
  • Common in systems like Spark Structured Streaming

batch pretending to be streaming


Examples

Fraud Detection (Streaming)

  • Decision must be immediate
  • Millisecond latency required
  • Delay = financial loss

Payment Posting (Micro-Batch)

  • Small delay is acceptable
  • Updates can lag slightly
  • No immediate risk

Monthly Statements (Batch)

  • No urgency
  • Process large volumes at once
  • Cost-efficient
STREAMING     > Event > Process > Output      (ms latency)
MICRO-BATCH   > Small windows > Process      (seconds)
BATCH         > Accumulate > Process         (minutes+)

#batch #streaming #kafka #realtimeVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

Medallion Architecture

This is also called as Multi-Hop architecture.

Bronze Layer (Raw Data)

  • Append-only ingestion
  • No business logic
  • Schema minimally enforced
  • Supports replay / backfill

Silver Layer (Cleansed and Conformed Data)

  • Deduplication
  • Joins / normalization
  • Schema enforcement
  • Basic data quality checks

Gold Layer (Curated Business-level tables)

  • Business logic
  • Aggregations
  • KPI tables
  • Semantic-ready datasets
                 (Many Inputs)
     Kafka     APIs     Files     DBs     Streams
        \        |        |        |        /
         \       |        |        |       /
          \      |        |        |      /
           \     |        |        |     /
            ▼    ▼        ▼        ▼    ▼

           ╔══════════════════════╗
           ║       BRONZE         ║   ← Wide ingest funnel
           ║  (Raw / Append-only)║
           ╚══════════════════════╝
                     │
                     │  (filter, dedupe, schema fix)
                     ▼
           ╔══════════════════════╗
           ║       SILVER         ║   ← Compression layer
           ║ (Clean / Conformed) ║
           ╚══════════════════════╝
                     │
                     │  (business logic, joins)
                     ▼
           ╔══════════════════════╗
           ║        GOLD          ║   ← High-value core
           ║ (Aggregated / KPI)  ║
           ╚══════════════════════╝
                     │
        ┌────────────┼────────────┬────────────┬────────────┐
        ▼            ▼            ▼            ▼            ▼

   BI / SQL     ML Features     APIs      Reverse ETL    Real-time Apps
 (Dashboards)   (Feature Store) (Serving)  (Salesforce)   (Streams)

Different Personas involved

  • Data Engineer
  • Data Analysts
  • Data Scientists

#medallion #bronze #silver #goldVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Data Quality Checks

Google Colab - Medallion Demo

#dataquality #validation #schemadriftVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Data Engineering Model

1. Sequence Model

Source > Process > Sink

This is the simplest and most common pattern.

  • Data flows in a straight line
  • Each step transforms the data
  • Typically implemented as Bronze → Silver → Gold

Where it fits

  • ETL pipelines
  • Batch processing
  • Data cleaning and enrichment

Example

Raw logs > cleaned logs > aggregated reports

Funnel Model

Multiple Sources > Process > Single Sink

Here, multiple inputs are combined into one destination.

  • Data from different systems is merged
  • Requires schema alignment and joins
  • Often introduces data quality challenges

Where it fits

  • Data warehouse ingestion
  • Building unified datasets
  • Customer 360 views

Example

CRM + Transactions + Web logs → Unified customer table

Fan-Out (Star) Model

Single Source > Process > Multiple Sinks

One dataset feeds multiple downstream consumers.

  • Same data used in different ways
  • Different outputs for different use cases
  • Requires careful data contracts

Where it fits

  • Serving layer
  • Analytics + ML + APIs from same data
  • Reverse ETL

Example

Gold table > BI dashboards + ML models + APIs

#funnel #starmodel #sequenceVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 5 minutes]

Data Quality & Governance

Data Quality is simple

Can you trust this data to make a decision?

If not, it’s useless.

What matters

  • Accuracy : Is it correct?
  • Completeness : Is anything important missing?
  • Consistency : Does it match across systems?
  • Timeliness : Is it fresh or stale?
  • Relevance : Do we even need this data?

How you improve it (practical, not theory)

  • Profile data : find issues early
  • Validate at entry : stop bad data upfront
  • Clean regularly : fix what slipped through
  • Track metrics : monitor trends over time
  • Standardize core data (MDM) : one version of truth

Data Governance (Who controls the data)

Data governance is not a document.
It’s control.

Who owns data, who can use it, and how it’s protected.


What it includes

  • Policies : rules for storing and sharing data
  • Ownership : someone accountable (data stewards)
  • Security : who can access what
  • Compliance : laws you cannot ignore
  • Metadata : context (where data came from, how to use it)

Laws you can’t ignore

You don’t need to memorize all of them.
Just understand the pattern: protect user data or pay heavily.

  • GDPR (EU) : strictest, global impact
  • CCPA (California) : consumer rights
  • HIPAA (US) : healthcare data

GDPR (the one everyone cares about)

  • Consent : you must ask clearly
  • Access : users can see their data
  • Delete : users can ask to remove it
  • Portability : users can take their data
  • Breach reporting : within 72 hours
  • Fines : up to 4% of global revenue

Summary

  • Data Quality = Is the data good?
  • Data Governance = Are we allowed to use it?

#governance #gdprVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Data Mesh

What it is

Data Mesh is not a tool.

It’s a way to organize data ownership in large organizations.

Instead of one central data team owning everything,
each domain owns its own data.

  • Finance owns finance data
  • Sales owns sales data
  • Marketing owns marketing data

Why it exists

Centralized data platforms don’t scale well.

Problems you see:

  • One team becomes a bottleneck
  • No clear ownership
  • Slow delivery
  • Constant dependency on data engineers

Data Mesh tries to fix this.


Before vs After

Monolithic Data Platform

  • Central team owns everything
  • Pipelines become complex and slow
  • No clear ownership
  • Everyone depends on one team

Data Mesh

  • Data is split by domain
  • Each team owns their pipelines
  • Faster development
  • Clear accountability

4 Core Principles

1

1. Domain Ownership

Each domain team owns:

  • Data
  • Pipelines
  • Quality

You build it, you own it


2. Data as a Product

Data is not just tables.

It must be:

  • Discoverable
  • Reliable
  • Documented
  • Easy to use

If nobody can use your data, it’s useless


3. Self-Service Platform

Central team still exists.

But they provide:

  • Infrastructure
  • Tools
  • Standards

Platform team builds the road, domains drive on it


4. Federated Governance

  • Global rules (security, compliance)
  • Local ownership (domains decide implementation)

Balance control and flexibility

Data Products

A data product is more than a dataset.

It includes:

  • Data
  • Metadata
  • Documentation
  • Code / pipelines
  • SLAs

Treat data like a product, not a byproduct


Where Data Mesh works well

  • Large organizations
  • Many independent teams
  • High data ownership conflicts

Where it fails

  • Small teams (overkill)
  • Weak engineering culture
  • No governance discipline
  • No platform team

Challenges

  • Cross-domain joins become painful
  • Standards drift across teams
  • Requires strong ownership mindset
  • More roles - more coordination
  • Expensive to implement

Data Mesh is rarely implemented fully.

What most companies do

Most companies do is a Hybrid model

  • Central platform (Example: Fabric, Databricks, Snowflake)
    • Domain ownership (partial Mesh)

To get access to lots of external data.

AWS Data Exchange

#datamesh #domainownership #selfservice


1: https://www.dremio.com/resources/guides/what-is-a-data-mesh/Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

KAFKA

[Avg. reading time: 11 minutes]

Apache Kafka — Introduction

What Problem Does Kafka Solve?

When systems need to handle millions of events per second reliably, traditional messaging systems start failing.

  • Data loss
  • Poor scalability
  • No easy replay of events

Kafka is built to solve these problems.


What is Kafka?

Apache Kafka is a distributed event streaming platform designed for:

  • High throughput
  • Fault tolerance
  • Real-time data pipelines

At its core, Kafka is:

  • A distributed commit log
  • A publish-subscribe system
  • A replayable event store

Key Characteristics

  • High Throughput → Millions of messages per second
  • Scalable → Horizontally scalable across brokers
  • Fault-Tolerant → Data replication across servers
  • Durable → Messages persisted and replayable

How Kafka Works

  1. Producer sends a message
  2. Kafka assigns it to a partition
  3. Message gets an offset
  4. Stored in a broker
  5. Consumers read using offsets

Basic Terms

1. Producer

A producer sends data to Kafka.

  • Publishes messages to topics
  • Can:
    • Send to a specific partition
    • Let Kafka decide

Partitioning logic:

  • With key → hash(key) % partitions
  • Without key → round-robin

2. Topic

A topic is a logical stream where messages are stored.

  • Similar to a table or data stream
  • Supports multiple consumers
  • Append-only (no updates/deletes)

3. Message (Record)

A message is the basic unit of data in Kafka.

Structure:

  • Key (optional) → partitioning
  • Value → actual data
  • Timestamp
  • Headers (optional)

Messages are immutable.


4. Key

The key determines how messages are distributed.

  • Same key → same partition
  • Maintains ordering per key

If no key:

  • Kafka uses round-robin distribution

5. Partition

A partition is a subset of a topic.

  • Enables parallelism and scalability
  • Append-only and ordered

Important:

  • Each message has an offset
  • Ordering is guaranteed only within a partition
  • No global ordering across topic

6. Broker

A broker is a Kafka server.

Responsibilities:

  • Receives messages
  • Stores partitions
  • Serves consumers

7. Consumer

A consumer reads messages from topics.

  • Pull-based model
  • Reads using offsets
  • Can replay data

8. Consumer Group

A consumer group is a set of consumers working together.

  • Each partition → only ONE consumer in group
  • Enables parallel processing

Rebalancing:

  • Happens when consumers join/leave
  • Kafka redistributes partitions

9. Offset

An offset is a unique ID for messages in a partition.

  • Starts from 0
  • Incremental and immutable

Types:

  • Current Offset → next to read
  • Committed Offset → last saved

Kafka stores offsets in: __consumer_offsets


10. Batches

A batch is a group of messages sent together.

Benefits:

  • Better network usage
  • Compression
  • Faster I/O

Trade-off:

  • Larger batch → higher latency
  • Smaller batch → lower latency

Brokers, Cluster, and Replication

Broker

  • Single Kafka server
  • Stores partitions

Cluster

  • Multiple brokers working together
  • Provides scalability and fault tolerance

Replication

  • Partitions are replicated across brokers
  • Ensures durability and availability

Message Delivery Semantics

Kafka supports three delivery guarantees:

1. At Most Once

  • No duplicates
  • Possible data loss

2. At Least Once (Default)

  • No data loss
  • Possible duplicates

3. Exactly Once

  • No duplicates
  • No data loss
  • Higher overhead

  • At Most Once → Fast but risky
  • At Least Once → Safe but duplicates
  • Exactly Once → Correct but expensive

Commit Strategies

  • Auto Commit

    • Automatic at intervals
  • Manual Commit

    • Controlled by consumer
    • More reliable

Real-World Use Cases

  • Log aggregation
  • Event-driven microservices
  • Real-time analytics
  • Fraud detection
  • User activity tracking

Summary

Kafka is not just a message queue.

It is a:

  • Distributed log
  • Streaming backbone
  • Real-time data platform

Use Kafka when:

  • Scale matters
  • Reliability matters
  • Real-time processing matters

#kafka #realtimeVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Kafka Use Cases

Data Streaming

Kafka can stream data in real time from various sources, such as sensors, applications, and databases. This data can then be processed and analyzed in real-time or stored for later analysis.

Log Aggregation

Kafka can be used to aggregate logs from various sources. This can help improve system logs’ visibility and facilitate troubleshooting.

Message Queuing

Kafka can decouple applications and services as a message queue. This can help to improve the scalability and performance of applications.

Web Activity Tracking

Kafka can track web activity in real-time. This data can then be used to analyze user behavior and improve the user experience.

Data replication

Kafka can be used to replicate data between different systems. This can help to ensure that data is always available and that it is consistent across systems.

#kafka #usecasesVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 10 minutes]

Kafka Software

  1. Free Trial for 30 days (Cloud)

2: Using Docker

Step 1

mkdir kafka-demo
cd kafka-demmo

Step 2

create a new file docker-compose.yml

services:
  kafka:
    image: docker.io/apache/kafka
    container_name: kafka
    ports:
      - "9092:9092"
      - "9093:9093"
    environment:
      - KAFKA_KRAFT_MODE=true
      - KAFKA_CFG_NODE_ID=1
      - KAFKA_CFG_PROCESS_ROLES=broker,controller
      - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=1@kafka:9093
      - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092
      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
      - ALLOW_PLAINTEXT_LISTENER=yes
      - KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE=true
      - KAFKA_CFG_NUM_PARTITIONS=3
      - KAFKA_CFG_DEFAULT_REPLICATION_FACTOR=1
    volumes:
      - kafka_data:/bitnami/kafka

volumes:
  kafka_data:
    driver: local

Step 3

docker compose up -d

Step 4

Verification

docker container ls

Check the logs

docker logs kafka

Step 5: Create a new Kafka Topic

docker exec -it kafka kafka-topics.sh \
  --create \
  --topic gctopic \
  --bootstrap-server localhost:9092 \
  --partitions 3 \
  --replication-factor 1

Step 6: Producer

docker exec -it kafka kafka-console-producer.sh \
--topic gctopic \
--bootstrap-server localhost:9092 \
--property "parse.key=true" \
--property "key.separator=:"

Step 7: Consumer (Terminal 1)

docker exec -it kafka kafka-console-consumer.sh \
  --topic gctopic \
  --bootstrap-server localhost:9092 \
  --group 123 \
  --property print.partition=true \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.offset=true

Consumer (Terminal 2)

docker exec -it kafka kafka-console-consumer.sh \
  --topic gctopic \
  --bootstrap-server localhost:9092 \
  --group 123 \
  --property print.partition=true \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.offset=true

Consumer (Terminal 3)

docker exec -it kafka kafka-console-consumer.sh \
  --topic gctopic \
  --bootstrap-server localhost:9092 \
  --group 123 \
  --property print.partition=true \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.offset=true

Consumer (Terminal 4)

This “new group” will receive all the messages published across partitions.

docker exec -it kafka kafka-console-consumer.sh \
  --topic gctopic \
  --bootstrap-server localhost:9092 \
  --group 456 \
  --property print.partition=true \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.offset=true

Kafka messages can be produced and consumed in many ways.

  • JAVA
  • Python
  • Go
  • CLI
  • REST API
  • Spark

and so on..

Similar tools

Amazon Kinesis

A cloud-based service from AWS for real-time data processing over large, distributed data streams. Kinesis is often compared to Kafka but is managed, making it easier to set up and operate at scale. It’s tightly integrated with the AWS ecosystem.

Microsoft Event Hubs

A highly scalable data streaming platform and event ingestion service, part of the Azure ecosystem. It can receive and process millions of events per second, making it suitable for big data scenarios.

Google Pub/Sub

A scalable, managed, real-time messaging service that allows messages to be exchanged between applications. Like Kinesis, it’s a cloud-native solution that offers durable message storage and real-time message delivery without the need to manage the underlying infrastructure.

RabbitMQ

A popular open-source message broker that supports multiple messaging protocols. It’s designed for scenarios requiring complex routing, message queuing, and delivery confirmations. It’s known for its simplicity and ease of use but is more traditionally suited for message queuing rather than log streaming.

#kafka #softwares #kinesis #pubsubVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

Python Scripts

Steps

Fork and Clone the repository.

git clone https://github.com/gchandra10/python_kafka_3_node_demos.git
uv sync

#python #kafkaVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 5 minutes]

Types of Streaming

Stateless Streaming

  • Processes each record independently
  • No memory of previous events
  • Simple transformations and filtering
  • Highly scalable

Examples of Stateless

  • Unit conversion (Celsius to Fahrenheit) for each reading
  • Data validation (checking if temperature is within realistic range)
  • Simple transformations (rounding values)
  • Filtering (removing invalid readings)
  • Basic alerting (if current temperature exceeds threshold)

Use Cases:

  • You only need to process current readings
  • Simple transformations are sufficient
  • Horizontal scaling is important
  • Memory resources are limited

Stateful Streaming:

  • Maintains state across events
  • Enables complex processing like windowing and aggregations
  • Requires state management strategies
  • Good for pattern detection and trend analysis

Examples of Stateful

  • Calculating moving averages of temperature
  • Detecting temperature trends over time
  • Computing daily min/max temperatures
  • Identifying temperature patterns
  • Calculating rate of temperature change
  • Detecting anomalies based on historical patterns
  • Unusual suspicious financial activity

Use Cases:

  • You need historical context
  • Analyzing patterns or trends
  • Computing moving averages
  • Detecting anomalies
  • Time-window based analysis is required

Different Ingestion Services

Stream Processing Frameworks:

Structured Streaming (Apache Spark)

A processing framework for handling streaming data Part of Apache Spark ecosystem

Message Brokers/Event Streaming Platforms:

Apache Kafka (Open Source)

  • Distributed event streaming platform
  • Self-managed

Amazon MSK

  • Managed Kafka service
  • AWS managed version of Kafka

Amazon Kinesis

  • AWS native streaming service
  • Different from Kafka-based solutions

Azure Event Hubs

  • Cloud-native event streaming service
  • Azure’s equivalent to Kafka

#kinesis #stateful #stateless #eventhubsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Cloud Computing

  1. Introduction
  2. Types of Cloud Services
  3. Challenges
  4. Multi-Cloud
  5. High Availability
  6. Disaster Recovery (DR)
  7. RTO - RPO
  8. Cloud Native vs Lift-and-Shift
  9. Azure Cloud
    1. Services
    2. Azure Access
    3. IaC
    4. Idempotency
    5. TerraformVer 6.0.25

[Avg. reading time: 6 minutes]

Introduction to Cloud Computing

Definitions

  • Hardware: Physical computing components such as servers, storage devices, and networking equipment
  • Software: Programs and systems such as operating systems, Microsoft Word, and Excel
  • Website: Read-only web content (e.g., company pages, portfolios, news sites)
  • Web Application: Interactive, read-write platforms (e.g., Google Docs, email, online forms)

What is Cloud Computing?

Cloud computing plays a critical role in the Big Data ecosystem.

Modern organizations deal with continuously growing data in terms of size, speed, and complexity. Cloud enables them to handle this efficiently without owning physical infrastructure.

  • Cloud Computing: On-demand delivery of IT resources over the internet with a pay-as-you-go pricing model

Key Perspective

Cloud is often misunderstood because different teams interact with different layers:

  • Compute (VMs, containers)
  • Storage (object, block)
  • Networking
  • IAM (Identity and Access Management)
  • Managed services

Each team sees only a slice and assumes that is the cloud.

  • Cloud is not just servers or storage
  • It is an abstraction layer over distributed systems

Shared responsibility is the core operating principle of cloud computing


Big Data Characteristics (6 V’s)

  • Volume: Amount of data
  • Velocity: Speed of data generation and processing
  • Variety: Different data types (structured, semi-structured, unstructured)
  • Veracity: Data quality and reliability
  • Value: Business usefulness
  • Vulnerability: Security and privacy risks

Cloud platforms help manage all these dimensions in an integrated way.


Why Cloud for Big Data?

  • Cost savings (no upfront infrastructure)
  • Scalability and flexibility
  • High availability and reliability
  • Built-in security controls
  • Faster insights using managed analytics tools
  • Collaboration across distributed teams
  • Disaster recovery and backup
  • Automatic updates and maintenance

Types of Cloud Computing

Public Cloud

  • Owned and operated by third-party providers
  • Examples: AWS, Azure, GCP

Private Cloud

  • Dedicated infrastructure for a single organization
  • Greater control, higher cost

Hybrid Cloud

  • Combination of public and private cloud
  • Enables workload portability and better control over sensitive data

#overview #cloud #azureVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 15 minutes]

Types of Cloud Services

SaaS: Software as a Service

SaaS delivers ready-to-use software applications over the internet. Users do not manage the infrastructure, platform, or most application settings.

Examples:

  • Google Workspace
  • Dropbox
  • Slack
  • Salesforce

Key Characteristics

  • Accessed through a web browser or thin client
  • Managed centrally by the provider
  • Usually follows a multi-tenant model
  • Updates and patches are handled by the provider
  • Minimal setup and maintenance for users

When Not to Use SaaS

  • Limited or unreliable internet access
  • Mission-critical workloads with very low downtime tolerance
  • Applications requiring deep customization
  • Tight integration with specialized on-premise hardware
  • Strict data residency or regulatory constraints
  • Performance-sensitive workloads that depend on local execution

PaaS: Platform as a Service

PaaS provides a managed environment for building, deploying, and running applications without requiring users to manage the underlying infrastructure.

Examples:

  • Heroku
  • Streamlit
  • PythonAnywhere

Key Characteristics

  • Developers focus on application code, not infrastructure
  • Built-in support for deployment, scaling, and monitoring
  • Provider manages runtime, middleware, patches, and much of the operations work
  • Speeds up development and release cycles
  • Often integrates well with CI/CD pipelines

When Not to Use PaaS

  • Risk of vendor lock-in
  • Limited control over infrastructure and runtime configuration
  • Specialized compliance or security requirements
  • Need for unsupported languages, frameworks, or custom system dependencies
  • Performance-sensitive workloads needing low-level tuning
  • Applications tightly coupled with legacy systems or custom middleware

IaaS: Infrastructure as a Service

IaaS provides virtualized compute, storage, and networking resources over the internet. Users manage the operating systems, middleware, and applications, while the provider manages the physical hardware.

Examples:

  • Amazon EC2
  • Google Compute Engine
  • Microsoft Azure Virtual Machines

Key Characteristics

  • High flexibility and control
  • Resources can scale up or down based on demand
  • Pay-as-you-go pricing
  • Suitable for lift-and-shift migrations
  • Supports custom operating systems and software stacks

When Not to Use IaaS

  • High operational complexity
  • Teams lack infrastructure expertise
  • Ongoing maintenance overhead for OS, patches, and security
  • Predictable workloads that may be cheaper or simpler on other models
  • High availability and disaster recovery require careful design
  • Compliance and security responsibilities remain heavily on the user

DBaaS: Database as a Service

DBaaS provides a fully managed database in the cloud. The provider handles infrastructure, provisioning, patching, backups, scaling, and high availability, while users focus on storing, querying, and managing data.

Examples:

  • Neon (PostgreSQL)
  • Amazon RDS
  • Google Cloud SQL
  • Azure SQL Database
  • ClickHouse Cloud

Key Characteristics

  • Managed database infrastructure
  • Automated backups and recovery
  • Built-in scaling and replication options
  • Reduced operational overhead
  • Users focus on schema, queries, and data access

When Not to Use DBaaS

  • Need deep control over database internals or host OS
  • Strict latency requirements with on-premise systems
  • Regulatory or data residency constraints
  • Very specialized database tuning or custom extensions
  • Workloads where self-managed databases are more cost-effective at scale

Easy Way to Remember

  • PaaS: deploy your application
  • DBaaS: use a managed database for your application

Comparison between Services

Service Comparison


FaaS: Function as a Service

FaaS, often associated with serverless computing, lets developers run event-driven functions without managing servers. The cloud provider handles provisioning, scaling, and infrastructure maintenance.

Examples:

  • AWS Lambda
  • Azure Functions
  • Google Cloud Functions

Key Characteristics

  • Event-driven execution
  • Automatic scaling
  • Pay only for execution time and resources consumed
  • No server provisioning or management
  • Well suited for lightweight, modular workloads

When Not to Use FaaS

  • Long-running tasks
  • Complex stateful workflows
  • Latency-sensitive applications affected by cold starts
  • Heavy compute-intensive jobs
  • Strong dependence on provider-specific services
  • Constant, predictable workloads where containers or VMs may be more efficient

Quick Comparison

ModelWhat You ManageWhat Provider ManagesBest For
SaaSMinimal user settings and usageApplication, platform, infrastructureEnd-user software
PaaSApplication code and dataInfrastructure, OS, runtime, middlewareApp development and deployment
IaaSOS, middleware, apps, dataPhysical hardware, virtualizationMaximum control and flexibility
FaaSFunction code and logicInfrastructure, scaling, execution environmentEvent-driven, serverless workloads

Easy way to remember SaaS, PaaS, IaaS

  • SaaS: Use the software
  • PaaS: Build the software
  • IaaS: Manage the software and OS on rented infrastructure
  • FaaS: Run small functions without managing servers

Pizza Example 1

#saas #iaas #paas #faas #dbaas


1: src: http://bigcommerce.comVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Challenges of Cloud Computing

Privacy

  • Sensitive data (PII, financial, health) lives in the cloud
  • Requires strong controls: encryption, IAM, audits
  • Breaches = high impact + regulatory exposure

Compliance

  • Data replication across regions can violate data residency laws
  • Regulations may restrict where data is stored/processed
  • Example: Google Cloud Platform (GCP) lacks mainland China regions

Data Availability

  • Depends on network + provider reliability
  • Major providers (AWS, Azure, GCP) offer redundancy
  • Still vulnerable to outages and regional failures

Connectivity

  • Internet quality directly impacts performance
  • High latency or downtime affects apps and pipelines

Vendor Lock-In

  • Proprietary APIs/services make migration costly
  • Rewrites and data movement add friction

Data Transfer Costs

  • Egress (data leaving cloud) is expensive
  • Large-scale pipelines can silently drive costs up

Limited Control

  • No access to underlying infrastructure
  • Less flexibility for tuning, customization, and debugging

#challenges #cloudVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 9 minutes]

Multi-Cloud

  • Amazon Web Services (AWS)
    Market leader with the broadest range of services. Strong in compute, storage, and global infrastructure.

  • Microsoft Azure
    Widely used in enterprises due to tight integration with Microsoft products like Windows Server, Active Directory, and Office.

  • Google Cloud Platform (GCP)
    Strong in data analytics, big data processing, and machine learning (e.g., BigQuery).

  • IBM Cloud
    Focused on hybrid cloud and enterprise-grade solutions.

  • Oracle Cloud (OCI)
    Known for database services and enterprise workloads.


What is Multi-Cloud?

Multi-cloud is an approach where an organization uses multiple cloud providers instead of relying on a single one.

Example:

  • AWS for infrastructure
  • GCP for analytics
  • Azure for enterprise applications

Why Multi-Cloud is Needed

  • Avoid Vendor Lock-in
    Prevents dependency on a single provider’s pricing, tools, and limitations.

  • Best-of-Breed Services
    Different providers excel in different areas:

    • AWS : infrastructure maturity
    • GCP : analytics and AI
    • Azure : enterprise integration
  • Improved Reliability
    Reduces risk of total system failure if one provider experiences an outage.

  • Regulatory Requirements
    Some workloads must run in specific regions or environments, requiring multiple providers.


Limitations of Single Cloud

  • Vendor Lock-in
    Migration becomes difficult once deeply integrated with one provider.

  • Pricing Constraints
    No negotiation power if fully dependent on one vendor.

  • Service Gaps
    No single provider is best at everything.

  • Single Point of Failure
    Outages in one cloud can impact the entire system.


Ingress vs Egress

  • Ingress
    Data entering the cloud.
    Typically free of cost.

  • Egress
    Data leaving the cloud.
    Typically charged, and often expensive.

Why it matters:

  • Moving data between clouds incurs egress costs
  • Example: Transferring data from AWS to GCP → AWS charges egress fees

Cloud Cost Considerations

  • Compute Costs
    Charges for virtual machines, containers, and serverless functions. Usually predictable.

  • Storage Costs
    Low per unit, but grows significantly with scale.

  • Data Transfer Costs (Egress)
    Often the hidden cost driver, especially in multi-cloud setups.

  • Managed Services Premium
    Higher cost for convenience (managed databases, AI services, etc.)

  • Idle Resources
    Unused or overprovisioned resources can significantly increase costs.


Challenges of Multi-Cloud

  • Operational Complexity
    Different tools, APIs, and configurations across providers.

  • Skill Requirements
    Teams must understand multiple cloud ecosystems.

  • Data Movement Costs
    Egress charges increase when transferring data between clouds.

  • Monitoring and Management
    Observability becomes more complex across platforms.


When to Use Multi-Cloud

  • Need for high resilience across providers
  • Advanced data and AI workloads
  • Compliance or regulatory constraints
  • Organizations with mature cloud teams

Summary

Multi-cloud provides flexibility, resilience, and access to best-in-class services, but it also introduces significant complexity and cost. It should be adopted only when there is a clear architectural or business need.

#aws #azure #oracle #gcp #multicloudVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

High Availability

High Availability (HA) refers to designing systems that remain operational with minimal downtime over a given period.

It is often associated with uptime, but they are not the same:

  • Uptime = observed system availability
  • High Availability = design approach used to achieve high uptime

Availability Formula

  • Availability = (Total Time - Downtime) / Total Time

This formula is used in SLAs and monitoring systems to measure system reliability.


Availability Levels and Downtime

Each additional “9” reduces downtime exponentially, not linearly.

99% Availability (Two Nines)

  • Downtime: ~3.65 days per year
  • Monthly Downtime: ~7.2 hours
  • Suitable for non-critical systems

99.9% Availability (Three Nines)

  • Downtime: ~8.76 hours per year
  • Monthly Downtime: ~43.8 minutes
  • Suitable for most business applications

99.99% Availability (Four Nines)

  • Downtime: ~52.6 minutes per year
  • Monthly Downtime: ~4.38 minutes
  • Used for critical systems

99.999% Availability (Five Nines)

  • Downtime: ~5.26 minutes per year
  • Monthly Downtime: ~26.3 seconds
  • Required for highly critical systems (finance, healthcare, telecom)

Why Each “9” Matters

  • 99% → downtime in days
  • 99.9% → downtime in hours
  • 99.99% → downtime in minutes
  • 99.999% → downtime in seconds

Each step requires significantly more advanced engineering and cost.


How High Availability is Achieved

  • Redundancy (multiple servers or instances)
  • Failover mechanisms (automatic switching)
  • Load balancing
  • No single point of failure
  • Multi-region deployments
  • Continuous monitoring and auto-recovery

SLA (Service Level Agreement)

  • Availability is usually defined in SLAs
  • Example: cloud providers like AWS, Azure, GCP offer ~99.9% to 99.99%
  • If availability drops below SLA → customers receive service credits (not full compensation)

Cost of Downtime

  • Average downtime cost: ~$5,600 per minute (Gartner estimate)
  • Large enterprises can exceed $100,000 per minute

Higher availability reduces risk but increases infrastructure and operational costs.


Key Insight

  • Moving from 99.9% → 99.99% is difficult
  • Moving from 99.99% → 99.999% is extremely complex and expensive

High Availability is a trade-off between reliability, cost, and system complexity.

#ha #highavailabilityVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

Disaster Recovery (DR)

What is Disaster Recovery?

Disaster Recovery (DR) refers to the process of restoring systems, applications, and data after a failure or catastrophic event.

These events can include:

  • Hardware failures
  • Data center outages
  • Cyberattacks (e.g., ransomware)
  • Natural disasters (earthquakes, floods, fires)

Disaster Recovery vs High Availability (HA)

  • High Availability (HA)
    Focuses on preventing downtime
    Systems continue running with minimal or no interruption

  • Disaster Recovery (DR)
    Focuses on recovering after failure
    Accepts downtime, but minimizes impact and recovery time

Simple way to think:

  • HA = Avoid failure
  • DR = Recover from failure

Why Disaster Recovery is Important

  • Business Continuity
    Ensures operations can resume after unexpected failures

  • Data Protection
    Prevents permanent data loss

  • Financial Impact Reduction
    Downtime can cost thousands to millions per hour

  • Compliance Requirements
    Many industries require DR plans (finance, healthcare, etc.)


Types of Disaster Recovery Strategies

1. Backup and Restore

  • Regular backups stored in another location
  • Restore systems when failure occurs

Pros:

  • Low cost
  • Simple to implement

Cons:

  • High recovery time
  • Possible data loss

2. Pilot Light

  • Minimal version of system always running in another region
  • Scale up during disaster

Pros:

  • Faster recovery than backup
  • Lower cost than full duplication

Cons:

  • Requires scaling during recovery

3. Warm Standby

  • Fully functional but scaled-down system running in another region

Pros:

  • Faster recovery
  • Moderate cost

Cons:

  • Still not instant failover

4. Active-Active (Multi-Region)

  • Systems run simultaneously in multiple regions

Pros:

  • Near-zero downtime
  • High resilience

Cons:

  • Very expensive
  • Complex to manage

Key Concepts in Disaster Recovery

Backup Types

  • Full Backup – Entire dataset
  • Incremental Backup – Only changes since last backup
  • Differential Backup – Changes since last full backup

Replication

  • Synchronous Replication
    Data written to multiple locations at the same time
    (low data loss, higher latency)

  • Asynchronous Replication
    Data replicated with delay
    (faster, but risk of data loss)


Disaster Recovery in Cloud

Cloud platforms simplify DR through:

  • Multi-region deployments
  • Automated backups
  • Managed replication services
  • Infrastructure as Code (IaC) for quick recovery

Example:

  • Primary system in one region
  • Backup or standby system in another region

Common Challenges

  • Cost vs Recovery Speed Tradeoff
  • Testing DR Plans
    • Many systems fail because DR is never tested
  • Data Consistency Issues
  • Complex Architecture
  • Human Error during recovery

Best Practices

  • Define clear RTO and RPO targets
  • Automate backups and replication
  • Use multiple regions
  • Regularly test recovery plans
  • Document procedures clearly

Summary

Disaster Recovery is not about avoiding failure-it is about being prepared to recover quickly and effectively when failure happens. A strong DR strategy ensures business continuity, protects data, and reduces the impact of unexpected disruptions.

#dr #RTO #RPOVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

RTO vs RPO

What are RTO and RPO?

Recovery Time Objective (RTO)

RTO is the maximum acceptable time a system can be down after a failure.

  • Focus: Time to recover
  • Question it answers:
    “How fast do we need to restore the system?”

Recovery Point Objective (RPO)

RPO is the maximum acceptable amount of data loss, measured in time.

  • Focus: Data loss tolerance
  • Question it answers:
    “How much data can we afford to lose?”

Simple Example

  • RTO = 2 hours
    → System must be back online within 2 hours

  • RPO = 15 minutes
    → You can only lose up to 15 minutes of data


Key Differences

AspectRTORPO
FocusDowntimeData Loss
Measured InTime (minutes/hours)Time (minutes/hours)
GoalRestore service quicklyMinimize data loss
Driven BySystem recovery capabilityBackup/replication strategy

When to Use RTO vs RPO

Use RTO when:

  • System availability is critical
  • Downtime directly impacts revenue or operations
  • Examples:
    • Banking systems
    • E-commerce platforms
    • Real-time services

Use RPO when:

  • Data accuracy and integrity are critical
  • Data loss has serious consequences
  • Examples:
    • Financial transactions
    • Healthcare records
    • Order processing systems

How to Define RTO and RPO

Step 1: Identify Critical Systems

  • Which systems must recover fastest?
  • Which systems can tolerate downtime?

Step 2: Analyze Business Impact

  • What is the cost of downtime?
  • What is the cost of data loss?

Step 3: Assign Targets

System TypeRTORPO
Payment SystemMinutesNear Zero
Internal ToolsHours1–2 Hours
Analytics SystemHours/DaySeveral Hours

How to Achieve RTO and RPO

Improving RTO (Faster Recovery)

  • Use failover systems
  • Deploy across multiple regions
  • Use automation (Infrastructure as Code)
  • Maintain warm or active standby systems

Improving RPO (Less Data Loss)

  • Frequent backups
  • Real-time replication
  • Use distributed databases
  • Enable continuous data protection

Trade-Off: Cost vs Recovery

  • Lower RTO → Higher cost
    (requires active systems, redundancy)

  • Lower RPO → Higher cost
    (requires frequent backups or real-time replication)

Example:

  • RPO = 0 (no data loss) → requires synchronous replication → expensive
  • RTO = near zero → requires active-active setup → very expensive

Common Mistakes

  • Setting unrealistic RTO/RPO without infrastructure support
  • Not aligning targets with business needs
  • Not testing recovery procedures
  • Assuming backups alone are enough

Key Takeaway

  • RTO = How fast you recover
  • RPO = How much data you lose

Both must be defined together to design an effective disaster recovery strategy. Optimizing them always involves a trade-off between cost, complexity, and business requirements.

#rto #rpoVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 11 minutes]

Cloud Native vs Lift-and-Shift

Introduction

Organizations moving to the cloud typically follow one of two approaches:

  • Cloud Native → Build or redesign applications specifically for the cloud
  • Lift-and-Shift (Rehosting) → Move existing applications to the cloud with minimal or no changes

These approaches differ significantly in terms of architecture, cost, scalability, and long-term value.


What is Lift-and-Shift?

Lift-and-shift is the process of migrating applications from on-premises to the cloud without modifying their architecture.

Key Characteristics

  • Minimal or no code changes
  • Same architecture as on-premises
  • Faster migration
  • Uses virtual machines (VMs)

Example

  • Moving a legacy Java application from a local data center to a cloud VM (e.g., AWS EC2)

  • Netflix early 2008-2009 moved their Monolithic application on prem to AWS as quick exit from failing Data Centers. Later redesigned with microservices.


What is Cloud Native?

Cloud native refers to designing and building applications specifically for cloud environments using modern architectural patterns.

Key Characteristics

  • Microservices architecture
  • Containers (Docker, Kubernetes)
  • Serverless computing
  • Auto-scaling and resilience built-in

Example

  • A microservices-based application using containers, APIs, and managed cloud services

Key Differences

AspectLift-and-ShiftCloud Native
ApproachRehost existing appsBuild/redesign for cloud
SpeedFast migrationSlower (requires redesign)
Cost (initial)LowHigher
Cost (long-term)HigherOptimized
ScalabilityLimitedHighly scalable
ArchitectureMonolithicMicroservices
Cloud UtilizationLowHigh
MaintenanceHighLower (managed services)

Use Cases for Lift-and-Shift

  • Quick cloud migration

    • Deadlines or data center shutdowns
  • Legacy applications

    • Difficult or risky to refactor
  • Short-term strategy

    • “Move first, optimize later”
  • Cost of redesign is too high

    • When ROI of refactoring is unclear

Use Cases for Cloud Native

  • New applications

    • Built from scratch for scalability
  • High-scale systems

    • E-commerce, streaming platforms, SaaS
  • Rapid innovation

    • Frequent deployments and updates
  • Modernization initiatives

    • Breaking monoliths into microservices

Advantages and Disadvantages

Lift-and-Shift

Advantages:

  • Fast and simple migration
  • Lower upfront effort
  • Minimal risk during transition

Disadvantages:

  • Does not leverage cloud capabilities
  • Higher long-term costs
  • Limited scalability and flexibility

Cloud Native

Advantages:

  • High scalability and resilience
  • Better cost optimization over time
  • Faster development and deployment cycles

Disadvantages:

  • Requires redesign and expertise
  • Higher initial investment
  • Increased architectural complexity

When to Choose Each Approach

Choose Lift-and-Shift if:

  • You need quick migration
  • Application is stable and rarely updated
  • Refactoring is too risky or expensive

Choose Cloud Native if:

  • You need scalability and flexibility
  • Building new applications
  • Want to leverage full cloud benefits
  • Long-term cost and performance matter

Hybrid Approach (Most Common in Reality)

Most organizations use a combination of both:

  • Lift-and-shift for legacy systems
  • Gradual refactoring into cloud-native architecture

This approach is often called:

  • Lift, Shift, and Optimize

Common Mistakes

  • Treating lift-and-shift as a final solution
  • Overengineering cloud-native systems unnecessarily
  • Ignoring cost implications of poor architecture
  • Lack of skilled teams for cloud-native development

Key Takeaway

  • Lift-and-Shift = Speed and simplicity
  • Cloud Native = Scalability and long-term efficiency

The right choice depends on business goals, timelines, and technical maturity. Most organizations start with lift-and-shift and evolve toward cloud-native architectures over time.

#cloudnative #lift #shiftsVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

Azure Cloud

Azure is a cloud computing platform provided by Microsoft that delivers computing resources over the internet on a pay-as-you-go model.


Core Concepts

Servers
Individual physical or virtual machines that provide compute power.

Data Centers
Physical facilities that host servers along with networking, storage, and other infrastructure components.

Availability Zones (AZs)
Each Availability Zone consists of one or more data centers within a region.

  • Designed for high availability
  • Provide fault isolation
  • Connected through low-latency networking

Even if one data center fails, services in other zones continue to operate.

Azure Availability Zones

Source: https://www.unixarena.com/2020/08/what-is-the-availablity-zone-on-azure.html


Regions

Regions are geographically distinct locations that contain multiple Availability Zones.

  • Help keep applications close to users
  • Improve latency and performance
  • Support data residency and compliance requirements
Azure Regions

Source: https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview?tabs=azure-cli


Paired Regions

Azure regions are grouped into region pairs for disaster recovery.

  • Located at least 300 miles apart
  • Ensure isolation from large-scale failures
  • Support cross-region replication

Geo-Redundant Storage (GRS)

  • Data is stored in the primary region
  • Automatically replicated to the paired secondary region
  • Ensures durability and disaster recovery

Azure Site Recovery

Azure Site Recovery enables:

  • Failover to a secondary region during outages
  • Business continuity during disasters
  • Automated recovery workflows
Azure Paired Regions

Source: https://i.stack.imgur.com/BwHct.png


How Everything Fits Together

  • A Server runs your application
  • Servers live inside a Data Center
  • Multiple Data Centers form an Availability Zone
  • Multiple Availability Zones form a Region
  • Two Regions form a Paired Region

Real-World Example

Scenario: E-commerce Application

  • Your application runs on servers in East US Region
  • Deployed across 3 Availability Zones for high availability
  • Database uses Geo-Redundant Storage (GRS)
  • Backup region: West US (Paired Region)

What happens during failure?

  • If one server fails : no impact
  • If one data center fails : AZ handles it
  • If entire region fails : failover to paired region

Minimal downtime, continuous service

Use Availability Zones when:

  • You want high availability within a region
  • Low latency is critical

Use Paired Regions when:

  • You need disaster recovery
  • You want protection from regional outages

Use Geo-Redundant Storage when:

  • Data durability is more important than cost

Trade-offs

  • Multi-zone deployments : Higher cost but better availability
  • Multi-region deployments : Even higher cost + complexity
  • GRS : More durable but increased storage cost and latency

#azure #az #regionVer 6.0.25

[Avg. reading time: 5 minutes]

Azure Services

Azure cloud platform that helps you:

  • Run applications
  • Store data
  • Connect systems
  • Secure access

1. Compute (Run Applications)

Virtual Machines (VM)

  • Cloud-based servers
  • You manage OS and software

Use when:

  • Full control is needed
  • Running legacy applications

App Service

  • Managed platform for web apps
  • No server management

Use when:

  • Hosting websites or APIs

Azure Functions

  • Runs code only when triggered

Use when:

  • Automation
  • Background jobs

Containers / AKS

  • Run containerized applications

Use when:

  • Microservices
  • Scalable systems

2. Storage (Store Data)

Blob Storage

  • Stores files (images, videos, backups)

Data Lake Storage

  • Optimized for big data and analytics

Azure Files

  • Shared file storage

Use when:

  • Lift-and-shift applications

3. Networking (Connect Systems)

Virtual Network (VNet)

  • Private network in Azure

Load Balancer

  • Distributes traffic across servers

Application Gateway

  • Routes web traffic
  • Includes Web Application Firewall (WAF)

ExpressRoute

  • Private connection to Azure (no public internet)

4. Identity & Access (Security Basics)

Azure Active Directory (AAD)

  • Manages users and login

RBAC (Role-Based Access Control)

  • Controls who can access what

5. Monitoring & Management

Azure Monitor

  • Tracks logs, metrics, and alerts

Azure Resource Manager (ARM)

  • Used to deploy and manage resources

Azure Backup

  • Backup and restore data

Azure Site Recovery

  • Disaster recovery (failover to another region)

flowchart LR

User[User / Browser]

User -->|HTTPS| CDN[CDN / Front Door]
CDN --> AG[Application Gateway]

AG --> App[WebApp-App Service]

App --> API[Backend APIs]

API --> DB[(Database)]
API --> Cache[(Cache)]
API --> Storage[(Blob Storage)]

App --> Auth[Azure AD / Identity]

subgraph Azure Cloud
    CDN
    AG
    App
    API
    DB
    Cache
    Storage
    Auth
end

#azure #servicesVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 2 minutes]

How Do We Access Azure?

There are multiple ways to interact with Azure:

1. Azure Portal (Web UI)

  • Browser-based interface
  • Point-and-click experience

Use when:

  • Learning Azure
  • Exploring services
  • Quick setup

2. Azure CLI

  • Command-line tool (az commands)
  • Scriptable and fast

Example:

az group create --name myRG --location eastus

Use when:

  • Automation
  • Dev workflows

3. Azure PowerShell

  • PowerShell-based commands

Use when:

  • Windows admins
  • Scripting in PowerShell

4. SDKs (Python, Java, etc.)

  • Interact with Azure using code

Use when:

  • Application integration
  • Custom automation

Summary

  • Portal → Manual
  • CLI → Fast + Scriptable
  • SDK → Programmable

#azurecli #sdk #webportalVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 6 minutes]

What is Infrastructure as Code

Infrastructure as Code (IaC) means:

Defining cloud resources using code instead of manually creating them.


Without IaC (Not best practice)

  • Click in portal
  • Create resources manually
  • Hard to repeat
  • Error-prone

With IaC (Best Practice)

  • Write code to define resources
  • Reusable and consistent
  • Version controlled

Example (Conceptual)

Instead of:

  • Manually creating VM
  • Manually creating storage

You write code that says:

  • Create 1 VM
  • Create 1 Storage Account
  • Connect them

IaC Tools in Azure

1. ARM Templates

  • Native Azure JSON-based templates

2. Terraform

  • Multi-cloud IaC tool

Why IaC Matters

  • Repeatability : same setup every time
  • Version Control : track changes
  • Automation : faster deployments
  • Reliability : fewer mistakes

Summary

  • Deploy same setup multiple times
  • Maintain consistency
  • Automate environments (Dev / Test / Prod)

Azure Login

az login

Azure Group

az group list --output table

# Create a new Resource Group
az group create --name resgroup_via_cli --location eastus2

# delete the Resource Group
az group delete --name resgroup_via_cli 

# Delete the Resource Group without Prompt
az group delete --name resgroup_via_cli -y

# List all VMs.

az vm list

# Azure List Sizes

az vm list-sizes --location eastus

Sample Basic VM

template.json

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",

  "parameters": {
    "vmName": { "type": "string", "defaultValue": "myVM" },
    "adminUsername": { "type": "string" },
    "adminPassword": { "type": "secureString" }
  },

  "resources": [
    {
      "type": "Microsoft.Compute/virtualMachines",
      "apiVersion": "2021-11-01",
      "name": "[parameters('vmName')]",
      "location": "[resourceGroup().location]",

      "properties": {
        "hardwareProfile": {
          "vmSize": "Standard_B1s"
        },

        "osProfile": {
          "computerName": "[parameters('vmName')]",
          "adminUsername": "[parameters('adminUsername')]",
          "adminPassword": "[parameters('adminPassword')]"
        },

        "storageProfile": {
          "imageReference": {
            "publisher": "Canonical",
            "offer": "UbuntuServer",
            "sku": "18_04-lts",
            "version": "latest"
          }
        }
      }
    }
  ]
}

Deploy ARM Template using Azure CLI

Step 1: Login to Azure

az login

Step 2: Create a Resource Group

az group create \
  --name myResourceGroup \
  --location eastus

Step 3: Deploy the ARM Template

az deployment group create \
  --resource-group myResourceGroup \
  --template-file template.json \
  --parameters adminUsername=azureuser adminPassword=YourPassword123

#iac #armtemplates #terraformVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

IaC Concept: Idempotency

What is Idempotency?

Idempotency means:

Running the same code multiple times produces the same result.


Simple Example

If your code says:

  • Create 1 Virtual Machine

Then:

  • Run once : VM is created
  • Run again : No duplicate VM
  • Run again : Still only 1 VM

Why This Matters

Without idempotency:

  • You might create duplicate resources
  • Systems become inconsistent
  • Hard to manage infrastructure

With Idempotency (IaC)

  • Safe to run deployments multiple times
  • Ensures desired state is maintained
  • Prevents duplication

Real-World Analogy

Think of a light switch:

  • Turn ON : light is on
  • Turn ON again : still on (no change)

Same idea with infrastructure.


In Azure (ARM Templates)

  • You define the desired state
  • Azure checks current state
  • Only applies necessary changes

IaC is not about “running commands”

It is about:

Declaring the final state and letting the system reach it

#iac #idempotencyVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 7 minutes]

Terraform

What is Terraform?

Terraform is a tool used to:

Define and manage cloud infrastructure using code

It works across multiple cloud providers like:

  • Azure
  • AWS
  • Google Cloud

Why Terraform?

Instead of manually creating resources:

  • Clicking in portal
  • Running multiple commands

You write code once and Terraform:

  • Creates everything
  • Updates changes
  • Keeps things consistent

Core Idea

You describe:

  • What you want

Terraform figures out:

  • How to create it

How Terraform Works

  1. Write configuration (code)
  2. Run terraform plan : see what will happen
  3. Run terraform apply : create/update resources

How Idempotency works with Terraform

Terraform State

Terraform state is:

A file that keeps track of what Terraform has created

Terraform needs to know:

  • What already exists
  • What needs to change
  • What to delete

Without state:

  • Terraform would not know current infrastructure
  • It could create duplicates or break things

Where is State Stored?

  • Local file : terraform.tfstate
  • Remote storage : Azure Storage, S3, etc.

Terraform does NOT check Azure directly every time.

It relies on:

State file as the source of truth

Terraform Example: Create a VM in Azure

Step 1: Install Terraform

Download from: https://developer.hashicorp.com/terraform/downloads

Verify:

terraform -version

Step 2: Create Project Folder

terraform-vm/
 ├── main.tf

Step 3: Write Terraform Code

Create main.tf:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "rg" {
  name     = "demo-rg"
  location = "East US"
}

resource "azurerm_virtual_network" "vnet" {
  name                = "demo-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
}

resource "azurerm_subnet" "subnet" {
  name                 = "demo-subnet"
  resource_group_name  = azurerm_resource_group.rg.name
  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.0.1.0/24"]
}

resource "azurerm_network_interface" "nic" {
  name                = "demo-nic"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.subnet.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_linux_virtual_machine" "vm" {
  name                = "demo-vm"
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  size                = "Standard_B1s"
  admin_username      = "azureuser"

  network_interface_ids = [
    azurerm_network_interface.nic.id
  ]

  admin_password = "YourPassword123!"

  disable_password_authentication = false

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "UbuntuServer"
    sku       = "18_04-lts"
    version   = "latest"
  }
}

Step 4: Login to Azure

az login

Step 5: Initialize Terraform

terraform init

Step 6: Preview Changes

terraform plan

Step 7: Apply (Create Resources)

terraform apply

Type:

yes

Step 8: Destroy Resources

terraform destroy

#azure #devops #terraformVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

CLI Tools for Operational Efficiency

  1. Introduction
  2. Linux Commands 01
  3. Linux Commands 02
  4. AWK
  5. CSV SQL
  6. JQ
  7. YQVer 6.0.25

[Avg. reading time: 1 minute]

Introduction - CLI Tools

Knowlege of these tools are baseline skills required to function in real data engineering environments. This chapter focuses on command line proficiency, text processing, and direct manipulation of JSON and YAML using standard tools.

These skills are not tied to any single framework and apply across cloud platforms, data pipelines, and production systems.Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Linux Commands - 01

The first set of Linux commands, so many websites to explain what these commands do or use your favourite AI tool.

MAC - Open Terminal

Windows - Open GIT BASH

hostname

whoami

uname

uname -a

ping

pwd

echo ""

mkdir <foldername>

cd <foldername>

touch <filename>

echo "sometext" > <filename>

cd ..  (space is needed)

ls [-l]

cp <filename> <filename1>

#linux #commands #cli #gitbashVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Linux Commands - 02

The next set of Linux commands, so many websites to explain what these commands do or use your favourite AI tool.

wget

touch

echo

variables 

|

cat

wc

more

head

tail

grep

cut

uniq

sort

#linux #cliVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 5 minutes]

AWK

AWK is a scripting language used for manipulating data and generating reports. It’s a Domain Specific Language.

Demo Using AWK

wget 
https://raw.githubusercontent.com/gchandra10/awk_scripts_data_science/master/sales_100.csv

Display file contents

awk '{print }' sales_100.csv

By default, AWK uses space as a delimiter. Since our file has a comma (,) let’s specify it with -F

awk -F ',' '{print }' sales_100.csv

To get the number of columns of each row, use the NF (a predefined variable)

awk -F ',' '{print NF}' sales_100.csv

AWK lets you choose specific columns.

awk -F ',' '{print $1,$2,$4}' sales_100.csv

Row Filter

AND = &&

OR = ||

Not = !

awk -F ',' '{if($4 == "Online") {print $1,$2,$4}}' sales_100.csv
awk -F ',' '{if($4 == "Online" && $5 =="L") {print $1,$2,$4,$5}}' sales_100.csv```

Variables

awk -F ',' '{sp=$9 * $10;cp=$9 * $11; {printf "%f,%f,%s,%s \n",sp,cp,$1,$2 }}' sales_100.csv

RegEx: Return all rows starting with A in Column 1

awk -F ',' '$1 ~ /^A/ {print}' sales_100.csv

Return all rows which have Space in Column 1

awk -F ',' '$1 ~ /\s/ {print}' sales_100.csv

AWK also has the functionality to change the column and row delimiter

OFS: Output Field Separator

ORS: Output Row Separator

awk -F ',' 'BEGIN{OFS="|";ORS="\n\n"} $1 ~ /^A/ {print substr($1,1,4),$2,$3,$4,$5}' sales_100.csv

Built-in Functions

awk -F ',' 'BEGIN{OFS="|";ORS="\n"} $1 ~ /^A/ {print tolower(substr($1,1,4)),tolower($2),$3,$4,$5}' sales_100.csv

#awk #library #textbasedVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

CSVSQL

SQL query on CSV file

Download CSV file to your local machine.

wget 
https://raw.githubusercontent.com/gchandra10/awk_scripts_data_science/master/sales_100.csv

Install CSVKit

Simple query

csvsql --query "select * from sales_100" ./sales_100.csv

with Limit

csvsql --query "select * from sales_100 limit 5" ./sales_100.csv

using MAX aggregate function

csvsql --query "select max(unitprice) from sales_100 limit 5" ./sales_100.csv

Use double quotes to handle columns that have Space in between them in csvsql

csvsql --query 'select distinct("Order Priority") from sales_100' ./sales_100.csv

Using Group By

csvsql --query "select country,region,count(*) from sales_100 group by country, region" ./sales_100.csv

using WildCards

csvsql --query "select * from sales_100 where region like 'A%' order by region desc" sales_100.csv

#csvsql #csvkit #csvsqlVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 8 minutes]

JQ

  • jq is a lightweight and flexible command-line JSON processor.
  • Reads JSON from stdin or a file, applies filters, and writes JSON to stdout.
  • Useful when working with APIs, logs, or config files in JSON format.
  • Handy tool in Automation.
  1. Download JQ CLI (Preferred) and learn JQ.

JQ Download

  1. Use the VSCode Extension and learn JQ.

VSCode Extension

Download the sample JSON

https://raw.githubusercontent.com/gchandra10/jqtutorial/refs/heads/master/sample_nows.json

Note: As this has no root element, '.' is used.

1. View JSON file in readable format

jq '.' sample_nows.json

2. Read the First JSON element / object

jq 'first(.[])' sample_nows.json

3. Read the Last JSON element

jq 'last(.[])' sample_nows.json

4. Read top 3 JSON elements

jq 'limit(3;.[])' sample_nows.json

5. Read 2nd & 3rd element. Remember, Python has the same format. LEFT Side inclusive, RIGHT Side exclusive

jq '.[2:4]' sample_nows.json

6. Extract individual values. | Pipeline the output

jq '.[] | [.balance,.age]' sample_nows.json

7. Extract individual values and do some calculations

jq '.[] | [.age, 65 - .age]' sample_nows.json

8. Return CSV from JSON

jq '.[] | [.company, .phone, .address] | @csv ' sample_nows.json

9. Return Tab Separated Values (TSV) from JSON

jq '.[] | [.company, .phone, .address] | @tsv ' sample_nows.json

10. Return with custom pipeline delimiter ( | )

jq '.[] | [.company, .phone, .address] | join("|")' sample_nows.json

Pro TIP : Export this result > output.txt and Import to db using bulk import tools like bcp, load data infile

11. Convert the number to string and return | delimited result

jq '.[] | [.balance,(.age | tostring)] | join("|") ' sample_nows.json

12. Process Array return Name (returns as list / array)

jq '.[] | [.friends[].name]' sample_nows.json

or (returns line by line)

jq '[].friends[].name' sample_nows.json

13. Parse multi level values

returns as list / array

jq '.[] | [.name.first, .name.last]' sample_nows.json 

returns line by line

jq '.[].name.first, .[].name.last' sample_nows.json 

14. Query values based on condition, say .index > 2

jq 'map(select(.index > 2))' sample_nows.json
jq 'map(select(.index > 2)) | .[] | [.index,.balance,.age]' sample_nows.json

15. Sorting Elements

# Sort by Age ASC
jq 'sort_by(.age)' sample_nows.json
# Sort by Age DESC
jq 'sort_by(-.age)' sample_nows.json
# Sort on multiple keys
jq 'sort_by(.age, .index)' sample_nows.json

Use Cases

curl -s https://www.githubstatus.com/api/v2/status.json
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.'
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.status'

#jq #tools #json #parser #cli #automationVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

YQ

YQ is a command line tool to read, query, transform, and write YAML

Its like jq for YAML. Written in Go, single binary, fast.

YAML files are popularly used in many tools, example: Kubernetes, Terraform, Github Actions.

YQ helps engineers to parse the YAML file and extract necessary output. The output can also be converted to JSON.

YQ Installation

YQ Documentation & UsageVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 0 minutes]

Miscellaneous

  1. Additional Reading
  2. Good Reads
  3. Roadmap Data Engineer
  4. Notebooks vs IDEVer 6.0.25

[Avg. reading time: 5 minutes]

Additional Reading

Note 1: LinkedIn Learning is Free for Rowan Students.

Rowan LinkedIn Learning


Additional Learning 1 - Python

Additional Learning 2 - Learning Git and GitHub

Additional Learning 3 - Python Classes & Functions

Additional Learning 4 - Github Codespaces

Additional Learning 5 - Cloud

Certification

AI Tools

#free #linkedinlearning #certificationVer 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Good Reads

Videos

ByteByteGo

It’s a very, very useful YT channel.

https://www.youtube.com/@ByteByteGo/videos

Loaded with lots and lots of useful information.


Career Path

RoadMap

Example: RoadMap for Python Learning


Cloud Providers

Run and Code Python in Cloud. Free and Affordable plans good for demonstration during Interviews.

Python Anywhere


Cheap/Affordable GPUs for AI Workloads

RunPod


AI Tools

NotebookLM


Job Search Tips

Job Search Guide

Communication Skills for IT

Choose the Right Data Role

Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 1 minute]

Roadmap - Data Engineer

src: https://www.linkedin.com/in/pooja-jain-898253106/Ver 6.0.25

Last change: 2026-04-21

[Avg. reading time: 3 minutes]

Notebooks vs IDE

FeatureNotebooks (.ipynb)Python Scripts (.py)
Use Case - DEQuick prototyping, visualizing intermediate stepsProduction-grade ETL, orchestration scripts
Use Case - DSEDA, model training, visualizationPackaging models, deployment scripts
InteractivityHigh – ideal for step-by-step executionLow – executed as a whole
VisualizationBuilt-in (matplotlib, seaborn, plotly support)Needs explicit code to save/show plots
Version ControlHarder to diff and mergeEasy to diff/merge in Git
ReusabilityLower, unless modularizedHigh – can be organized into functions, modules
Execution ContextCell-based executionLinear, top-to-bottom
Production ReadinessPoor (unless using tools like Papermill, nbconvert)High – standard for CI/CD & Airflow etc.
DebuggingEasy with cell-wise changesNeeds breakpoints/logging
IntegrationJupyter, Colab, Databricks NotebooksAny IDE (VSCode, PyCharm), scheduler integration
Documentation & TeachingMarkdown + codeDocstrings and comments only
Unit TestsNot practicalEasily written using pytest, unittest
Package ManagementAd hoc, via %pip, %condaManaged via requirements.txt, poetry, pipenv
Using LibrariesEasy for experimentation, auto-reloads supportedCleaner imports, better for dependency resolution
Last change: 2026-04-21

Tags

abs

/Protocol/Idempotency

ai

/Big Data Overview/Trending Technologies

amazonprime

/Protocol/Monolithic Architecture

amd

/Containers/CPU Architecture Fundamentals

analysis

/Big Data Overview/How does it help?

api

/Protocol/API Performance

apiinbigdata

/Protocol/API in Big Data world

architecture

/Containers/CPU Architecture Fundamentals

arm

/Containers/CPU Architecture Fundamentals

armtemplates

/Cloud Computing/Azure Cloud/IaC

arrow

/Data Format/Arrow

automation

/CLI Tools/JQ

availability

/Big Data Overview/CAP Theorem

avro

/Data Format/Avro

awk

/CLI Tools/AWK

aws

/Cloud Computing/Multi-Cloud

az

/Cloud Computing/Azure Cloud

azure

/Cloud Computing/Azure Cloud

/Cloud Computing/Azure Cloud/Services

/Cloud Computing/Azure Cloud/Terraform

/Cloud Computing/Introduction

/Cloud Computing/Multi-Cloud

azurecli

/Cloud Computing/Azure Cloud/Azure Access

banking

/Protocol/Monolithic Architecture

batch

/Data Engineering/Batch vs Streaming

bigdata

/Big Data Overview

/Big Data Overview/Big Data Challenges

/Big Data Overview/Big Data Concerns

/Big Data Overview/Big Data Tools

/Big Data Overview/Eventual Consistency

/Big Data Overview/How does it help?

/Big Data Overview/Introduction

/Big Data Overview/Job Opportunities

/Big Data Overview/Learning Big Data means?

/Big Data Overview/Optimistic Concurrency

/Big Data Overview/The Big V's

/Big Data Overview/The Big V's/Other V's

/Big Data Overview/The Big V's/Variety

/Big Data Overview/The Big V's/Velocity

/Big Data Overview/The Big V's/Veracity

/Big Data Overview/The Big V's/Volume

/Big Data Overview/Trending Technologies

/Big Data Overview/What is Data?

/Data Format/CSV-TSV

/Data Format/Introduction

/Data Format/JSON

/Data Format/Parquet

bigv

/Big Data Overview/The Big V's

/Big Data Overview/The Big V's/Variety

/Big Data Overview/The Big V's/Velocity

/Big Data Overview/The Big V's/Veracity

/Big Data Overview/The Big V's/Volume

binary

/Big Data Overview/The Big V's/Variety

bronze

/Data Engineering/Medallion Architecture

calc

/CICD/CICD Tools

cap

/Big Data Overview/CAP Theorem

cd

/CICD/CD Yaml

/CICD/Introduction

certification

/Miscellaneous/Additional Reading

challenges

/Cloud Computing/Challenges

chapter1

/Big Data Overview

ci

/CICD/Introduction

cicd

/CICD/CICD Tools

/CICD/Introduction

/Protocol/Microservices

cli

/CLI Tools/JQ

/CLI Tools/Linux Commands 01

/CLI Tools/Linux Commands 02

/Data Format/Duck DB

cloud

/Big Data Overview/Big Data Tools

/Cloud Computing/Challenges

/Cloud Computing/Introduction

cloudnative

/Cloud Computing/Cloud Native vs Lift-and-Shift

columnar

/Big Data Overview/NO Sql Databases

/Data Format/Parquet

commands

/CLI Tools/Linux Commands 01

compressed

/Data Format/Parquet

concerns

/Big Data Overview/Big Data Concerns

concurrent

/Big Data Overview/Concurrent vs Parallel

connectionpool

/Protocol/API Performance

consistency

/Big Data Overview/CAP Theorem

container

/Containers/Docker

/Containers/Docker Examples

/Containers/VMs or Containers

/Containers/What Container does

containers

/Containers/Introduction

continuous

/Big Data Overview/Types of Data

csv

/Data Format/CSV-TSV

csvkit

/CLI Tools/CSV SQL

csvsql

/CLI Tools/CSV SQL

/CLI Tools/CSV SQL

curl

/Protocol/REST API

dask

/Advanced Python/Data Frames

data

/Big Data Overview/What is Data?

dataengineering

/Data Engineering/Introduction

dataformat

/Data Format/Arrow

/Data Format/CSV-TSV

/Data Format/Introduction

/Data Format/JSON

/Data Format/Parquet

/Data Format/YAML

datalake

/Big Data Overview/Data Integration

datamesh

/Data Engineering/Data Mesh

dataquality

/Big Data Overview/Big Data Challenges

/Data Engineering/Data Quality Checks

dbaas

/Cloud Computing/Types of Cloud Services

decorator

/Advanced Python/Decorator

devcontainer

/Setup Workspace

devops

/Cloud Computing/Azure Cloud/Terraform

discrete

/Big Data Overview/Types of Data

distributed

/Big Data Overview/Scaling

docker

/Containers/Docker

/Containers/Docker Examples

/Containers/Introduction

/Containers/VMs or Containers

/Containers/What Container does

dockerhub

/Containers/Docker Examples

documentdb

/Big Data Overview/NO Sql Databases

domain

/Big Data Overview/DSL

domainownership

/Data Engineering/Data Mesh

dr

/Cloud Computing/Disaster Recovery (DR)

dsl

/Big Data Overview/DSL

duckdb

/Data Format/Duck DB

elt

/Big Data Overview/Data Integration

errorhandling

/Advanced Python/Error Handling

ethics

/Big Data Overview/Big Data Challenges

etl

/Big Data Overview/Data Integration

eventhubs

/Data Engineering/KAFKA/Different types of streaming

eventualconsistency

/Big Data Overview/Eventual Consistency

exception

/Advanced Python/Error Handling

faas

/Cloud Computing/Types of Cloud Services

free

/Miscellaneous/Additional Reading

funnel

/Data Engineering/Data Engineering Model

gcp

/Cloud Computing/Multi-Cloud

gdpr

/Data Engineering/Quality & Governance

get

/Protocol/HTTP

gitbash

/CLI Tools/Linux Commands 01

github

/CICD/CI Yaml

githubactions

/CICD/CI Yaml

/CICD/CICD Tools

gold

/Data Engineering/Medallion Architecture

governance

/Data Engineering/Quality & Governance

gpl

/Big Data Overview/GPL

graphdb

/Big Data Overview/NO Sql Databases

grpc

/Protocol/Introduction

ha

/Cloud Computing/High Availability

hierarchical

/Data Format/JSON

highavailability

/Cloud Computing/High Availability

horizontal

/Big Data Overview/Scaling

html

/Big Data Overview/DSL

http

/Protocol/HTTP

/Protocol/Introduction

hub

/Containers/Docker

iaas

/Cloud Computing/Types of Cloud Services

iac

/Cloud Computing/Azure Cloud/IaC

/Cloud Computing/Azure Cloud/Idempotency

idempotency

/Cloud Computing/Azure Cloud/Idempotency

idempotent

/Protocol/Idempotency

image

/Big Data Overview/The Big V's/Variety

info

/Advanced Python/Logging

interoperability

/Big Data Overview/Big Data Challenges

introduction

/Big Data Overview

iot

/Big Data Overview/Trending Technologies

jobs

/Big Data Overview/Job Opportunities

jq

/CLI Tools/JQ

json

/Big Data Overview/The Big V's/Variety

/CLI Tools/JQ

/Data Format/JSON

jwt

/Protocol/Statelessness

kafka

/Big Data Overview/Big Data Tools

/Data Engineering/Batch vs Streaming

/Data Engineering/KAFKA/Introduction

/Data Engineering/KAFKA/Kafka Software

/Data Engineering/KAFKA/Python Scripts

/Data Engineering/KAFKA/Use Cases

/Protocol/API in Big Data world

keyvalue

/Big Data Overview/NO Sql Databases

kinesis

/Data Engineering/KAFKA/Different types of streaming

/Data Engineering/KAFKA/Kafka Software

knowledge

/Big Data Overview/How does it help?

learning

/Big Data Overview/Learning Big Data means?

/Big Data Overview/Learning Big Data means?

library

/CLI Tools/AWK

lift

/Cloud Computing/Cloud Native vs Lift-and-Shift

linkedinlearning

/Miscellaneous/Additional Reading

lint

/Developer Tools/Other Python Tools

linux

/CLI Tools/Linux Commands 01

/CLI Tools/Linux Commands 02

loadbalancer

/Protocol/Statefulness

loadbalancing

/Protocol/API Performance

logging

/Advanced Python/Logging

medallion

/Data Engineering/Medallion Architecture

memoization

/Advanced Python/Decorator

merge

/Protocol/Idempotency

microservices

/Protocol/Microservices

mitigation

/Big Data Overview/Big Data Concerns

monolithic

/Protocol/Monolithic Architecture

mqtt

/Protocol/Introduction

multicloud

/Cloud Computing/Multi-Cloud

mypy

/Developer Tools/Other Python Tools

nominal

/Big Data Overview/Types of Data

nosql

/Big Data Overview/NO Sql Databases

optimistic

/Big Data Overview/Optimistic Concurrency

oracle

/Cloud Computing/Multi-Cloud

ordinal

/Big Data Overview/Types of Data

otherv

/Big Data Overview/The Big V's/Other V's

overview

/Big Data Overview/Introduction

/Cloud Computing/Introduction

paas

/Cloud Computing/Types of Cloud Services

pagination

/Protocol/API Performance

pandas

/Advanced Python/Data Frames

parallelprogramming

/Big Data Overview/Concurrent vs Parallel

parquet

/Data Format/Duck DB

/Data Format/Parquet

parser

/CLI Tools/JQ

partitiontolerant

/Big Data Overview/CAP Theorem

pep

/Developer Tools/Other Python Tools

performance

/Protocol/API Performance

pipeline

/Big Data Overview/Data Integration

/Data Engineering/Introduction

poetry

/Developer Tools/Introduction

polars

/Advanced Python/Data Frames

post

/Protocol/HTTP

privacy

/Big Data Overview/Big Data Challenges

protocols

/Protocol/Introduction

pubsub

/Data Engineering/KAFKA/Kafka Software

put

/Protocol/HTTP

pytest

/Advanced Python/Unit Testing

python

/Big Data Overview/GPL

/Data Engineering/KAFKA/Python Scripts

/Developer Tools/Introduction

qualitative

/Big Data Overview/Types of Data

quantitative

/Big Data Overview/Types of Data

rawdata

/Big Data Overview/Data Integration

/Big Data Overview/How does it help?

rdbms

/Data Format/Introduction

realtime

/Big Data Overview/Big Data Challenges

/Data Engineering/Batch vs Streaming

/Data Engineering/KAFKA/Introduction

region

/Cloud Computing/Azure Cloud

repositories

/Containers/Docker

requests

/Protocol/REST API

rest

/Protocol/REST API

/Protocol/Statelessness

restapi

/Protocol/Microservices

/Protocol/REST API

robotics

/Big Data Overview/Trending Technologies

rowbased

/Data Format/Avro

rpo

/Cloud Computing/Disaster Recovery (DR)

/Cloud Computing/RTO - RPO

rto

/Cloud Computing/Disaster Recovery (DR)

/Cloud Computing/RTO - RPO

ruff

/Developer Tools/Other Python Tools

rust

/Big Data Overview/GPL

/Developer Tools/UV

saas

/Cloud Computing/Types of Cloud Services

scaling

/Big Data Overview/Scaling

schemadrift

/Data Engineering/Data Quality Checks

sdk

/Cloud Computing/Azure Cloud/Azure Access

selfservice

/Data Engineering/Data Mesh

semistructured

/Big Data Overview/The Big V's/Variety

sequence

/Data Engineering/Data Engineering Model

services

/Cloud Computing/Azure Cloud/Services

setup

/Setup Workspace

shift

/Cloud Computing/Cloud Native vs Lift-and-Shift

silver

/Data Engineering/Medallion Architecture

singlefiledatabase

/Data Format/Duck DB

softwares

/Data Engineering/KAFKA/Kafka Software

spark

/Big Data Overview/Big Data Tools

/Protocol/API in Big Data world

sql

/Big Data Overview/DSL

starmodel

/Data Engineering/Data Engineering Model

stateful

/Data Engineering/KAFKA/Different types of streaming

/Protocol/Statefulness

stateless

/Data Engineering/KAFKA/Different types of streaming

statelessness

/Protocol/Statelessness

statuscodes

/Protocol/HTTP

stickiness

/Protocol/Statefulness

storage

/Big Data Overview/Big Data Challenges

streaming

/Data Engineering/Batch vs Streaming

structured

/Big Data Overview/The Big V's/Variety

technologies

/Big Data Overview/Trending Technologies

teraform

/Protocol/Idempotency

terraform

/Cloud Computing/Azure Cloud/IaC

/Cloud Computing/Azure Cloud/Terraform

textbased

/CLI Tools/AWK

tools

/Big Data Overview/Big Data Tools

/CLI Tools/JQ

/Data Format/Duck DB

traditionaldata

/Big Data Overview/What is Data?

try

/Advanced Python/Error Handling

tsv

/Data Format/CSV-TSV

unittesting

/Advanced Python/Unit Testing

unstructured

/Big Data Overview/The Big V's/Variety

upsert

/Protocol/Idempotency

usecases

/Data Engineering/KAFKA/Use Cases

uv

/Developer Tools/Introduction

/Developer Tools/UV

validation

/Data Engineering/Data Quality Checks

validity

/Big Data Overview/The Big V's/Other V's

value

/Big Data Overview/The Big V's/Other V's

velocity

/Big Data Overview/The Big V's/Velocity

venv

/Developer Tools/Introduction

/Developer Tools/UV

veracity

/Big Data Overview/The Big V's/Veracity

version

/Big Data Overview/The Big V's/Other V's

vertical

/Big Data Overview/Scaling

vm

/Containers/VMs or Containers

volume

/Big Data Overview/The Big V's/Volume

webportal

/Cloud Computing/Azure Cloud/Azure Access

worksforme

/Containers/What Container does

workspace

/Setup Workspace

xml

/Big Data Overview/The Big V's/Variety

yaml

/CICD/CD Yaml

/CICD/CI Yaml

/Data Format/YAML

yq

/Data Format/YAML