Data Engineering 101: A Complete Beginner's Guide¶

Welcome to the world of data engineering. If you've never heard this term before, that's perfectly fine. This guide assumes you're starting from absolute zero. We'll explain every concept, every term, and every idea from the ground up.

By the end of this guide, you'll understand what data engineers do, why their work matters, and how tools like Odibi help solve real problems.

What is Data?¶

Before we talk about "data engineering," let's talk about "data." You interact with data every single day, even if you don't realize it.

Data is Just Information Stored Somewhere¶

When you check your bank account online, you see a list of transactions. That list—the dates, amounts, descriptions—is data. It's information that's been recorded and stored so you can look at it later.

When you shop online and receive an email receipt, that receipt contains data: what you bought, how much you paid, when the order was placed, where it's being shipped.

When you scroll through your phone's photo gallery, each photo has data attached to it: the date it was taken, the location (if GPS was on), the file size, the camera settings.

Data is simply recorded information. It can be numbers, text, dates, images, or anything else that can be stored.

Where Does Data Live?¶

Data needs a home. Just like you keep physical documents in folders and filing cabinets, digital data lives in specific places:

Files are the simplest form. A file is just a container for data sitting on a computer's hard drive: - A .csv file (Comma-Separated Values) is like a spreadsheet saved as plain text - A .json file stores data in a structured, nested format - A .txt file is just raw text - An .xlsx file is a Microsoft Excel spreadsheet

Databases are specialized software designed to store, organize, and retrieve large amounts of data efficiently. Think of a database as a super-powered filing cabinet that can find any document instantly, even if you have millions of them. Examples include PostgreSQL, MySQL, SQL Server, and Oracle.

APIs (Application Programming Interfaces) are not storage themselves, but they're how different systems share data. When a weather app on your phone shows today's forecast, it's calling a weather API to get that data. The API is like a waiter in a restaurant—you tell it what you want, and it brings it to you from the kitchen (the database).

Data Lakes and Warehouses are massive storage systems designed for analytics. A data lake stores raw data in its original format (like a lake that collects water from many streams). A data warehouse stores cleaned, organized data ready for analysis (like a warehouse with neatly arranged shelves).

Types of Data¶

Not all data is created equal. Data comes in three main flavors:

Structured Data¶

Structured data fits neatly into rows and columns, like a spreadsheet. Every row follows the same pattern. Every column has a specific type of information.

Example: A customer table

| customer_id | name          | email                | signup_date |
|-------------|---------------|----------------------|-------------|
| 1           | Alice Smith   | alice@email.com      | 2024-01-15  |
| 2           | Bob Johnson   | bob@email.com        | 2024-02-20  |
| 3           | Carol White   | carol@email.com      | 2024-03-10  |

This is structured data because: - Every customer has exactly the same fields - Each field has a consistent type (IDs are numbers, names are text, dates are dates) - You can easily search, sort, and filter this data

Most business data is structured: sales records, inventory counts, employee information, financial transactions.

Semi-Structured Data¶

Semi-structured data has some organization, but it's flexible. Not every record needs to have the same fields.

Example: JSON data from an API

{
  "customer_id": 1,
  "name": "Alice Smith",
  "email": "alice@email.com",
  "preferences": {
    "newsletter": true,
    "theme": "dark"
  },
  "tags": ["premium", "early-adopter"]
}

This is semi-structured because: - There's a clear organization (fields have names) - But the structure can vary (some customers might have a phone field, others might not) - Nested data is allowed (preferences contains its own sub-fields)

JSON (JavaScript Object Notation) and XML are common semi-structured formats. They're often used for web data and configuration files.

Unstructured Data¶

Unstructured data has no predefined format. It's raw and messy by nature.

Examples: - A paragraph of text from a customer support email - An image file - A video recording - A voice memo

Unstructured data is the hardest to work with because computers can't easily understand its meaning without special processing.

Data You Encounter Daily¶

To make this concrete, here are data examples from everyday life:

Everyday Activity	The Data Behind It
Checking your bank balance	Account numbers, transaction history, balances
Getting a grocery receipt	Product names, prices, quantities, totals, timestamps
Using GPS navigation	Location coordinates, road maps, traffic patterns
Streaming music	Song titles, artists, play counts, user preferences
Ordering food delivery	Menu items, prices, addresses, delivery times

Every digital interaction creates data. Data engineering is about managing, moving, and transforming all of this information so it can be useful.

What is Data Engineering?¶

Now that you understand what data is, let's talk about what data engineers actually do.

The "Plumbing" of the Data World¶

Imagine a city's water system. Water needs to flow from reservoirs, through treatment plants, into pipes, and eventually out of your faucet. Someone has to design those pipes, build them, maintain them, and make sure clean water arrives when you turn the handle.

Data engineering is the same idea, but for data instead of water.

Data engineers build and maintain the systems that move data from where it's created to where it's needed.

Just like you don't think about water pipes when you take a shower, most people don't think about data pipelines when they look at a dashboard. But those pipelines are essential. Without them, the data would never arrive.

What Data Engineers Actually Do¶

On a typical day, a data engineer might:

Build data pipelines: Create automated processes that move data from one place to another. For example, extracting sales data from an e-commerce system and loading it into a data warehouse.
Transform data: Clean, reshape, and combine data so it's ready for analysis. Raw data is often messy—data engineers make it usable.
Ensure data quality: Check that data is accurate, complete, and consistent. If bad data gets into a system, every report and decision based on it will be wrong.
Maintain infrastructure: Keep databases running, optimize queries that are too slow, and troubleshoot when something breaks.
Collaborate with analysts and scientists: Data engineers don't usually create the final reports or machine learning models—they provide the clean data that makes those things possible.

Data Engineering vs Data Science vs Data Analytics¶

These three roles are related but distinct:

Role	Focus	Key Question
Data Engineer	Building systems to move and prepare data	"How do we get clean data from A to B?"
Data Analyst	Creating reports and dashboards from data	"What happened? What does the data show?"
Data Scientist	Building predictive models and finding patterns	"What will happen? Can we predict X?"

Analogy: Think of a restaurant. - The Data Engineer is the chef who prepares the ingredients—washing vegetables, cutting meat, making sure everything is fresh and ready. - The Data Analyst is the cook who follows recipes to create dishes (reports) from those ingredients. - The Data Scientist is the chef who invents new recipes, experimenting with ingredients to create something new (predictions, recommendations).

You can't have great dishes without properly prepared ingredients. That's why data engineering is foundational.

A Real Job Scenario¶

Let's say you work at an online clothing retailer. The company wants to know which products are selling best in each region.

Without data engineering, someone would: 1. Log into the sales system manually 2. Export data to Excel 3. Spend hours cleaning up the data (fixing typos, removing duplicates) 4. Copy-paste data from different sources 5. Manually calculate totals 6. Repeat this process every week

With data engineering, you would: 1. Build a pipeline that automatically extracts sales data every night 2. Join it with product and region data 3. Clean and standardize everything automatically 4. Load it into a warehouse where analysts can query it instantly 5. The pipeline runs every night without human intervention

The first approach might work for a small company. But when you have millions of transactions, hundreds of products, and dozens of regions, manual work becomes impossible. That's when you need data engineering.

What is a Data Pipeline?¶

A data pipeline is an automated process that moves data from one place to another, often transforming it along the way.

The Factory Assembly Line Analogy¶

Imagine a car factory. Raw materials (steel, glass, rubber) enter at one end. They pass through various stations where workers and machines shape them, assemble them, and quality-check them. At the other end, finished cars roll off the line.

A data pipeline works the same way:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Raw Data  │───▶│  Station 1  │───▶│  Station 2  │───▶│ Final Data  │
│   (Input)   │    │  (Extract)  │    │ (Transform) │    │  (Output)   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Input: The raw data enters the pipeline. This might be CSV files, API responses, database tables, or event streams.

Processing Stations: Each station does something specific to the data: - Clean up messy values - Convert data types (text to dates, for example) - Join data from multiple sources - Calculate new fields - Filter out irrelevant records

Output: Clean, transformed data arrives at its destination—ready to be used for reports, dashboards, or further analysis.

A Visual Explanation¶

Let's trace a specific piece of data through a pipeline:

                        DATA PIPELINE EXAMPLE
═══════════════════════════════════════════════════════════════════════

📦 INPUT: Raw sales data from a point-of-sale system
┌────────────────────────────────────────────────────────────────────┐
│ transaction_id,product,price,date,store                           │
│ T001,SHRT-BLU-M,$29.99,2024-03-15,Store #42                       │
│ T002,SHRT-BLU-M,29.99,15/03/2024,42                               │
│ T003,,29.99,2024-03-15,Store 42                                   │
└────────────────────────────────────────────────────────────────────┘
Notice the problems? Inconsistent date formats, missing product, 
dollar signs in some prices, store names formatted differently.

              │
              ▼

⚙️ STATION 1: Parse and Validate
- Convert all dates to YYYY-MM-DD format
- Remove currency symbols from prices
- Flag records with missing required fields

              │
              ▼

⚙️ STATION 2: Standardize
- Normalize store names to IDs
- Convert product codes to full names
- Ensure prices are decimal numbers

              │
              ▼

⚙️ STATION 3: Enrich
- Look up product category from product catalog
- Look up store region from store table
- Calculate sales tax based on region

              │
              ▼

✅ OUTPUT: Clean, enriched sales data
┌────────────────────────────────────────────────────────────────────┐
│ transaction_id │ product_name   │ price │ date       │ store_id │  │
│ T001           │ Blue Shirt (M) │ 29.99 │ 2024-03-15 │ 42       │  │
│ T002           │ Blue Shirt (M) │ 29.99 │ 2024-03-15 │ 42       │  │
│ T003           │ NULL           │ 29.99 │ 2024-03-15 │ 42       │  │ ← Quarantined
└────────────────────────────────────────────────────────────────────┘

The pipeline took messy, inconsistent input and produced clean, standardized output. It also flagged the problematic record (T003) so someone can investigate the missing product.

Why Manual Excel Work Doesn't Scale¶

You might wonder: "Can't I just do this in Excel?"

For small amounts of data, yes. But consider these scenarios:

Scenario	Manual Approach	Pipeline Approach
100 rows, once	✅ Totally fine	Overkill
100 rows, daily	😓 Tedious, error-prone	✅ Automated
100,000 rows	❌ Excel slows down	✅ No problem
10 million rows	❌ Excel crashes	✅ Still fine
1 billion rows	❌ Impossible	✅ Use distributed processing

Scale is one reason. But there are others:

Reproducibility: If you clean data manually in Excel, can you do it exactly the same way next time? What if you're sick and a colleague has to do it? With a pipeline, the logic is written in code—it runs the same way every time.

Audit trail: When something goes wrong with a report, you need to trace back to see what happened. With Excel, you'd have to remember every click and formula change. With a pipeline, you can review the code and logs.

Speed: A pipeline can run automatically at 3 AM while you sleep. Manual work requires a human to be present.

What Problems Do Pipelines Solve?¶

Let's get specific about the problems data pipelines address.

Problem 1: Messy Data¶

Real-world data is almost never clean. Here are common issues:

Inconsistent formatting:

# These all mean the same thing, but are written differently
"United States", "USA", "U.S.A.", "US", "united states", "UNITED STATES"

Duplicate records:

# Same customer entered twice
customer_id: 1001, name: "John Smith", email: "john@email.com"
customer_id: 1002, name: "John Smith", email: "john@email.com"

Invalid values:

# Age can't be negative
age: -5

# Birth date can't be in the future
birth_date: 2050-01-01

# Email missing @ symbol
email: "invalidemail.com"

Mixed data types:

# Price column has different formats
price: 29.99
price: "$29.99"
price: "29,99"  (European format)
price: "twenty-nine ninety-nine"

A pipeline includes validation and cleaning steps that catch and fix these issues automatically.

Problem 2: Manual Work and Human Error¶

Every time a human touches data manually, there's a chance for error:

Copy-pasting the wrong range
Forgetting to apply a filter
Typing a formula incorrectly
Saving over the wrong file
Using last month's template instead of the updated one

These mistakes are expensive. A single typo in a financial report could cost millions. A forgotten filter could lead to wrong business decisions.

💡 Automation removes human error from repetitive tasks. A pipeline runs the same logic every time, exactly as written.

Problem 3: Scale¶

Consider this: A small retail store might have 100 transactions per day. That's 36,500 transactions per year—manageable in Excel.

Now consider Amazon. They process roughly 8,500 transactions per second. That's 734 million transactions per day. 268 billion per year.

You cannot process 268 billion rows in Excel. You need distributed computing systems that can split the work across hundreds or thousands of machines. Data pipelines built with tools like Apache Spark are designed for this scale.

Even at smaller scales, performance matters. If generating a report takes 4 hours manually but 4 minutes with a pipeline, that's meaningful time savings.

Problem 4: Reproducibility¶

Imagine this scenario:

"Hey, the revenue numbers in last month's report look wrong. Can you re-run the analysis?"

If you did it manually in Excel: - Do you still have the original files? - Did you document every step? - Can you replicate exactly what you did?

If you have a pipeline: - Run the pipeline with last month's parameters - Get the same result - Compare to find the discrepancy

Reproducibility is essential for trust. If you can't reproduce a result, how do you know it was right in the first place?

Problem 5: Freshness¶

Business moves fast. Executives want to see yesterday's numbers this morning, not next week.

Manual processes have inherent delays: - Someone has to be available to run them - They can only run during business hours - If something fails, you might not notice until someone complains

Automated pipelines can: - Run on a schedule (every night at 2 AM) - Run on triggers (whenever new data arrives) - Alert you immediately if something fails - Retry automatically after transient failures

ETL vs ELT¶

You'll often hear these acronyms in data engineering. They describe two different approaches to moving and transforming data.

ETL: Extract, Transform, Load¶

ETL is the traditional approach. It works like this:

┌──────────────┐         ┌───────────────┐         ┌───────────────┐
│    SOURCE    │────────▶│   TRANSFORM   │────────▶│  DESTINATION  │
│   Systems    │ Extract │   (Outside)   │  Load   │  (Warehouse)  │
└──────────────┘         └───────────────┘         └───────────────┘

Extract: Pull data from source systems (databases, APIs, files)
Transform: Clean and reshape the data before loading it
Load: Insert the transformed data into the destination

Why ETL? In the past, storage and computing power in data warehouses were expensive. It made sense to clean and shrink the data before loading it—you'd only store what you actually needed.

ETL Example:

Source: 10 million raw transaction records
Transform: Filter to only completed orders, aggregate by day
Load: 365 daily summary records into the warehouse

You've reduced 10 million records to 365, saving storage and making queries faster.

ELT: Extract, Load, Transform¶

ELT is the modern approach, enabled by cheap cloud storage and powerful cloud warehouses.

┌──────────────┐         ┌───────────────┐         ┌───────────────┐
│    SOURCE    │────────▶│  DESTINATION  │────────▶│   TRANSFORM   │
│   Systems    │ Extract │  (Warehouse)  │  Load   │   (Inside)    │
└──────────────┘         └───────────────┘         └───────────────┘

Extract: Pull data from source systems
Load: Load the raw data into the destination first
Transform: Transform the data inside the warehouse using SQL

Why ELT? Modern cloud warehouses (Snowflake, BigQuery, Databricks) can store huge amounts of data cheaply and process it incredibly fast. Loading raw data first gives you: - A complete historical record (nothing is lost) - Flexibility to transform in different ways later - The ability to re-transform if requirements change

ELT Example:

Source: 10 million raw transaction records
Load: All 10 million records into the warehouse (raw layer)
Transform: Create views or tables for different purposes
  - Daily summaries for executives
  - Detailed records for fraud analysis
  - Customer-level aggregations for marketing

Visual Comparison¶

                    ETL (Traditional)
═══════════════════════════════════════════════════════════════

    SOURCE              ETL SERVER              WAREHOUSE
  ┌─────────┐          ┌─────────┐           ┌─────────┐
  │ Sales   │          │         │           │         │
  │ System  │────────▶ │ Clean   │─────────▶ │ Clean   │
  │         │          │ Filter  │           │ Data    │
  └─────────┘          │ Aggregate           └─────────┘
  ┌─────────┐          │         │
  │ CRM     │────────▶ │         │
  └─────────┘          └─────────┘

  ⚠️ Transformation happens OUTSIDE the warehouse
  ⚠️ Raw data is discarded after transformation


                    ELT (Modern)
═══════════════════════════════════════════════════════════════

    SOURCE              WAREHOUSE                WAREHOUSE
  ┌─────────┐          ┌─────────┐           ┌─────────┐
  │ Sales   │          │ RAW     │           │ CLEAN   │
  │ System  │────────▶ │ Layer   │─────────▶ │ Layer   │
  │         │          │ (All    │   SQL     │ (Views  │
  └─────────┘          │  Data)  │  Transforms  & Tables)
  ┌─────────┐          │         │           │         │
  │ CRM     │────────▶ │         │           │         │
  └─────────┘          └─────────┘           └─────────┘

  ✅ Raw data is preserved
  ✅ Transformations use warehouse compute power
  ✅ Can re-transform any time

When to Use Which¶

Consideration	ETL	ELT
Warehouse storage is expensive	✅ Preferred
Warehouse compute is powerful		✅ Preferred
Need to preserve raw data		✅ Preferred
Complex transformations before loading	✅ Preferred
Privacy/compliance requires filtering before loading	✅ Preferred
Want flexibility to change transformations later		✅ Preferred

How Odibi Supports Both¶

Odibi is designed to work with either approach:

For ETL workflows: You can define transformation patterns that run during the extract phase, cleaning and reshaping data before it reaches your destination.

For ELT workflows: You can use Odibi to manage the "E" and "L" (extracting from sources and loading to raw layers), then define separate transformation jobs that run inside your warehouse.

Hybrid approaches: Many real-world pipelines use both. You might: 1. Extract data from a source 2. Do light cleaning (ETL-style) to fix obvious issues 3. Load to a raw layer 4. Do heavy transformations (ELT-style) in the warehouse

Odibi's pattern-based approach lets you mix and match as needed.

What is a Schema? Why Does It Matter?¶

A schema is the structure or blueprint of your data. It defines what fields exist, what types they are, and how they relate to each other.

The Spreadsheet Headers Analogy¶

When you create a spreadsheet, the first row usually contains headers:

| Name       | Age | Email              | Salary   |
|------------|-----|--------------------|----------|
| Alice      | 30  | alice@email.com    | 75000    |
| Bob        | 25  | bob@email.com      | 65000    |

Those headers are a simple schema. They tell you: - What each column means - In what order columns appear

A database schema goes further: - Name: Text, maximum 100 characters, cannot be empty - Age: Integer, must be between 0 and 150 - Email: Text, must contain "@", must be unique - Salary: Decimal number, cannot be negative

Type Safety: Why It Matters¶

Every piece of data has a type. Common types include:

Type	What It Stores	Examples
String/Text	Letters, words, sentences	"Hello", "Product-123"
Integer	Whole numbers	1, 42, -7, 1000000
Float/Decimal	Numbers with decimals	3.14, 29.99, -0.5
Boolean	True or false	true, false
Date	Calendar dates	2024-03-15
Timestamp	Date and time	2024-03-15 10:30:00

Why does this matter?

Imagine someone puts text in a number column:

| product_id | quantity |
|------------|----------|
| P001       | 10       |
| P002       | 5        |
| P003       | "TBD"    |  ← This is text, not a number!

Now try to calculate total quantity: 10 + 5 + "TBD" = ???

The calculation fails because you can't add a number to text. This is a type mismatch, and it breaks things.

A schema with proper types prevents this. If the quantity column is defined as an integer, the system will reject "TBD" when someone tries to insert it.

What Happens When Schemas Don't Match¶

Schema mismatches are a major source of pipeline failures. Here are common scenarios:

Scenario 1: Source adds a new column

# Your schema expects:
name, email, signup_date

# Source now sends:
name, email, signup_date, phone_number  ← New column!

Your pipeline might fail because it doesn't know what to do with phone_number.

Scenario 2: Source changes a column type

# Previously:
order_id: 12345 (integer)

# Now:
order_id: "ORD-12345" (string)  ← Type changed!

Your pipeline expects an integer but receives a string. Crash.

Scenario 3: Source renames a column

# Previously:
customer_name

# Now:
cust_name  ← Different name, same data

Your pipeline looks for customer_name, finds nothing, and fails.

Schema Evolution¶

Schemas change over time. Business requirements evolve. New data becomes available. This is called schema evolution, and it's one of the trickiest parts of data engineering.

Good data engineering practices handle schema evolution gracefully:

Additive changes (new columns): Usually safe. Ignore new columns or add them as optional.
Removal changes (dropped columns): Dangerous. If your pipeline depends on that column, it will break.
Type changes: Very dangerous. Requires careful migration.
Rename changes: Dangerous. Requires mapping old names to new names.

Odibi helps with schema evolution by: - Validating incoming data against expected schemas - Providing clear error messages when schemas don't match - Supporting schema transformation patterns (renaming, type conversion)

Data Quality Basics¶

"Garbage in, garbage out" is the oldest saying in data. If the data going into your pipeline is bad, the output will be bad too—no matter how sophisticated your transformations are.

The Five Pillars of Data Quality¶

1. Completeness: Are required values present?¶

Nulls (missing values) are the most common data quality issue.

| order_id | customer_id | total   |
|----------|-------------|---------|
| 1001     | C100        | 59.99   |
| 1002     | NULL        | 29.99   |  ← Who placed this order?
| 1003     | C102        | NULL    |  ← How much was it?

Some nulls are acceptable (an optional phone number might be null). Others are critical failures (an order without a total is useless).

Questions to ask: - Which fields are required? - What should happen when a required field is null? - Should we reject the record? Fill in a default? Quarantine for review?

2. Uniqueness: Are there duplicate records?¶

Duplicates inflate counts and totals, leading to wrong conclusions.

| transaction_id | amount |
|----------------|--------|
| T001           | 100.00 |
| T001           | 100.00 |  ← Duplicate!
| T002           | 50.00  |

Total without duplicates: $150.00
Total with duplicates: $250.00  ← Wrong!

Questions to ask: - What makes a record unique? (Usually a primary key) - How did duplicates get created? (Fix the source if possible) - How do we handle duplicates? (Keep first? Keep last? Merge?)

3. Validity: Are values within acceptable ranges?¶

Invalid values don't make sense in context.

# Invalid examples:
age: -5              # Age cannot be negative
temperature: 500°C   # Unless it's a volcano, this is wrong
discount: 150%       # Can't discount more than 100%
birth_date: 2099-01-01  # This date hasn't happened yet
status: "Pneding"    # Typo in status value

Questions to ask: - What are the valid ranges for numeric fields? - What are the valid values for categorical fields? - What date ranges make sense?

4. Consistency: Does the same thing have the same representation?¶

Inconsistent data is technically valid but difficult to work with.

# Inconsistent country names:
"United States"
"USA"
"U.S.A."
"United States of America"
"US"

# Inconsistent date formats:
"2024-03-15"
"03/15/2024"
"15-Mar-2024"
"March 15, 2024"

All of these mean the same thing, but a computer sees them as different values. Reports will show separate counts for "USA" and "United States" when they should be combined.

Solution: Standardize during transformation. Pick one format and convert everything to it.

5. Referential Integrity: Do relationships make sense?¶

When records refer to other records, those references should be valid.

# Orders table:
| order_id | customer_id |
|----------|-------------|
| 1001     | C100        |
| 1002     | C999        |  ← This customer doesn't exist!

# Customers table:
| customer_id | name    |
|-------------|---------|
| C100        | Alice   |
| C101        | Bob     |

Order 1002 references customer C999, but that customer isn't in the customers table. This is called an orphan record—it has no parent.

Why it matters: - You can't look up the customer name for order 1002 - Reports will show orders without customer information - Joins between tables will lose data

"Garbage In, Garbage Out" in Practice¶

Imagine building a dashboard showing "Average Order Value by Region."

If your data has: - ❌ Missing order totals → Averages will be wrong - ❌ Duplicate orders → Totals will be inflated - ❌ Invalid region codes → Some orders won't be assigned to any region - ❌ Inconsistent region names → "West Coast" and "WEST COAST" counted separately

Your dashboard will show wrong numbers, and executives will make wrong decisions based on it.

💡 Data quality is not optional. It's the foundation that everything else depends on.

Odibi includes built-in validation patterns to catch these issues before they pollute your data.

The Medallion Architecture (Bronze/Silver/Gold)¶

The Medallion Architecture is a design pattern for organizing data in layers. Each layer has a specific purpose and quality level.

The Manufacturing Analogy¶

Think of a factory that turns raw materials into finished products:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│    RAW ORE      │    │  REFINED METAL  │    │ FINISHED PRODUCT│
│   (Bronze)      │───▶│    (Silver)     │───▶│     (Gold)      │
│                 │    │                 │    │                 │
│ Rocks, dirt,    │    │ Pure metal      │    │ Jewelry, tools  │
│ impurities      │    │ ingots          │    │ ready to sell   │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Data flows through similar stages, becoming more refined at each step.

Visual Diagram of the Medallion Architecture¶

═══════════════════════════════════════════════════════════════════════════
                        MEDALLION ARCHITECTURE
═══════════════════════════════════════════════════════════════════════════

   SOURCES                BRONZE               SILVER               GOLD
  (External)            (Raw Data)          (Cleaned)          (Business-Ready)
┌──────────────┐      ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Sales API    │─────▶│             │     │             │     │             │
├──────────────┤      │ Exact copy  │     │ Validated   │     │ Aggregated  │
│ CRM Database │─────▶│ of source   │────▶│ Cleaned     │────▶│ Joined      │
├──────────────┤      │ data        │     │ Standardized│     │ Business    │
│ CSV Files    │─────▶│             │     │             │     │ Metrics     │
├──────────────┤      │ No changes  │     │ Quality     │     │             │
│ Event Stream │─────▶│ whatsoever  │     │ checked     │     │ Ready for   │
└──────────────┘      └─────────────┘     └─────────────┘     │ dashboards  │
                                                               └─────────────┘

                      📦 STORAGE:          📦 STORAGE:        📦 STORAGE:
                      Everything           Reliable data      Curated datasets
                      (even garbage)       for analysis       for specific uses

What Belongs in Each Layer?¶

📦 Bronze Layer (Raw)¶

Purpose: Preserve the original data exactly as received.

Contents: - Exact copies of source data - No transformations - No cleaning - No filtering

Why keep raw data? - If something goes wrong, you can reprocess from scratch - If requirements change, you have the original data - For auditing and compliance, you can prove what was received

Example Bronze table:

| _raw_json                                          | _source   | _loaded_at          |
|----------------------------------------------------|-----------|---------------------|
| {"id":1,"name":"Alice","date":"2024-03-15"}       | crm_api   | 2024-03-15 10:00:00 |
| {"id":2,"name":"Bob","date":"15/03/2024"}         | crm_api   | 2024-03-15 10:00:00 |
| {"id":"X","name":null,"date":"invalid"}           | crm_api   | 2024-03-15 10:00:00 |

Even the bad data (ID "X" with null name and invalid date) is preserved.

🥈 Silver Layer (Cleaned)¶

Purpose: Validated, cleaned, standardized data ready for analysis.

Contents: - Data types enforced (dates are dates, numbers are numbers) - Invalid records filtered out or quarantined - Formats standardized (consistent date formats, normalized text) - Duplicates removed - Basic business logic applied

Example Silver table:

| id   | name   | registration_date | _is_valid | _quality_score |
|------|--------|-------------------|-----------|----------------|
| 1    | Alice  | 2024-03-15        | true      | 100            |
| 2    | Bob    | 2024-03-15        | true      | 100            |

Notice the bad record from Bronze didn't make it to Silver. It was quarantined because it had invalid values.

🥇 Gold Layer (Business-Ready)¶

Purpose: Aggregated, joined, business-specific datasets for end users.

Contents: - Pre-calculated metrics - Data from multiple Silver tables joined together - Optimized for specific use cases (dashboards, reports, ML models) - Business terminology (not technical column names)

Example Gold table (Daily Sales Summary):

| report_date | region      | total_orders | total_revenue | avg_order_value |
|-------------|-------------|--------------|---------------|-----------------|
| 2024-03-15  | Northeast   | 1,234        | $98,456.78    | $79.78          |
| 2024-03-15  | Southeast   | 987          | $76,543.21    | $77.55          |
| 2024-03-15  | Midwest     | 1,567        | $123,456.89   | $78.78          |

This is what business users see in their dashboards. They don't need to know about raw JSON or data cleaning—they just want the numbers.

Why This Pattern Exists¶

Separation of concerns: Each layer has one job. Bronze preserves. Silver cleans. Gold serves.
Reprocessing: If cleaning logic changes, reprocess Silver from Bronze. If aggregation logic changes, reprocess Gold from Silver.
Debugging: When something looks wrong in Gold, you can trace back to Silver, then Bronze, to find where the problem started.
Performance: Gold tables are pre-aggregated, so dashboards load instantly instead of calculating on the fly.
Access control: Raw data in Bronze might be sensitive. You can restrict access to Bronze while allowing wider access to sanitized Gold data.

How Odibi Fits Into the Ecosystem¶

There are many tools in the data engineering world. Here's how Odibi compares.

Odibi vs dbt¶

dbt (data build tool) is popular for transforming data inside a warehouse using SQL.

Aspect	dbt	Odibi
Primary language	SQL	Python (with YAML config)
Transform location	Inside the warehouse	Flexible (Spark, Pandas, Polars)
Focus	Transformations only	Full pipeline patterns
Approach	Write SQL, dbt manages	Declarative patterns

When to use dbt: Your team is SQL-strong, your data is already in a warehouse, and you want a SQL-centric workflow.

When to use Odibi: You need to work with data before it reaches the warehouse, want Python flexibility, or prefer pattern-based configuration over writing SQL.

Odibi vs Airflow¶

Airflow is an orchestrator—it schedules and monitors when jobs run, but it doesn't define what those jobs do.

Aspect	Airflow	Odibi
Purpose	Orchestration (when)	Transformation (what)
What it does	Schedules tasks, manages dependencies	Defines data transformations
Scope	Runs any task (data, ML, DevOps)	Data pipelines specifically

Relationship: You can use Airflow to schedule Odibi jobs. Airflow says "run this at 3 AM," Odibi says "here's what to do when you run."

Odibi vs Raw Spark/Pandas¶

Spark and Pandas are processing engines. They're extremely powerful but require you to write all the logic yourself.

Aspect	Raw Spark/Pandas	Odibi
Approach	Write code for everything	Use pre-built patterns
SCD2 logic	Write 50+ lines of code	One config option
Data quality	Build it yourself	Built-in validation
Learning curve	Steep	Gentler

Odibi adds patterns on top of these engines. Instead of writing the same boilerplate code for SCD2 (Slowly Changing Dimension Type 2) over and over, you declare what you want in YAML and Odibi generates the code.

When to use raw Spark/Pandas: You need custom logic that doesn't fit any pattern, or you're learning how these engines work.

When to use Odibi: You're building production pipelines and want to move fast with proven patterns.

When to Use Odibi¶

Odibi is a good fit when:

✅ You're building data pipelines in Python
✅ You want to use Spark, Pandas, or Polars
✅ You need SCD2, merge patterns, or data quality checks
✅ You prefer configuration over code
✅ You're a small team (or solo) and need to move fast

Odibi might not be the best fit when:

❌ Your entire workflow is SQL-based (consider dbt)
❌ You need a scheduler/orchestrator (use Airflow, Dagster, or Prefect)
❌ You're doing real-time streaming (consider Kafka, Flink)

Your First Mental Model¶

Let's put everything together with a complete example. We'll trace data from source to dashboard.

The Journey: Source → Bronze → Silver → Gold → Dashboard¶

═══════════════════════════════════════════════════════════════════════════
                    THE COMPLETE DATA JOURNEY
═══════════════════════════════════════════════════════════════════════════

    📄 SOURCE            📦 BRONZE           🥈 SILVER           🥇 GOLD
    (CSV file)          (Raw copy)         (Cleaned)          (Aggregated)
        │                   │                  │                  │
        ▼                   ▼                  ▼                  ▼
   ┌─────────┐         ┌─────────┐        ┌─────────┐       ┌─────────┐
   │ orders  │────────▶│  raw_   │───────▶│ clean_  │──────▶│ daily_  │
   │  .csv   │         │ orders  │        │ orders  │       │ summary │
   └─────────┘         └─────────┘        └─────────┘       └─────────┘
                                                                  │
                                                                  ▼
                                                            ┌──────────┐
                                                            │ Dashboard│
                                                            │  📊📈    │
                                                            └──────────┘

Step 1: Sample Source Data (CSV)¶

This is what arrives from the source system—an orders CSV file:

order_id,customer_name,product,quantity,price,order_date,region
ORD001,Alice Smith,Widget A,2,29.99,2024-03-15,Northeast
ORD002,BOB JOHNSON,Widget B,1,$49.99,03/15/2024,northeast
ORD003,Carol White,Widget A,-1,29.99,2024-03-15,Southeast
ORD004,,Widget C,3,19.99,2024-03-15,Midwest
ORD001,Alice Smith,Widget A,2,29.99,2024-03-15,Northeast
ORD005,Dave Brown,Widget B,1,49.99,2024-03-99,West

Problems in this data: - 📛 Row 2: Name in ALL CAPS, price has "$", date format different, region lowercase - ❌ Row 3: Negative quantity (invalid) - ❌ Row 4: Missing customer name - ❌ Row 5: Duplicate of row 1 - ❌ Row 6: Invalid date (March 99th doesn't exist)

Step 2: Bronze Layer (Raw Copy)¶

The Bronze layer stores this data exactly as received:

| order_id | customer_name | product  | quantity | price   | order_date | region    | _loaded_at          |
|----------|---------------|----------|----------|---------|------------|-----------|---------------------|
| ORD001   | Alice Smith   | Widget A | 2        | 29.99   | 2024-03-15 | Northeast | 2024-03-16 02:00:00 |
| ORD002   | BOB JOHNSON   | Widget B | 1        | $49.99  | 03/15/2024 | northeast | 2024-03-16 02:00:00 |
| ORD003   | Carol White   | Widget A | -1       | 29.99   | 2024-03-15 | Southeast | 2024-03-16 02:00:00 |
| ORD004   |               | Widget C | 3        | 19.99   | 2024-03-15 | Midwest   | 2024-03-16 02:00:00 |
| ORD001   | Alice Smith   | Widget A | 2        | 29.99   | 2024-03-15 | Northeast | 2024-03-16 02:00:00 |
| ORD005   | Dave Brown    | Widget B | 1        | 49.99   | 2024-03-99 | West      | 2024-03-16 02:00:00 |

Nothing changed except adding a timestamp of when we loaded it. All the messy data is preserved.

Step 3: Silver Layer (Cleaned)¶

The Silver layer applies transformations and validations:

Transformations applied: 1. Standardize customer names to Title Case 2. Remove "$" from prices, convert to decimal 3. Parse dates to YYYY-MM-DD format 4. Standardize region names to Title Case 5. Remove duplicates based on order_id

Validations applied: 1. Quantity must be positive 2. Customer name cannot be empty 3. Date must be valid

Resulting Silver table:

| order_id | customer_name | product  | quantity | price | order_date | region    | _is_valid |
|----------|---------------|----------|----------|-------|------------|-----------|-----------|
| ORD001   | Alice Smith   | Widget A | 2        | 29.99 | 2024-03-15 | Northeast | ✅        |
| ORD002   | Bob Johnson   | Widget B | 1        | 49.99 | 2024-03-15 | Northeast | ✅        |

Quarantined records (sent to a separate table for review):

| order_id | reason                              |
|----------|-------------------------------------|
| ORD003   | Quantity must be positive: -1       |
| ORD004   | Customer name is required           |
| ORD005   | Invalid date: 2024-03-99            |

Only 2 of the original 6 records (after deduplication) made it through. The bad records are quarantined so someone can investigate and fix the source.

Step 4: Gold Layer (Aggregated)¶

The Gold layer aggregates for business use:

Daily Sales Summary:

| report_date | region    | total_orders | total_quantity | total_revenue |
|-------------|-----------|--------------|----------------|---------------|
| 2024-03-15  | Northeast | 2            | 3              | $79.98        |

Product Performance:

| product  | units_sold | revenue |
|----------|------------|---------|
| Widget A | 2          | $59.98  |
| Widget B | 1          | $49.99  |

Step 5: Dashboard Visualization¶

Finally, users see clean visualizations:

╔═══════════════════════════════════════════════════════════════════════╗
║                    DAILY SALES DASHBOARD                              ║
║                       March 15, 2024                                  ║
╠═══════════════════════════════════════════════════════════════════════╣
║                                                                       ║
║   TOTAL REVENUE        TOTAL ORDERS        AVG ORDER VALUE            ║
║   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐            ║
║   │  $79.98     │     │     2       │     │   $39.99    │            ║
║   └─────────────┘     └─────────────┘     └─────────────┘            ║
║                                                                       ║
║   SALES BY REGION                    TOP PRODUCTS                     ║
║   ┌───────────────────────┐         ┌─────────────────────────┐      ║
║   │ ██████████ Northeast  │         │ Widget A  ████████ $59.98│     ║
║   │            $79.98     │         │ Widget B  ████     $49.99│     ║
║   └───────────────────────┘         └─────────────────────────┘      ║
║                                                                       ║
║   ⚠️ DATA QUALITY ALERT: 3 records quarantined for review            ║
║                                                                       ║
╚═══════════════════════════════════════════════════════════════════════╝

What the user sees: Clean numbers, clear charts, actionable insights.

What they don't see: All the messy data, parsing errors, and transformations that happened behind the scenes.

That's data engineering. You built the invisible plumbing that makes this dashboard possible.

What's Next?¶

Congratulations! You now understand the fundamentals of data engineering:

✅ What data is and where it lives
✅ What data engineers do
✅ How data pipelines work
✅ ETL vs ELT approaches
✅ Schemas and data types
✅ Data quality fundamentals
✅ The Medallion Architecture
✅ How Odibi fits in the ecosystem

Continue Your Learning¶

📘 Curriculum - A structured learning path from beginner to advanced

📖 Glossary - Quick definitions for data engineering terms

🛠️ First Tutorial - Build your first Odibi pipeline

Key Concepts to Explore Next¶

Slowly Changing Dimensions (SCD2) - How to track historical changes
Data Modeling - Designing efficient schemas
Orchestration - Scheduling and monitoring pipelines
Testing - Validating that your pipelines work correctly

Remember¶

Data engineering is a journey. You don't need to understand everything at once. Start with simple pipelines, add complexity gradually, and always focus on data quality.

The best data engineers aren't those who know the most tools—they're the ones who understand the fundamentals deeply and can apply them to any situation.

Welcome to data engineering. 🎉