Skip to content

๐ŸŽ“ Odibi Learning Curriculum

A 4-Week Journey from Zero to Data Engineer

This curriculum is designed for complete beginners with no prior data engineering experience. By the end, you'll be able to build production-ready data pipelines.


How This Course Works

  • Pace: ~1-2 hours per day, 5 days per week
  • Style: Learn by doing โ€” every concept has hands-on exercises
  • Format: Read โ†’ Try โ†’ Check โ†’ Repeat

Each week builds on the previous one, like stacking building blocks.


๐Ÿ“… Week 1: Bronze Layer + Basic Concepts

๐Ÿ“š Learning Objectives

By the end of this week, you will: - Understand what data is and common file formats - Know what a data pipeline does and why it matters - Install Odibi and run your first pipeline - Load raw data into a Bronze layer

โœ… Prerequisites

Before starting, make sure you have: - A computer (Windows, Mac, or Linux) - Python 3.9+ installed (Download Python) - A text editor (VS Code recommended) - Basic comfort using a terminal/command prompt


Day 1: What is Data?

๐Ÿณ Kitchen Analogy

Think of data like ingredients in your kitchen. You have: - Raw ingredients (flour, eggs, sugar) = raw data files - Recipes = data transformations - Finished dishes = clean, usable reports

Data comes in many "containers":

Format What it looks like When to use
CSV Spreadsheet-like rows and columns Simple tabular data
JSON Nested key-value pairs API responses, configs
Parquet Binary columnar format Large datasets, analytics
Delta Parquet + versioning + ACID Production data lakes

๐Ÿ’ป Hands-On: Create Your First Data File

Create a folder called my_first_pipeline and inside it create customers.csv:

id,name,email,signup_date
1,Alice,alice@example.com,2024-01-15
2,Bob,bob@example.com,2024-02-20
3,Charlie,charlie@example.com,2024-03-10

This is tabular data: rows (records) and columns (fields).

๐Ÿงช Self-Check

  • [ ] Can you explain what a CSV file is?
  • [ ] What's the difference between a row and a column?

Day 2: What is a Data Pipeline?

๐Ÿญ Assembly Line Analogy

Imagine a car factory. Raw materials enter one end, go through multiple stations (welding, painting, assembly), and a finished car comes out the other end.

A data pipeline works the same way: 1. Extract โ€” Get raw data from somewhere (files, databases, APIs) 2. Transform โ€” Clean, reshape, and enrich the data 3. Load โ€” Save the result somewhere useful

This is called ETL (Extract, Transform, Load).

The Medallion Architecture

Odibi uses a "layered" approach to organize data:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      YOUR DATA LAKE                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   BRONZE    โ”‚     SILVER      โ”‚           GOLD              โ”‚
โ”‚   (Raw)     โ”‚    (Cleaned)    โ”‚       (Business-Ready)      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ€ข As-is     โ”‚ โ€ข Deduplicated  โ”‚ โ€ข Aggregated                โ”‚
โ”‚ โ€ข Untouched โ”‚ โ€ข Typed         โ”‚ โ€ข Joined                    โ”‚
โ”‚ โ€ข Archived  โ”‚ โ€ข Validated     โ”‚ โ€ข Ready for reporting       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why layers? - If something breaks, you can always go back to Bronze - Each layer has a clear purpose - Teams can work on different layers independently

๐Ÿ“– Deep Dive: Medallion Architecture Guide

๐Ÿงช Self-Check

  • [ ] What does ETL stand for?
  • [ ] Why do we have separate Bronze, Silver, and Gold layers?

Day 3: Introduction to Odibi

What is Odibi?

Odibi is a YAML-first data pipeline framework. Instead of writing hundreds of lines of code, you describe what you want in simple configuration files.

๐Ÿ”ง Installation

Open your terminal and run:

# Create a virtual environment (recommended)
python -m venv .venv

# Activate it
# Windows:
.venv\Scripts\activate
# Mac/Linux:
source .venv/bin/activate

# Install Odibi
pip install odibi

Verify it works:

odibi --version

Your First Odibi Project

Let Odibi create a project structure for you:

odibi init my_first_project --template hello
cd my_first_project

This creates:

my_first_project/
โ”œโ”€โ”€ odibi.yaml          # Your pipeline configuration
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ landing/        # Where raw files arrive
โ”‚   โ”œโ”€โ”€ bronze/         # Raw data preserved
โ”‚   โ”œโ”€โ”€ silver/         # Cleaned data
โ”‚   โ””โ”€โ”€ gold/           # Business-ready data
โ””โ”€โ”€ README.md

๐Ÿ“– Deep Dive: Getting Started Tutorial

๐Ÿงช Self-Check

  • [ ] What command installs Odibi?
  • [ ] What folder does raw data go into?

Day 4: The Bronze Layer

๐Ÿ“ฆ Filing Cabinet Analogy

Think of Bronze as your filing cabinet where you store original documents. You never write on the originals โ€” you make copies first.

The Bronze layer: - Stores data exactly as received - Never modifies or cleans anything - Acts as your "source of truth" backup

๐Ÿ’ป Hands-On: Build a Bronze Pipeline

  1. Copy your customers.csv to data/landing/

  2. Edit odibi.yaml:

project: "my_first_project"
engine: "pandas"

connections:
  local:
    type: local
    base_path: "./data"

story:
  connection: local
  path: stories

system:
  connection: local
  path: system

pipelines:
  - pipeline: bronze_customers
    layer: bronze
    description: "Load raw customer data"
    nodes:
      - name: raw_customers
        description: "Ingest customers from landing zone"

        read:
          connection: local
          path: landing/customers.csv
          format: csv

        write:
          connection: local
          path: bronze/customers
          format: parquet
          mode: overwrite
  1. Run your pipeline:
odibi run odibi.yaml
  1. Check your output:
    # You should see a parquet file in data/bronze/customers/
    ls data/bronze/customers/
    

What Just Happened?

  1. Odibi read your CSV file
  2. Converted it to Parquet format (more efficient for analytics)
  3. Saved it to the Bronze layer

No data was modified โ€” just preserved in a better format.

๐Ÿ“– Deep Dive: Bronze Layer Tutorial

๐Ÿงช Self-Check

  • [ ] Why don't we clean data in Bronze?
  • [ ] What format did we convert the CSV to?

Day 5: Multi-Node Pipelines

๐Ÿš‚ Train Cars Analogy

A pipeline is like a train. Each node is a train car โ€” they're connected and run in sequence.

๐Ÿ’ป Hands-On: Add More Data

  1. Create data/landing/orders.csv:
order_id,customer_id,product,amount,order_date
1001,1,Widget A,29.99,2024-01-20
1002,2,Widget B,49.99,2024-02-25
1003,1,Widget C,19.99,2024-03-15
1004,3,Widget A,29.99,2024-03-20
  1. Add a second node to your pipeline:
pipelines:
  - pipeline: bronze_ingest
    layer: bronze
    description: "Load all raw data"
    nodes:
      - name: raw_customers
        description: "Ingest customers"
        read:
          connection: local
          path: landing/customers.csv
          format: csv
        write:
          connection: local
          path: bronze/customers
          format: parquet
          mode: overwrite

      - name: raw_orders
        description: "Ingest orders"
        read:
          connection: local
          path: landing/orders.csv
          format: csv
        write:
          connection: local
          path: bronze/orders
          format: parquet
          mode: overwrite
  1. Run it:
    odibi run odibi.yaml
    

Both datasets are now in your Bronze layer.

๐Ÿงช Self-Check

  • [ ] What is a "node" in Odibi?
  • [ ] Can a pipeline have multiple nodes?

๐Ÿ“ Week 1 Summary

You learned: - Data comes in different formats (CSV, JSON, Parquet) - Pipelines move data through stages (ETL) - Medallion architecture organizes data into layers - Bronze layer stores raw, unmodified data - Odibi uses YAML configuration to define pipelines

Congratulations! You've built your first data pipeline. ๐ŸŽ‰


๐Ÿ“… Week 2: Silver Layer + SCD2 + Data Quality

๐Ÿ“š Learning Objectives

By the end of this week, you will: - Clean and transform data in the Silver layer - Understand and implement SCD2 (history tracking) - Add data quality checks to catch bad data - Handle missing values and invalid data

โœ… Prerequisites

Before starting, make sure you have: - Completed Week 1 - A working Bronze layer with customer and order data


Day 1: Why Data Cleaning Matters

๐Ÿงน Dirty Kitchen Analogy

Imagine cooking with ingredients covered in dirt, or using expired milk. The result would be... unpleasant.

Bad data causes: - Wrong business decisions - Broken reports - Angry users - Lost revenue

Common Data Problems

Problem Example Impact
Missing values email: NULL Can't contact customer
Invalid format date: "not-a-date" Calculations fail
Duplicates Same order twice Revenue doubled incorrectly
Inconsistent "CA", "California", "ca" Grouping breaks

The Silver Layer's Job

The Silver layer is your cleaning station: - Fix data types (strings to dates, etc.) - Remove duplicates - Handle missing values - Validate data quality

๐Ÿงช Self-Check

  • [ ] Name 3 common data quality problems
  • [ ] What layer handles data cleaning?

Day 2: Building a Silver Pipeline

๐Ÿ’ป Hands-On: Clean Your Customer Data

  1. Update odibi.yaml to add a Silver pipeline:
pipelines:
  # ... your bronze pipeline from Week 1 ...

  - pipeline: silver_customers
    layer: silver
    description: "Clean and standardize customers"
    nodes:
      - name: clean_customers
        description: "Apply cleaning transformations"

        read:
          connection: local
          path: bronze/customers
          format: parquet

        transform:
          - type: rename
            columns:
              id: customer_id

          - type: cast
            columns:
              customer_id: integer
              signup_date: date

          - type: fill_null
            columns:
              email: "unknown@example.com"

          - type: lower
            columns:
              - email

        write:
          connection: local
          path: silver/customers
          format: parquet
          mode: overwrite
  1. Run it:
    odibi run odibi.yaml --pipeline silver_customers
    

What Each Transform Does

Transform Purpose Example
rename Change column names id โ†’ customer_id
cast Change data types String โ†’ Date
fill_null Replace missing values NULL โ†’ default value
lower Lowercase text "BOB@EMAIL.COM" โ†’ "bob@email.com"

๐Ÿ“– Deep Dive: Silver Layer Tutorial

๐Ÿงช Self-Check

  • [ ] What does cast do?
  • [ ] Why lowercase email addresses?

Day 3: SCD2 โ€” Tracking History

โฐ Time Machine Analogy

Imagine you could look at a customer's record as it was 6 months ago. Where did they live? What tier were they?

SCD2 (Slowly Changing Dimension Type 2) makes this possible by: - Never deleting old records - Adding new versions when data changes - Tracking when each version was valid

Visual Example

Customer moves from CA to NY on Feb 1:

customer_id address valid_from valid_to is_current
101 CA 2024-01-01 2024-02-01 false
101 NY 2024-02-01 NULL true

Now you can answer: "Where did customer 101 live on January 15th?" โ†’ CA

๐Ÿ’ป Hands-On: Add SCD2 to Customers

  1. First, update your source data. Create data/landing/customers_update.csv:
id,name,email,signup_date,tier
1,Alice,alice@example.com,2024-01-15,Gold
2,Bob,bob_new@example.com,2024-02-20,Silver
3,Charlie,charlie@example.com,2024-03-10,Bronze
4,Diana,diana@example.com,2024-04-01,Gold

(Notice: Bob has a new email, and Diana is a new customer)

  1. Add an SCD2 node:
  - pipeline: silver_customers_scd2
    layer: silver
    description: "Track customer history"
    nodes:
      - name: customers_with_history
        description: "Apply SCD2 for full history"

        read:
          connection: local
          path: landing/customers_update.csv
          format: csv

        transformer: scd2
        params:
          connection: local
          path: silver/dim_customers
          keys:
            - id
          track_cols:
            - email
            - tier
          effective_time_col: signup_date

        write:
          connection: local
          path: silver/dim_customers
          format: parquet
          mode: overwrite
  1. Run it:
    odibi run odibi.yaml --pipeline silver_customers_scd2
    

๐Ÿ“– Deep Dive: SCD2 Pattern

๐Ÿงช Self-Check

  • [ ] What does SCD2 stand for?
  • [ ] What column tells you if a record is the current version?

Day 4: Data Quality Validation

๐Ÿšจ Security Guard Analogy

Before entering a building, security checks your ID. Data quality validation checks your data before it enters the Silver layer.

Types of Checks

Check Type What it does Example
not_null Ensure value exists customer_id can't be empty
unique No duplicates Each email is unique
range Value in bounds age between 0 and 150
regex Pattern matching Email contains @
foreign_key Reference exists customer_id exists in customers table

๐Ÿ’ป Hands-On: Add Validation

  1. Add validation to your node:
      - name: clean_customers
        description: "Clean with validation"

        read:
          connection: local
          path: bronze/customers
          format: parquet

        validation:
          rules:
            - column: customer_id
              check: not_null
              on_fail: fail

            - column: email
              check: not_null
              on_fail: warn

            - column: email
              check: regex
              pattern: ".*@.*\\..*"
              on_fail: quarantine

          # Bad rows go to quarantine via on_fail: quarantine on individual tests

        write:
          connection: local
          path: silver/customers
          format: parquet
          mode: overwrite

Severity Levels

Level What happens
warn Log warning, continue processing
error Quarantine bad rows, continue with good rows
fatal Stop the entire pipeline

๐Ÿ“– Deep Dive: Data Validation

๐Ÿงช Self-Check

  • [ ] What does quarantine mean?
  • [ ] What's the difference between warn and error severity?

Day 5: Putting It Together

๐Ÿ’ป Hands-On: Complete Silver Pipeline

Create a complete Silver pipeline that: 1. Reads from Bronze 2. Cleans and transforms 3. Validates quality 4. Tracks history with SCD2

  - pipeline: silver_complete
    layer: silver
    description: "Complete silver processing"
    nodes:
      - name: stg_customers
        description: "Stage and clean customers"

        read:
          connection: local
          path: bronze/customers
          format: parquet

        transform:
          - type: rename
            columns:
              id: customer_id
          - type: cast
            columns:
              customer_id: integer
              signup_date: date
          - type: trim
            columns:
              - name
              - email

        validation:
          rules:
            - column: customer_id
              check: not_null
              on_fail: fail
            - column: email
              check: regex
              pattern: ".*@.*"
              on_fail: warn
          # Quarantine is set via on_fail: quarantine on individual tests

        write:
          connection: local
          path: silver/stg_customers
          format: parquet
          mode: overwrite

      - name: dim_customers
        description: "Create customer dimension with history"
        depends_on:
          - stg_customers

        read:
          connection: local
          path: silver/stg_customers
          format: parquet

        transformer: scd2
        params:
          connection: local
          path: silver/dim_customers
          keys:
            - customer_id
          track_cols:
            - name
            - email
          effective_time_col: signup_date

        write:
          connection: local
          path: silver/dim_customers
          format: parquet
          mode: overwrite

๐Ÿงช Self-Check

  • [ ] What does depends_on do?
  • [ ] Why do we stage data before applying SCD2?

๐Ÿ“ Week 2 Summary

You learned: - Why data cleaning is critical - How to transform data (rename, cast, fill_null) - SCD2 tracks historical changes - Validation catches bad data before it causes problems - Quarantine isolates bad rows for review

Great progress! Your data is now clean and trackable. ๐ŸŽ‰


๐Ÿ“… Week 3: Gold Layer + Dimensional Modeling

๐Ÿ“š Learning Objectives

By the end of this week, you will: - Understand Facts vs Dimensions - Build a star schema - Use surrogate keys - Create aggregations for reporting - Build a complete data warehouse

โœ… Prerequisites

Before starting, make sure you have: - Completed Weeks 1 and 2 - Working Bronze and Silver layers


Day 1: Facts vs Dimensions

๐ŸŽญ Theater Analogy

Think of a theater production: - Facts = The events (ticket sales, performances) - Dimensions = The context (who, what, when, where)

Every fact answers: "What happened?" Every dimension answers: "Tell me more about..."

Examples

Facts (Events) Dimensions (Context)
Order placed Customer, Product, Date
Payment received Customer, Account, Date
Page viewed User, Page, Date

Visual: A Sales Transaction

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     FACT: Order                             โ”‚
โ”‚  order_id=1001, amount=49.99, quantity=2                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚          โ”‚          โ”‚
           โ–ผ          โ–ผ          โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ DIM: Customerโ”‚ โ”‚ DIM: Product โ”‚ โ”‚ DIM: Date    โ”‚
โ”‚ name=Alice   โ”‚ โ”‚ name=Widget  โ”‚ โ”‚ date=2024-01 โ”‚
โ”‚ tier=Gold    โ”‚ โ”‚ category=HW  โ”‚ โ”‚ quarter=Q1   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿงช Self-Check

  • [ ] Is "order amount" a fact or dimension?
  • [ ] Is "customer name" a fact or dimension?

Day 2: Star Schema Basics

โญ Star Analogy

A star schema looks like a star: the fact table is in the center, with dimension tables around it like points.

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ dim_product โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ dim_customerโ”‚โ”€โ”€โ”€โ”€โ”€โ”‚  fact_sales โ”‚โ”€โ”€โ”€โ”€โ”€โ”‚  dim_date   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ dim_locationโ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why stars? - Simple to understand - Fast to query (fewer joins) - Works with every BI tool

Dimension Table Structure

# dim_customers
customer_key: 1          # Surrogate key (system-generated)
customer_id: "C001"      # Natural key (from source)
name: "Alice"
email: "alice@example.com"
tier: "Gold"
effective_from: "2024-01-01"
effective_to: null
is_current: true

Fact Table Structure

# fact_orders
order_key: 1001
customer_key: 1          # Points to dim_customers
product_key: 42          # Points to dim_products
date_key: 20240120       # Points to dim_date
quantity: 2
amount: 49.99

๐Ÿ“– Deep Dive: Dimensional Modeling Guide

๐Ÿงช Self-Check

  • [ ] Why is it called a "star" schema?
  • [ ] What's in the center of the star?

Day 3: Surrogate Keys

๐Ÿ”‘ Hotel Room Key Analogy

When you check into a hotel, they give you a room key. This key is: - Unique to your stay (not your name) - System-generated (you don't choose it) - Internal (the hotel manages it)

A surrogate key works the same way: - Unique identifier for each record - Generated by the system (not from source data) - Never changes, even if source data changes

Why Not Use Natural Keys?

Problem Natural Key Example Issue
Changes SSN gets corrected Breaks all references
Duplicates "John Smith" Too common
Missing New customer, no ID yet Can't insert
Composite firstName + lastName + DOB Slow to join

๐Ÿ’ป Hands-On: Generate Surrogate Keys

Odibi can auto-generate surrogate keys:

  - pipeline: gold_dimensions
    layer: gold
    description: "Build dimension tables"
    nodes:
      - name: dim_customers
        description: "Customer dimension with surrogate keys"

        read:
          connection: local
          path: silver/dim_customers
          format: parquet

        transform:
          - type: generate_surrogate_key
            key_column: customer_key
            source_columns:
              - customer_id

        write:
          connection: local
          path: gold/dim_customers
          format: parquet
          mode: overwrite

The generate_surrogate_key transform creates a unique integer for each unique combination of source columns.

๐Ÿ“– Deep Dive: Dimension Pattern

๐Ÿงช Self-Check

  • [ ] What's wrong with using email as a primary key?
  • [ ] Who generates surrogate keys โ€” the source system or our data warehouse?

Day 4: Building Fact Tables

๐Ÿ’ป Hands-On: Create a Sales Fact Table

  1. First, ensure you have a date dimension. Create data/landing/dates.csv:
date_key,full_date,year,quarter,month,day_of_week
20240115,2024-01-15,2024,Q1,January,Monday
20240120,2024-01-20,2024,Q1,January,Saturday
20240220,2024-02-20,2024,Q1,February,Tuesday
20240225,2024-02-25,2024,Q1,February,Sunday
20240310,2024-03-10,2024,Q1,March,Sunday
20240315,2024-03-15,2024,Q1,March,Friday
20240320,2024-03-20,2024,Q1,March,Wednesday
  1. Build the fact table:
      - name: fact_orders
        description: "Order fact table"
        depends_on:
          - dim_customers

        read:
          - connection: local
            path: silver/orders
            format: parquet
            alias: orders
          - connection: local
            path: gold/dim_customers
            format: parquet
            alias: customers

        transform:
          - type: join
            left: orders
            right: customers
            on:
              - left: customer_id
                right: customer_id
            how: left
            filter: "is_current = true"  # Only join to current customer version

          - type: select
            columns:
              - order_id
              - customer_key
              - product
              - amount
              - order_date

          - type: cast
            columns:
              order_date: date

          - type: add_column
            name: date_key
            expression: "date_format(order_date, 'yyyyMMdd')"

        write:
          connection: local
          path: gold/fact_orders
          format: parquet
          mode: overwrite

๐Ÿ“– Deep Dive: Fact Pattern

๐Ÿงช Self-Check

  • [ ] Why do we filter for is_current = true when joining?
  • [ ] What's the purpose of date_key?

Day 5: Aggregations for Reporting

๐Ÿ“Š Summary Reports Analogy

Instead of reading every receipt, store managers want: - "Total sales this month" - "Average order size by customer tier" - "Top 10 products"

Aggregations pre-compute these summaries.

๐Ÿ’ป Hands-On: Build an Aggregation

  - pipeline: gold_aggregations
    layer: gold
    description: "Pre-computed summaries"
    nodes:
      - name: agg_sales_by_customer
        description: "Sales summary per customer"

        read:
          connection: local
          path: gold/fact_orders
          format: parquet

        pattern:
          type: aggregation
          params:
            group_by:
              - customer_key
            metrics:
              - name: total_orders
                expression: "count(*)"
              - name: total_revenue
                expression: "sum(amount)"
              - name: avg_order_value
                expression: "avg(amount)"
              - name: first_order_date
                expression: "min(order_date)"
              - name: last_order_date
                expression: "max(order_date)"

        write:
          connection: local
          path: gold/agg_sales_by_customer
          format: parquet
          mode: overwrite

๐Ÿ“– Deep Dive: Aggregation Pattern

๐Ÿงช Self-Check

  • [ ] Why pre-compute aggregations instead of calculating on-the-fly?
  • [ ] What does group_by do?

๐Ÿ“ Week 3 Summary

You learned: - Facts record events, dimensions provide context - Star schemas are simple and fast - Surrogate keys are stable, system-generated identifiers - Fact tables link to dimensions via keys - Aggregations pre-compute summaries for fast reporting

Amazing work! You've built a complete data warehouse. ๐ŸŽ‰


๐Ÿ“… Week 4: Production Deployment + Best Practices

๐Ÿ“š Learning Objectives

By the end of this week, you will: - Configure connections for different environments - Implement error handling and retry logic - Add monitoring and logging - Tune performance for large datasets - Deploy to production with confidence

โœ… Prerequisites

Before starting, make sure you have: - Completed Weeks 1-3 - A complete Bronze โ†’ Silver โ†’ Gold pipeline


Day 1: Connections and Environments

๐Ÿ  Different Homes Analogy

Your pipeline needs to work in different "homes": - Development โ€” Your laptop, small test data - Staging โ€” Test server, realistic data - Production โ€” Real deal, live data

Each environment has different connection details.

๐Ÿ’ป Hands-On: Configure Environments

project: "my_project"
engine: "pandas"

# Global variables
vars:
  env: ${ODIBI_ENV:dev}  # Default to 'dev' if not set

# Environment-specific overrides
environments:
  dev:
    connections:
      data_lake:
        type: local
        base_path: "./data"

  staging:
    connections:
      data_lake:
        type: azure_blob
        account_name: "mystorageacct"
        container: "staging-data"
        credential: ${AZURE_STORAGE_KEY}

  prod:
    connections:
      data_lake:
        type: azure_blob
        account_name: "prodstorageacct"
        container: "prod-data"
        credential: ${AZURE_STORAGE_KEY}

connections:
  data_lake:
    type: local
    base_path: "./data"

Run for a specific environment:

ODIBI_ENV=staging odibi run odibi.yaml

๐Ÿ“– Deep Dive: Environments Guide

๐Ÿงช Self-Check

  • [ ] Why use environment variables for credentials?
  • [ ] What's the default environment if ODIBI_ENV isn't set?

Day 2: Error Handling and Retry Logic

๐Ÿ”„ Retry Analogy

If your phone call fails, you try again. Networks are unreliable; databases timeout. Retries handle temporary failures.

๐Ÿ’ป Hands-On: Configure Retries

project: "production_pipeline"
engine: "spark"

retry:
  enabled: true
  max_attempts: 3
  backoff: exponential    # Wait longer between each retry
  initial_delay: 5        # First retry after 5 seconds
  max_delay: 300          # Never wait more than 5 minutes

pipelines:
  - pipeline: bronze_ingest
    nodes:
      - name: fetch_api_data
        retry:
          max_attempts: 5  # Override for this node
        read:
          connection: external_api
          path: /customers
          format: json

Backoff Strategies

Strategy Wait times (5s initial) Best for
constant 5s, 5s, 5s Simple cases
linear 5s, 10s, 15s Gradual increase
exponential 5s, 10s, 20s, 40s API rate limits

Handling Failures

        on_fail: warn  # Options: fail, warn, quarantine
Action Behavior
fail Stop entire pipeline (default)
warn Log the issue, continue processing
quarantine Route failing rows to quarantine table

๐Ÿงช Self-Check

  • [ ] What does "exponential backoff" mean?
  • [ ] When would you use on_fail: warn?

Day 3: Monitoring and Logging

๐Ÿ“บ Dashboard Analogy

A pilot needs instruments to fly safely. You need monitoring to run pipelines safely.

๐Ÿ’ป Hands-On: Configure Logging

logging:
  level: INFO              # DEBUG, INFO, WARNING, ERROR
  structured: true         # JSON format for log aggregators
  include_metrics: true    # Row counts, timing

alerts:
  - type: slack
    url: ${SLACK_WEBHOOK_URL}
    on_events:
      - on_failure
      - on_success

  - type: email
    to:
      - data-team@company.com
    on_events:
      - on_failure

What Gets Logged

Every pipeline run generates a Data Story with: - Start/end timestamps - Row counts (read/written/quarantined) - Validation results - Error messages

View your story:

odibi story last

๐Ÿงช Self-Check

  • [ ] What's the difference between INFO and DEBUG logging?
  • [ ] What is a "Data Story"?

Day 4: Performance Tuning

๐ŸŽ๏ธ Race Car Analogy

A race car needs tuning to go fast. Data pipelines need tuning for large datasets.

Key Performance Levers

Lever When to use Configuration
Partitioning Large tables (>1M rows) Split data by date/category
Caching Reused datasets Keep in memory
Parallelism Multiple nodes Run independent nodes together
Batch size Memory limits Process in chunks

๐Ÿ’ป Hands-On: Add Partitioning

        write:
          connection: data_lake
          path: gold/fact_orders
          format: delta
          mode: overwrite
          partition_by:
            - order_year
            - order_month

๐Ÿ’ป Hands-On: Enable Caching

      - name: dim_customers
        cache: true          # Keep in memory for downstream nodes
        read:
          connection: local
          path: silver/dim_customers

๐Ÿ’ป Hands-On: Performance Config

performance:
  max_parallel_nodes: 4    # Run up to 4 nodes simultaneously
  batch_size: 100000       # Process 100k rows at a time
  shuffle_partitions: 200  # Spark shuffle partitions

๐Ÿ“– Deep Dive: Performance Tuning Guide

๐Ÿงช Self-Check

  • [ ] Why partition by date?
  • [ ] What does caching do?

Day 5: Production Deployment Checklist

๐Ÿš€ Launch Checklist

Before deploying to production, verify:

Configuration

  • [ ] All secrets use environment variables (never hardcoded)
  • [ ] Correct environment settings for prod
  • [ ] Retry logic enabled
  • [ ] Alerts configured

Data Quality

  • [ ] Validation rules on all critical columns
  • [ ] Quarantine configured for bad rows
  • [ ] Foreign key checks enabled

Performance

  • [ ] Partitioning on large tables
  • [ ] Appropriate parallelism
  • [ ] Tested with production-scale data

Operations

  • [ ] Logging at INFO level
  • [ ] Monitoring dashboard set up
  • [ ] Runbook for common failures
  • [ ] Backup/restore procedures documented

Complete Production Config

project: "customer360"
engine: "spark"
version: "1.0.0"
owner: "data-team@company.com"
description: "Customer analytics pipeline"

vars:
  env: ${ODIBI_ENV:prod}

retry:
  enabled: true
  max_attempts: 3
  backoff: exponential

logging:
  level: INFO
  structured: true
  include_metrics: true

alerts:
  - type: slack
    url: ${SLACK_WEBHOOK}
    on_events: [on_failure]

performance:
  max_parallel_nodes: 8
  batch_size: 500000

connections:
  data_lake:
    type: azure_blob
    account_name: ${AZURE_STORAGE_ACCOUNT}
    container: "prod-data"
    credential: ${AZURE_STORAGE_KEY}

story:
  connection: data_lake
  path: _odibi/stories

system:
  connection: data_lake
  path: _odibi/system

pipelines:
  # ... your pipelines ...

๐Ÿ“– Deep Dive: Production Deployment Guide

๐Ÿงช Self-Check

  • [ ] What should NEVER be hardcoded in config?
  • [ ] What logging level is recommended for production?

๐Ÿ“ Week 4 Summary

You learned: - Environments separate dev/staging/prod configurations - Retry logic handles temporary failures - Logging and alerts keep you informed - Partitioning and caching improve performance - A production checklist prevents common mistakes

Congratulations! You've completed the Odibi curriculum! ๐ŸŽ“๐ŸŽ‰


๐ŸŽฏ What's Next?

Now that you've completed the basics:

  1. Build a real project โ€” Apply what you learned to actual data
  2. Explore advanced patterns โ€” Browse all patterns
  3. Learn the CLI โ€” CLI Master Guide
  4. Join the community โ€” Share your projects, ask questions
Topic Link
All Patterns ../patterns/README.md
YAML Reference ../reference/yaml_schema.md
Best Practices ../guides/best_practices.md
Troubleshooting ../troubleshooting.md

Built with โค๏ธ for data engineers who are just getting started.