๐ Odibi Learning Curriculum¶
A 4-Week Journey from Zero to Data Engineer
This curriculum is designed for complete beginners with no prior data engineering experience. By the end, you'll be able to build production-ready data pipelines.
How This Course Works¶
- Pace: ~1-2 hours per day, 5 days per week
- Style: Learn by doing โ every concept has hands-on exercises
- Format: Read โ Try โ Check โ Repeat
Each week builds on the previous one, like stacking building blocks.
๐ Week 1: Bronze Layer + Basic Concepts¶
๐ Learning Objectives¶
By the end of this week, you will: - Understand what data is and common file formats - Know what a data pipeline does and why it matters - Install Odibi and run your first pipeline - Load raw data into a Bronze layer
โ Prerequisites¶
Before starting, make sure you have: - A computer (Windows, Mac, or Linux) - Python 3.9+ installed (Download Python) - A text editor (VS Code recommended) - Basic comfort using a terminal/command prompt
Day 1: What is Data?¶
๐ณ Kitchen Analogy¶
Think of data like ingredients in your kitchen. You have: - Raw ingredients (flour, eggs, sugar) = raw data files - Recipes = data transformations - Finished dishes = clean, usable reports
Data comes in many "containers":
| Format | What it looks like | When to use |
|---|---|---|
| CSV | Spreadsheet-like rows and columns | Simple tabular data |
| JSON | Nested key-value pairs | API responses, configs |
| Parquet | Binary columnar format | Large datasets, analytics |
| Delta | Parquet + versioning + ACID | Production data lakes |
๐ป Hands-On: Create Your First Data File¶
Create a folder called my_first_pipeline and inside it create customers.csv:
id,name,email,signup_date
1,Alice,alice@example.com,2024-01-15
2,Bob,bob@example.com,2024-02-20
3,Charlie,charlie@example.com,2024-03-10
This is tabular data: rows (records) and columns (fields).
๐งช Self-Check¶
- [ ] Can you explain what a CSV file is?
- [ ] What's the difference between a row and a column?
Day 2: What is a Data Pipeline?¶
๐ญ Assembly Line Analogy¶
Imagine a car factory. Raw materials enter one end, go through multiple stations (welding, painting, assembly), and a finished car comes out the other end.
A data pipeline works the same way: 1. Extract โ Get raw data from somewhere (files, databases, APIs) 2. Transform โ Clean, reshape, and enrich the data 3. Load โ Save the result somewhere useful
This is called ETL (Extract, Transform, Load).
The Medallion Architecture¶
Odibi uses a "layered" approach to organize data:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ YOUR DATA LAKE โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ BRONZE โ SILVER โ GOLD โ
โ (Raw) โ (Cleaned) โ (Business-Ready) โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โข As-is โ โข Deduplicated โ โข Aggregated โ
โ โข Untouched โ โข Typed โ โข Joined โ
โ โข Archived โ โข Validated โ โข Ready for reporting โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why layers? - If something breaks, you can always go back to Bronze - Each layer has a clear purpose - Teams can work on different layers independently
๐ Deep Dive: Medallion Architecture Guide
๐งช Self-Check¶
- [ ] What does ETL stand for?
- [ ] Why do we have separate Bronze, Silver, and Gold layers?
Day 3: Introduction to Odibi¶
What is Odibi?¶
Odibi is a YAML-first data pipeline framework. Instead of writing hundreds of lines of code, you describe what you want in simple configuration files.
๐ง Installation¶
Open your terminal and run:
# Create a virtual environment (recommended)
python -m venv .venv
# Activate it
# Windows:
.venv\Scripts\activate
# Mac/Linux:
source .venv/bin/activate
# Install Odibi
pip install odibi
Verify it works:
Your First Odibi Project¶
Let Odibi create a project structure for you:
This creates:
my_first_project/
โโโ odibi.yaml # Your pipeline configuration
โโโ data/
โ โโโ landing/ # Where raw files arrive
โ โโโ bronze/ # Raw data preserved
โ โโโ silver/ # Cleaned data
โ โโโ gold/ # Business-ready data
โโโ README.md
๐ Deep Dive: Getting Started Tutorial
๐งช Self-Check¶
- [ ] What command installs Odibi?
- [ ] What folder does raw data go into?
Day 4: The Bronze Layer¶
๐ฆ Filing Cabinet Analogy¶
Think of Bronze as your filing cabinet where you store original documents. You never write on the originals โ you make copies first.
The Bronze layer: - Stores data exactly as received - Never modifies or cleans anything - Acts as your "source of truth" backup
๐ป Hands-On: Build a Bronze Pipeline¶
-
Copy your
customers.csvtodata/landing/ -
Edit
odibi.yaml:
project: "my_first_project"
engine: "pandas"
connections:
local:
type: local
base_path: "./data"
story:
connection: local
path: stories
system:
connection: local
path: system
pipelines:
- pipeline: bronze_customers
layer: bronze
description: "Load raw customer data"
nodes:
- name: raw_customers
description: "Ingest customers from landing zone"
read:
connection: local
path: landing/customers.csv
format: csv
write:
connection: local
path: bronze/customers
format: parquet
mode: overwrite
- Run your pipeline:
- Check your output:
What Just Happened?¶
- Odibi read your CSV file
- Converted it to Parquet format (more efficient for analytics)
- Saved it to the Bronze layer
No data was modified โ just preserved in a better format.
๐ Deep Dive: Bronze Layer Tutorial
๐งช Self-Check¶
- [ ] Why don't we clean data in Bronze?
- [ ] What format did we convert the CSV to?
Day 5: Multi-Node Pipelines¶
๐ Train Cars Analogy¶
A pipeline is like a train. Each node is a train car โ they're connected and run in sequence.
๐ป Hands-On: Add More Data¶
- Create
data/landing/orders.csv:
order_id,customer_id,product,amount,order_date
1001,1,Widget A,29.99,2024-01-20
1002,2,Widget B,49.99,2024-02-25
1003,1,Widget C,19.99,2024-03-15
1004,3,Widget A,29.99,2024-03-20
- Add a second node to your pipeline:
pipelines:
- pipeline: bronze_ingest
layer: bronze
description: "Load all raw data"
nodes:
- name: raw_customers
description: "Ingest customers"
read:
connection: local
path: landing/customers.csv
format: csv
write:
connection: local
path: bronze/customers
format: parquet
mode: overwrite
- name: raw_orders
description: "Ingest orders"
read:
connection: local
path: landing/orders.csv
format: csv
write:
connection: local
path: bronze/orders
format: parquet
mode: overwrite
- Run it:
Both datasets are now in your Bronze layer.
๐งช Self-Check¶
- [ ] What is a "node" in Odibi?
- [ ] Can a pipeline have multiple nodes?
๐ Week 1 Summary¶
You learned: - Data comes in different formats (CSV, JSON, Parquet) - Pipelines move data through stages (ETL) - Medallion architecture organizes data into layers - Bronze layer stores raw, unmodified data - Odibi uses YAML configuration to define pipelines
Congratulations! You've built your first data pipeline. ๐
๐ Week 2: Silver Layer + SCD2 + Data Quality¶
๐ Learning Objectives¶
By the end of this week, you will: - Clean and transform data in the Silver layer - Understand and implement SCD2 (history tracking) - Add data quality checks to catch bad data - Handle missing values and invalid data
โ Prerequisites¶
Before starting, make sure you have: - Completed Week 1 - A working Bronze layer with customer and order data
Day 1: Why Data Cleaning Matters¶
๐งน Dirty Kitchen Analogy¶
Imagine cooking with ingredients covered in dirt, or using expired milk. The result would be... unpleasant.
Bad data causes: - Wrong business decisions - Broken reports - Angry users - Lost revenue
Common Data Problems¶
| Problem | Example | Impact |
|---|---|---|
| Missing values | email: NULL |
Can't contact customer |
| Invalid format | date: "not-a-date" |
Calculations fail |
| Duplicates | Same order twice | Revenue doubled incorrectly |
| Inconsistent | "CA", "California", "ca" | Grouping breaks |
The Silver Layer's Job¶
The Silver layer is your cleaning station: - Fix data types (strings to dates, etc.) - Remove duplicates - Handle missing values - Validate data quality
๐งช Self-Check¶
- [ ] Name 3 common data quality problems
- [ ] What layer handles data cleaning?
Day 2: Building a Silver Pipeline¶
๐ป Hands-On: Clean Your Customer Data¶
- Update
odibi.yamlto add a Silver pipeline:
pipelines:
# ... your bronze pipeline from Week 1 ...
- pipeline: silver_customers
layer: silver
description: "Clean and standardize customers"
nodes:
- name: clean_customers
description: "Apply cleaning transformations"
read:
connection: local
path: bronze/customers
format: parquet
transform:
- type: rename
columns:
id: customer_id
- type: cast
columns:
customer_id: integer
signup_date: date
- type: fill_null
columns:
email: "unknown@example.com"
- type: lower
columns:
- email
write:
connection: local
path: silver/customers
format: parquet
mode: overwrite
- Run it:
What Each Transform Does¶
| Transform | Purpose | Example |
|---|---|---|
rename |
Change column names | id โ customer_id |
cast |
Change data types | String โ Date |
fill_null |
Replace missing values | NULL โ default value |
lower |
Lowercase text | "BOB@EMAIL.COM" โ "bob@email.com" |
๐ Deep Dive: Silver Layer Tutorial
๐งช Self-Check¶
- [ ] What does
castdo? - [ ] Why lowercase email addresses?
Day 3: SCD2 โ Tracking History¶
โฐ Time Machine Analogy¶
Imagine you could look at a customer's record as it was 6 months ago. Where did they live? What tier were they?
SCD2 (Slowly Changing Dimension Type 2) makes this possible by: - Never deleting old records - Adding new versions when data changes - Tracking when each version was valid
Visual Example¶
Customer moves from CA to NY on Feb 1:
| customer_id | address | valid_from | valid_to | is_current |
|---|---|---|---|---|
| 101 | CA | 2024-01-01 | 2024-02-01 | false |
| 101 | NY | 2024-02-01 | NULL | true |
Now you can answer: "Where did customer 101 live on January 15th?" โ CA
๐ป Hands-On: Add SCD2 to Customers¶
- First, update your source data. Create
data/landing/customers_update.csv:
id,name,email,signup_date,tier
1,Alice,alice@example.com,2024-01-15,Gold
2,Bob,bob_new@example.com,2024-02-20,Silver
3,Charlie,charlie@example.com,2024-03-10,Bronze
4,Diana,diana@example.com,2024-04-01,Gold
(Notice: Bob has a new email, and Diana is a new customer)
- Add an SCD2 node:
- pipeline: silver_customers_scd2
layer: silver
description: "Track customer history"
nodes:
- name: customers_with_history
description: "Apply SCD2 for full history"
read:
connection: local
path: landing/customers_update.csv
format: csv
transformer: scd2
params:
connection: local
path: silver/dim_customers
keys:
- id
track_cols:
- email
- tier
effective_time_col: signup_date
write:
connection: local
path: silver/dim_customers
format: parquet
mode: overwrite
- Run it:
๐ Deep Dive: SCD2 Pattern
๐งช Self-Check¶
- [ ] What does SCD2 stand for?
- [ ] What column tells you if a record is the current version?
Day 4: Data Quality Validation¶
๐จ Security Guard Analogy¶
Before entering a building, security checks your ID. Data quality validation checks your data before it enters the Silver layer.
Types of Checks¶
| Check Type | What it does | Example |
|---|---|---|
| not_null | Ensure value exists | customer_id can't be empty |
| unique | No duplicates | Each email is unique |
| range | Value in bounds | age between 0 and 150 |
| regex | Pattern matching | Email contains @ |
| foreign_key | Reference exists | customer_id exists in customers table |
๐ป Hands-On: Add Validation¶
- Add validation to your node:
- name: clean_customers
description: "Clean with validation"
read:
connection: local
path: bronze/customers
format: parquet
validation:
rules:
- column: customer_id
check: not_null
on_fail: fail
- column: email
check: not_null
on_fail: warn
- column: email
check: regex
pattern: ".*@.*\\..*"
on_fail: quarantine
# Bad rows go to quarantine via on_fail: quarantine on individual tests
write:
connection: local
path: silver/customers
format: parquet
mode: overwrite
Severity Levels¶
| Level | What happens |
|---|---|
warn |
Log warning, continue processing |
error |
Quarantine bad rows, continue with good rows |
fatal |
Stop the entire pipeline |
๐ Deep Dive: Data Validation
๐งช Self-Check¶
- [ ] What does
quarantinemean? - [ ] What's the difference between
warnanderrorseverity?
Day 5: Putting It Together¶
๐ป Hands-On: Complete Silver Pipeline¶
Create a complete Silver pipeline that: 1. Reads from Bronze 2. Cleans and transforms 3. Validates quality 4. Tracks history with SCD2
- pipeline: silver_complete
layer: silver
description: "Complete silver processing"
nodes:
- name: stg_customers
description: "Stage and clean customers"
read:
connection: local
path: bronze/customers
format: parquet
transform:
- type: rename
columns:
id: customer_id
- type: cast
columns:
customer_id: integer
signup_date: date
- type: trim
columns:
- name
- email
validation:
rules:
- column: customer_id
check: not_null
on_fail: fail
- column: email
check: regex
pattern: ".*@.*"
on_fail: warn
# Quarantine is set via on_fail: quarantine on individual tests
write:
connection: local
path: silver/stg_customers
format: parquet
mode: overwrite
- name: dim_customers
description: "Create customer dimension with history"
depends_on:
- stg_customers
read:
connection: local
path: silver/stg_customers
format: parquet
transformer: scd2
params:
connection: local
path: silver/dim_customers
keys:
- customer_id
track_cols:
- name
- email
effective_time_col: signup_date
write:
connection: local
path: silver/dim_customers
format: parquet
mode: overwrite
๐งช Self-Check¶
- [ ] What does
depends_ondo? - [ ] Why do we stage data before applying SCD2?
๐ Week 2 Summary¶
You learned: - Why data cleaning is critical - How to transform data (rename, cast, fill_null) - SCD2 tracks historical changes - Validation catches bad data before it causes problems - Quarantine isolates bad rows for review
Great progress! Your data is now clean and trackable. ๐
๐ Week 3: Gold Layer + Dimensional Modeling¶
๐ Learning Objectives¶
By the end of this week, you will: - Understand Facts vs Dimensions - Build a star schema - Use surrogate keys - Create aggregations for reporting - Build a complete data warehouse
โ Prerequisites¶
Before starting, make sure you have: - Completed Weeks 1 and 2 - Working Bronze and Silver layers
Day 1: Facts vs Dimensions¶
๐ญ Theater Analogy¶
Think of a theater production: - Facts = The events (ticket sales, performances) - Dimensions = The context (who, what, when, where)
Every fact answers: "What happened?" Every dimension answers: "Tell me more about..."
Examples¶
| Facts (Events) | Dimensions (Context) |
|---|---|
| Order placed | Customer, Product, Date |
| Payment received | Customer, Account, Date |
| Page viewed | User, Page, Date |
Visual: A Sales Transaction¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FACT: Order โ
โ order_id=1001, amount=49.99, quantity=2 โ
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ DIM: Customerโ โ DIM: Product โ โ DIM: Date โ
โ name=Alice โ โ name=Widget โ โ date=2024-01 โ
โ tier=Gold โ โ category=HW โ โ quarter=Q1 โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
๐งช Self-Check¶
- [ ] Is "order amount" a fact or dimension?
- [ ] Is "customer name" a fact or dimension?
Day 2: Star Schema Basics¶
โญ Star Analogy¶
A star schema looks like a star: the fact table is in the center, with dimension tables around it like points.
โโโโโโโโโโโโโโโ
โ dim_product โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโโโโโโโโ โโโโโโโโดโโโโโโโ โโโโโโโโโโโโโโโ
โ dim_customerโโโโโโโ fact_sales โโโโโโโ dim_date โ
โโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โ dim_locationโ
โโโโโโโโโโโโโโโ
Why stars? - Simple to understand - Fast to query (fewer joins) - Works with every BI tool
Dimension Table Structure¶
# dim_customers
customer_key: 1 # Surrogate key (system-generated)
customer_id: "C001" # Natural key (from source)
name: "Alice"
email: "alice@example.com"
tier: "Gold"
effective_from: "2024-01-01"
effective_to: null
is_current: true
Fact Table Structure¶
# fact_orders
order_key: 1001
customer_key: 1 # Points to dim_customers
product_key: 42 # Points to dim_products
date_key: 20240120 # Points to dim_date
quantity: 2
amount: 49.99
๐ Deep Dive: Dimensional Modeling Guide
๐งช Self-Check¶
- [ ] Why is it called a "star" schema?
- [ ] What's in the center of the star?
Day 3: Surrogate Keys¶
๐ Hotel Room Key Analogy¶
When you check into a hotel, they give you a room key. This key is: - Unique to your stay (not your name) - System-generated (you don't choose it) - Internal (the hotel manages it)
A surrogate key works the same way: - Unique identifier for each record - Generated by the system (not from source data) - Never changes, even if source data changes
Why Not Use Natural Keys?¶
| Problem | Natural Key Example | Issue |
|---|---|---|
| Changes | SSN gets corrected | Breaks all references |
| Duplicates | "John Smith" | Too common |
| Missing | New customer, no ID yet | Can't insert |
| Composite | firstName + lastName + DOB | Slow to join |
๐ป Hands-On: Generate Surrogate Keys¶
Odibi can auto-generate surrogate keys:
- pipeline: gold_dimensions
layer: gold
description: "Build dimension tables"
nodes:
- name: dim_customers
description: "Customer dimension with surrogate keys"
read:
connection: local
path: silver/dim_customers
format: parquet
transform:
- type: generate_surrogate_key
key_column: customer_key
source_columns:
- customer_id
write:
connection: local
path: gold/dim_customers
format: parquet
mode: overwrite
The generate_surrogate_key transform creates a unique integer for each unique combination of source columns.
๐ Deep Dive: Dimension Pattern
๐งช Self-Check¶
- [ ] What's wrong with using email as a primary key?
- [ ] Who generates surrogate keys โ the source system or our data warehouse?
Day 4: Building Fact Tables¶
๐ป Hands-On: Create a Sales Fact Table¶
- First, ensure you have a date dimension. Create
data/landing/dates.csv:
date_key,full_date,year,quarter,month,day_of_week
20240115,2024-01-15,2024,Q1,January,Monday
20240120,2024-01-20,2024,Q1,January,Saturday
20240220,2024-02-20,2024,Q1,February,Tuesday
20240225,2024-02-25,2024,Q1,February,Sunday
20240310,2024-03-10,2024,Q1,March,Sunday
20240315,2024-03-15,2024,Q1,March,Friday
20240320,2024-03-20,2024,Q1,March,Wednesday
- Build the fact table:
- name: fact_orders
description: "Order fact table"
depends_on:
- dim_customers
read:
- connection: local
path: silver/orders
format: parquet
alias: orders
- connection: local
path: gold/dim_customers
format: parquet
alias: customers
transform:
- type: join
left: orders
right: customers
on:
- left: customer_id
right: customer_id
how: left
filter: "is_current = true" # Only join to current customer version
- type: select
columns:
- order_id
- customer_key
- product
- amount
- order_date
- type: cast
columns:
order_date: date
- type: add_column
name: date_key
expression: "date_format(order_date, 'yyyyMMdd')"
write:
connection: local
path: gold/fact_orders
format: parquet
mode: overwrite
๐ Deep Dive: Fact Pattern
๐งช Self-Check¶
- [ ] Why do we filter for
is_current = truewhen joining? - [ ] What's the purpose of
date_key?
Day 5: Aggregations for Reporting¶
๐ Summary Reports Analogy¶
Instead of reading every receipt, store managers want: - "Total sales this month" - "Average order size by customer tier" - "Top 10 products"
Aggregations pre-compute these summaries.
๐ป Hands-On: Build an Aggregation¶
- pipeline: gold_aggregations
layer: gold
description: "Pre-computed summaries"
nodes:
- name: agg_sales_by_customer
description: "Sales summary per customer"
read:
connection: local
path: gold/fact_orders
format: parquet
pattern:
type: aggregation
params:
group_by:
- customer_key
metrics:
- name: total_orders
expression: "count(*)"
- name: total_revenue
expression: "sum(amount)"
- name: avg_order_value
expression: "avg(amount)"
- name: first_order_date
expression: "min(order_date)"
- name: last_order_date
expression: "max(order_date)"
write:
connection: local
path: gold/agg_sales_by_customer
format: parquet
mode: overwrite
๐ Deep Dive: Aggregation Pattern
๐งช Self-Check¶
- [ ] Why pre-compute aggregations instead of calculating on-the-fly?
- [ ] What does
group_bydo?
๐ Week 3 Summary¶
You learned: - Facts record events, dimensions provide context - Star schemas are simple and fast - Surrogate keys are stable, system-generated identifiers - Fact tables link to dimensions via keys - Aggregations pre-compute summaries for fast reporting
Amazing work! You've built a complete data warehouse. ๐
๐ Week 4: Production Deployment + Best Practices¶
๐ Learning Objectives¶
By the end of this week, you will: - Configure connections for different environments - Implement error handling and retry logic - Add monitoring and logging - Tune performance for large datasets - Deploy to production with confidence
โ Prerequisites¶
Before starting, make sure you have: - Completed Weeks 1-3 - A complete Bronze โ Silver โ Gold pipeline
Day 1: Connections and Environments¶
๐ Different Homes Analogy¶
Your pipeline needs to work in different "homes": - Development โ Your laptop, small test data - Staging โ Test server, realistic data - Production โ Real deal, live data
Each environment has different connection details.
๐ป Hands-On: Configure Environments¶
project: "my_project"
engine: "pandas"
# Global variables
vars:
env: ${ODIBI_ENV:dev} # Default to 'dev' if not set
# Environment-specific overrides
environments:
dev:
connections:
data_lake:
type: local
base_path: "./data"
staging:
connections:
data_lake:
type: azure_blob
account_name: "mystorageacct"
container: "staging-data"
credential: ${AZURE_STORAGE_KEY}
prod:
connections:
data_lake:
type: azure_blob
account_name: "prodstorageacct"
container: "prod-data"
credential: ${AZURE_STORAGE_KEY}
connections:
data_lake:
type: local
base_path: "./data"
Run for a specific environment:
๐ Deep Dive: Environments Guide
๐งช Self-Check¶
- [ ] Why use environment variables for credentials?
- [ ] What's the default environment if
ODIBI_ENVisn't set?
Day 2: Error Handling and Retry Logic¶
๐ Retry Analogy¶
If your phone call fails, you try again. Networks are unreliable; databases timeout. Retries handle temporary failures.
๐ป Hands-On: Configure Retries¶
project: "production_pipeline"
engine: "spark"
retry:
enabled: true
max_attempts: 3
backoff: exponential # Wait longer between each retry
initial_delay: 5 # First retry after 5 seconds
max_delay: 300 # Never wait more than 5 minutes
pipelines:
- pipeline: bronze_ingest
nodes:
- name: fetch_api_data
retry:
max_attempts: 5 # Override for this node
read:
connection: external_api
path: /customers
format: json
Backoff Strategies¶
| Strategy | Wait times (5s initial) | Best for |
|---|---|---|
constant |
5s, 5s, 5s | Simple cases |
linear |
5s, 10s, 15s | Gradual increase |
exponential |
5s, 10s, 20s, 40s | API rate limits |
Handling Failures¶
| Action | Behavior |
|---|---|
fail |
Stop entire pipeline (default) |
warn |
Log the issue, continue processing |
quarantine |
Route failing rows to quarantine table |
๐งช Self-Check¶
- [ ] What does "exponential backoff" mean?
- [ ] When would you use
on_fail: warn?
Day 3: Monitoring and Logging¶
๐บ Dashboard Analogy¶
A pilot needs instruments to fly safely. You need monitoring to run pipelines safely.
๐ป Hands-On: Configure Logging¶
logging:
level: INFO # DEBUG, INFO, WARNING, ERROR
structured: true # JSON format for log aggregators
include_metrics: true # Row counts, timing
alerts:
- type: slack
url: ${SLACK_WEBHOOK_URL}
on_events:
- on_failure
- on_success
- type: email
to:
- data-team@company.com
on_events:
- on_failure
What Gets Logged¶
Every pipeline run generates a Data Story with: - Start/end timestamps - Row counts (read/written/quarantined) - Validation results - Error messages
View your story:
๐งช Self-Check¶
- [ ] What's the difference between INFO and DEBUG logging?
- [ ] What is a "Data Story"?
Day 4: Performance Tuning¶
๐๏ธ Race Car Analogy¶
A race car needs tuning to go fast. Data pipelines need tuning for large datasets.
Key Performance Levers¶
| Lever | When to use | Configuration |
|---|---|---|
| Partitioning | Large tables (>1M rows) | Split data by date/category |
| Caching | Reused datasets | Keep in memory |
| Parallelism | Multiple nodes | Run independent nodes together |
| Batch size | Memory limits | Process in chunks |
๐ป Hands-On: Add Partitioning¶
write:
connection: data_lake
path: gold/fact_orders
format: delta
mode: overwrite
partition_by:
- order_year
- order_month
๐ป Hands-On: Enable Caching¶
- name: dim_customers
cache: true # Keep in memory for downstream nodes
read:
connection: local
path: silver/dim_customers
๐ป Hands-On: Performance Config¶
performance:
max_parallel_nodes: 4 # Run up to 4 nodes simultaneously
batch_size: 100000 # Process 100k rows at a time
shuffle_partitions: 200 # Spark shuffle partitions
๐ Deep Dive: Performance Tuning Guide
๐งช Self-Check¶
- [ ] Why partition by date?
- [ ] What does caching do?
Day 5: Production Deployment Checklist¶
๐ Launch Checklist¶
Before deploying to production, verify:
Configuration¶
- [ ] All secrets use environment variables (never hardcoded)
- [ ] Correct environment settings for prod
- [ ] Retry logic enabled
- [ ] Alerts configured
Data Quality¶
- [ ] Validation rules on all critical columns
- [ ] Quarantine configured for bad rows
- [ ] Foreign key checks enabled
Performance¶
- [ ] Partitioning on large tables
- [ ] Appropriate parallelism
- [ ] Tested with production-scale data
Operations¶
- [ ] Logging at INFO level
- [ ] Monitoring dashboard set up
- [ ] Runbook for common failures
- [ ] Backup/restore procedures documented
Complete Production Config¶
project: "customer360"
engine: "spark"
version: "1.0.0"
owner: "data-team@company.com"
description: "Customer analytics pipeline"
vars:
env: ${ODIBI_ENV:prod}
retry:
enabled: true
max_attempts: 3
backoff: exponential
logging:
level: INFO
structured: true
include_metrics: true
alerts:
- type: slack
url: ${SLACK_WEBHOOK}
on_events: [on_failure]
performance:
max_parallel_nodes: 8
batch_size: 500000
connections:
data_lake:
type: azure_blob
account_name: ${AZURE_STORAGE_ACCOUNT}
container: "prod-data"
credential: ${AZURE_STORAGE_KEY}
story:
connection: data_lake
path: _odibi/stories
system:
connection: data_lake
path: _odibi/system
pipelines:
# ... your pipelines ...
๐ Deep Dive: Production Deployment Guide
๐งช Self-Check¶
- [ ] What should NEVER be hardcoded in config?
- [ ] What logging level is recommended for production?
๐ Week 4 Summary¶
You learned: - Environments separate dev/staging/prod configurations - Retry logic handles temporary failures - Logging and alerts keep you informed - Partitioning and caching improve performance - A production checklist prevents common mistakes
Congratulations! You've completed the Odibi curriculum! ๐๐
๐ฏ What's Next?¶
Now that you've completed the basics:
- Build a real project โ Apply what you learned to actual data
- Explore advanced patterns โ Browse all patterns
- Learn the CLI โ CLI Master Guide
- Join the community โ Share your projects, ask questions
Quick Reference Links¶
| Topic | Link |
|---|---|
| All Patterns | ../patterns/README.md |
| YAML Reference | ../reference/yaml_schema.md |
| Best Practices | ../guides/best_practices.md |
| Troubleshooting | ../troubleshooting.md |
Built with โค๏ธ for data engineers who are just getting started.