Junior Data Engineer Learning Journey¶

📌 Who Is This For?¶

New grads, bootcamp graduates, and career switchers who need to: - Run and debug data pipelines - Copy and adapt working configurations - Choose the right engine (Pandas vs Spark) - Fix common errors independently

Prerequisites: Basic Python and SQL knowledge

⏱️ Time to Complete¶

4-6 hours (can be broken into 1-hour sessions)

🎯 Learning Outcomes¶

By the end of this journey, you will be able to:

✅ Run a complete pipeline from CSV to Parquet to Delta
✅ Adapt canonical examples for your own data
✅ Choose between Pandas, Polars, and Spark engines
✅ Debug pipelines using odibi doctor, validate, and graph
✅ Add data quality validation gates
✅ Understand incremental loading patterns
✅ Read and modify YAML configs confidently

📋 Prerequisites¶

Python 3.9+ installed
Terminal/command line basics (cd, ls, pip)
SQL familiarity (SELECT, WHERE, JOIN)
Git (optional but helpful)

📚 Learning Modules¶

Module 1: Your First Pipeline (45 min)¶

📖 Read¶

Installation Guide - Install Odibi
Golden Path - Read Steps 1-5

👀 Watch¶

Video: "Zero to First Story in 7 Minutes" ← Coming soon

✋ Do¶

Run the Golden Path end-to-end:

# 1. Install
pip install odibi

# 2. Verify installation
odibi --version
odibi list transformers

# 3. Create project (Option A - recommended)
odibi init my_first_project --template star-schema
cd my_first_project

# 4. Run the pipeline
odibi run odibi.yaml

# 5. View the Data Story
odibi story list
odibi story last

✅ Verify¶

[ ] Pipeline ran without errors
[ ] You see dim_customer, dim_product, dim_date, fact_sales in data/gold/
[ ] Data Story HTML opened in your browser
[ ] You can find the row counts in the Story

Troubleshooting: - If odibi init fails, see Installation Guide - If pipeline fails, run odibi doctor for diagnostics

Module 2: Engine Decision Tree (30 min)¶

📖 Read¶

Decision Guide - "Choose Your Engine" section
Engine Parity Rule

👀 Visual¶

Data size?
├─► < 1GB        → engine: pandas (default)
├─► 1-10GB       → engine: polars (fast local)
└─► > 10GB       → engine: spark (distributed)
    └─► Delta Lake → engine: spark (required)

✋ Do¶

Edit my_first_project/odibi.yaml and add this line at the top:

version: 1
engine: pandas  # Try: pandas, polars, spark
project: my_first_project

Run with different engines:

# 1. Pandas (default - already ran this)
odibi run odibi.yaml

# 2. Polars (install first: pip install "odibi[polars]")
# Edit odibi.yaml: engine: polars
odibi run odibi.yaml

# 3. Compare Stories
odibi story list

Questions: 1. Did both engines produce the same row counts? 2. Which was faster for this small dataset? 3. Which Story shows more detailed execution logs?

✅ Verify¶

[ ] You understand when to use Pandas vs Spark
[ ] You successfully changed the engine setting
[ ] You know that Pandas is best for local development

Module 3: Canonical Examples Deep Dive (60 min)¶

📖 Read¶

✋ Do¶

Exercise 1: Run Hello World

cd docs/examples/canonical/runnable
odibi run 01_hello_world.yaml
ls data/output/  # See the parquet file

Exercise 2: Modify Hello World for Your Data

Create my_data.csv:

id,product,price,category
1,Widget,10.99,Tools
2,Gadget,25.50,Electronics
3,Doohickey,5.00,Tools

Copy and modify:

cp 01_hello_world.yaml my_hello_world.yaml

Edit my_hello_world.yaml:

# Change path to your CSV
read:
  path: my_data.csv  # Instead of sample.csv

# Change output path
write:
  path: my_output  # Instead of output

Run it:
```
odibi run my_hello_world.yaml
```

Exercise 3: Add a Transformation

Add a filter step to keep only Tools:

transform:
  steps:
    - "SELECT * FROM load_data WHERE category = 'Tools'"

Re-run and verify only 2 rows in output.

✅ Verify¶

[ ] You adapted Example 1 for your own CSV
[ ] You added a SQL transformation successfully
[ ] Output parquet file has the filtered data

Module 4: Debugging Toolkit (45 min)¶

📖 Read¶

CLI Master Guide - Focus on doctor, validate, graph
Troubleshooting Guide

👀 Watch¶

Video: "Debug with doctor, validate, graph" ← Coming soon

✋ Do¶

Exercise 1: Validate a Config

# Validate before running
odibi validate my_hello_world.yaml

# Look for:
# - ✅ YAML syntax is valid
# - ✅ All connections are defined
# - ✅ No circular dependencies

Exercise 2: Visualize the DAG

odibi graph my_hello_world.yaml
# Opens a visual dependency graph

Exercise 3: Diagnose Issues

odibi doctor

# Checks:
# - Python version
# - Required packages
# - Connection accessibility
# - File permissions

Exercise 4: Intentionally Break Something

Break your config and practice debugging:

# Break 1: Typo in connection name
read:
  connection: raw_dataa  # Extra 'a'

Run odibi validate - what error do you get?

# Break 2: Circular dependency
nodes:
  - name: node_a
    depends_on: [node_b]
  - name: node_b
    depends_on: [node_a]

Run odibi validate - what error do you get?

✅ Verify¶

[ ] You can validate configs before running
[ ] You can visualize pipeline DAGs
[ ] You can diagnose errors using doctor
[ ] You understand common error messages

Module 5: Data Quality Gates (60 min)¶

📖 Read¶

👀 Visual¶

Read → [Contract Check] → Transform → [Validation Check] → Write
         ↓ fail early         ↓ verify output

✋ Do¶

Exercise 1: Add Validation Tests

Edit my_hello_world.yaml and add validation:

nodes:
  - name: load_data
    read:
      connection: local
      path: my_data.csv
      format: csv

    # Add validation BEFORE write
    validation:
      tests:
        - type: not_null
          columns: [id, product]
        - type: unique
          columns: [id]
        - type: row_count
          min: 1
      gate:
        on_fail: abort  # Stop if validation fails

    write:
      connection: local
      path: my_output
      format: parquet

Run and check the Story for validation results.

Exercise 2: Test Failure Behavior

Change on_fail: warn_and_write
Break the data (add duplicate IDs to my_data.csv)
Re-run - pipeline completes but logs warnings

Exercise 3: Add Contracts

nodes:
  - name: load_data
    # Add contracts BEFORE read
    contracts:
      - type: not_null
        columns: [id, product, price]
      - type: accepted_values
        column: category
        values: [Tools, Electronics, Home]

    read:
      ...

Add a row with category: BadValue and watch the contract fail immediately.

✅ Verify¶

[ ] You added validation tests successfully
[ ] You understand fail vs warn behavior
[ ] You know the difference between contracts (input) and validation (output)
[ ] Validation results appear in the Data Story

Module 6: Patterns Overview (30 min)¶

📖 Read¶

✋ Do¶

Pick ONE pattern that interests you and wire it into Example 5:

Option A: Add Merge Logic

transformer: merge
params:
  target:
    connection: gold
    path: existing_table
  keys: [id]
  strategy: upsert  # insert new, update existing

Option B: Add Incremental Loading

read:
  connection: source
  path: transactions.csv
  incremental:
    mode: stateful
    column: updated_at

Run the pipeline twice and verify: - First run: loads all data - Second run: skips unchanged data (check Story for "rows skipped")

✅ Verify¶

[ ] You understand what problems patterns solve
[ ] You successfully applied one pattern
[ ] You can explain merge vs append vs overwrite

🏆 Capstone Project¶

Build a Complete Pipeline from Scratch

Requirements¶

Create a pipeline that: 1. Reads data from a CSV 2. Filters bad rows 3. Adds validation gates 4. Handles failures gracefully (quarantine or warn) 5. Writes to Parquet 6. Generates a Data Story

Starter Data¶

Create sales_data.csv:

order_id,customer_id,product_id,amount,order_date
1,101,501,150.00,2025-01-01
2,102,502,25.50,2025-01-02
3,103,,99.99,2025-01-03
4,104,504,-10.00,invalid-date
5,105,505,1000.00,2025-01-05

(Notice: Row 3 has NULL product_id, Row 4 has invalid date and negative amount)

Your Pipeline Must:¶

Filter out negative amounts
Validate:
order_id is unique
customer_id is not null
amount > 0
Handle failures: Use gate on_fail: warn_and_write or test on_fail: warn and document WHY in YAML comments
Generate a Story with explanation

Verification Script¶

Run this after your pipeline:

python examples/verify/verify_capstone_jr_de.py

Expected output:

✅ Output file exists
✅ Row count correct (3 valid rows)
✅ No negative amounts
✅ All customer_ids are populated
✅ Story generated
✅ Validation section exists in Story

✅ Verify¶

[ ] Your pipeline runs without errors
[ ] Verify script passes all checks
[ ] Data Story clearly shows validation results
[ ] You added comments explaining your choices

➡️ Next Steps¶

You've completed the Junior DE journey! Here's where to go next:

Deepen Your Skills¶

CLI Master Guide - Learn advanced CLI commands and debugging
Writing Transformations - Custom Python functions
Dimensional Modeling Tutorial - Star schemas

Production Readiness¶

Spark Engine Tutorial - Scale to big data
Azure Connections - Connect to cloud storage
Performance Tuning - Optimize pipelines

Level Up¶

Sr Data Engineer Journey - Design production systems

Completed the capstone? We want to celebrate with you!

Share your Data Story in GitHub Discussions
Tag us on LinkedIn with #OdibiJourney
Add "Odibi Junior Data Engineer" to your resume!

Questions? Troubleshooting Guide | FAQ

Junior Data Engineer Learning Journey¶

📌 Who Is This For?¶

⏱️ Time to Complete¶

🎯 Learning Outcomes¶

📋 Prerequisites¶

📚 Learning Modules¶

Module 1: Your First Pipeline (45 min)¶

📖 Read¶

👀 Watch¶

✋ Do¶

✅ Verify¶

Module 2: Engine Decision Tree (30 min)¶

📖 Read¶

👀 Visual¶

✋ Do¶

✅ Verify¶

Module 3: Canonical Examples Deep Dive (60 min)¶

📖 Read¶

✋ Do¶

✅ Verify¶

Module 4: Debugging Toolkit (45 min)¶

📖 Read¶

👀 Watch¶

✋ Do¶

✅ Verify¶

Module 5: Data Quality Gates (60 min)¶

📖 Read¶

👀 Visual¶

✋ Do¶

✅ Verify¶

Module 6: Patterns Overview (30 min)¶

📖 Read¶

✋ Do¶

✅ Verify¶

🏆 Capstone Project¶

Requirements¶

Starter Data¶

Your Pipeline Must:¶

Verification Script¶

✅ Verify¶

➡️ Next Steps¶

Deepen Your Skills¶

Production Readiness¶

Level Up¶

📣 Share Your Success¶