Junior Data Engineer Learning Journey¶
📌 Who Is This For?¶
New grads, bootcamp graduates, and career switchers who need to: - Run and debug data pipelines - Copy and adapt working configurations - Choose the right engine (Pandas vs Spark) - Fix common errors independently
Prerequisites: Basic Python and SQL knowledge
⏱️ Time to Complete¶
4-6 hours (can be broken into 1-hour sessions)
🎯 Learning Outcomes¶
By the end of this journey, you will be able to:
✅ Run a complete pipeline from CSV to Parquet to Delta
✅ Adapt canonical examples for your own data
✅ Choose between Pandas, Polars, and Spark engines
✅ Debug pipelines using odibi doctor, validate, and graph
✅ Add data quality validation gates
✅ Understand incremental loading patterns
✅ Read and modify YAML configs confidently
📋 Prerequisites¶
- Python 3.9+ installed
- Terminal/command line basics (cd, ls, pip)
- SQL familiarity (SELECT, WHERE, JOIN)
- Git (optional but helpful)
📚 Learning Modules¶
Module 1: Your First Pipeline (45 min)¶
📖 Read¶
- Installation Guide - Install Odibi
- Golden Path - Read Steps 1-5
👀 Watch¶
- Video: "Zero to First Story in 7 Minutes" ← Coming soon
✋ Do¶
Run the Golden Path end-to-end:
# 1. Install
pip install odibi
# 2. Verify installation
odibi --version
odibi list transformers
# 3. Create project (Option A - recommended)
odibi init my_first_project --template star-schema
cd my_first_project
# 4. Run the pipeline
odibi run odibi.yaml
# 5. View the Data Story
odibi story list
odibi story last
✅ Verify¶
- [ ] Pipeline ran without errors
- [ ] You see
dim_customer,dim_product,dim_date,fact_salesindata/gold/ - [ ] Data Story HTML opened in your browser
- [ ] You can find the row counts in the Story
Troubleshooting:
- If odibi init fails, see Installation Guide
- If pipeline fails, run odibi doctor for diagnostics
Module 2: Engine Decision Tree (30 min)¶
📖 Read¶
- Decision Guide - "Choose Your Engine" section
- Engine Parity Rule
👀 Visual¶
Data size?
├─► < 1GB → engine: pandas (default)
├─► 1-10GB → engine: polars (fast local)
└─► > 10GB → engine: spark (distributed)
└─► Delta Lake → engine: spark (required)
✋ Do¶
Edit my_first_project/odibi.yaml and add this line at the top:
Run with different engines:
# 1. Pandas (default - already ran this)
odibi run odibi.yaml
# 2. Polars (install first: pip install "odibi[polars]")
# Edit odibi.yaml: engine: polars
odibi run odibi.yaml
# 3. Compare Stories
odibi story list
Questions: 1. Did both engines produce the same row counts? 2. Which was faster for this small dataset? 3. Which Story shows more detailed execution logs?
✅ Verify¶
- [ ] You understand when to use Pandas vs Spark
- [ ] You successfully changed the
enginesetting - [ ] You know that Pandas is best for local development
Module 3: Canonical Examples Deep Dive (60 min)¶
📖 Read¶
✋ Do¶
Exercise 1: Run Hello World
cd docs/examples/canonical/runnable
odibi run 01_hello_world.yaml
ls data/output/ # See the parquet file
Exercise 2: Modify Hello World for Your Data
-
Create
my_data.csv: -
Copy and modify:
-
Edit
my_hello_world.yaml: -
Run it:
Exercise 3: Add a Transformation
Add a filter step to keep only Tools:
Re-run and verify only 2 rows in output.
✅ Verify¶
- [ ] You adapted Example 1 for your own CSV
- [ ] You added a SQL transformation successfully
- [ ] Output parquet file has the filtered data
Module 4: Debugging Toolkit (45 min)¶
📖 Read¶
- CLI Master Guide - Focus on
doctor,validate,graph - Troubleshooting Guide
👀 Watch¶
- Video: "Debug with doctor, validate, graph" ← Coming soon
✋ Do¶
Exercise 1: Validate a Config
# Validate before running
odibi validate my_hello_world.yaml
# Look for:
# - ✅ YAML syntax is valid
# - ✅ All connections are defined
# - ✅ No circular dependencies
Exercise 2: Visualize the DAG
Exercise 3: Diagnose Issues
odibi doctor
# Checks:
# - Python version
# - Required packages
# - Connection accessibility
# - File permissions
Exercise 4: Intentionally Break Something
Break your config and practice debugging:
Run odibi validate - what error do you get?
# Break 2: Circular dependency
nodes:
- name: node_a
depends_on: [node_b]
- name: node_b
depends_on: [node_a]
Run odibi validate - what error do you get?
✅ Verify¶
- [ ] You can validate configs before running
- [ ] You can visualize pipeline DAGs
- [ ] You can diagnose errors using
doctor - [ ] You understand common error messages
Module 5: Data Quality Gates (60 min)¶
📖 Read¶
👀 Visual¶
✋ Do¶
Exercise 1: Add Validation Tests
Edit my_hello_world.yaml and add validation:
nodes:
- name: load_data
read:
connection: local
path: my_data.csv
format: csv
# Add validation BEFORE write
validation:
tests:
- type: not_null
columns: [id, product]
- type: unique
columns: [id]
- type: row_count
min: 1
gate:
on_fail: abort # Stop if validation fails
write:
connection: local
path: my_output
format: parquet
Run and check the Story for validation results.
Exercise 2: Test Failure Behavior
- Change
on_fail: warn_and_write - Break the data (add duplicate IDs to
my_data.csv) - Re-run - pipeline completes but logs warnings
Exercise 3: Add Contracts
nodes:
- name: load_data
# Add contracts BEFORE read
contracts:
- type: not_null
columns: [id, product, price]
- type: accepted_values
column: category
values: [Tools, Electronics, Home]
read:
...
Add a row with category: BadValue and watch the contract fail immediately.
✅ Verify¶
- [ ] You added validation tests successfully
- [ ] You understand
failvswarnbehavior - [ ] You know the difference between contracts (input) and validation (output)
- [ ] Validation results appear in the Data Story
Module 6: Patterns Overview (30 min)¶
📖 Read¶
✋ Do¶
Pick ONE pattern that interests you and wire it into Example 5:
Option A: Add Merge Logic
transformer: merge
params:
target:
connection: gold
path: existing_table
keys: [id]
strategy: upsert # insert new, update existing
Option B: Add Incremental Loading
Run the pipeline twice and verify: - First run: loads all data - Second run: skips unchanged data (check Story for "rows skipped")
✅ Verify¶
- [ ] You understand what problems patterns solve
- [ ] You successfully applied one pattern
- [ ] You can explain merge vs append vs overwrite
🏆 Capstone Project¶
Build a Complete Pipeline from Scratch
Requirements¶
Create a pipeline that: 1. Reads data from a CSV 2. Filters bad rows 3. Adds validation gates 4. Handles failures gracefully (quarantine or warn) 5. Writes to Parquet 6. Generates a Data Story
Starter Data¶
Create sales_data.csv:
order_id,customer_id,product_id,amount,order_date
1,101,501,150.00,2025-01-01
2,102,502,25.50,2025-01-02
3,103,,99.99,2025-01-03
4,104,504,-10.00,invalid-date
5,105,505,1000.00,2025-01-05
(Notice: Row 3 has NULL product_id, Row 4 has invalid date and negative amount)
Your Pipeline Must:¶
- Filter out negative amounts
- Validate:
order_idis uniquecustomer_idis not nullamount > 0- Handle failures: Use gate
on_fail: warn_and_writeor teston_fail: warnand document WHY in YAML comments - Generate a Story with explanation
Verification Script¶
Run this after your pipeline:
Expected output:
✅ Output file exists
✅ Row count correct (3 valid rows)
✅ No negative amounts
✅ All customer_ids are populated
✅ Story generated
✅ Validation section exists in Story
✅ Verify¶
- [ ] Your pipeline runs without errors
- [ ] Verify script passes all checks
- [ ] Data Story clearly shows validation results
- [ ] You added comments explaining your choices
➡️ Next Steps¶
You've completed the Junior DE journey! Here's where to go next:
Deepen Your Skills¶
- CLI Master Guide - Learn advanced CLI commands and debugging
- Writing Transformations - Custom Python functions
- Dimensional Modeling Tutorial - Star schemas
Production Readiness¶
- Spark Engine Tutorial - Scale to big data
- Azure Connections - Connect to cloud storage
- Performance Tuning - Optimize pipelines
Level Up¶
- Sr Data Engineer Journey - Design production systems
📣 Share Your Success¶
Completed the capstone? We want to celebrate with you!
- Share your Data Story in GitHub Discussions
- Tag us on LinkedIn with #OdibiJourney
- Add "Odibi Junior Data Engineer" to your resume!
Questions? Troubleshooting Guide | FAQ