The Odibi Playbook
Find your problem. Get the solution.
Most Common Flows
If You Only Read 3 Pages...
- Golden Path — Zero to running in 10 minutes
- Patterns Overview — Common solutions to common problems
- YAML Schema — All configuration options
Find Your Problem
Bronze Layer: Ingestion
"Get data from sources into your lakehouse reliably."
| Problem |
Pattern |
Docs |
| Load all files from a folder |
Append-only |
Pattern |
| Only process new files since last run |
Rolling window |
Pattern |
| Track exact high-water mark |
Stateful HWM |
Pattern |
| Fail if source is empty or stale |
Contracts |
YAML |
| Handle malformed records |
Bad records path |
YAML |
| Extract from SQL Server |
JDBC read |
Example |
"Clean, deduplicate, and model your data."
| Problem |
Pattern |
Docs |
| Remove duplicates |
Deduplicate transformer |
YAML |
| Keep latest record per key |
Dedupe with ordering |
YAML |
| Track dimension changes over time |
SCD2 |
Pattern |
| Upsert into target table |
Merge |
Pattern |
| Validate output data quality |
Validation tests |
Feature |
| Route bad rows for review |
Quarantine |
Feature |
Gold Layer: Analytics
"Build fact tables, aggregations, and semantic layers."
| Problem |
Pattern |
Docs |
| Build fact table with SK lookups |
Fact pattern |
Pattern |
| Handle orphan records |
Orphan handling |
Pattern |
| Pre-aggregate metrics |
Aggregation pattern |
Pattern |
| Generate date dimension |
Date dimension |
Pattern |
Decision Trees
Choose Your Engine
Data size?
├─► < 1GB → engine: pandas
├─► 1-10GB → engine: polars
└─► > 10GB or Delta Lake → engine: spark
Choose Your Incremental Mode
Source has timestamps?
├─► Yes → mode: stateful (exact HWM tracking)
└─► No
└─► Data arrives daily? → mode: rolling_window (lookback)
└─► Unknown pattern? → write.skip_if_unchanged: true
Choose Your Validation Approach
When to check?
├─► Before processing (source quality) → contracts:
└─► After processing (output quality) → validation.tests:
└─► Need to stop pipeline? → gate.on_fail: abort
└─► Soft warning OK? → gate.on_fail: warn_and_write
Choose Your SCD Type
Need historical state?
├─► No → scd_type: 1 (overwrite)
└─► Yes → scd_type: 2 (versioned)
└─► Storage concerns? → Consider snapshots instead
Quick Links by Role
Data Engineer (Daily Work)
Data Engineer (Building Pipelines)
Data Engineer (Production)
CLI Quick Reference
| Task |
Command |
| Run pipeline |
odibi run config.yaml |
| Run specific node |
odibi run config.yaml --node name |
| Dry run (no writes) |
odibi run config.yaml --dry-run |
| Validate config |
odibi validate config.yaml |
| View DAG |
odibi graph config.yaml |
| Check state |
odibi catalog state config.yaml |
| Diagnose issues |
odibi doctor |
| List stories |
odibi story list |
| Category |
Transformers |
| Filtering |
filter_rows, distinct, sample, limit |
| Columns |
derive_columns, select_columns, drop_columns, rename_columns, cast_columns |
| Text |
clean_text, trim_whitespace, regex_replace, split_part, concat_columns |
| Dates |
extract_date_parts, date_add, date_trunc, date_diff, convert_timezone |
| Nulls |
fill_nulls, coalesce_columns |
| Relational |
join, union, aggregate, pivot, unpivot |
| Window |
window_calculation (rank, sum, lag, lead) |
| JSON |
parse_json, normalize_json, explode_list_column, unpack_struct |
| Keys |
generate_surrogate_key, hash_columns |
| Patterns |
scd2, merge, deduplicate, dimension, fact, aggregation |
Full reference: YAML Schema - Transformers
See Also