Odibi Cheatsheet
CLI Commands
| Command |
Description |
odibi run odibi.yaml |
Run the pipeline. |
odibi run odibi.yaml --dry-run |
Validate connections without moving data. |
odibi validate odibi.yaml |
Validate YAML schema, pipeline logic, and transformer params. |
odibi doctor |
Check environment and config health. |
odibi story last |
Open the latest run report. |
odibi secrets init odibi.yaml |
Create .env template for secrets. |
odibi graph odibi.yaml |
Visualize pipeline dependencies. |
odibi init my_project --template hello |
Scaffold a new project from a template. |
odibi templates list |
List all template types (connections, patterns, configs). |
odibi templates show azure_blob |
Show YAML template with all options. |
odibi templates transformer scd2 |
Show transformer params and example. |
odibi templates schema |
Generate JSON schema for VS Code autocomplete. |
odibi.yaml Structure
version: "1.0.0"
project: My Project
engine: pandas # or 'spark'
connections:
raw_data:
type: local
base_path: ./data
story:
connection: raw_data
path: stories/
pipelines:
- pipeline: main_etl
nodes:
# 1. Read
- name: load_csv
read:
connection: raw_data
path: input.csv
format: csv
# 2. Transform (SQL)
- name: clean_data
depends_on: [load_csv]
transform:
steps:
- "SELECT * FROM load_csv WHERE id IS NOT NULL"
# 3. Transform (Python)
- name: advanced_clean
depends_on: [clean_data]
transform:
steps:
- function: my_custom_func # registered Python function
params:
threshold: 10
# 4. Write
- name: save_parquet
depends_on: [advanced_clean]
write:
connection: raw_data
path: output.parquet
format: parquet
mode: overwrite
from odibi import transform
@transform
def my_custom_func(df, threshold=10):
"""Docstrings are required!"""
return df[df['val'] > threshold]
Cross-Pipeline Dependencies
Reference outputs from other pipelines using $pipeline.node syntax:
nodes:
- name: enriched_data
inputs:
# Cross-pipeline reference
events: $read_bronze.shift_events
# Explicit read config
calendar:
connection: prod
path: "bronze/calendar"
format: delta
transform:
steps:
- operation: join
left: events
right: calendar
on: [date_id]
| Syntax |
Example |
Description |
$pipeline.node |
$read_bronze.orders |
Reference node output from another pipeline |
Requirements:
- Referenced pipeline must have run first
- Referenced node must have a write block
- Cannot use both read and inputs in same node
Node Documentation (Explanation)
Document business logic for stakeholders directly in YAML:
nodes:
- name: fact_sales
# Inline explanation (for short docs)
explanation: |
## Sales Fact
Joins orders with dimension tables.
**Business Rule:** Orphan orders get `unknown_member_sk = -1`
- name: dim_customer
# External file (for longer docs, keeps YAML clean)
explanation_file: docs/dim_customer.md # Relative to this YAML
Explanations render in Data Story HTML reports with full Markdown support (tables, code blocks, headers).
| Field |
Description |
explanation |
Inline Markdown documentation |
explanation_file |
Path to external .md file (mutually exclusive with explanation) |
See: Explanation Feature Guide
SQL Templates
| Variable |
Value |
${source} |
The path of the source file (if reading). |
${SELF} |
The name of the current node. |
Variable Substitution
| Syntax |
Purpose |
Example |
${VAR} |
Environment variable |
${DB_PASSWORD} |
${env:VAR} |
Environment variable (explicit) |
${env:API_KEY} |
${vars.name} |
Custom variable from vars: block |
${vars.env} |
${date:expr} |
Dynamic date |
${date:today}, ${date:-7d} |
${date:expr:fmt} |
Date with format |
${date:today:%Y%m%d} |
Date expressions: today, yesterday, now, start_of_month, end_of_month, -7d, +30d, -1m, -1y
API Fetching
read:
connection: my_api
format: api
path: /v1/data
options:
method: GET # GET (default), POST, PUT, PATCH, DELETE
params: # URL params (GET) or merged into body (POST)
limit: 100
request_body: # JSON body for POST/PUT/PATCH
filters: {...}
pagination:
type: offset_limit # offset_limit | page_number | cursor | link_header
start_offset: 0 # Use 1 for 1-indexed APIs
response:
items_path: results
add_fields:
_fetched_at: "${date:now}"
SQL Core (Column & Row Operations)
| Transformer |
Description |
Key Params |
filter_rows |
Filter rows with SQL WHERE |
condition |
derive_columns |
Add calculated columns |
derivations: {col: expr} |
cast_columns |
Change column types |
casts: {col: type} |
clean_text |
Trim, lowercase, remove chars |
columns, lowercase, strip |
select_columns |
Keep only listed columns |
columns: [...] |
drop_columns |
Remove columns |
columns: [...] |
rename_columns |
Rename columns |
mapping: {old: new} |
fill_nulls |
Replace nulls |
fills: {col: value} |
case_when |
Conditional logic |
conditions: [{when, then}] |
sort |
Order rows |
order_by: "<column> [ASC\|DESC]" — multiple columns can be comma-separated |
limit |
Take first N rows |
n: 100 |
distinct |
Remove duplicates |
columns: [...] (optional) |
Date/Time Operations
| Transformer |
Description |
Key Params |
extract_date_parts |
Extract year, month, day |
column, parts: [year, month] |
date_add |
Add interval to date |
column, interval, unit (unit options: day, hour, month, year) |
date_trunc |
Truncate to period |
column, unit (unit options: day, month, year, hour, minute, second) |
date_diff |
Days between dates |
start_col, end_col |
convert_timezone |
Convert timezones |
column, from_tz, to_tz (IANA timezone format, e.g., America/New_York, UTC) |
Relational (Joins & Aggregations)
| Transformer |
Description |
Key Params |
join |
Join datasets |
right_dataset, on, how |
union |
Stack datasets |
datasets: [...] |
aggregate |
Group and aggregate |
group_by, aggregations |
pivot |
Rows to columns |
index, columns, values |
unpivot |
Columns to rows |
id_vars, value_vars |
| Transformer |
Description |
Key Params |
deduplicate |
Keep first/last per key |
keys, order_by |
hash_columns |
Generate hash |
columns, output_column |
window_calculation |
Window functions |
partition_by, order_by, expr |
explode_list_column |
Flatten arrays |
column, outer: true |
dict_based_mapping |
Value mapping |
column, mapping: {} |
regex_replace |
Regex substitution |
column, pattern, replacement |
parse_json |
Extract from JSON |
column, schema |
generate_surrogate_key |
UUID keys |
columns, output_column |
sessionize |
Session detection |
timestamp_col, gap_minutes — integer, inactivity threshold in minutes after which a new session starts |
Patterns (SCD, Merge, Delete Detection)
| Transformer |
Description |
Key Params |
scd2 |
SCD Type 2 history |
target, keys, track_cols, effective_time_col (required) |
merge |
Upsert/append/delete |
target, keys, strategy — options: upsert, append_only, delete_match |
detect_deletes |
Find deleted records |
target, keys |
Validation
| Transformer |
Description |
Key Params |
cross_check |
Compare datasets |
inputs: [a, b], type — options: row_count_diff, schema_match |
validate_and_flag |
Flag bad records |
rules: [{col, condition}] |