Skip to content

Odibi Cheatsheet

CLI Commands

Command Description
odibi run odibi.yaml Run the pipeline.
odibi run odibi.yaml --dry-run Validate connections without moving data.
odibi validate odibi.yaml Validate YAML schema, pipeline logic, and transformer params.
odibi doctor Check environment and config health.
odibi story last Open the latest run report.
odibi secrets init odibi.yaml Create .env template for secrets.
odibi graph odibi.yaml Visualize pipeline dependencies.
odibi init my_project --template hello Scaffold a new project from a template.
odibi templates list List all template types (connections, patterns, configs).
odibi templates show azure_blob Show YAML template with all options.
odibi templates transformer scd2 Show transformer params and example.
odibi templates schema Generate JSON schema for VS Code autocomplete.

odibi.yaml Structure

version: "1.0.0"
project: My Project
engine: pandas          # or 'spark'

connections:
  raw_data:
    type: local
    base_path: ./data

story:
  connection: raw_data
  path: stories/

pipelines:
  - pipeline: main_etl
    nodes:
      # 1. Read
      - name: load_csv
        read:
          connection: raw_data
          path: input.csv
          format: csv

      # 2. Transform (SQL)
      - name: clean_data
        depends_on: [load_csv]
        transform:
          steps:
            - "SELECT * FROM load_csv WHERE id IS NOT NULL"

      # 3. Transform (Python)
      - name: advanced_clean
        depends_on: [clean_data]
        transform:
          steps:
            - function: my_custom_func  # registered Python function
              params:
                threshold: 10

      # 4. Write
      - name: save_parquet
        depends_on: [advanced_clean]
        write:
          connection: raw_data
          path: output.parquet
          format: parquet
          mode: overwrite

Python Transformation

from odibi import transform

@transform
def my_custom_func(df, threshold=10):
    """Docstrings are required!"""
    return df[df['val'] > threshold]

Cross-Pipeline Dependencies

Reference outputs from other pipelines using $pipeline.node syntax:

nodes:
  - name: enriched_data
    inputs:
      # Cross-pipeline reference
      events: $read_bronze.shift_events

      # Explicit read config
      calendar:
        connection: prod
        path: "bronze/calendar"
        format: delta
    transform:
      steps:
        - operation: join
          left: events
          right: calendar
          on: [date_id]
Syntax Example Description
$pipeline.node $read_bronze.orders Reference node output from another pipeline

Requirements: - Referenced pipeline must have run first - Referenced node must have a write block - Cannot use both read and inputs in same node


Node Documentation (Explanation)

Document business logic for stakeholders directly in YAML:

nodes:
  - name: fact_sales
    # Inline explanation (for short docs)
    explanation: |
      ## Sales Fact
      Joins orders with dimension tables.
      **Business Rule:** Orphan orders get `unknown_member_sk = -1`

  - name: dim_customer
    # External file (for longer docs, keeps YAML clean)
    explanation_file: docs/dim_customer.md  # Relative to this YAML

Explanations render in Data Story HTML reports with full Markdown support (tables, code blocks, headers).

Field Description
explanation Inline Markdown documentation
explanation_file Path to external .md file (mutually exclusive with explanation)

See: Explanation Feature Guide


SQL Templates

Variable Value
${source} The path of the source file (if reading).
${SELF} The name of the current node.

Variable Substitution

Syntax Purpose Example
${VAR} Environment variable ${DB_PASSWORD}
${env:VAR} Environment variable (explicit) ${env:API_KEY}
${vars.name} Custom variable from vars: block ${vars.env}
${date:expr} Dynamic date ${date:today}, ${date:-7d}
${date:expr:fmt} Date with format ${date:today:%Y%m%d}

Date expressions: today, yesterday, now, start_of_month, end_of_month, -7d, +30d, -1m, -1y


API Fetching

read:
  connection: my_api
  format: api
  path: /v1/data
  options:
    method: GET              # GET (default), POST, PUT, PATCH, DELETE
    params:                  # URL params (GET) or merged into body (POST)
      limit: 100
    request_body:            # JSON body for POST/PUT/PATCH
      filters: {...}
    pagination:
      type: offset_limit     # offset_limit | page_number | cursor | link_header
      start_offset: 0        # Use 1 for 1-indexed APIs
    response:
      items_path: results
      add_fields:
        _fetched_at: "${date:now}"

Transformer Quick Reference

SQL Core (Column & Row Operations)

Transformer Description Key Params
filter_rows Filter rows with SQL WHERE condition
derive_columns Add calculated columns derivations: {col: expr}
cast_columns Change column types casts: {col: type}
clean_text Trim, lowercase, remove chars columns, lowercase, strip
select_columns Keep only listed columns columns: [...]
drop_columns Remove columns columns: [...]
rename_columns Rename columns mapping: {old: new}
fill_nulls Replace nulls fills: {col: value}
case_when Conditional logic conditions: [{when, then}]
sort Order rows order_by: "<column> [ASC\|DESC]" — multiple columns can be comma-separated
limit Take first N rows n: 100
distinct Remove duplicates columns: [...] (optional)

Date/Time Operations

Transformer Description Key Params
extract_date_parts Extract year, month, day column, parts: [year, month]
date_add Add interval to date column, interval, unit (unit options: day, hour, month, year)
date_trunc Truncate to period column, unit (unit options: day, month, year, hour, minute, second)
date_diff Days between dates start_col, end_col
convert_timezone Convert timezones column, from_tz, to_tz (IANA timezone format, e.g., America/New_York, UTC)

Relational (Joins & Aggregations)

Transformer Description Key Params
join Join datasets right_dataset, on, how
union Stack datasets datasets: [...]
aggregate Group and aggregate group_by, aggregations
pivot Rows to columns index, columns, values
unpivot Columns to rows id_vars, value_vars

Advanced (Complex Transformations)

Transformer Description Key Params
deduplicate Keep first/last per key keys, order_by
hash_columns Generate hash columns, output_column
window_calculation Window functions partition_by, order_by, expr
explode_list_column Flatten arrays column, outer: true
dict_based_mapping Value mapping column, mapping: {}
regex_replace Regex substitution column, pattern, replacement
parse_json Extract from JSON column, schema
generate_surrogate_key UUID keys columns, output_column
sessionize Session detection timestamp_col, gap_minutes — integer, inactivity threshold in minutes after which a new session starts

Patterns (SCD, Merge, Delete Detection)

Transformer Description Key Params
scd2 SCD Type 2 history target, keys, track_cols, effective_time_col (required)
merge Upsert/append/delete target, keys, strategy — options: upsert, append_only, delete_match
detect_deletes Find deleted records target, keys

Validation

Transformer Description Key Params
cross_check Compare datasets inputs: [a, b], type — options: row_count_diff, schema_match
validate_and_flag Flag bad records rules: [{col, condition}]