Skip to content

Odibi Best Practices Guide

Version: 3.4.3
Last Updated: 2025-12-03
Audience: Data Engineers, Analytics Engineers, Team Leads

This guide covers recommended patterns for building maintainable, scalable, and production-ready Odibi pipelines.


Table of Contents

  1. Project Organization
  2. Pipeline Design
  3. Node Design
  4. Naming Conventions
  5. Configuration Management
  6. Performance
  7. Data Quality
  8. Cross-Pipeline Dependencies
  9. Security
  10. Version Control

1. Project Organization

my-odibi-project/
├── project.yaml                    # Core config (connections, settings)
├── pipelines/
│   ├── bronze/
│   │   └── read_bronze.yaml        # Bronze layer pipeline
│   ├── silver/
│   │   └── transform_silver.yaml   # Silver layer pipeline
│   └── gold/
│       └── build_gold.yaml         # Gold layer pipeline
├── transformations/
│   ├── __init__.py
│   └── custom_transforms.py        # Custom Python transformations
├── sql/
│   └── complex_queries.sql         # Complex SQL (optional)
├── tests/
│   └── test_pipelines.py           # Pipeline tests
├── .env                            # Local secrets (git-ignored)
├── .gitignore
└── README.md

Separation of Concerns

File Contains Does NOT Contain
project.yaml Connections, system config, story config, imports Pipeline definitions
pipelines/*.yaml Pipeline and node definitions Connection details
transformations/ Custom Python logic YAML configuration

Example project.yaml

project: SalesAnalytics
description: "Sales Analytics Platform"
engine: spark
version: "1.0.0"
owner: "Data Team"

# === Connections (defined once, used everywhere) ===
connections:
  source_db:
    type: sql_server
    host: ${DB_HOST}
    database: ${DB_NAME}
    auth:
      mode: sql_login
      username: ${DB_USER}
      password: ${DB_PASS}

  lakehouse:
    type: azure_blob
    account_name: ${STORAGE_ACCOUNT}
    container: datalake
    auth:
      mode: account_key
      account_key: ${STORAGE_KEY}

# === System Catalog ===
system:
  connection: lakehouse
  path: _odibi_system

# === Story Configuration ===
story:
  connection: lakehouse
  path: stories/
  retention_days: 30
  auto_generate: true

# === Global Settings ===
performance:
  use_arrow: true
  skip_null_profiling: true

retry:
  enabled: true
  max_attempts: 3
  backoff: exponential

logging:
  level: INFO
  structured: true

# === Import Pipelines ===
imports:
  - pipelines/bronze/read_bronze.yaml
  - pipelines/silver/transform_silver.yaml
  - pipelines/gold/build_gold.yaml

Example Pipeline File

pipelines/bronze/read_bronze.yaml:

pipelines:
  - pipeline: read_bronze
    description: "Ingest raw data from source systems"
    layer: bronze
    nodes:
      - name: orders
        description: "Raw orders from ERP"
        read:
          connection: source_db
          format: sql
          table: sales.orders
          incremental:
            mode: stateful
            column: updated_at
        write:
          connection: lakehouse
          format: delta
          path: "bronze/orders"
          mode: append
          add_metadata: true

      - name: customers
        description: "Customer master data"
        read:
          connection: source_db
          format: sql
          table: sales.customers
        write:
          connection: lakehouse
          format: delta
          path: "bronze/customers"
          mode: append
          add_metadata: true
          skip_if_unchanged: true
          skip_hash_columns: [customer_id]


2. Pipeline Design

One Pipeline Per Layer Per Domain

Good:

pipelines/
├── bronze/
│   └── read_bronze.yaml           # All bronze ingestion
├── silver/
│   └── transform_silver.yaml      # All silver transformations
└── gold/
    ├── gold_sales.yaml            # Sales domain aggregates
    └── gold_inventory.yaml        # Inventory domain aggregates

Avoid:

pipelines/
├── orders_bronze_silver_gold.yaml  # Too many concerns in one file
└── everything.yaml                 # Unmaintainable

Pipeline Sizing Guidelines

Node Count Recommendation
1-20 nodes Single pipeline file
20-50 nodes Consider splitting by sub-domain
50+ nodes Split into multiple pipelines

Keep Nodes with Their Pipeline

Don't split nodes into separate files:

# nodes/orders.yaml - BAD: nodes scattered across files
pipelines:
  - pipeline: read_bronze
    nodes:
      - name: orders

Keep nodes together:

# read_bronze.yaml - GOOD: all nodes in one place
pipelines:
  - pipeline: read_bronze
    nodes:
      - name: orders
        # ...
      - name: customers
        # ...
      - name: products
        # ...

Why? - depends_on relationships are visible in one file - Easier to understand the full pipeline flow - One file = one pipeline = one commit for changes


3. Node Design

Single Responsibility

Each node should do one thing well:

Good:

- name: load_orders
  read: ...
  write: ...

- name: clean_orders
  depends_on: [load_orders]
  transform:
    steps:
      - sql: "SELECT * FROM load_orders WHERE status IS NOT NULL"
  write: ...

- name: enrich_orders
  depends_on: [clean_orders, customers]
  transform:
    steps:
      - operation: join
        left: clean_orders
        right: customers
        on: [customer_id]
  write: ...

Avoid:

- name: do_everything
  read: ...
  transform:
    steps:
      - sql: "..."  # 500 lines of SQL doing everything
  write: ...

Use Descriptions

Always add descriptions for documentation and debugging:

- name: calculate_daily_revenue
  description: "Aggregates order amounts by day for finance reporting"
  tags: [daily, finance, critical]

Cache Strategically

Use cache: true for nodes that are: - Read by multiple downstream nodes - Expensive to compute - Small enough to fit in memory

Auto-Caching (Default Enabled):

Odibi automatically caches nodes with 3+ downstream dependencies (configurable):

pipeline:
  name: silver
  auto_cache_threshold: 3  # Auto-cache nodes with 3+ dependencies (default)
  nodes:
    - name: dim_calendar
      # Has 10 downstream nodes → automatically cached ✅

    - name: huge_dimension
      # Has 5 downstream nodes → would be auto-cached...
      cache: false  # ...but explicit override prevents it ✅

    - name: linear_transform
      # Has 1 downstream node → not auto-cached (below threshold)

Manual Caching:

- name: dimension_products
  description: "Product dimension - cached for multiple joins"
  read: ...
  cache: true  # Explicit cache for nodes below threshold

Disable Auto-Caching:

pipeline:
  auto_cache_threshold: null  # Disable auto-caching, all manual

4. Naming Conventions

Pipeline Names

Use snake_case with layer prefix:

Pattern Example
{action}_{layer} read_bronze, transform_silver, build_gold
{layer}_{domain} bronze_sales, silver_inventory

Node Names

Use descriptive snake_case:

Pattern Example
Source nodes orders, customers, products
Transformed nodes clean_orders, enriched_customers
Aggregated nodes daily_sales, monthly_revenue
Dimension nodes dim_product, dim_customer
Fact nodes fact_orders, fact_inventory

Connection Names

Use environment + purpose:

connections:
  prod_source_db:    # Production source database
  prod_lakehouse:    # Production data lake
  dev_lakehouse:     # Development data lake

5. Configuration Management

Environment Variables for Secrets

Always use environment variables for sensitive data:

connections:
  database:
    host: ${DB_HOST}
    username: ${DB_USER}
    password: ${DB_PASSWORD}

Never hardcode secrets:

connections:
  database:
    password: "my_secret_password"  # NEVER DO THIS

Use .env for Local Development

# .env (git-ignored)
DB_HOST=localhost
DB_USER=dev_user
DB_PASSWORD=dev_password
STORAGE_ACCOUNT=devaccount
STORAGE_KEY=abc123...

Environment-Specific Overrides

# In project.yaml
environments:
  dev:
    connections:
      lakehouse:
        container: dev-datalake
  prod:
    logging:
      level: WARNING
    connections:
      lakehouse:
        container: prod-datalake

Run with: odibi run project.yaml --env prod


6. Performance

Enable Arrow for Pandas

performance:
  use_arrow: true  # Major speedup for Parquet I/O

Use Incremental Loading

Don't reload full tables every time:

read:
  connection: source_db
  table: orders
  incremental:
    mode: stateful
    column: updated_at
    watermark_lag: "1d"

Skip Unchanged Data

For dimension tables that rarely change:

write:
  mode: append
  skip_if_unchanged: true
  skip_hash_columns: [id]

Optimize Delta Writes (Spark)

write:
  format: delta
  options:
    optimize_write: true
    cluster_by: [date, region]

Skip Null Profiling for Large Tables

performance:
  skip_null_profiling: true  # Faster for very large DataFrames

7. Data Quality & Validation

Validation Strategy Overview

Odibi provides three validation mechanisms for different use cases:

Mechanism When Executed Purpose On Failure
Contracts Before processing Input validation Always stops pipeline
Validation After transformation Output checks Configurable (warn/error)
Gates Before write Critical path checks Blocks downstream nodes

Use Contracts for Input Validation

Fail fast if source data is bad:

- name: process_orders
  contracts:
    - type: not_null
      columns: [order_id, customer_id]
    - type: row_count
      min: 100
    - type: freshness
      column: created_at
      max_age: "24h"
  read: ...
  transform: ...

Use Validation for Output Checks

Warn (or fail) if output doesn't meet expectations:

- name: daily_revenue
  transform: ...
  validation:
    tests:
      - type: not_null
        columns: [date, revenue]
      - type: unique
        columns: [date]
      - type: range
        column: revenue
        min: 0
    mode: warn  # or "fail" to stop the pipeline

Available Validation Types

Type Description Example
not_null Check for null values columns: [id, name]
unique Check for duplicates columns: [id]
row_count Validate row counts min: 100, max: 1000000
freshness Check data recency column: updated_at, max_age: "24h"
range Numeric bounds column: amount, min: 0, max: 10000
regex Pattern matching column: email, pattern: "^.+@.+$"
custom_sql Custom SQL assertion condition: "col > 0"

Use Quality Gates for Critical Paths

- name: load_orders
  validation:
    tests:
      - type: row_count
        min: 1000
    gate:
      require_pass_rate: 1.0
      on_fail: abort  # Stops pipeline if validation fails

FK Validation for Fact Tables

Ensure referential integrity before loading fact tables:

- name: fact_orders
  depends_on: [dim_customer, dim_product]
  read:
    connection: staging
    path: orders
  # FK validation is handled by the fact pattern's dimension lookups
  # See: Fact Pattern → orphan_handling (unknown, reject, quarantine)
  pattern:
    type: fact
    params:
      grain: [order_id]
      dimensions:
        - source_column: customer_id
          dimension_table: dim_customer
          dimension_key: customer_id
          surrogate_key: customer_sk
        - source_column: product_id
          dimension_table: dim_product
          dimension_key: product_id
          surrogate_key: product_sk
      orphan_handling: unknown
  write:
    connection: warehouse
    path: fact_orders

Custom Validation Functions

Register custom validation logic:

from odibi import transform

@transform("validate_business_rules")
def validate_business_rules(context, current):
    """Custom business rule validation."""
    errors = []

    # Rule 1: Order amount must match line items
    mismatched = current[current['total'] != current['line_items_sum']]
    if len(mismatched) > 0:
        errors.append(f"{len(mismatched)} orders with mismatched totals")

    # Rule 2: Future dates not allowed
    future_orders = current[current['order_date'] > pd.Timestamp.now()]
    if len(future_orders) > 0:
        errors.append(f"{len(future_orders)} orders with future dates")

    if errors:
        context.log_warning(f"Validation issues: {'; '.join(errors)}")

    return current

Use in YAML:

transform:
  steps:
    - function: validate_business_rules

Quarantine Bad Records

Separate bad data for review instead of failing:

- name: process_orders
  validation:
    tests:
      - type: not_null
        columns: [order_id, amount]
        on_fail: quarantine
    quarantine:
      connection: warehouse
      path: quarantine/orders

8. Cross-Pipeline Dependencies

Use $pipeline.node References

When silver needs bronze outputs:

# pipelines/silver/transform_silver.yaml
pipelines:
  - pipeline: transform_silver
    nodes:
      - name: enriched_orders
        inputs:
          orders: $read_bronze.orders           # Cross-pipeline reference
          customers: $read_bronze.customers
        transform:
          steps:
            - operation: join
              left: orders
              right: customers
              on: [customer_id]
        write:
          connection: lakehouse
          format: delta
          path: "silver/enriched_orders"

Run Pipelines in Order

# Bronze first
odibi run project.yaml --pipeline read_bronze

# Then silver (references bronze outputs)
odibi run project.yaml --pipeline transform_silver

Best Practices for References

  1. Always use path: in write config — ensures cross-engine compatibility
  2. Run source pipeline first — references require catalog entries
  3. Use meaningful node names$read_bronze.orders is clearer than $p1.n1

9. Security

Mask Sensitive Columns in Stories

- name: process_users
  sensitive: [email, ssn, phone]  # Masked in Data Stories

Full Node Masking for PII-Heavy Nodes

- name: medical_records
  sensitive: true  # Entire sample redacted

Use Key Vault in Production

connections:
  lakehouse:
    auth:
      mode: key_vault
      key_vault: my-key-vault
      secret: storage-account-key

Never Log Secrets

Odibi automatically redacts values that look like secrets, but be careful in custom transformations:

@transform
def my_transform(context, params):
    # ❌ NEVER do this
    print(f"Using password: {params['password']}")

    # ✅ Do this instead
    logger.info("Connecting to database...")

10. Version Control

Git Ignore List

# .gitignore
.env
*.pyc
__pycache__/
.odibi/
stories/
*.log
.venv/

Commit Guidelines

Change Type Commit Message
New pipeline feat(bronze): add customer ingestion pipeline
New node feat(silver): add order enrichment node
Bug fix fix(gold): correct revenue calculation
Config change chore: update retry settings

Branch Strategy

main           # Production-ready pipelines
├── develop    # Integration branch
├── feature/*  # New pipelines/nodes
└── fix/*      # Bug fixes

PR Checklist

  • [ ] Pipeline runs locally without errors
  • [ ] Node descriptions added
  • [ ] Sensitive columns marked
  • [ ] Incremental config for large tables
  • [ ] Tests pass

Quick Reference

Project Organization Cheat Sheet

project.yaml          → Connections, settings, imports (NO pipelines)
pipelines/{layer}/    → One YAML per pipeline
transformations/      → Custom Python code
.env                  → Local secrets (git-ignored)

Node Checklist

  • [ ] Descriptive name (clean_orders not node_1)
  • [ ] Description explaining purpose
  • [ ] Tags for filtering (daily, critical)
  • [ ] cache: true if used by multiple nodes
  • [ ] sensitive for PII columns
  • [ ] Incremental config for large tables

Performance Checklist

  • [ ] use_arrow: true for Pandas
  • [ ] Incremental loading for large sources
  • [ ] skip_if_unchanged for dimensions
  • [ ] skip_null_profiling for very large tables
  • [ ] cluster_by for Spark/Delta