Odibi Best Practices Guide¶
Version: 3.4.3
Last Updated: 2025-12-03
Audience: Data Engineers, Analytics Engineers, Team Leads
This guide covers recommended patterns for building maintainable, scalable, and production-ready Odibi pipelines.
Table of Contents¶
- Project Organization
- Pipeline Design
- Node Design
- Naming Conventions
- Configuration Management
- Performance
- Data Quality
- Cross-Pipeline Dependencies
- Security
- Version Control
1. Project Organization¶
Recommended Folder Structure¶
my-odibi-project/
├── project.yaml # Core config (connections, settings)
├── pipelines/
│ ├── bronze/
│ │ └── read_bronze.yaml # Bronze layer pipeline
│ ├── silver/
│ │ └── transform_silver.yaml # Silver layer pipeline
│ └── gold/
│ └── build_gold.yaml # Gold layer pipeline
├── transformations/
│ ├── __init__.py
│ └── custom_transforms.py # Custom Python transformations
├── sql/
│ └── complex_queries.sql # Complex SQL (optional)
├── tests/
│ └── test_pipelines.py # Pipeline tests
├── .env # Local secrets (git-ignored)
├── .gitignore
└── README.md
Separation of Concerns¶
| File | Contains | Does NOT Contain |
|---|---|---|
project.yaml |
Connections, system config, story config, imports | Pipeline definitions |
pipelines/*.yaml |
Pipeline and node definitions | Connection details |
transformations/ |
Custom Python logic | YAML configuration |
Example project.yaml¶
project: SalesAnalytics
description: "Sales Analytics Platform"
engine: spark
version: "1.0.0"
owner: "Data Team"
# === Connections (defined once, used everywhere) ===
connections:
source_db:
type: sql_server
host: ${DB_HOST}
database: ${DB_NAME}
auth:
mode: sql_login
username: ${DB_USER}
password: ${DB_PASS}
lakehouse:
type: azure_blob
account_name: ${STORAGE_ACCOUNT}
container: datalake
auth:
mode: account_key
account_key: ${STORAGE_KEY}
# === System Catalog ===
system:
connection: lakehouse
path: _odibi_system
# === Story Configuration ===
story:
connection: lakehouse
path: stories/
retention_days: 30
auto_generate: true
# === Global Settings ===
performance:
use_arrow: true
skip_null_profiling: true
retry:
enabled: true
max_attempts: 3
backoff: exponential
logging:
level: INFO
structured: true
# === Import Pipelines ===
imports:
- pipelines/bronze/read_bronze.yaml
- pipelines/silver/transform_silver.yaml
- pipelines/gold/build_gold.yaml
Example Pipeline File¶
pipelines/bronze/read_bronze.yaml:
pipelines:
- pipeline: read_bronze
description: "Ingest raw data from source systems"
layer: bronze
nodes:
- name: orders
description: "Raw orders from ERP"
read:
connection: source_db
format: sql
table: sales.orders
incremental:
mode: stateful
column: updated_at
write:
connection: lakehouse
format: delta
path: "bronze/orders"
mode: append
add_metadata: true
- name: customers
description: "Customer master data"
read:
connection: source_db
format: sql
table: sales.customers
write:
connection: lakehouse
format: delta
path: "bronze/customers"
mode: append
add_metadata: true
skip_if_unchanged: true
skip_hash_columns: [customer_id]
2. Pipeline Design¶
One Pipeline Per Layer Per Domain¶
✅ Good:
pipelines/
├── bronze/
│ └── read_bronze.yaml # All bronze ingestion
├── silver/
│ └── transform_silver.yaml # All silver transformations
└── gold/
├── gold_sales.yaml # Sales domain aggregates
└── gold_inventory.yaml # Inventory domain aggregates
❌ Avoid:
pipelines/
├── orders_bronze_silver_gold.yaml # Too many concerns in one file
└── everything.yaml # Unmaintainable
Pipeline Sizing Guidelines¶
| Node Count | Recommendation |
|---|---|
| 1-20 nodes | Single pipeline file |
| 20-50 nodes | Consider splitting by sub-domain |
| 50+ nodes | Split into multiple pipelines |
Keep Nodes with Their Pipeline¶
❌ Don't split nodes into separate files:
# nodes/orders.yaml - BAD: nodes scattered across files
pipelines:
- pipeline: read_bronze
nodes:
- name: orders
✅ Keep nodes together:
# read_bronze.yaml - GOOD: all nodes in one place
pipelines:
- pipeline: read_bronze
nodes:
- name: orders
# ...
- name: customers
# ...
- name: products
# ...
Why?
- depends_on relationships are visible in one file
- Easier to understand the full pipeline flow
- One file = one pipeline = one commit for changes
3. Node Design¶
Single Responsibility¶
Each node should do one thing well:
✅ Good:
- name: load_orders
read: ...
write: ...
- name: clean_orders
depends_on: [load_orders]
transform:
steps:
- sql: "SELECT * FROM load_orders WHERE status IS NOT NULL"
write: ...
- name: enrich_orders
depends_on: [clean_orders, customers]
transform:
steps:
- operation: join
left: clean_orders
right: customers
on: [customer_id]
write: ...
❌ Avoid:
- name: do_everything
read: ...
transform:
steps:
- sql: "..." # 500 lines of SQL doing everything
write: ...
Use Descriptions¶
Always add descriptions for documentation and debugging:
- name: calculate_daily_revenue
description: "Aggregates order amounts by day for finance reporting"
tags: [daily, finance, critical]
Cache Strategically¶
Use cache: true for nodes that are:
- Read by multiple downstream nodes
- Expensive to compute
- Small enough to fit in memory
Auto-Caching (Default Enabled):
Odibi automatically caches nodes with 3+ downstream dependencies (configurable):
pipeline:
name: silver
auto_cache_threshold: 3 # Auto-cache nodes with 3+ dependencies (default)
nodes:
- name: dim_calendar
# Has 10 downstream nodes → automatically cached ✅
- name: huge_dimension
# Has 5 downstream nodes → would be auto-cached...
cache: false # ...but explicit override prevents it ✅
- name: linear_transform
# Has 1 downstream node → not auto-cached (below threshold)
Manual Caching:
- name: dimension_products
description: "Product dimension - cached for multiple joins"
read: ...
cache: true # Explicit cache for nodes below threshold
Disable Auto-Caching:
4. Naming Conventions¶
Pipeline Names¶
Use snake_case with layer prefix:
| Pattern | Example |
|---|---|
{action}_{layer} |
read_bronze, transform_silver, build_gold |
{layer}_{domain} |
bronze_sales, silver_inventory |
Node Names¶
Use descriptive snake_case:
| Pattern | Example |
|---|---|
| Source nodes | orders, customers, products |
| Transformed nodes | clean_orders, enriched_customers |
| Aggregated nodes | daily_sales, monthly_revenue |
| Dimension nodes | dim_product, dim_customer |
| Fact nodes | fact_orders, fact_inventory |
Connection Names¶
Use environment + purpose:
connections:
prod_source_db: # Production source database
prod_lakehouse: # Production data lake
dev_lakehouse: # Development data lake
5. Configuration Management¶
Environment Variables for Secrets¶
✅ Always use environment variables for sensitive data:
❌ Never hardcode secrets:
Use .env for Local Development¶
# .env (git-ignored)
DB_HOST=localhost
DB_USER=dev_user
DB_PASSWORD=dev_password
STORAGE_ACCOUNT=devaccount
STORAGE_KEY=abc123...
Environment-Specific Overrides¶
# In project.yaml
environments:
dev:
connections:
lakehouse:
container: dev-datalake
prod:
logging:
level: WARNING
connections:
lakehouse:
container: prod-datalake
Run with: odibi run project.yaml --env prod
6. Performance¶
Enable Arrow for Pandas¶
Use Incremental Loading¶
Don't reload full tables every time:
read:
connection: source_db
table: orders
incremental:
mode: stateful
column: updated_at
watermark_lag: "1d"
Skip Unchanged Data¶
For dimension tables that rarely change:
Optimize Delta Writes (Spark)¶
Skip Null Profiling for Large Tables¶
7. Data Quality & Validation¶
Validation Strategy Overview¶
Odibi provides three validation mechanisms for different use cases:
| Mechanism | When Executed | Purpose | On Failure |
|---|---|---|---|
| Contracts | Before processing | Input validation | Always stops pipeline |
| Validation | After transformation | Output checks | Configurable (warn/error) |
| Gates | Before write | Critical path checks | Blocks downstream nodes |
Use Contracts for Input Validation¶
Fail fast if source data is bad:
- name: process_orders
contracts:
- type: not_null
columns: [order_id, customer_id]
- type: row_count
min: 100
- type: freshness
column: created_at
max_age: "24h"
read: ...
transform: ...
Use Validation for Output Checks¶
Warn (or fail) if output doesn't meet expectations:
- name: daily_revenue
transform: ...
validation:
tests:
- type: not_null
columns: [date, revenue]
- type: unique
columns: [date]
- type: range
column: revenue
min: 0
mode: warn # or "fail" to stop the pipeline
Available Validation Types¶
| Type | Description | Example |
|---|---|---|
not_null |
Check for null values | columns: [id, name] |
unique |
Check for duplicates | columns: [id] |
row_count |
Validate row counts | min: 100, max: 1000000 |
freshness |
Check data recency | column: updated_at, max_age: "24h" |
range |
Numeric bounds | column: amount, min: 0, max: 10000 |
regex |
Pattern matching | column: email, pattern: "^.+@.+$" |
custom_sql |
Custom SQL assertion | condition: "col > 0" |
Use Quality Gates for Critical Paths¶
- name: load_orders
validation:
tests:
- type: row_count
min: 1000
gate:
require_pass_rate: 1.0
on_fail: abort # Stops pipeline if validation fails
FK Validation for Fact Tables¶
Ensure referential integrity before loading fact tables:
- name: fact_orders
depends_on: [dim_customer, dim_product]
read:
connection: staging
path: orders
# FK validation is handled by the fact pattern's dimension lookups
# See: Fact Pattern → orphan_handling (unknown, reject, quarantine)
pattern:
type: fact
params:
grain: [order_id]
dimensions:
- source_column: customer_id
dimension_table: dim_customer
dimension_key: customer_id
surrogate_key: customer_sk
- source_column: product_id
dimension_table: dim_product
dimension_key: product_id
surrogate_key: product_sk
orphan_handling: unknown
write:
connection: warehouse
path: fact_orders
Custom Validation Functions¶
Register custom validation logic:
from odibi import transform
@transform("validate_business_rules")
def validate_business_rules(context, current):
"""Custom business rule validation."""
errors = []
# Rule 1: Order amount must match line items
mismatched = current[current['total'] != current['line_items_sum']]
if len(mismatched) > 0:
errors.append(f"{len(mismatched)} orders with mismatched totals")
# Rule 2: Future dates not allowed
future_orders = current[current['order_date'] > pd.Timestamp.now()]
if len(future_orders) > 0:
errors.append(f"{len(future_orders)} orders with future dates")
if errors:
context.log_warning(f"Validation issues: {'; '.join(errors)}")
return current
Use in YAML:
Quarantine Bad Records¶
Separate bad data for review instead of failing:
- name: process_orders
validation:
tests:
- type: not_null
columns: [order_id, amount]
on_fail: quarantine
quarantine:
connection: warehouse
path: quarantine/orders
8. Cross-Pipeline Dependencies¶
Use $pipeline.node References¶
When silver needs bronze outputs:
# pipelines/silver/transform_silver.yaml
pipelines:
- pipeline: transform_silver
nodes:
- name: enriched_orders
inputs:
orders: $read_bronze.orders # Cross-pipeline reference
customers: $read_bronze.customers
transform:
steps:
- operation: join
left: orders
right: customers
on: [customer_id]
write:
connection: lakehouse
format: delta
path: "silver/enriched_orders"
Run Pipelines in Order¶
# Bronze first
odibi run project.yaml --pipeline read_bronze
# Then silver (references bronze outputs)
odibi run project.yaml --pipeline transform_silver
Best Practices for References¶
- Always use
path:in write config — ensures cross-engine compatibility - Run source pipeline first — references require catalog entries
- Use meaningful node names —
$read_bronze.ordersis clearer than$p1.n1
9. Security¶
Mask Sensitive Columns in Stories¶
Full Node Masking for PII-Heavy Nodes¶
Use Key Vault in Production¶
Never Log Secrets¶
Odibi automatically redacts values that look like secrets, but be careful in custom transformations:
@transform
def my_transform(context, params):
# ❌ NEVER do this
print(f"Using password: {params['password']}")
# ✅ Do this instead
logger.info("Connecting to database...")
10. Version Control¶
Git Ignore List¶
Commit Guidelines¶
| Change Type | Commit Message |
|---|---|
| New pipeline | feat(bronze): add customer ingestion pipeline |
| New node | feat(silver): add order enrichment node |
| Bug fix | fix(gold): correct revenue calculation |
| Config change | chore: update retry settings |
Branch Strategy¶
main # Production-ready pipelines
├── develop # Integration branch
├── feature/* # New pipelines/nodes
└── fix/* # Bug fixes
PR Checklist¶
- [ ] Pipeline runs locally without errors
- [ ] Node descriptions added
- [ ] Sensitive columns marked
- [ ] Incremental config for large tables
- [ ] Tests pass
Quick Reference¶
Project Organization Cheat Sheet¶
project.yaml → Connections, settings, imports (NO pipelines)
pipelines/{layer}/ → One YAML per pipeline
transformations/ → Custom Python code
.env → Local secrets (git-ignored)
Node Checklist¶
- [ ] Descriptive name (
clean_ordersnotnode_1) - [ ] Description explaining purpose
- [ ] Tags for filtering (
daily,critical) - [ ]
cache: trueif used by multiple nodes - [ ]
sensitivefor PII columns - [ ] Incremental config for large tables
Performance Checklist¶
- [ ]
use_arrow: truefor Pandas - [ ] Incremental loading for large sources
- [ ]
skip_if_unchangedfor dimensions - [ ]
skip_null_profilingfor very large tables - [ ]
cluster_byfor Spark/Delta
Related Documentation¶
- The Definitive Guide — Deep dive into architecture
- Performance Tuning — Optimization details
- Production Deployment — Going to production
- Cross-Pipeline Dependencies —
$pipeline.nodereferences - YAML Schema Reference — Full configuration options
- Validation Overview — Data quality framework
- Glossary — Terminology reference