Decision Guide¶

Rules of thumb for common Odibi decisions.

Engine Choice¶

Scenario	Engine	Why
Local dev, files < 1GB	`pandas`	Fast startup, no dependencies
Local dev, files 1-10GB	`polars`	Faster than Pandas, lazy eval
Production, Delta Lake, > 10GB	`spark`	Distributed, Delta support
Databricks	`spark`	Native integration

Rule: Start with pandas locally, switch to spark for production.

Validation: Contracts vs Tests¶

Use...	When...	Behavior
`contracts:`	Checking source data	Runs before read, fail-fast
`validation.tests:`	Checking output data	Runs after transform, configurable

Rule: Use contracts for freshness/schema/volume. Use tests for row-level quality.

# Contracts: Source quality (fail-fast)
contracts:
  - type: freshness
    column: updated_at
    max_age: "24h"

# Tests: Output quality (with gates)
validation:
  tests:
    - type: not_null
      columns: [id, name]
  gate:
    on_fail: warn_and_write  # or 'abort'

When to Use Quality Gates¶

Scenario	Gate Setting
Development/testing	`on_fail: warn_and_write`
Production, non-critical	`on_fail: warn_and_write` + alerting
Production, critical data	`on_fail: abort`
Regulatory/compliance	`on_fail: abort` + quarantine

Rule: Start with warn_and_write, tighten to abort as you trust the data.

When to Enable Alerting¶

Scenario	Alert
Local dev	No alerts
Scheduled production jobs	`on_failure`
Critical SLA pipelines	`on_failure` + `on_gate_block`
All runs (audit trail)	`on_success` + `on_failure`

Rule: Enable alerting when someone needs to act on failure.

alerts:
  - type: slack
    url: ${SLACK_WEBHOOK}
    on_events: [on_failure, on_gate_block]

Bronze vs Silver vs Gold Logic¶

Logic Type	Layer	Why
Ingestion, format conversion	Bronze	Raw preservation
Deduplication, cleaning	Silver	Single source of truth
Business transforms, aggregation	Gold	Consumption-ready

Rule: If it changes the semantic meaning, it belongs in Gold.

Incremental Mode Selection¶

Has reliable timestamp column?
├─► Yes
│   └─► Need exact row tracking? → mode: stateful
│   └─► OK with overlap? → mode: rolling_window
└─► No
    └─► Data is immutable? → mode: append
    └─► Data can change? → skip_if_unchanged: true

Mode	State	Use Case
`stateful`	Persisted HWM	CDC, database extraction
`rolling_window`	Lookback period	Event logs, files with dates
`append`	None	Immutable streams

SCD Type Selection¶

Need history?	Changes often?	Recommendation
No	-	SCD Type 1 (overwrite)
Yes	Slowly (< 1/day)	SCD Type 2 (versioned)
Yes	Frequently	Daily snapshots instead

Rule: SCD2 is for slowly changing dimensions. Fast changes = snapshot approach.

Merge vs Overwrite Write Mode¶

Scenario	Mode	Why
First load / full refresh	`overwrite`	Clean slate
Incremental updates	`merge`	Preserve existing, update keys
Append-only events	`append`	Immutable, no updates
SCD2 result	N/A (self-contained)	Transformer writes directly to target — no `write:` block needed
Aggregations	`overwrite`	Idempotent recalculation

Rule: When in doubt, use overwrite for transforms and append for raw.

Retry Configuration¶

Scenario	Retry Config
Transient network issues	`max_attempts: 3`
Database locks	`backoff: exponential`
Stable local files	`enabled: false`

retry:
  enabled: true
  max_attempts: 3
  backoff: exponential

When to Use Quarantine¶

Scenario	Quarantine?
Dev/testing	No (just fail or warn)
Data with expected bad rows	Yes (review later)
Strict quality required	Yes (fail + quarantine for audit)
High-volume, low-quality tolerance	Yes (separate good from bad)

Rule: Quarantine when you can't afford data loss but also can't accept bad data.

File Format Selection¶

Format	When to Use
`csv`	Source files, human-readable
`parquet`	Local analytics, single-machine
`delta`	Production, ACID, time travel
`json`	API responses, nested data

Rule: Use Delta for anything that needs reliability or history.

Quick Checklist: Production Ready?¶

[ ] Engine set to 'spark' (or appropriate for scale)
[ ] Contracts on source nodes (freshness, row_count)
[ ] Validation tests on transform outputs
[ ] Quality gates configured
[ ] Alerting enabled
[ ] Retry configured
[ ] Stories enabled for audit trail
[ ] Environment variables for secrets