Decision Guide
Rules of thumb for common Odibi decisions.
Engine Choice
| Scenario |
Engine |
Why |
| Local dev, files < 1GB |
pandas |
Fast startup, no dependencies |
| Local dev, files 1-10GB |
polars |
Faster than Pandas, lazy eval |
| Production, Delta Lake, > 10GB |
spark |
Distributed, Delta support |
| Databricks |
spark |
Native integration |
Rule: Start with pandas locally, switch to spark for production.
Validation: Contracts vs Tests
| Use... |
When... |
Behavior |
contracts: |
Checking source data |
Runs before read, fail-fast |
validation.tests: |
Checking output data |
Runs after transform, configurable |
Rule: Use contracts for freshness/schema/volume. Use tests for row-level quality.
# Contracts: Source quality (fail-fast)
contracts:
- type: freshness
column: updated_at
max_age: "24h"
# Tests: Output quality (with gates)
validation:
tests:
- type: not_null
columns: [id, name]
gate:
on_fail: warn_and_write # or 'abort'
When to Use Quality Gates
| Scenario |
Gate Setting |
| Development/testing |
on_fail: warn_and_write |
| Production, non-critical |
on_fail: warn_and_write + alerting |
| Production, critical data |
on_fail: abort |
| Regulatory/compliance |
on_fail: abort + quarantine |
Rule: Start with warn_and_write, tighten to abort as you trust the data.
When to Enable Alerting
| Scenario |
Alert |
| Local dev |
No alerts |
| Scheduled production jobs |
on_failure |
| Critical SLA pipelines |
on_failure + on_gate_block |
| All runs (audit trail) |
on_success + on_failure |
Rule: Enable alerting when someone needs to act on failure.
alerts:
- type: slack
url: ${SLACK_WEBHOOK}
on_events: [on_failure, on_gate_block]
Bronze vs Silver vs Gold Logic
| Logic Type |
Layer |
Why |
| Ingestion, format conversion |
Bronze |
Raw preservation |
| Deduplication, cleaning |
Silver |
Single source of truth |
| Business transforms, aggregation |
Gold |
Consumption-ready |
Rule: If it changes the semantic meaning, it belongs in Gold.
Incremental Mode Selection
Has reliable timestamp column?
├─► Yes
│ └─► Need exact row tracking? → mode: stateful
│ └─► OK with overlap? → mode: rolling_window
└─► No
└─► Data is immutable? → mode: append
└─► Data can change? → skip_if_unchanged: true
| Mode |
State |
Use Case |
stateful |
Persisted HWM |
CDC, database extraction |
rolling_window |
Lookback period |
Event logs, files with dates |
append |
None |
Immutable streams |
SCD Type Selection
| Need history? |
Changes often? |
Recommendation |
| No |
- |
SCD Type 1 (overwrite) |
| Yes |
Slowly (< 1/day) |
SCD Type 2 (versioned) |
| Yes |
Frequently |
Daily snapshots instead |
Rule: SCD2 is for slowly changing dimensions. Fast changes = snapshot approach.
Merge vs Overwrite Write Mode
| Scenario |
Mode |
Why |
| First load / full refresh |
overwrite |
Clean slate |
| Incremental updates |
merge |
Preserve existing, update keys |
| Append-only events |
append |
Immutable, no updates |
| SCD2 result |
N/A (self-contained) |
Transformer writes directly to target — no write: block needed |
| Aggregations |
overwrite |
Idempotent recalculation |
Rule: When in doubt, use overwrite for transforms and append for raw.
Retry Configuration
| Scenario |
Retry Config |
| Transient network issues |
max_attempts: 3 |
| Database locks |
backoff: exponential |
| Stable local files |
enabled: false |
retry:
enabled: true
max_attempts: 3
backoff: exponential
When to Use Quarantine
| Scenario |
Quarantine? |
| Dev/testing |
No (just fail or warn) |
| Data with expected bad rows |
Yes (review later) |
| Strict quality required |
Yes (fail + quarantine for audit) |
| High-volume, low-quality tolerance |
Yes (separate good from bad) |
Rule: Quarantine when you can't afford data loss but also can't accept bad data.
| Format |
When to Use |
csv |
Source files, human-readable |
parquet |
Local analytics, single-machine |
delta |
Production, ACID, time travel |
json |
API responses, nested data |
Rule: Use Delta for anything that needs reliability or history.
Quick Checklist: Production Ready?
[ ] Engine set to 'spark' (or appropriate for scale)
[ ] Contracts on source nodes (freshness, row_count)
[ ] Validation tests on transform outputs
[ ] Quality gates configured
[ ] Alerting enabled
[ ] Retry configured
[ ] Stories enabled for audit trail
[ ] Environment variables for secrets
See Also