Odibi Data Patterns¶

This directory contains documentation for common data pipeline patterns used in Odibi. Each pattern solves a specific problem and includes step-by-step examples.

Patterns¶

1. Append-Only Raw Layer ¶

Problem: How do I safely ingest data without losing audit trails?

Pattern: All data from sources is appended to the Raw layer without modification. Raw is immutable and append-only.

When to use: Always, for all source ingestion. Raw is your safety net.

2. High Water Mark (Smart Read)¶

Problem: My source table has millions of rows. How do I efficiently read only new/changed data?

Pattern: Use the incremental configuration (Smart Read) to automatically filter the source query. Odibi manages the state for you: First Run = Full Load, Subsequent Runs = Incremental Load.

When to use: When your source has timestamps (created_at, updated_at) and you want incremental reads. Essential for daily incremental loads.

See also: Manual HWM Guide - understanding the underlying SQL pattern.

3. Merge/Upsert (Silver Layer)¶

Problem: How do I deduplicate and keep the latest version of each record?

Pattern: Use Delta Lake's MERGE operation (or Merge Transformer) to upsert records by key, with audit columns tracking created/updated timestamps.

When to use: Refining Raw → Silver. Always use for stateful transformations.

4. SCD Type 2 (History Tracking)¶

Problem: "I need to know what the address was last month, not just now."

Pattern: Track full history. Old records are closed (valid_to set), new records are opened (valid_to NULL). Preserves point-in-time accuracy.

When to use: Slowly Changing Dimensions (Customer address, Product category).

5. Windowed Reprocess (Gold Layer Aggregates)¶

Problem: Late-arriving data can break my aggregates. How do I fix them without double-counting?

Pattern: Instead of patching aggregates with updates, recalculate the entire time window and overwrite that partition.

When to use: Building Gold-layer aggregates (KPIs, star schemas). Ensures idempotency and correctness.

6. Skip If Unchanged (Snapshot Optimization)¶

Problem: My hourly pipeline appends identical data 24 times/day when the source hasn't changed.

Pattern: Compute a hash of the DataFrame content before writing. If hash matches previous write, skip the append entirely.

When to use: Snapshot tables without timestamps, reference data that changes infrequently, or when change frequency is unknown.

Dimensional Modeling Patterns¶

These patterns are designed for building star schemas and data warehouses. Use them via pattern: type: pattern_name in your node config.

7. Dimension Pattern ¶

Problem: How do I build dimension tables with surrogate keys and SCD support?

Pattern: Use pattern: type: dimension to auto-generate surrogate keys and handle SCD Type 0/1/2 with optional unknown member rows.

When to use: Building any dimension table (dim_customer, dim_product, etc.)

pattern:
  type: dimension
  params:
    natural_key: customer_id
    surrogate_key: customer_sk
    scd_type: 2
    track_cols: [name, email, address]

8. Date Dimension Pattern ¶

Problem: How do I generate a complete date dimension with fiscal calendars?

Pattern: Use pattern: type: date_dimension to generate dates with 19 pre-calculated columns including fiscal year/quarter.

When to use: Every data warehouse needs a date dimension. Generate once with a wide range (2015-2035).

pattern:
  type: date_dimension
  params:
    start_date: "2020-01-01"
    end_date: "2030-12-31"
    fiscal_year_start_month: 7
    unknown_member: true

9. Fact Pattern ¶

Problem: How do I build fact tables with automatic surrogate key lookups?

Pattern: Use pattern: type: fact to join source data to dimensions, retrieve SKs, handle orphans, and validate grain.

When to use: Building any fact table that references dimensions.

pattern:
  type: fact
  params:
    grain: [order_id, line_item_id]
    dimensions:
      - source_column: customer_id
        dimension_table: dim_customer
        dimension_key: customer_id
        surrogate_key: customer_sk
    orphan_handling: unknown

10. Aggregation Pattern ¶

Problem: How do I build aggregate tables with declarative GROUP BY and incremental refresh?

Pattern: Use pattern: type: aggregation with grain (GROUP BY) and measure expressions.

When to use: Building aggregate/summary tables, KPI tables, or materializing metrics.

pattern:
  type: aggregation
  params:
    grain: [date_sk, product_sk]
    measures:
      - name: total_revenue
        expr: "SUM(line_total)"
      - name: order_count
        expr: "COUNT(*)"

Design Principles¶

These patterns are built on the Odibi Architecture Manifesto:

Robots Remember, Humans Forget → Use checkpoint bookkeeping, not manual state tracking
Raw is Sacred → Append-only, immutable history. Never destroy original data.
Rebuild the Bucket, Don't Patch the Hole → Reprocess entire time windows, don't patch aggregates
SQL is for Humans, ADLS is for Robots → ADLS stores everything; SQL serves BI
No Duplication → Test against production data; don't duplicate datasets

Quick Reference¶

Pattern	Input	Output	Write Mode	Idempotent?
Append-Only Raw	Source	Raw	`append`	Yes (duplicates OK)
High Water Mark	Source + Timestamp	Raw	`append`	Yes (filtered by timestamp)
Smart Read	Source + Timestamp	Raw	`append`	Yes (auto-managed)
Merge/Upsert	Raw (micro-batch)	Silver	`merge`	Yes (by key)
SCD Type 2	Raw (micro-batch)	Silver/Gold	N/A (self-contained)	Yes (full history)
Windowed Reprocess	Silver (window)	Gold	`overwrite` (partition)	Yes (recalculated)
Skip If Unchanged	Snapshot Source	Raw	`append` (conditional)	Yes (hash-based)
Dimension	Staging	Gold (dim_*)	`overwrite`	Yes (SK-based)
Date Dimension	Generated	Gold (dim_date)	`overwrite`	Yes (no input)
Fact	Staging + Dims	Gold (fact_*)	`overwrite`	Yes (grain-based)
Aggregation	Fact	Gold (agg_*)	`overwrite`	Yes (grain-based)

See It All Together¶

THE_REFERENCE.md — A complete star schema that combines dimension, date_dimension, and fact patterns in one runnable pipeline. This is the canonical example showing how all the patterns work together.