Skip to content

SCD Type 2 (Slowly Changing Dimensions)

The SCD Type 2 pattern allows you to track the full history of changes for a record over time. Unlike a simple update (which overwrites the old value), SCD2 keeps the old version and adds a new version, managing effective dates for you.

The "Time Machine" Concept

Business Problem: "I need to know what the customer's address was last month, not just where they live now."

The Solution: Each record has an "effective window" (valid_from to valid_to) and a flag (is_current) indicating if it is the latest version.

Visual Example

Input (Source Update): Customer 101 moved to NY on Feb 1st.

customer_id address tier txn_date
101 NY Gold 2024-02-01

Target Table (Before): Customer 101 lived in CA since Jan 1st.

customer_id address tier valid_from valid_to is_current
101 CA Gold 2024-01-01 NULL true

Target Table (After SCD2): Old record CLOSED (valid_to set). New record OPEN (is_current=true).

customer_id address tier valid_from valid_to is_current
101 CA Gold 2024-01-01 2024-02-01 false
101 NY Gold 2024-02-01 NULL true

Note: The source column txn_date is automatically copied to valid_from in the target, giving each version a complete time window [valid_from, valid_to). The original source column is preserved.


Configuration

Two Ways to Build SCD2 Dimensions

Option A: Raw scd2 Transformer (shown below) - Lower-level transformer for SCD2 logic only - You manage surrogate keys, audit columns, unknown members separately

Option B: Dimension Pattern (recommended) - Higher-level pattern that includes SCD2 + surrogate keys + audit + unknown member - More comprehensive for building production dimensions - See the sections on Audit Columns, Unknown Member, and Grain Validation in this doc

This section covers the raw scd2 transformer configuration.

On Spark with Delta tables, SCD2 uses an optimized Delta MERGE by default. The MERGE only touches changed rows instead of rewriting the entire table, making it dramatically faster for large tables with small change sets.

nodes:
  - name: "dim_customers"
    # ... (read from source) ...

    transformer: "scd2"
    params:
      target: "silver.dim_customers"   # Delta table name
      keys: ["customer_id"]            # Unique ID
      track_cols: ["address", "tier"]  # Changes here trigger a new version
      effective_time_col: "txn_date"   # When the change happened
      # use_delta_merge: true          # Default โ€” no write block needed!

No write: block needed! SCD2 writes directly to the target table on all engines. This includes first run โ€” the transformer creates the table automatically.

Option 2: Using Connection + Path (ADLS)

nodes:
  - name: "dim_customers"
    # ... (read from source) ...

    transformer: "scd2"
    params:
      connection: adls_prod            # READ existing history from here
      path: sales/silver/dim_customers
      keys: ["customer_id"]
      track_cols: ["address", "tier"]
      effective_time_col: "txn_date"
      register_table: "silver.dim_customers"  # Optional: register in Unity Catalog

Option 3: Legacy Full Overwrite (opt-out of Delta MERGE)

Set use_delta_merge: false to use the legacy full-overwrite approach on Spark. Still self-contained โ€” no write: block needed.

nodes:
  - name: "dim_customers"
    # ... (read from source) ...

    transformer: "scd2"
    params:
      target: "silver.dim_customers"
      keys: ["customer_id"]
      track_cols: ["address", "tier"]
      effective_time_col: "txn_date"
      use_delta_merge: false           # Uses full read-join-overwrite instead

Full Configuration

transformer: "scd2"
params:
  target: "silver.dim_customers"       # OR use connection + path
  keys: ["customer_id"]
  track_cols: ["address", "tier", "email"]

  # Source column indicating when the change happened
  effective_time_col: "updated_at"

  # Target columns (optional โ€” defaults shown)
  start_time_col: "valid_from"         # effective_time_col copied to this
  end_time_col: "valid_to"
  current_flag_col: "is_current"

  # Optional: soft delete support
  delete_col: "is_deleted"             # If source has soft delete flag

  # Performance optimization (default: true)
  # Uses Delta MERGE on Spark โ€” only updates changed rows
  # Falls back to full overwrite if target is not Delta
  use_delta_merge: true

  # Optional: register as table in Unity Catalog/metastore (Spark only)
  # Useful with connection + path to make data queryable via SQL
  register_table: "silver.dim_customers"

Customizing Column Names

If your organization uses different naming conventions, you can customize the SCD2 metadata column names:

# Example: Using company-specific naming standards
transformer: "scd2"
params:
  target: "silver.dim_customers"
  keys: ["customer_id"]
  track_cols: ["address", "tier"]
  effective_time_col: "last_modified_date"

  # Custom column names
  end_time_col: "effective_end_date"      # Instead of valid_to
  current_flag_col: "active_flag"         # Instead of is_current

Result columns: - effective_end_date instead of valid_to - active_flag instead of is_current


Complete Parameter Reference

Parameter Type Required Default Description
target string Conditionalยน - Target table name or full path
connection string Conditionalยน - Connection name to resolve path
path string Conditionalยน - Relative path within connection
keys list[string] Yes - Natural keys to identify unique entities
track_cols list[string] Yes - Columns to monitor for changes
effective_time_col string Yes - Source column indicating when change occurred
start_time_col string No "valid_from" Name of the start timestamp column (effective_time_col copied to this)
end_time_col string No "valid_to" Name of the end timestamp column
current_flag_col string No "is_current" Name of the current record flag column
delete_col string No None Column indicating soft deletion (boolean)
use_delta_merge bool No true Use Delta MERGE on Spark (falls back to full overwrite if not Delta)
register_table string No None Register as Unity Catalog/metastore table after write (Spark only)
vacuum_hours int No None Hours to retain for VACUUM after SCD2 write (Spark only). Set to 168 for 7 days. None disables VACUUM.

Notes: 1. Target specification: Must provide either target OR both connection + path - Providing both will raise a validation error - Providing neither will also raise a validation error

Additional Validation Rules: - keys and track_cols must exist in source data - effective_time_col must exist in source data


How It Works

The scd2 transformer performs a complex set of operations automatically:

  1. Match: Finds existing records in the target table using keys.
  2. Compare: Checks track_cols to see if any data has changed.
  3. Close: If a record changed, it updates the old record's valid_to to the new record's effective time, and sets is_current = false.
  4. Insert: It adds the new record with valid_from (copied from effective_time_col), valid_to = NULL, and is_current = true.
  5. Preserve: It keeps all unchanged history records as they are.

Delta MERGE Optimization (Spark)

When use_delta_merge: true (default) and the target is a Delta table, SCD2 uses DeltaTable.merge() instead of the full read-join-overwrite approach:

  • whenMatchedUpdate: Only updates end_time_col and current_flag_col on changed current records (partial column update, not full row rewrite)
  • whenNotMatchedInsert: Inserts new records with SCD metadata

This means only the Parquet files containing affected rows are rewritten by Delta, rather than the entire table. For a 10M row table with 1K changes, this is orders of magnitude faster.

If the target is not a Delta table, or if delta-spark is not available, the transformer automatically falls back to the legacy full-overwrite approach.

Important Notes

  • Self-contained: SCD2 writes directly to the target on all engines and code paths โ€” no write: block is needed.
  • First Run: If the target table doesn't exist, the transformer creates it automatically with the initial data.
  • Engine Support: Works on Spark (Delta MERGE or full overwrite) and Pandas (Parquet/CSV with DuckDB optimization or pure Pandas fallback).

When to Use

  • Dimension Tables: Customer dimensions, Product dimensions where attributes change slowly over time.
  • Audit Trails: When you need exact historical state reconstruction.

When NOT to Use

  • Fact Tables: Events (Transactions, Logs) are immutable; they don't change state, they just occur. Use append instead.
  • Rapidly Changing Data: If a record changes 100 times a day, SCD2 will explode your storage size. Use a snapshot or aggregate approach instead.

Audit Columns

SCD2 dimensions work seamlessly with audit columns to track data lineage and load metadata. When building SCD2 dimensions via the Dimension Pattern, you can automatically add:

  • load_timestamp: When the record was loaded into the data warehouse
  • source_system: Source system identifier (e.g., "crm", "erp", "pos")

Configuration

Use the audit parameter in the Dimension Pattern (not the raw scd2 transformer):

nodes:
  - name: "dim_customers"
    read:
      connection: staging
      path: customers

    pattern:
      type: dimension
      params:
        natural_key: customer_id
        surrogate_key: customer_sk
        scd_type: 2
        track_cols: ["address", "tier", "email"]
        target: "warehouse.dim_customers"
        audit:
          load_timestamp: true       # Adds load_timestamp column
          source_system: "crm"       # Adds source_system='crm' column

    write:
      connection: warehouse
      path: dim_customers
      format: delta
      mode: overwrite

Result

Your SCD2 dimension will include these columns:

customer_sk customer_id address tier valid_from valid_to is_current load_timestamp source_system
1 101 CA Gold 2024-01-01 2024-02-01 false 2024-01-01 10:00 crm
2 101 NY Gold 2024-02-01 NULL true 2024-02-01 14:30 crm

Use Cases: - Data lineage: Track when each version was loaded - Source identification: Know which system the data came from - Audit trails: Reconstruct the history of data loading operations


Unknown Member Row (SK=0)

When building star schema fact tables, you often need to handle orphan records โ€” fact records that reference dimensions that don't exist yet (or never will).

The Problem

# Fact table has customer_id=999, but dim_customers doesn't have it yet
fact_orders:
  order_id: 12345
  customer_id: 999  # โ† Orphan! Not in dim_customers
  amount: 100.00

The Solution

Add an unknown member row to your dimension with surrogate key = 0. When fact tables encounter orphans, they can safely reference SK=0 instead of failing or creating NULL foreign keys.

Configuration

Enable unknown_member: true in the Dimension Pattern:

nodes:
  - name: "dim_customers"
    pattern:
      type: dimension
      params:
        natural_key: customer_id
        surrogate_key: customer_sk
        scd_type: 2
        track_cols: ["address", "tier"]
        target: "warehouse.dim_customers"
        unknown_member: true        # โ† Adds SK=0 row
        audit:
          load_timestamp: true
          source_system: "crm"

    write:
      connection: warehouse
      path: dim_customers
      format: delta
      mode: overwrite

Result

The dimension automatically includes an unknown member row:

customer_sk customer_id address tier valid_from valid_to is_current load_timestamp source_system
0 -1 Unknown Unknown 1900-01-01 NULL true current_time crm
1 101 CA Gold 2024-01-01 2024-02-01 false 2024-01-01 10:00 crm
2 101 NY Gold 2024-02-01 NULL true 2024-02-01 14:30 crm

Facts can now safely reference SK=0 when the dimension doesn't exist:

fact_orders:
  order_id: 12345
  customer_sk: 0      # โ† Maps to unknown member
  amount: 100.00

Integration with Fact Pattern

The Fact Pattern supports three orphan handling strategies:

  1. unknown (default): Map orphans to SK=0
  2. reject: Fail the pipeline if orphans exist
  3. quarantine: Write orphans to a quarantine table for investigation
nodes:
  - name: "fact_orders"
    depends_on: [dim_customers, dim_products]

    pattern:
      type: fact
      params:
        grain: [order_id]
        dimensions:
          - source_column: customer_id
            dimension_table: dim_customers
            dimension_key: customer_id
            surrogate_key: customer_sk
            scd2: true                    # Filter to is_current=true
        orphan_handling: unknown          # โ† Uses SK=0 for orphans

See Fact Pattern - Orphan Handling for complete details.


Grain Validation

SCD2 dimensions have a specific grain (level of uniqueness) that you must maintain:

Grain Definition: (natural_key, is_current=true) must be unique.

What This Means

For each natural key, only one record can have is_current=true at any time. Historical records (closed versions) have is_current=false.

Valid SCD2 State:

customer_sk customer_id address valid_from valid_to is_current
1 101 CA 2024-01-01 2024-02-01 false
2 101 NY 2024-02-01 NULL true

Note: Only one record per customer_id has is_current=true.

INVALID State (Duplicate Grain):

customer_sk customer_id address valid_from valid_to is_current
1 101 CA 2024-01-01 NULL true
2 101 NY 2024-02-01 NULL true

Problem: Two records for customer_id=101 both have is_current=true (grain violation).

How Odibi Prevents This

The scd2 transformer automatically maintains grain uniqueness by: 1. Identifying changed records (comparing track_cols) 2. Closing old versions (setting is_current=false, valid_to=<new_effective_time>) 3. Opening new versions (setting is_current=true, valid_to=NULL)

Validating Grain in Fact Tables

When using the Fact Pattern, define the fact table grain to detect duplicates:

nodes:
  - name: "fact_orders"
    pattern:
      type: fact
      params:
        grain: [order_id, line_item_id]  # โ† Validates uniqueness
        dimensions:
          - source_column: customer_id
            dimension_table: dim_customers
            dimension_key: customer_id
            surrogate_key: customer_sk
            scd2: true

The Fact Pattern will raise an error if duplicates exist at the grain level.

Manual Validation Query

If you need to check for grain violations in an existing SCD2 dimension:

-- Find natural keys with multiple current records (grain violation)
SELECT 
  customer_id,
  COUNT(*) as current_count
FROM dim_customers
WHERE is_current = true
GROUP BY customer_id
HAVING COUNT(*) > 1
ORDER BY current_count DESC;

Expected result: 0 rows (no violations).


Incremental Loading and Merge Pattern

The scd2 transformer is NOT an incremental merge pattern. It is designed for building complete dimension history from source data.

Key Differences

Feature SCD2 Transformer Merge Pattern
Purpose Track dimension history Incrementally merge raw โ†’ silver
Write Mode overwrite (returns full history)ยน append or merge
Use Case Dimension tables with history Fact tables, raw data ingestion
Output Complete historical dataset New/changed records only
Idempotency Re-runs rebuild full history Merge by key prevents duplicates

Notes: 1. overwrite mode is the typical pattern. The SCD2 transformer returns the complete history (all versions), which you then write back to replace the target. Advanced users can implement incremental SCD2 using Delta merge operations, but this is not the default behavior of the scd2 transformer.

When to Use Each

Use SCD2 Transformer: - Dimension tables where you need full history - Customer, product, employee dimensions - Tracking attribute changes over time

Use Merge Pattern: - Incremental fact table loading - Deduplicating raw data into silver - Stateless transformations (no history tracking)

Example: Building dimensions with SCD2, facts with merge:

pipelines:
  # 1. Build SCD2 dimension
  - pipeline: build_dimensions
    nodes:
      - name: dim_customers
        read:
          connection: staging
          path: customers

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols: [name, address, tier]
            target: warehouse.dim_customers
            unknown_member: true

        write:
          connection: warehouse
          path: dim_customers
          mode: overwrite          # โ† Full history

  # 2. Merge fact table incrementally
  - pipeline: build_facts
    nodes:
      - name: fact_orders
        read:
          connection: raw
          path: orders_incremental  # Only new/changed orders

        transformer: merge
        params:
          target: warehouse.fact_orders
          keys: [order_id]
          mode: insert              # Or upsert

        write:
          connection: warehouse
          path: fact_orders
          mode: append              # โ† Incremental

See Merge/Upsert Pattern for complete details on incremental loading strategies.


Common Errors and Debugging

This section covers the most common SCD2 errors and how to fix them.

Error: effective_time_col Not Found

Error Message:

KeyError: 'updated_at'
# or
AnalysisException: Column 'updated_at' does not exist

What It Means (Plain English): The effective_time_col you specified doesn't exist in your source DataFrame.

Why It Happens: - The column name is misspelled - The column was renamed or dropped upstream - Case sensitivity mismatch (Updated_At vs updated_at)

Step-by-Step Fix:

  1. Check your source data columns:

    # Add this before the SCD2 node to debug
    df = spark.read.format("delta").load("bronze/customers")
    print(df.columns)
    # Output: ['customer_id', 'name', 'UpdatedAt', 'address']
    # Aha! It's 'UpdatedAt', not 'updated_at'
    

  2. Fix the YAML:

    # BEFORE (wrong)
    params:
      effective_time_col: "updated_at"  # โŒ Doesn't exist
    
    # AFTER (correct)
    params:
      effective_time_col: "UpdatedAt"  # โœ… Matches actual column
    

Important: The effective_time_col must exist in the SOURCE data, not the target. After SCD2 processing, this column gets used to populate the history columns.


Error: track_cols Column Mismatch

Error Message:

KeyError: 'email'
# or
Column 'Email' not found in schema

What It Means: One of the columns in track_cols doesn't exist in your source data, or there's a case mismatch.

Why It Happens: - Column names are case-sensitive - A column was renamed in the source system - You're tracking a column that doesn't exist yet

Step-by-Step Fix:

  1. List actual columns:

    df = spark.read.format("delta").load("bronze/customers")
    print(df.columns)
    # ['customer_id', 'Name', 'Email', 'Address']
    

  2. Match case exactly:

    # BEFORE (wrong - case doesn't match)
    params:
      track_cols: ["name", "email", "address"]
    
    # AFTER (correct - matches actual columns)
    params:
      track_cols: ["Name", "Email", "Address"]
    

๐Ÿ’ก Pro Tip: Consider normalizing column names to lowercase in Bronze/Silver to avoid case issues:

transform:
  steps:
    - function: "rename_columns"
      params:
        lowercase: true


Error: Schema Evolution Issues

Error Message:

AnalysisException: A]chema mismatch detected:
- Expected: customer_id: string, name: string, address: string, ...
- Actual:   customer_id: string, name: string, phone: string, ...

What It Means: Your target table (from a previous run) has a different schema than the new data.

Why It Happens: - Source added new columns (e.g., phone) - Source removed columns (e.g., dropped address) - Column types changed (e.g., int โ†’ string)

Step-by-Step Fix:

Option 1: Handle in Transform

# Add missing columns with defaults before SCD2
transform:
  steps:
    - function: "derive_columns"
      params:
        derivations:
          phone: "COALESCE(phone, 'unknown')"

Option 2: Full Schema Reset (Nuclear Option)

# Delete target table and rerun from scratch
# WARNING: Loses all history!
spark.sql("DROP TABLE IF EXISTS silver.dim_customers")


"Why Did My Row Count Explode?"

Symptom:

Before: dim_customers had 10,000 rows
After:  dim_customers has 50,000 rows

What's Happening: SCD2 is working correctly! Every time a tracked column changes, it creates a new version. If you ran it multiple times or have duplicates, you get multiple versions per record.

Common Causes:

  1. Duplicate source data:

    customer_id | name  | updated_at
    101         | Alice | 2024-01-01
    101         | Alice | 2024-01-01  <- Duplicate!
    101         | Alice | 2024-01-01  <- Another duplicate!
    
    Fix: Deduplicate before SCD2 (see Anti-Patterns)

  2. Running SCD2 on append-mode source: Each run sees ALL historical source data, creating versions for old changes again. Fix: Use incremental loading or filter source to only new records.

  3. Tracking too many columns:

    # Tracking every column = version explosion
    track_cols: ["*"]  # โŒ Don't do this!
    
    # Track only meaningful business changes
    track_cols: ["tier", "status", "region"]  # โœ… Selective
    

Debugging Query:

-- Find customers with excessive versions
SELECT customer_id, COUNT(*) as version_count
FROM dim_customers
GROUP BY customer_id
HAVING COUNT(*) > 10
ORDER BY version_count DESC


Debugging Checklist

Before running your SCD2 pipeline, verify these items:

# โœ… DEBUGGING CHECKLIST
# Print this and check each box:

# [ ] 1. Source Data Check
#     - effective_time_col exists in source
#     - All track_cols exist in source
#     - Column names match case exactly

# [ ] 2. Key Column Check  
#     - keys columns exist in source
#     - keys columns have no NULLs
#     - keys columns uniquely identify records

# [ ] 3. Target Table Check
#     - Target exists (or this is first run)
#     - Target schema is compatible
#     - Target has end_time_col and current_flag_col

# [ ] 4. Deduplication Check
#     - Source has no duplicate keys
#     - If duplicates exist, deduplicate BEFORE SCD2

# [ ] 5. Write Mode Check
#     - SCD2 is self-contained โ€” no write: block needed on any engine
#     - If you have a legacy write: block, remove it to avoid double-writes

Python Debugging Script:

# Add this before your SCD2 node to validate
def validate_scd2_input(df, config):
    """Validate data before SCD2 processing."""
    errors = []

    # Check effective_time_col exists
    if config['effective_time_col'] not in df.columns:
        errors.append(f"effective_time_col '{config['effective_time_col']}' not in columns: {df.columns}")

    # Check all track_cols exist
    for col in config['track_cols']:
        if col not in df.columns:
            errors.append(f"track_col '{col}' not in columns: {df.columns}")

    # Check for duplicate keys
    key_cols = config['keys']
    dup_count = df.groupBy(key_cols).count().filter("count > 1").count()
    if dup_count > 0:
        errors.append(f"Found {dup_count} duplicate keys! Deduplicate first.")

    # Check for NULL keys
    for key in key_cols:
        null_count = df.filter(df[key].isNull()).count()
        if null_count > 0:
            errors.append(f"Found {null_count} NULL values in key column '{key}'")

    if errors:
        for e in errors:
            print(f"โŒ {e}")
        raise ValueError("SCD2 validation failed. Fix errors above.")
    else:
        print("โœ… SCD2 input validation passed")


Quick Reference: SCD2 Error Cheat Sheet

Error Likely Cause Quick Fix
effective_time_col not found Column doesn't exist or wrong name Check source columns, fix spelling/case
track_col X not found Column name mismatch Match exact column names including case
Schema mismatch Target has different columns Use mergeSchema: true or reset target
Row count explosion Duplicates or too many runs Deduplicate source first
merge_key not found Key column missing Verify keys exist in both source and target
NULL in key columns Missing business keys Handle NULLs before SCD2

Next Steps

Key Concepts Covered

  • โœ… Audit Columns: Track load timestamps and source systems
  • โœ… Unknown Members: Handle orphan FK references with SK=0
  • โœ… Grain Validation: Ensure uniqueness at (natural_key, is_current) level
  • โœ… Incremental Strategies: Use merge pattern for facts, SCD2 for dimensions