Dimension Pattern Tutorial¶

In this tutorial, you'll learn how to use Odibi's dimension pattern to build dimension tables with automatic surrogate key generation and SCD (Slowly Changing Dimension) support.

What You'll Learn: - How surrogate keys are generated - SCD Type 0 (static) - never update - SCD Type 1 (overwrite) - update in place - SCD Type 2 (history) - track all changes - Unknown member handling

Source Data¶

We'll start with this customer data (12 rows):

Source Data (customers.csv) - 12 rows:

customer_id	name	email	region	city	state
C001	Alice Johnson	alice@example.com	North	Chicago	IL
C002	Bob Smith	bob@example.com	South	Houston	TX
C003	Carol White	carol@example.com	North	Detroit	MI
C004	David Brown	david@example.com	East	New York	NY
C005	Emma Davis	emma@example.com	West	Seattle	WA
C006	Frank Miller	frank@example.com	South	Miami	FL
C007	Grace Lee	grace@example.com	East	Boston	MA
C008	Henry Wilson	henry@example.com	West	Portland	OR
C009	Ivy Chen	ivy@example.com	North	Minneapolis	MN
C010	Jack Taylor	jack@example.com	South	Dallas	TX
C011	Karen Martinez	karen@example.com	East	Philadelphia	PA
C012	Leo Anderson	leo@example.com	West	Denver	CO

Step 1: SCD Type 0 - Static Dimensions¶

When to use: Reference data that never changes (country codes, fixed lookups).

SCD Type 0 creates surrogate keys but never updates existing records. New records are inserted, but changes to existing records are ignored.

YAML Configuration¶

project: dimension_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_dim_customer_scd0
    nodes:
      - name: dim_customer
        read:
          connection: source
          path: customers.csv
          format: csv

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 0
            unknown_member: true

        write:
          connection: warehouse
          path: dim_customer
          format: parquet
          mode: overwrite

Output: dim_customer (13 rows)¶

After running with scd_type: 0, here's the dimension table with generated surrogate keys:

customer_sk	customer_id	name	email	region	city	state	load_timestamp
0	-1	Unknown	Unknown	Unknown	Unknown	Unknown	1900-01-01 00:00:00
1	C001	Alice Johnson	alice@example.com	North	Chicago	IL	2024-01-15 10:00:00
2	C002	Bob Smith	bob@example.com	South	Houston	TX	2024-01-15 10:00:00
3	C003	Carol White	carol@example.com	North	Detroit	MI	2024-01-15 10:00:00
4	C004	David Brown	david@example.com	East	New York	NY	2024-01-15 10:00:00
5	C005	Emma Davis	emma@example.com	West	Seattle	WA	2024-01-15 10:00:00
6	C006	Frank Miller	frank@example.com	South	Miami	FL	2024-01-15 10:00:00
7	C007	Grace Lee	grace@example.com	East	Boston	MA	2024-01-15 10:00:00
8	C008	Henry Wilson	henry@example.com	West	Portland	OR	2024-01-15 10:00:00
9	C009	Ivy Chen	ivy@example.com	North	Minneapolis	MN	2024-01-15 10:00:00
10	C010	Jack Taylor	jack@example.com	South	Dallas	TX	2024-01-15 10:00:00
11	C011	Karen Martinez	karen@example.com	East	Philadelphia	PA	2024-01-15 10:00:00
12	C012	Leo Anderson	leo@example.com	West	Denver	CO	2024-01-15 10:00:00

Key observations: - Row 0 is the unknown member (customer_sk = 0, customer_id = -1) - Surrogate keys are sequential integers starting at 1 - Each source row gets a unique SK

Step 2: SCD Type 1 - Overwrite Updates¶

When to use: Attributes where you only need the current value (email, phone, preferences).

SCD Type 1 updates existing records in place when changes are detected. No history is preserved.

The Update Scenario¶

Three customers changed their email addresses:

Updated Source Data (customers_updated.csv) - 3 changes highlighted:

customer_id	name	email	region	city	state
C001	Alice Johnson	alice.johnson@newmail.com	North	Chicago	IL
C002	Bob Smith	bob@example.com	South	Houston	TX
C003	Carol White	carol@example.com	North	Detroit	MI
C004	David Brown	david.b@corporate.com	East	New York	NY
C005	Emma Davis	emma@example.com	West	Seattle	WA
C006	Frank Miller	frank@example.com	South	Miami	FL
C007	Grace Lee	grace.lee@gmail.com	East	Boston	MA
C008	Henry Wilson	henry@example.com	West	Portland	OR
C009	Ivy Chen	ivy@example.com	North	Minneapolis	MN
C010	Jack Taylor	jack@example.com	South	Dallas	TX
C011	Karen Martinez	karen@example.com	East	Philadelphia	PA
C012	Leo Anderson	leo@example.com	West	Denver	CO

YAML Configuration¶

project: dimension_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_dim_customer_scd1
    nodes:
      - name: dim_customer
        read:
          connection: source
          path: customers_updated.csv
          format: csv

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 1
            track_cols:
              - name
              - email
              - region
              - city
              - state
            target: warehouse.dim_customer
            unknown_member: true
            audit:
              load_timestamp: true

        write:
          connection: warehouse
          path: dim_customer
          format: parquet
          mode: overwrite

Before vs After Comparison¶

BEFORE (original load):

customer_sk	customer_id	email	load_timestamp
0	-1	Unknown	1900-01-01 00:00:00
1	C001	alice@example.com	2024-01-15 10:00:00
4	C004	david@example.com	2024-01-15 10:00:00
7	C007	grace@example.com	2024-01-15 10:00:00
...	...	...	...

AFTER (SCD1 update):

customer_sk	customer_id	email	load_timestamp
0	-1	Unknown	1900-01-01 00:00:00
1	C001	alice.johnson@newmail.com	2024-01-20 14:30:00
4	C004	david.b@corporate.com	2024-01-20 14:30:00
7	C007	grace.lee@gmail.com	2024-01-20 14:30:00
...	...	...	...

Key observations: - Same surrogate keys - C001 is still customer_sk = 1 - Values updated in place - old emails are gone - Timestamp updated - shows when the record was last modified - No history preserved - we can't see the old email addresses

Complete SCD1 Output (13 rows)¶

customer_sk	customer_id	name	email	region	city	state	load_timestamp
0	-1	Unknown	Unknown	Unknown	Unknown	Unknown	1900-01-01 00:00:00
1	C001	Alice Johnson	alice.johnson@newmail.com	North	Chicago	IL	2024-01-20 14:30:00
2	C002	Bob Smith	bob@example.com	South	Houston	TX	2024-01-15 10:00:00
3	C003	Carol White	carol@example.com	North	Detroit	MI	2024-01-15 10:00:00
4	C004	David Brown	david.b@corporate.com	East	New York	NY	2024-01-20 14:30:00
5	C005	Emma Davis	emma@example.com	West	Seattle	WA	2024-01-15 10:00:00
6	C006	Frank Miller	frank@example.com	South	Miami	FL	2024-01-15 10:00:00
7	C007	Grace Lee	grace.lee@gmail.com	East	Boston	MA	2024-01-20 14:30:00
8	C008	Henry Wilson	henry@example.com	West	Portland	OR	2024-01-15 10:00:00
9	C009	Ivy Chen	ivy@example.com	North	Minneapolis	MN	2024-01-15 10:00:00
10	C010	Jack Taylor	jack@example.com	South	Dallas	TX	2024-01-15 10:00:00
11	C011	Karen Martinez	karen@example.com	East	Philadelphia	PA	2024-01-15 10:00:00
12	C012	Leo Anderson	leo@example.com	West	Denver	CO	2024-01-15 10:00:00

Step 3: SCD Type 2 - Full History Tracking¶

When to use: Attributes where historical accuracy matters (address for shipping analysis, tier for billing history).

SCD Type 2 preserves full history by creating a new row for each change. Old versions are closed with a valid_to date.

The History Scenario¶

Same three customers changed their emails. With SCD2, we keep both versions:

YAML Configuration¶

project: dimension_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_dim_customer_scd2
    nodes:
      - name: dim_customer
        read:
          connection: source
          path: customers_updated.csv
          format: csv

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols:
              - name
              - email
              - region
              - city
              - state
            target: warehouse.dim_customer
            valid_from_col: valid_from
            valid_to_col: valid_to
            is_current_col: is_current
            unknown_member: true
            audit:
              load_timestamp: true
              source_system: "crm"

        write:
          connection: warehouse
          path: dim_customer
          format: parquet
          mode: overwrite

Output: Full History (16 rows)¶

customer_sk	customer_id	name	email	region	valid_from	valid_to	is_current
0	-1	Unknown	Unknown	Unknown	1900-01-01	NULL	true
1	C001	Alice Johnson	alice@example.com	North	2024-01-15	2024-01-20	false
2	C002	Bob Smith	bob@example.com	South	2024-01-15	NULL	true
3	C003	Carol White	carol@example.com	North	2024-01-15	NULL	true
4	C004	David Brown	david@example.com	East	2024-01-15	2024-01-20	false
5	C005	Emma Davis	emma@example.com	West	2024-01-15	NULL	true
6	C006	Frank Miller	frank@example.com	South	2024-01-15	NULL	true
7	C007	Grace Lee	grace@example.com	East	2024-01-15	2024-01-20	false
8	C008	Henry Wilson	henry@example.com	West	2024-01-15	NULL	true
9	C009	Ivy Chen	ivy@example.com	North	2024-01-15	NULL	true
10	C010	Jack Taylor	jack@example.com	South	2024-01-15	NULL	true
11	C011	Karen Martinez	karen@example.com	East	2024-01-15	NULL	true
12	C012	Leo Anderson	leo@example.com	West	2024-01-15	NULL	true
13	C001	Alice Johnson	alice.johnson@newmail.com	North	2024-01-20	NULL	true
14	C004	David Brown	david.b@corporate.com	East	2024-01-20	NULL	true
15	C007	Grace Lee	grace.lee@gmail.com	East	2024-01-20	NULL	true

Key observations: - New surrogate keys for new versions (13, 14, 15) - Old versions marked closed (is_current = false, valid_to = 2024-01-20) - New versions marked current (is_current = true, valid_to = NULL) - Full history preserved - we can query data as of any point in time

How to Query SCD2¶

Current view (most common):

SELECT * FROM dim_customer WHERE is_current = true;

Point-in-time query (as of January 17):

SELECT * FROM dim_customer
WHERE '2024-01-17' >= valid_from
  AND ('2024-01-17' < valid_to OR valid_to IS NULL);

Customer C001's email history:

SELECT customer_sk, email, valid_from, valid_to
FROM dim_customer
WHERE customer_id = 'C001'
ORDER BY valid_from;

customer_sk	email	valid_from	valid_to
1	alice@example.com	2024-01-15	2024-01-20
13	alice.johnson@newmail.com	2024-01-20	NULL

Understanding the Unknown Member¶

The unknown member row (customer_sk = 0) is automatically created when unknown_member: true:

customer_sk	customer_id	name	email	all other columns
0	-1	Unknown	Unknown	Unknown

Why it matters:

When building fact tables, orders might reference a customer_id that doesn't exist in the dimension (data quality issue, late-arriving data, etc.). Instead of: - Failing the pipeline (strict but inflexible) - Losing the order data (dangerous)

We assign those orphan records to customer_sk = 0. This: - Preserves all fact data - Makes orphans easily identifiable - Allows later cleanup/investigation

Complete Runnable Example¶

Here's a complete YAML file you can run:

# File: odibi_dimension_tutorial.yaml
project: dimension_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./examples/tutorials/dimensional_modeling/data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  # Initial load with SCD2
  - pipeline: initial_load
    description: "First load of customer dimension"
    nodes:
      - name: dim_customer
        description: "Customer dimension with SCD2 history tracking"
        read:
          connection: source
          path: customers.csv
          format: csv

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols:
              - name
              - email
              - region
              - city
              - state
            target: warehouse.dim_customer
            valid_from_col: valid_from
            valid_to_col: valid_to
            is_current_col: is_current
            unknown_member: true
            audit:
              load_timestamp: true
              source_system: "crm"

        write:
          connection: warehouse
          path: dim_customer
          format: parquet
          mode: overwrite

  # Incremental update with changes
  - pipeline: incremental_update
    description: "Process updates to customer dimension"
    nodes:
      - name: dim_customer
        description: "Update customer dimension with new email addresses"
        read:
          connection: source
          path: customers_updated.csv
          format: csv

        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols:
              - name
              - email
              - region
              - city
              - state
            target: warehouse.dim_customer
            valid_from_col: valid_from
            valid_to_col: valid_to
            is_current_col: is_current
            unknown_member: true
            audit:
              load_timestamp: true
              source_system: "crm"

        write:
          connection: warehouse
          path: dim_customer
          format: parquet
          mode: overwrite

Python API Alternative¶

If you prefer Python over YAML:

from odibi.patterns.dimension import DimensionPattern
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd

# Load source data
source_df = pd.read_csv("examples/tutorials/dimensional_modeling/data/customers.csv")

# Create pattern
pattern = DimensionPattern(params={
    "natural_key": "customer_id",
    "surrogate_key": "customer_sk",
    "scd_type": 2,
    "track_cols": ["name", "email", "region", "city", "state"],
    "unknown_member": True,
    "audit": {
        "load_timestamp": True,
        "source_system": "crm"
    }
})

# Validate configuration
pattern.validate()

# Execute pattern
context = EngineContext(df=source_df, engine_type=EngineType.PANDAS)
result_df = pattern.execute(context)

# View results
print(f"Generated {len(result_df)} dimension rows")
print(result_df.head(15))

What You Learned¶

In this tutorial, you learned:

SCD Type 0 creates surrogate keys but never updates existing records
SCD Type 1 updates records in place, losing history but keeping current data
SCD Type 2 creates new rows for changes, preserving full history with valid_from/valid_to dates
Surrogate keys are auto-generated integers, sequential starting from 1
Unknown member (SK=0) provides a default for orphan FK handling
track_cols defines which columns trigger a new version in SCD1/SCD2
Audit columns (load_timestamp, source_system) track when/where data came from

Next Steps¶

Now that you can build customer dimensions, let's create a date dimension that's automatically generated.

Next: Date Dimension Pattern Tutorial

Previous	Up	Next
Introduction	Tutorials	Date Dimension

Reference¶

For complete parameter documentation, see: Dimension Pattern Reference

Dimension Pattern Tutorial¶

Source Data¶

Step 1: SCD Type 0 - Static Dimensions¶

YAML Configuration¶

Output: dim_customer (13 rows)¶

Step 2: SCD Type 1 - Overwrite Updates¶

The Update Scenario¶

YAML Configuration¶

Before vs After Comparison¶

Complete SCD1 Output (13 rows)¶

Step 3: SCD Type 2 - Full History Tracking¶

The History Scenario¶

YAML Configuration¶

Output: Full History (16 rows)¶

How to Query SCD2¶

Understanding the Unknown Member¶

Complete Runnable Example¶

Python API Alternative¶

What You Learned¶

Next Steps¶

Navigation¶

Reference¶