Dimension Pattern Tutorial¶
In this tutorial, you'll learn how to use Odibi's dimension pattern to build dimension tables with automatic surrogate key generation and SCD (Slowly Changing Dimension) support.
What You'll Learn: - How surrogate keys are generated - SCD Type 0 (static) - never update - SCD Type 1 (overwrite) - update in place - SCD Type 2 (history) - track all changes - Unknown member handling
Source Data¶
We'll start with this customer data (12 rows):
Source Data (customers.csv) - 12 rows:
| customer_id | name | region | city | state | |
|---|---|---|---|---|---|
| C001 | Alice Johnson | alice@example.com | North | Chicago | IL |
| C002 | Bob Smith | bob@example.com | South | Houston | TX |
| C003 | Carol White | carol@example.com | North | Detroit | MI |
| C004 | David Brown | david@example.com | East | New York | NY |
| C005 | Emma Davis | emma@example.com | West | Seattle | WA |
| C006 | Frank Miller | frank@example.com | South | Miami | FL |
| C007 | Grace Lee | grace@example.com | East | Boston | MA |
| C008 | Henry Wilson | henry@example.com | West | Portland | OR |
| C009 | Ivy Chen | ivy@example.com | North | Minneapolis | MN |
| C010 | Jack Taylor | jack@example.com | South | Dallas | TX |
| C011 | Karen Martinez | karen@example.com | East | Philadelphia | PA |
| C012 | Leo Anderson | leo@example.com | West | Denver | CO |
Step 1: SCD Type 0 - Static Dimensions¶
When to use: Reference data that never changes (country codes, fixed lookups).
SCD Type 0 creates surrogate keys but never updates existing records. New records are inserted, but changes to existing records are ignored.
YAML Configuration¶
project: dimension_tutorial
engine: pandas
connections:
source:
type: local
base_path: ./data
warehouse:
type: local
base_path: ./warehouse
story:
connection: warehouse
path: stories
pipelines:
- pipeline: build_dim_customer_scd0
nodes:
- name: dim_customer
read:
connection: source
path: customers.csv
format: csv
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 0
unknown_member: true
write:
connection: warehouse
path: dim_customer
format: parquet
mode: overwrite
Output: dim_customer (13 rows)¶
After running with scd_type: 0, here's the dimension table with generated surrogate keys:
| customer_sk | customer_id | name | region | city | state | load_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | -1 | Unknown | Unknown | Unknown | Unknown | Unknown | 1900-01-01 00:00:00 |
| 1 | C001 | Alice Johnson | alice@example.com | North | Chicago | IL | 2024-01-15 10:00:00 |
| 2 | C002 | Bob Smith | bob@example.com | South | Houston | TX | 2024-01-15 10:00:00 |
| 3 | C003 | Carol White | carol@example.com | North | Detroit | MI | 2024-01-15 10:00:00 |
| 4 | C004 | David Brown | david@example.com | East | New York | NY | 2024-01-15 10:00:00 |
| 5 | C005 | Emma Davis | emma@example.com | West | Seattle | WA | 2024-01-15 10:00:00 |
| 6 | C006 | Frank Miller | frank@example.com | South | Miami | FL | 2024-01-15 10:00:00 |
| 7 | C007 | Grace Lee | grace@example.com | East | Boston | MA | 2024-01-15 10:00:00 |
| 8 | C008 | Henry Wilson | henry@example.com | West | Portland | OR | 2024-01-15 10:00:00 |
| 9 | C009 | Ivy Chen | ivy@example.com | North | Minneapolis | MN | 2024-01-15 10:00:00 |
| 10 | C010 | Jack Taylor | jack@example.com | South | Dallas | TX | 2024-01-15 10:00:00 |
| 11 | C011 | Karen Martinez | karen@example.com | East | Philadelphia | PA | 2024-01-15 10:00:00 |
| 12 | C012 | Leo Anderson | leo@example.com | West | Denver | CO | 2024-01-15 10:00:00 |
Key observations: - Row 0 is the unknown member (customer_sk = 0, customer_id = -1) - Surrogate keys are sequential integers starting at 1 - Each source row gets a unique SK
Step 2: SCD Type 1 - Overwrite Updates¶
When to use: Attributes where you only need the current value (email, phone, preferences).
SCD Type 1 updates existing records in place when changes are detected. No history is preserved.
The Update Scenario¶
Three customers changed their email addresses:
Updated Source Data (customers_updated.csv) - 3 changes highlighted:
| customer_id | name | region | city | state | |
|---|---|---|---|---|---|
| C001 | Alice Johnson | alice.johnson@newmail.com | North | Chicago | IL |
| C002 | Bob Smith | bob@example.com | South | Houston | TX |
| C003 | Carol White | carol@example.com | North | Detroit | MI |
| C004 | David Brown | david.b@corporate.com | East | New York | NY |
| C005 | Emma Davis | emma@example.com | West | Seattle | WA |
| C006 | Frank Miller | frank@example.com | South | Miami | FL |
| C007 | Grace Lee | grace.lee@gmail.com | East | Boston | MA |
| C008 | Henry Wilson | henry@example.com | West | Portland | OR |
| C009 | Ivy Chen | ivy@example.com | North | Minneapolis | MN |
| C010 | Jack Taylor | jack@example.com | South | Dallas | TX |
| C011 | Karen Martinez | karen@example.com | East | Philadelphia | PA |
| C012 | Leo Anderson | leo@example.com | West | Denver | CO |
YAML Configuration¶
project: dimension_tutorial
engine: pandas
connections:
source:
type: local
base_path: ./data
warehouse:
type: local
base_path: ./warehouse
story:
connection: warehouse
path: stories
pipelines:
- pipeline: build_dim_customer_scd1
nodes:
- name: dim_customer
read:
connection: source
path: customers_updated.csv
format: csv
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 1
track_cols:
- name
- email
- region
- city
- state
target: warehouse.dim_customer
unknown_member: true
audit:
load_timestamp: true
write:
connection: warehouse
path: dim_customer
format: parquet
mode: overwrite
Before vs After Comparison¶
BEFORE (original load):
| customer_sk | customer_id | load_timestamp | |
|---|---|---|---|
| 0 | -1 | Unknown | 1900-01-01 00:00:00 |
| 1 | C001 | alice@example.com | 2024-01-15 10:00:00 |
| 4 | C004 | david@example.com | 2024-01-15 10:00:00 |
| 7 | C007 | grace@example.com | 2024-01-15 10:00:00 |
| ... | ... | ... | ... |
AFTER (SCD1 update):
| customer_sk | customer_id | load_timestamp | |
|---|---|---|---|
| 0 | -1 | Unknown | 1900-01-01 00:00:00 |
| 1 | C001 | alice.johnson@newmail.com | 2024-01-20 14:30:00 |
| 4 | C004 | david.b@corporate.com | 2024-01-20 14:30:00 |
| 7 | C007 | grace.lee@gmail.com | 2024-01-20 14:30:00 |
| ... | ... | ... | ... |
Key observations: - Same surrogate keys - C001 is still customer_sk = 1 - Values updated in place - old emails are gone - Timestamp updated - shows when the record was last modified - No history preserved - we can't see the old email addresses
Complete SCD1 Output (13 rows)¶
| customer_sk | customer_id | name | region | city | state | load_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | -1 | Unknown | Unknown | Unknown | Unknown | Unknown | 1900-01-01 00:00:00 |
| 1 | C001 | Alice Johnson | alice.johnson@newmail.com | North | Chicago | IL | 2024-01-20 14:30:00 |
| 2 | C002 | Bob Smith | bob@example.com | South | Houston | TX | 2024-01-15 10:00:00 |
| 3 | C003 | Carol White | carol@example.com | North | Detroit | MI | 2024-01-15 10:00:00 |
| 4 | C004 | David Brown | david.b@corporate.com | East | New York | NY | 2024-01-20 14:30:00 |
| 5 | C005 | Emma Davis | emma@example.com | West | Seattle | WA | 2024-01-15 10:00:00 |
| 6 | C006 | Frank Miller | frank@example.com | South | Miami | FL | 2024-01-15 10:00:00 |
| 7 | C007 | Grace Lee | grace.lee@gmail.com | East | Boston | MA | 2024-01-20 14:30:00 |
| 8 | C008 | Henry Wilson | henry@example.com | West | Portland | OR | 2024-01-15 10:00:00 |
| 9 | C009 | Ivy Chen | ivy@example.com | North | Minneapolis | MN | 2024-01-15 10:00:00 |
| 10 | C010 | Jack Taylor | jack@example.com | South | Dallas | TX | 2024-01-15 10:00:00 |
| 11 | C011 | Karen Martinez | karen@example.com | East | Philadelphia | PA | 2024-01-15 10:00:00 |
| 12 | C012 | Leo Anderson | leo@example.com | West | Denver | CO | 2024-01-15 10:00:00 |
Step 3: SCD Type 2 - Full History Tracking¶
When to use: Attributes where historical accuracy matters (address for shipping analysis, tier for billing history).
SCD Type 2 preserves full history by creating a new row for each change. Old versions are closed with a valid_to date.
The History Scenario¶
Same three customers changed their emails. With SCD2, we keep both versions:
YAML Configuration¶
project: dimension_tutorial
engine: pandas
connections:
source:
type: local
base_path: ./data
warehouse:
type: local
base_path: ./warehouse
story:
connection: warehouse
path: stories
pipelines:
- pipeline: build_dim_customer_scd2
nodes:
- name: dim_customer
read:
connection: source
path: customers_updated.csv
format: csv
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols:
- name
- email
- region
- city
- state
target: warehouse.dim_customer
valid_from_col: valid_from
valid_to_col: valid_to
is_current_col: is_current
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
write:
connection: warehouse
path: dim_customer
format: parquet
mode: overwrite
Output: Full History (16 rows)¶
| customer_sk | customer_id | name | region | valid_from | valid_to | is_current | |
|---|---|---|---|---|---|---|---|
| 0 | -1 | Unknown | Unknown | Unknown | 1900-01-01 | NULL | true |
| 1 | C001 | Alice Johnson | alice@example.com | North | 2024-01-15 | 2024-01-20 | false |
| 2 | C002 | Bob Smith | bob@example.com | South | 2024-01-15 | NULL | true |
| 3 | C003 | Carol White | carol@example.com | North | 2024-01-15 | NULL | true |
| 4 | C004 | David Brown | david@example.com | East | 2024-01-15 | 2024-01-20 | false |
| 5 | C005 | Emma Davis | emma@example.com | West | 2024-01-15 | NULL | true |
| 6 | C006 | Frank Miller | frank@example.com | South | 2024-01-15 | NULL | true |
| 7 | C007 | Grace Lee | grace@example.com | East | 2024-01-15 | 2024-01-20 | false |
| 8 | C008 | Henry Wilson | henry@example.com | West | 2024-01-15 | NULL | true |
| 9 | C009 | Ivy Chen | ivy@example.com | North | 2024-01-15 | NULL | true |
| 10 | C010 | Jack Taylor | jack@example.com | South | 2024-01-15 | NULL | true |
| 11 | C011 | Karen Martinez | karen@example.com | East | 2024-01-15 | NULL | true |
| 12 | C012 | Leo Anderson | leo@example.com | West | 2024-01-15 | NULL | true |
| 13 | C001 | Alice Johnson | alice.johnson@newmail.com | North | 2024-01-20 | NULL | true |
| 14 | C004 | David Brown | david.b@corporate.com | East | 2024-01-20 | NULL | true |
| 15 | C007 | Grace Lee | grace.lee@gmail.com | East | 2024-01-20 | NULL | true |
Key observations: - New surrogate keys for new versions (13, 14, 15) - Old versions marked closed (is_current = false, valid_to = 2024-01-20) - New versions marked current (is_current = true, valid_to = NULL) - Full history preserved - we can query data as of any point in time
How to Query SCD2¶
Current view (most common):
Point-in-time query (as of January 17):
SELECT * FROM dim_customer
WHERE '2024-01-17' >= valid_from
AND ('2024-01-17' < valid_to OR valid_to IS NULL);
Customer C001's email history:
SELECT customer_sk, email, valid_from, valid_to
FROM dim_customer
WHERE customer_id = 'C001'
ORDER BY valid_from;
| customer_sk | valid_from | valid_to | |
|---|---|---|---|
| 1 | alice@example.com | 2024-01-15 | 2024-01-20 |
| 13 | alice.johnson@newmail.com | 2024-01-20 | NULL |
Understanding the Unknown Member¶
The unknown member row (customer_sk = 0) is automatically created when unknown_member: true:
| customer_sk | customer_id | name | all other columns | |
|---|---|---|---|---|
| 0 | -1 | Unknown | Unknown | Unknown |
Why it matters:
When building fact tables, orders might reference a customer_id that doesn't exist in the dimension (data quality issue, late-arriving data, etc.). Instead of: - Failing the pipeline (strict but inflexible) - Losing the order data (dangerous)
We assign those orphan records to customer_sk = 0. This: - Preserves all fact data - Makes orphans easily identifiable - Allows later cleanup/investigation
Complete Runnable Example¶
Here's a complete YAML file you can run:
# File: odibi_dimension_tutorial.yaml
project: dimension_tutorial
engine: pandas
connections:
source:
type: local
base_path: ./examples/tutorials/dimensional_modeling/data
warehouse:
type: local
base_path: ./warehouse
story:
connection: warehouse
path: stories
pipelines:
# Initial load with SCD2
- pipeline: initial_load
description: "First load of customer dimension"
nodes:
- name: dim_customer
description: "Customer dimension with SCD2 history tracking"
read:
connection: source
path: customers.csv
format: csv
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols:
- name
- email
- region
- city
- state
target: warehouse.dim_customer
valid_from_col: valid_from
valid_to_col: valid_to
is_current_col: is_current
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
write:
connection: warehouse
path: dim_customer
format: parquet
mode: overwrite
# Incremental update with changes
- pipeline: incremental_update
description: "Process updates to customer dimension"
nodes:
- name: dim_customer
description: "Update customer dimension with new email addresses"
read:
connection: source
path: customers_updated.csv
format: csv
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols:
- name
- email
- region
- city
- state
target: warehouse.dim_customer
valid_from_col: valid_from
valid_to_col: valid_to
is_current_col: is_current
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
write:
connection: warehouse
path: dim_customer
format: parquet
mode: overwrite
Python API Alternative¶
If you prefer Python over YAML:
from odibi.patterns.dimension import DimensionPattern
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd
# Load source data
source_df = pd.read_csv("examples/tutorials/dimensional_modeling/data/customers.csv")
# Create pattern
pattern = DimensionPattern(params={
"natural_key": "customer_id",
"surrogate_key": "customer_sk",
"scd_type": 2,
"track_cols": ["name", "email", "region", "city", "state"],
"unknown_member": True,
"audit": {
"load_timestamp": True,
"source_system": "crm"
}
})
# Validate configuration
pattern.validate()
# Execute pattern
context = EngineContext(df=source_df, engine_type=EngineType.PANDAS)
result_df = pattern.execute(context)
# View results
print(f"Generated {len(result_df)} dimension rows")
print(result_df.head(15))
What You Learned¶
In this tutorial, you learned:
- SCD Type 0 creates surrogate keys but never updates existing records
- SCD Type 1 updates records in place, losing history but keeping current data
- SCD Type 2 creates new rows for changes, preserving full history with valid_from/valid_to dates
- Surrogate keys are auto-generated integers, sequential starting from 1
- Unknown member (SK=0) provides a default for orphan FK handling
- track_cols defines which columns trigger a new version in SCD1/SCD2
- Audit columns (load_timestamp, source_system) track when/where data came from
Next Steps¶
Now that you can build customer dimensions, let's create a date dimension that's automatically generated.
Next: Date Dimension Pattern Tutorial
Navigation¶
| Previous | Up | Next |
|---|---|---|
| Introduction | Tutorials | Date Dimension |
Reference¶
For complete parameter documentation, see: Dimension Pattern Reference