SCD2 Timeline Anatomy¶
Understanding how Slowly Changing Dimension Type 2 tracks history over time.
What is SCD2?¶
SCD Type 2 creates a new row for every change to tracked attributes, preserving complete history.
Example: Customer "Alice" changes her address twice.
Timeline Visualization¶
gantt
title Customer #101 (Alice) - Address History
dateFormat YYYY-MM-DD
section Portland
Version 1 (SK=1001) :active, v1, 2024-01-01, 2024-06-15
section Seattle
Version 2 (SK=1002) :active, v2, 2024-06-15, 2025-01-10
section Boston
Version 3 (SK=1003) :active, v3, 2025-01-10, 2099-12-31
Table Evolution¶
Initial Load (2024-01-01)¶
Customer moves to Portland.
| customer_sk | customer_id | name | city | is_current | valid_from | valid_to | is_deleted |
|---|---|---|---|---|---|---|---|
| 1001 | 101 | Alice | Portland | TRUE | 2024-01-01 | 9999-12-31 | FALSE |
Explanation:
- customer_sk = 1001 → Surrogate key (unique, never reused)
- customer_id = 101 → Natural key (from source system)
- is_current = TRUE → This is the active version
- valid_to = 9999-12-31 → No end date yet (current record)
First Change (2024-06-15)¶
Customer moves to Seattle.
graph LR
A[Source: Alice<br/>city = Seattle] -->|Merge| B{Match on<br/>customer_id?}
B -->|Yes| C{City changed?}
C -->|Yes| D[Close old row<br/>Open new row]
D --> E[SK=1001<br/>is_current=FALSE<br/>valid_to=2024-06-14]
D --> F[SK=1002<br/>is_current=TRUE<br/>valid_from=2024-06-15]
Result:
| customer_sk | customer_id | name | city | is_current | valid_from | valid_to | is_deleted |
|---|---|---|---|---|---|---|---|
| 1001 | 101 | Alice | Portland | FALSE | 2024-01-01 | 2024-06-14 | FALSE |
| 1002 | 101 | Alice | Seattle | TRUE | 2024-06-15 | 9999-12-31 | FALSE |
Key Changes:
- Row 1 (SK=1001): is_current → FALSE, valid_to → 2024-06-14
- Row 2 (SK=1002): New row with Seattle, is_current → TRUE
Second Change (2025-01-10)¶
Customer moves to Boston.
| customer_sk | customer_id | name | city | is_current | valid_from | valid_to | is_deleted |
|---|---|---|---|---|---|---|---|
| 1001 | 101 | Alice | Portland | FALSE | 2024-01-01 | 2024-06-14 | FALSE |
| 1002 | 101 | Alice | Seattle | FALSE | 2024-06-15 | 2025-01-09 | FALSE |
| 1003 | 101 | Alice | Boston | TRUE | 2025-01-10 | 9999-12-31 | FALSE |
Now we have 3 versions of customer 101, each with its own surrogate key.
SCD2 State Diagram¶
stateDiagram-v2
[*] --> Current: New record inserted
Current --> Historical: Attribute changes
Historical --> [*]: Retention policy
Current: is_current = TRUE<br/>valid_to = 9999-12-31
Historical: is_current = FALSE<br/>valid_to = change_date
Tracked vs Non-Tracked Columns¶
Not all columns need history tracking. Choose wisely.
| Column Type | Track Changes? | Example |
|---|---|---|
| Tracked | ✅ Yes (creates new version) | address, email, tier, status |
| Non-Tracked | ❌ No (update in place) | phone, last_login, preferences |
| Fixed | 🔒 Never changes | customer_id, signup_date, birth_date |
Example YAML:
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols: [name, email, city, tier] # Only these trigger new versions
Querying SCD2 Tables¶
Get Current State¶
Returns 1 row per customer (latest version).
Get Historical State (Point-in-Time)¶
"What was Alice's address on 2024-10-01?"
SELECT *
FROM dim_customer
WHERE customer_id = 101
AND valid_from <= '2024-10-01'
AND valid_to >= '2024-10-01';
Returns: | customer_sk | customer_id | name | city | valid_from | valid_to | |-------------|-------------|------|------|------------|----------| | 1002 | 101 | Alice | Seattle | 2024-06-15 | 2025-01-09 |
Get Full History¶
SELECT customer_id, city, valid_from, valid_to, is_current
FROM dim_customer
WHERE customer_id = 101
ORDER BY valid_from;
Returns: | customer_id | city | valid_from | valid_to | is_current | |-------------|------|------------|----------|------------| | 101 | Portland | 2024-01-01 | 2024-06-14 | FALSE | | 101 | Seattle | 2024-06-15 | 2025-01-09 | FALSE | | 101 | Boston | 2025-01-10 | 9999-12-31 | TRUE |
Common Patterns¶
Delete Detection¶
If a customer is deleted from source:
| customer_sk | customer_id | name | city | is_current | is_deleted |
|---|---|---|---|---|---|
| 1003 | 101 | Alice | Boston | FALSE | TRUE |
Unknown Member (SK=0)¶
For orphan fact records (customer not in dimension):
| customer_sk | customer_id | name | city | is_current |
|---|---|---|---|---|
| 0 | NULL | Unknown | Unknown | TRUE |
Fact records reference SK=0 when customer_id is missing.
SCD2 vs SCD1¶
| Feature | SCD1 (Overwrite) | SCD2 (Versioned) |
|---|---|---|
| History | ❌ No | ✅ Yes |
| Storage | Small | Large |
| Queries | Simple | Moderate |
| Use Case | Latest state only | Audit, compliance, trends |
When to use SCD2: - Compliance requirements (GDPR, SOX) - Trend analysis (how has customer tier changed?) - Slowly changing attributes (address, status)
When to use SCD1: - Non-critical attributes (preferences, UI settings) - High-frequency changes (no value in history) - Storage constraints
Odibi SCD2 Example¶
nodes:
- name: dim_customer
read:
connection: bronze
table: raw_customers
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols: [name, email, city, tier]
effective_from_col: valid_from
effective_to_col: valid_to
current_flag_col: is_current
deleted_flag_col: is_deleted
detect_deletes: true
unknown_member: true
write:
connection: silver
format: delta
table: dim_customer
mode: overwrite
First run: Inserts all customers with is_current = TRUE
Second run: Detects changes, closes old rows, opens new rows
Third run: Detects deletes, sets is_deleted = TRUE
Pitfalls to Avoid¶
❌ Wrong: No Surrogate Keys¶
✅ Right: Use Surrogate Keys¶
❌ Wrong: Tracking Too Many Columns¶
# BAD: Tracking high-frequency changes
track_cols: [name, email, city, tier, last_login, page_views]
# Creates new version on every login!
✅ Right: Track Only Business-Critical Changes¶
❌ Wrong: No Unknown Member¶
Facts referencing SK=NULL break BI tools.
✅ Right: Unknown Member Handling¶
Related¶
- Dimension Pattern - Full implementation guide
- SCD2 Pattern - Deep dive
- Fact Pattern - How facts use surrogate keys
- Example: SCD2 Dimension - Working example
← Back to Visuals | Architecture | Incremental Decision Tree