Dimension Pattern¶
The dimension pattern builds complete dimension tables with automatic surrogate key generation and SCD (Slowly Changing Dimension) support.
Integration with Odibi YAML¶
Patterns are used via the pattern: block in a node config. The pattern type goes in pattern: type: and configuration goes in pattern: params:.
project: my_warehouse
engine: spark
connections:
staging:
type: delta
path: /mnt/staging
warehouse:
type: delta
path: /mnt/warehouse
story:
connection: warehouse
path: stories
pipelines:
- pipeline: build_dimensions
nodes:
- name: dim_customer
read:
connection: staging
path: customers
format: delta
# Use dimension pattern
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols: [name, email, address]
target: warehouse.dim_customer
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
write:
connection: warehouse
path: dim_customer
format: delta
mode: overwrite
Features¶
- Auto-generate integer surrogate keys (MAX(existing) + ROW_NUMBER for new rows)
- SCD Type 0 (static - never update existing records)
- SCD Type 1 (overwrite - update in place, no history)
- SCD Type 2 (history tracking - full audit trail)
- Unknown member row (SK=0) for orphan FK handling
- Audit columns (load_timestamp, source_system)
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
natural_key |
str | Yes | - | Natural/business key column name |
surrogate_key |
str | Yes | - | Surrogate key column name to generate |
scd_type |
int | No | 1 | 0=static, 1=overwrite, 2=history tracking |
track_cols |
list | For SCD1/2 | - | Columns to track for changes |
target |
str | For SCD2 | - | Target table path (required to read existing history) |
unknown_member |
bool | No | false | Insert a row with SK=0 for orphan FK handling |
audit |
dict | No | {} | Audit column configuration |
valid_from_col |
str | No | valid_from | Name of the SCD2 start date column |
valid_to_col |
str | No | valid_to | Name of the SCD2 end date column |
is_current_col |
str | No | is_current | Name of the SCD2 current flag column |
Audit Config¶
params:
# ... other params ...
audit:
load_timestamp: true # Add load_timestamp column
source_system: "pos" # Add source_system column with this value
SCD Type 0 (Static)¶
Static dimensions never update existing records. Only new records (not matching natural key) are inserted.
Use case: Reference data that never changes (ISO country codes, fixed lookup values).
nodes:
- name: dim_country
read:
connection: staging
path: countries
pattern:
type: dimension
params:
natural_key: country_code
surrogate_key: country_sk
scd_type: 0
target: warehouse.dim_country
write:
connection: warehouse
path: dim_country
mode: overwrite
SCD Type 1 (Overwrite)¶
Overwrite dimensions update existing records in place. No history is kept.
Use case: Attributes where you only care about the current value (customer email, product price).
nodes:
- name: dim_customer
read:
connection: staging
path: customers
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 1
track_cols: [name, email, address]
target: warehouse.dim_customer
audit:
load_timestamp: true
write:
connection: warehouse
path: dim_customer
mode: overwrite
SCD Type 2 (History Tracking)¶
History-tracking dimensions preserve full audit trail. Old records are closed, new versions are opened.
Use case: Slowly changing attributes where historical accuracy matters (customer address for point-in-time reporting).
nodes:
- name: dim_customer
read:
connection: staging
path: customers
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols: [name, email, address, city, state]
target: warehouse.dim_customer
valid_from_col: valid_from # Optional, default: valid_from
valid_to_col: valid_to # Optional, default: valid_to
is_current_col: is_current # Optional, default: is_current
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
write:
connection: warehouse
path: dim_customer
mode: overwrite
Generated Columns:
- valid_from: Timestamp when this version became active
- valid_to: Timestamp when this version was superseded (NULL for current)
- is_current: Boolean flag (true for current version)
Unknown Member Handling¶
Enable unknown_member: true to automatically insert a row with SK=0. This allows fact tables to reference unknown dimensions without FK violations.
Generated Unknown Member Row:
| customer_sk | customer_id | name | valid_from | is_current | |
|---|---|---|---|---|---|
| 0 | -1 | Unknown | Unknown | 1900-01-01 | true |
Full Star Schema Example¶
Complete pipeline building dimensions for a star schema:
project: sales_warehouse
engine: spark
connections:
staging:
type: delta
path: /mnt/staging
warehouse:
type: delta
path: /mnt/warehouse
story:
connection: warehouse
path: stories
system:
connection: warehouse
path: _system_catalog
pipelines:
- pipeline: build_dimensions
nodes:
# Customer dimension with SCD2
- name: dim_customer
read:
connection: staging
path: customers
format: delta
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols:
- name
- email
- phone
- address_line_1
- city
- state
- postal_code
target: warehouse.dim_customer
unknown_member: true
audit:
load_timestamp: true
source_system: "salesforce"
write:
connection: warehouse
path: dim_customer
format: delta
mode: overwrite
# Product dimension with SCD1 (no history)
- name: dim_product
read:
connection: staging
path: products
format: delta
pattern:
type: dimension
params:
natural_key: product_id
surrogate_key: product_sk
scd_type: 1
track_cols: [name, category, price, status]
target: warehouse.dim_product
unknown_member: true
write:
connection: warehouse
path: dim_product
format: delta
mode: overwrite
# Date dimension (generated, no source read needed)
- name: dim_date
pattern:
type: date_dimension
params:
start_date: "2020-01-01"
end_date: "2030-12-31"
fiscal_year_start_month: 7
unknown_member: true
write:
connection: warehouse
path: dim_date
format: delta
mode: overwrite
Python API¶
from odibi.patterns.dimension import DimensionPattern
from odibi.context import EngineContext
# Create pattern instance
pattern = DimensionPattern(
engine=my_engine,
config=node_config # NodeConfig with params
)
# Pattern params come from the NodeConfig
from odibi.config import NodeConfig
node_config = NodeConfig(
name="dim_customer",
params={
"natural_key": "customer_id",
"surrogate_key": "customer_sk",
"scd_type": 2,
"track_cols": ["name", "email", "address"],
"target": "gold.dim_customer",
"unknown_member": True,
"audit": {
"load_timestamp": True,
"source_system": "crm"
}
}
)
pattern = DimensionPattern(engine=my_engine, config=node_config)
# Validate configuration
pattern.validate()
# Execute pattern
context = EngineContext(global_context, source_df, EngineType.SPARK)
result_df = pattern.execute(context)
Data Quality and Quarantine¶
Dimension patterns do not directly support quarantine functionality. However, dimensions can be validated using standard validation tests with on_fail: quarantine to route invalid source records to quarantine tables before dimension processing.
For orphan handling in fact tables (when dimension lookups fail), use the fact pattern's quarantine configuration with orphan_handling: quarantine.
See Also¶
- Date Dimension Pattern - Generate date dimensions
- Fact Pattern - Build fact tables with SK lookups
- Aggregation Pattern - Build aggregate tables
- Quarantine Feature - Route invalid data to quarantine tables
- YAML Schema Reference - Full configuration reference