Dimension Pattern¶

The dimension pattern builds complete dimension tables with automatic surrogate key generation and SCD (Slowly Changing Dimension) support.

Integration with Odibi YAML¶

Patterns are used via the pattern: block in a node config. The pattern type goes in pattern: type: and configuration goes in pattern: params:.

project: my_warehouse
engine: spark

connections:
  staging:
    type: delta
    path: /mnt/staging
  warehouse:
    type: delta
    path: /mnt/warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_dimensions
    nodes:
      - name: dim_customer
        read:
          connection: staging
          path: customers
          format: delta

        # Use dimension pattern
        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols: [name, email, address]
            target: warehouse.dim_customer
            unknown_member: true
            audit:
              load_timestamp: true
              source_system: "crm"

        write:
          connection: warehouse
          path: dim_customer
          format: delta
          mode: overwrite

Features¶

Auto-generate integer surrogate keys (MAX(existing) + ROW_NUMBER for new rows)
SCD Type 0 (static - never update existing records)
SCD Type 1 (overwrite - update in place, no history)
SCD Type 2 (history tracking - full audit trail)
Unknown member row (SK=0) for orphan FK handling
Audit columns (load_timestamp, source_system)

Parameters¶

Parameter	Type	Required	Default	Description
`natural_key`	str	Yes	-	Natural/business key column name
`surrogate_key`	str	Yes	-	Surrogate key column name to generate
`scd_type`	int	No	1	0=static, 1=overwrite, 2=history tracking
`track_cols`	list	For SCD1/2	-	Columns to track for changes
`target`	str	For SCD2	-	Target table path (required to read existing history)
`unknown_member`	bool	No	false	Insert a row with SK=0 for orphan FK handling
`audit`	dict	No	{}	Audit column configuration
`valid_from_col`	str	No	valid_from	Name of the SCD2 start date column
`valid_to_col`	str	No	valid_to	Name of the SCD2 end date column
`is_current_col`	str	No	is_current	Name of the SCD2 current flag column

Audit Config¶

params:
  # ... other params ...
  audit:
    load_timestamp: true      # Add load_timestamp column
    source_system: "pos"      # Add source_system column with this value

SCD Type 0 (Static)¶

Static dimensions never update existing records. Only new records (not matching natural key) are inserted.

Use case: Reference data that never changes (ISO country codes, fixed lookup values).

nodes:
  - name: dim_country
    read:
      connection: staging
      path: countries
    pattern:
      type: dimension
      params:
        natural_key: country_code
        surrogate_key: country_sk
        scd_type: 0
        target: warehouse.dim_country
    write:
      connection: warehouse
      path: dim_country
      mode: overwrite

SCD Type 1 (Overwrite)¶

Overwrite dimensions update existing records in place. No history is kept.

Use case: Attributes where you only care about the current value (customer email, product price).

nodes:
  - name: dim_customer
    read:
      connection: staging
      path: customers
    pattern:
      type: dimension
      params:
        natural_key: customer_id
        surrogate_key: customer_sk
        scd_type: 1
        track_cols: [name, email, address]
        target: warehouse.dim_customer
        audit:
          load_timestamp: true
    write:
      connection: warehouse
      path: dim_customer
      mode: overwrite

SCD Type 2 (History Tracking)¶

History-tracking dimensions preserve full audit trail. Old records are closed, new versions are opened.

Use case: Slowly changing attributes where historical accuracy matters (customer address for point-in-time reporting).

nodes:
  - name: dim_customer
    read:
      connection: staging
      path: customers
    pattern:
      type: dimension
      params:
        natural_key: customer_id
        surrogate_key: customer_sk
        scd_type: 2
        track_cols: [name, email, address, city, state]
        target: warehouse.dim_customer
        valid_from_col: valid_from     # Optional, default: valid_from
        valid_to_col: valid_to         # Optional, default: valid_to
        is_current_col: is_current     # Optional, default: is_current
        unknown_member: true
        audit:
          load_timestamp: true
          source_system: "crm"
    write:
      connection: warehouse
      path: dim_customer
      mode: overwrite

Generated Columns: - valid_from: Timestamp when this version became active - valid_to: Timestamp when this version was superseded (NULL for current) - is_current: Boolean flag (true for current version)

Unknown Member Handling¶

Enable unknown_member: true to automatically insert a row with SK=0. This allows fact tables to reference unknown dimensions without FK violations.

Generated Unknown Member Row:

customer_sk	customer_id	name	email	valid_from	is_current
0	-1	Unknown	Unknown	1900-01-01	true

Full Star Schema Example¶

Complete pipeline building dimensions for a star schema:

project: sales_warehouse
engine: spark

connections:
  staging:
    type: delta
    path: /mnt/staging
  warehouse:
    type: delta
    path: /mnt/warehouse

story:
  connection: warehouse
  path: stories

system:
  connection: warehouse
  path: _system_catalog

pipelines:
  - pipeline: build_dimensions
    nodes:
      # Customer dimension with SCD2
      - name: dim_customer
        read:
          connection: staging
          path: customers
          format: delta
        pattern:
          type: dimension
          params:
            natural_key: customer_id
            surrogate_key: customer_sk
            scd_type: 2
            track_cols:
              - name
              - email
              - phone
              - address_line_1
              - city
              - state
              - postal_code
            target: warehouse.dim_customer
            unknown_member: true
            audit:
              load_timestamp: true
              source_system: "salesforce"
        write:
          connection: warehouse
          path: dim_customer
          format: delta
          mode: overwrite

      # Product dimension with SCD1 (no history)
      - name: dim_product
        read:
          connection: staging
          path: products
          format: delta
        pattern:
          type: dimension
          params:
            natural_key: product_id
            surrogate_key: product_sk
            scd_type: 1
            track_cols: [name, category, price, status]
            target: warehouse.dim_product
            unknown_member: true
        write:
          connection: warehouse
          path: dim_product
          format: delta
          mode: overwrite

      # Date dimension (generated, no source read needed)
      - name: dim_date
        pattern:
          type: date_dimension
          params:
            start_date: "2020-01-01"
            end_date: "2030-12-31"
            fiscal_year_start_month: 7
            unknown_member: true
        write:
          connection: warehouse
          path: dim_date
          format: delta
          mode: overwrite

Python API¶

from odibi.patterns.dimension import DimensionPattern
from odibi.context import EngineContext

# Create pattern instance
pattern = DimensionPattern(
    engine=my_engine,
    config=node_config  # NodeConfig with params
)

# Pattern params come from the NodeConfig
from odibi.config import NodeConfig

node_config = NodeConfig(
    name="dim_customer",
    params={
        "natural_key": "customer_id",
        "surrogate_key": "customer_sk",
        "scd_type": 2,
        "track_cols": ["name", "email", "address"],
        "target": "gold.dim_customer",
        "unknown_member": True,
        "audit": {
            "load_timestamp": True,
            "source_system": "crm"
        }
    }
)

pattern = DimensionPattern(engine=my_engine, config=node_config)

# Validate configuration
pattern.validate()

# Execute pattern
context = EngineContext(global_context, source_df, EngineType.SPARK)
result_df = pattern.execute(context)

Data Quality and Quarantine¶

Dimension patterns do not directly support quarantine functionality. However, dimensions can be validated using standard validation tests with on_fail: quarantine to route invalid source records to quarantine tables before dimension processing.

For orphan handling in fact tables (when dimension lookups fail), use the fact pattern's quarantine configuration with orphan_handling: quarantine.