Dimensional Modeling Guide for ODIBI¶
A practical reference for building data warehouses with ODIBI.
Table of Contents¶
- The Problem We're Solving
- Facts and Dimensions
- The Star Schema
- Natural Keys vs Surrogate Keys
- Bronze → Silver → Gold Flow
- Where Do IDs Come From?
- Lookup Tables vs Dimension Tables
- Slowly Changing Dimensions (SCD)
- The Date Dimension
- Aggregations
- Common Mistakes to Avoid
- ODIBI Current State vs Target
The Problem We're Solving¶
Raw Data is Hard to Query¶
Your source systems store data for operations, not analysis:
orders.csv
| customer_email | product_name | quantity | price | timestamp |
|-------------------|--------------|----------|-------|---------------------|
| john@mail.com | Latte | 2 | 11.00 | 2024-01-15 09:15:00 |
To answer "What was revenue by product category for Q4 weekends?", you need: - Product category (not in orders) - Whether it's a weekend (calculated from timestamp) - Q4 filter (calculated from timestamp)
Dimensional modeling pre-calculates and organizes this context.
Facts and Dimensions¶
Fact Table = What Happened¶
Events, transactions, measurements. Numbers you can add/count/average.
| Contains | Example |
|---|---|
| Measures (numbers) | quantity, price, total_amount |
| Foreign keys (pointers to dimensions) | customer_sk, product_sk, date_sk |
| Degenerate dimensions (IDs with no table) | order_id, invoice_number |
Key insight: Facts are usually append-only. Once it happened, it happened.
Dimension Table = Context About What Happened¶
Who, what, when, where, why. Descriptive attributes.
| Contains | Example |
|---|---|
| Primary key | customer_sk |
| Natural key | customer_id (from source system) |
| Descriptive attributes | name, email, city, segment |
| Hierarchies | city → state → country |
Key insight: Dimensions change over time. A customer might move cities.
The Star Schema¶
dim_customer
│
│ customer_sk
│
dim_product ─────── fact_orders ─────── dim_date
product_sk │ date_sk
│
dim_region
│
region_sk
Why this shape?
- Storage efficiency — Store "John Smith, Premium, NYC" once, reference by ID
- Query speed — Filter small dimension tables first, then join to facts
- Flexibility — Add new attributes to dimensions, all reports get them
- Single source of truth — Customer's segment defined in ONE place
Natural Keys vs Surrogate Keys¶
| Type | What It Is | Example | Who Creates It |
|---|---|---|---|
| Natural Key | Business identifier | customer_id, email, product_sku |
Source system or you (via hash) |
| Surrogate Key | Warehouse-generated integer | customer_sk = 1001 |
Data warehouse (auto-increment) |
Why Use Surrogate Keys?¶
- Faster JOINs — Integers are faster than strings
- SCD2 support — One
customer_idcan have multiplecustomer_skvalues (history) - Survives source changes — If source system changes IDs, yours don't
- Unknown member —
customer_sk = 0for orphan records
The "Unknown" Member¶
What if an order references customer_id = 999 but that customer doesn't exist?
Solution: Create a special row in every dimension:
| customer_sk | customer_id | name | city |
|---|---|---|---|
| 0 | -1 | Unknown | Unknown |
| 1001 | 47 | John | NYC |
Orphan facts JOIN to customer_sk = 0 — you never lose data.
Bronze → Silver → Gold Flow¶
┌─────────────────────────────────────────────────────────────────────┐
│ BRONZE │
│ Raw data, as-is from source. Messy, duplicates, nulls, no IDs. │
│ Example: raw_orders.csv, raw_customers.json │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SILVER │
│ Cleaned, validated, deduplicated. NATURAL KEYS assigned. │
│ Example: clean_orders, clean_customers │
│ Keys: customer_id (hash or from source), product_id, order_id │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ GOLD │
│ Dimensional model. SURROGATE KEYS, SCD history, aggregates. │
│ Example: dim_customer, dim_product, fact_orders, agg_daily_sales │
│ Keys: customer_sk, product_sk, date_sk (integers) │
└─────────────────────────────────────────────────────────────────────┘
Key Point: Silver Uses Natural Keys, Gold Uses Surrogate Keys¶
| Layer | Keys Used |
|---|---|
| Bronze | Whatever source provides (emails, names, raw IDs) |
| Silver | Natural keys — business identifiers (customer_id, product_id) |
| Gold | Surrogate keys — warehouse-generated integers (customer_sk) |
Where Do IDs Come From?¶
Scenario A: Source System Has IDs (Common)¶
Your CRM already has customer_id = 47. Just pass it through.
Scenario B: Source System Has NO IDs (Your Data)¶
You only have email and product_name. You must generate IDs.
Option 1: Hash (Recommended)
- Same email = same ID, forever
- Deterministic, stateless, simple
- Ugly IDs but who cares
Option 2: Persistent Lookup Table
- Store
email → customer_idmapping - On each run, only assign new IDs to new emails
- Nice sequential numbers (1, 2, 3...)
- More complex, requires state
⚠️ NEVER use RANK() or ROW_NUMBER() alone — IDs will shift when data changes!
Lookup Tables vs Dimension Tables¶
These are NOT the same thing.
| Lookup Table | Dimension Table |
|---|---|
| Silver layer | Gold layer |
Maps email → customer_id |
Has customer_sk, customer_id, name, city, ... |
| Just a helper for ID generation | Official, versioned, surrogate-keyed entity |
| No history | SCD2 history tracking |
| Simple key-value | Rich with attributes |
The Flow¶
Bronze: raw_customers (email, name, city)
│
▼
Silver: customer_lookup (email → customer_id via hash)
│
▼
Silver: clean_customers (customer_id, email, name, city)
│
▼
Gold: dim_customer (customer_SK, customer_id, name, city, valid_from, valid_to, is_current)
Slowly Changing Dimensions (SCD)¶
What happens when a customer moves from NYC to LA?
SCD Type 0: Keep Original¶
Never update. Always shows original value.
Use case: Birth date, original signup date
SCD Type 1: Overwrite¶
Update in place. Lose history.
| customer_sk | customer_id | city |
|---|---|---|
| 1001 | 47 | LA |
Use case: Correcting typos, current-state-only reporting
SCD Type 2: Track History (Most Common)¶
Create new row, close old row.
| customer_sk | customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 1001 | 47 | NYC | 2020-01-01 | 2023-03-15 | false |
| 1002 | 47 | LA | 2023-03-15 | NULL | true |
Use case: Historical analysis, "What was their city when they ordered?"
Looking Up Surrogate Keys for Facts¶
When building fact tables, filter by is_current = true:
SELECT
o.order_id,
dc.customer_sk
FROM clean_orders o
LEFT JOIN dim_customer dc
ON o.customer_id = dc.customer_id
AND dc.is_current = true -- Only match current version!
The Date Dimension¶
Every business question involves time. Pre-calculate all date attributes.
| Column | Example | Why Useful |
|---|---|---|
| date_sk | 20240115 | Primary key |
| full_date | 2024-01-15 | Actual date |
| day_of_week | Monday | Filter by weekday |
| is_weekend | false | Weekend analysis |
| month | 1 | Monthly aggregation |
| month_name | January | Display-friendly |
| quarter | 1 | Quarterly reporting |
| year | 2024 | Annual comparison |
| is_holiday | false | Holiday impact |
Why Pre-Calculate?¶
Without date dimension:
With date dimension:
Aggregations¶
Why Pre-Aggregate?¶
Without aggregates:
With aggregates:
The Grain Concept¶
Grain = what one row represents
| Table | Grain |
|---|---|
| fact_sales | One order line item |
| agg_daily_sales | All sales for a day + product + region |
| agg_monthly_sales | All sales for a month + product + region |
Rule: You can roll UP (daily → monthly) but not drill DOWN (monthly → daily) without the source.
When to Use Natural Keys vs Surrogate Keys (Practical Decision)¶
The theory says "always use surrogate keys in gold." Reality is more nuanced.
Use Natural Keys When:¶
- Source system IDs are stable — e.g.,
entity_idfrom operational systems,location_idfrom reference tables - You're a solo DE and simplicity matters — fewer moving parts = fewer bugs
- No multi-source integration with conflicting IDs — one source per entity
- No need for unknown member (SK=0) handling — your FKs always resolve
Example: Analytics project uses entity_id, location_id, facility_id as natural keys because they're system-generated and never change.
Use Surrogate Keys When:¶
- Natural keys might change or get recycled — e.g., product codes reassigned
- Joining on composite keys (3+ columns) is unwieldy —
JOIN ON a=a AND b=b AND c=c AND d=d - Integrating same entity from multiple sources — same
customer_idmeans different things in CRM vs ERP - Need unknown member rows for orphan FK handling — SK=0 catches missing references
When to Use scd2 Function vs DimensionPattern¶
Both handle SCD Type 2, but they serve different purposes.
| Scenario | Use | Why |
|---|---|---|
| Need custom SQL transforms before SCD | scd2 function |
Pattern doesn't support mid-pipeline SQL |
| All-in-one with surrogate keys + unknown member | DimensionPattern |
Handles everything declaratively |
| Natural keys sufficient, no surrogates needed | scd2 function |
Lighter weight, less overhead |
| Building classic Kimball star schema from scratch | DimensionPattern |
Full feature set |
scd2 Function Example (Custom SQL + SCD)¶
transform:
steps:
- sql_file: "sql/cleaned_vw_dim_location.sql" # Custom transform first
- function: scd2
params:
connection: warehouse_prod
path: "analytics/silver/cleaned_vw_dim_location"
keys: [location_id]
track_cols: [LocationCode, Region, Site, Department]
effective_time_col: _extracted_at
DimensionPattern Example (All-in-One)¶
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
track_cols: [name, email, address]
target: warehouse.dim_customer
unknown_member: true
audit:
load_timestamp: true
source_system: "crm"
Common Mistakes to Avoid¶
❌ Using RANK() to Generate IDs¶
-- WRONG: IDs will shift when new data arrives!
SELECT DENSE_RANK() OVER (ORDER BY email) as customer_id
Fix: Use hash or persistent lookup table.
❌ Building Lookups from Bronze (Dirty Data)¶
Fix: Build lookups from Silver (cleaned data).
❌ Joining Fact to Dimension Without is_current Filter¶
-- WRONG: May get multiple matches (historical + current)
SELECT * FROM fact_orders f
JOIN dim_customer dc ON f.customer_id = dc.customer_id
Fix: Add AND dc.is_current = true.
❌ Confusing Lookup Tables with Dimension Tables¶
- Lookup = Silver, just maps keys
- Dimension = Gold, has surrogate keys + history + attributes
❌ Not Having an Unknown Member¶
If a fact references a customer that doesn't exist, the JOIN fails and you lose data.
Fix: Every dimension should have SK = 0, name = 'Unknown'.
ODIBI Patterns¶
Implemented Patterns¶
| Pattern | What It Does |
|---|---|
DimensionPattern |
Auto SK + SCD + unknown member + audit columns |
FactPattern |
Dedup + pass through + SK lookup + orphan handling |
SCD2Pattern |
Track history with automatic surrogate keys |
MergePattern |
Upsert logic |
AggregationPattern |
Declarative GROUP BY + time rollups |
DateDimensionPattern |
Generate complete date dimension |
Target: Declarative Dimensional Modeling¶
# This should just work
- name: dim_customer
pattern:
type: dimension
params:
natural_key: customer_id
surrogate_key: customer_sk
scd_type: 2
unknown_member: true
- name: fact_orders
pattern:
type: fact
params:
grain: [order_id]
dimensions:
- source_column: customer_id
dimension_table: dim_customer
surrogate_key: customer_sk
orphan_handling: unknown
Quick Reference: The Mental Model¶
┌─────────────────────────────────────────────────────────────────────┐
│ BUSINESS QUESTION │
│ "What was revenue by product category for Q4 weekends?" │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ GOLD LAYER │
│ fact_orders ←→ dim_product ←→ dim_date │
│ Surrogate keys, SCD2 history, pre-aggregated │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SILVER LAYER │
│ clean_orders, clean_customers, customer_lookup │
│ Natural keys (hash or source), validated, deduplicated │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER │
│ raw_orders.csv, raw_customers.json │
│ As-is from source, messy, no IDs if source doesn't have them │
└─────────────────────────────────────────────────────────────────────┘
Glossary¶
| Term | Definition |
|---|---|
| Fact | Table of events/transactions with measures |
| Dimension | Table of context (who, what, when, where) |
| Grain | What one row represents |
| Natural Key | Business identifier (customer_id) |
| Surrogate Key | Warehouse-generated integer (customer_sk) |
| SCD | Slowly Changing Dimension |
| Star Schema | Fact in center, dimensions around it |
| Lookup Table | Helper table to map values to IDs |
| Unknown Member | Row with SK=0 for orphan handling |
| Aggregate | Pre-calculated summary table |