Fact Pattern Tutorial¶

In this tutorial, you'll learn how to use Odibi's fact pattern to build fact tables with automatic surrogate key lookups, orphan handling, grain validation, and measure calculations.

What You'll Learn: - How surrogate key lookups work - Orphan handling strategies - Grain validation and deduplication - Measure calculations - Joining facts with dimensions

Source Data¶

We'll use orders data that references customers and products by their natural keys:

Source Data (orders.csv) - 30 rows:

order_id	customer_id	product_id	order_date	quantity	unit_price	status
ORD001	C001	P001	2024-01-15	1	1299.99	completed
ORD002	C001	P002	2024-01-15	2	29.99	completed
ORD003	C002	P003	2024-01-16	1	249.99	completed
ORD004	C003	P004	2024-01-16	3	49.99	completed
ORD005	C004	P005	2024-01-17	1	599.99	completed
ORD006	C005	P006	2024-01-17	1	149.99	completed
ORD007	C006	P007	2024-01-18	2	399.99	completed
ORD008	C007	P008	2024-01-18	4	45.99	completed
ORD009	C008	P009	2024-01-19	1	79.99	completed
ORD010	C009	P010	2024-01-19	1	189.99	completed
ORD011	C010	P001	2024-01-20	1	1299.99	completed
ORD012	C011	P002	2024-01-20	5	29.99	completed
ORD013	C012	P003	2024-01-21	2	249.99	completed
ORD014	C001	P004	2024-01-21	1	49.99	completed
ORD015	C002	P005	2024-01-22	1	599.99	pending
ORD016	C003	P006	2024-01-22	2	149.99	completed
ORD017	C004	P007	2024-01-23	1	399.99	completed
ORD018	C005	P008	2024-01-23	3	45.99	completed
ORD019	C006	P009	2024-01-24	2	79.99	completed
ORD020	C007	P010	2024-01-24	1	189.99	completed
ORD021	C008	P001	2024-01-25	1	1299.99	completed
ORD022	C009	P002	2024-01-25	3	29.99	completed
ORD023	C010	P003	2024-01-26	1	249.99	cancelled
ORD024	C011	P004	2024-01-26	2	49.99	completed
ORD025	C012	P005	2024-01-27	1	599.99	completed
ORD026	C001	P006	2024-01-27	1	149.99	completed
ORD027	C002	P007	2024-01-28	1	399.99	completed
ORD028	C003	P008	2024-01-28	2	45.99	completed
ORD029	C004	P009	2024-01-15	1	79.99	completed
ORD030	C005	P010	2024-01-16	1	189.99	completed

Dimension Tables (Pre-built)¶

Before building fact tables, we need dimension tables. Here are the dimensions from previous tutorials:

dim_customer (13 rows including unknown)¶

customer_sk	customer_id	name	region	is_current
0	-1	Unknown	Unknown	true
1	C001	Alice Johnson	North	true
2	C002	Bob Smith	South	true
3	C003	Carol White	North	true
4	C004	David Brown	East	true
5	C005	Emma Davis	West	true
6	C006	Frank Miller	South	true
7	C007	Grace Lee	East	true
8	C008	Henry Wilson	West	true
9	C009	Ivy Chen	North	true
10	C010	Jack Taylor	South	true
11	C011	Karen Martinez	East	true
12	C012	Leo Anderson	West	true

dim_product (11 rows including unknown)¶

product_sk	product_id	name	category
0	-1	Unknown	Unknown
1	P001	Laptop Pro 15	Electronics
2	P002	Wireless Mouse	Electronics
3	P003	Office Chair	Furniture
4	P004	USB-C Hub	Electronics
5	P005	Standing Desk	Furniture
6	P006	Mechanical Keyboard	Electronics
7	P007	Monitor 27"	Electronics
8	P008	Desk Lamp	Furniture
9	P009	Webcam HD	Electronics
10	P010	Filing Cabinet	Furniture

dim_date (15 rows for Jan 15-28 + unknown)¶

date_sk	full_date	day_of_week	month_name
0	1900-01-01	Unknown	Unknown
20240115	2024-01-15	Monday	January
20240116	2024-01-16	Tuesday	January
20240117	2024-01-17	Wednesday	January
20240118	2024-01-18	Thursday	January
20240119	2024-01-19	Friday	January
20240120	2024-01-20	Saturday	January
20240121	2024-01-21	Sunday	January
20240122	2024-01-22	Monday	January
20240123	2024-01-23	Tuesday	January
20240124	2024-01-24	Wednesday	January
20240125	2024-01-25	Thursday	January
20240126	2024-01-26	Friday	January
20240127	2024-01-27	Saturday	January
20240128	2024-01-28	Sunday	January

Step 1: Understanding SK Lookups¶

The fact pattern's main job is to replace natural keys with surrogate keys:

Source:     customer_id = "C001"
                    ↓
            Look up in dim_customer where customer_id = "C001"
                    ↓
Fact:       customer_sk = 1

This transformation happens for every dimension referenced.

YAML Configuration¶

project: fact_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_fact_orders
    nodes:
      # First, load the dimension tables into context
      - name: dim_customer
        read:
          connection: warehouse
          path: dim_customer
          format: parquet

      - name: dim_product
        read:
          connection: warehouse
          path: dim_product
          format: parquet

      - name: dim_date
        read:
          connection: warehouse
          path: dim_date
          format: parquet

      # Then build the fact table
      - name: fact_orders
        depends_on: [dim_customer, dim_product, dim_date]
        read:
          connection: source
          path: orders.csv
          format: csv

        pattern:
          type: fact
          params:
            grain: [order_id]
            dimensions:
              - source_column: customer_id
                dimension_table: dim_customer
                dimension_key: customer_id
                surrogate_key: customer_sk
                scd2: true
              - source_column: product_id
                dimension_table: dim_product
                dimension_key: product_id
                surrogate_key: product_sk
              - source_column: order_date
                dimension_table: dim_date
                dimension_key: full_date
                surrogate_key: date_sk
            orphan_handling: unknown
            measures:
              - quantity
              - unit_price
              - line_total: "quantity * unit_price"
            audit:
              load_timestamp: true
              source_system: "pos"

        write:
          connection: warehouse
          path: fact_orders
          format: parquet
          mode: overwrite

The Transformation¶

Here's what happens to the first 10 rows:

Before (Source Data):

order_id	customer_id	product_id	order_date	quantity	unit_price
ORD001	C001	P001	2024-01-15	1	1299.99
ORD002	C001	P002	2024-01-15	2	29.99
ORD003	C002	P003	2024-01-16	1	249.99
ORD004	C003	P004	2024-01-16	3	49.99
ORD005	C004	P005	2024-01-17	1	599.99
ORD006	C005	P006	2024-01-17	1	149.99
ORD007	C006	P007	2024-01-18	2	399.99
ORD008	C007	P008	2024-01-18	4	45.99
ORD009	C008	P009	2024-01-19	1	79.99
ORD010	C009	P010	2024-01-19	1	189.99

After (Fact Table):

order_id	customer_sk	product_sk	date_sk	quantity	unit_price	line_total	status	load_timestamp
ORD001	1	1	20240115	1	1299.99	1299.99	completed	2024-01-30 10:00:00
ORD002	1	2	20240115	2	29.99	59.98	completed	2024-01-30 10:00:00
ORD003	2	3	20240116	1	249.99	249.99	completed	2024-01-30 10:00:00
ORD004	3	4	20240116	3	49.99	149.97	completed	2024-01-30 10:00:00
ORD005	4	5	20240117	1	599.99	599.99	completed	2024-01-30 10:00:00
ORD006	5	6	20240117	1	149.99	149.99	completed	2024-01-30 10:00:00
ORD007	6	7	20240118	2	399.99	799.98	completed	2024-01-30 10:00:00
ORD008	7	8	20240118	4	45.99	183.96	completed	2024-01-30 10:00:00
ORD009	8	9	20240119	1	79.99	79.99	completed	2024-01-30 10:00:00
ORD010	9	10	20240119	1	189.99	189.99	completed	2024-01-30 10:00:00

Key observations: - customer_id = "C001" → customer_sk = 1 - product_id = "P001" → product_sk = 1 - order_date = "2024-01-15" → date_sk = 20240115 - New column line_total calculated as quantity * unit_price - Original natural keys can be dropped or kept (configurable)

Step 2: Orphan Handling¶

What happens when a source record references a dimension value that doesn't exist?

Source with Orphans (orders_with_orphans.csv) - 5 orphan rows:

order_id	customer_id	product_id	order_date	quantity	unit_price
...	...	...	...	...	...
ORD031	C999	P001	2024-01-17	1	1299.99
ORD032	C888	P002	2024-01-18	2	29.99
ORD033	C777	P003	2024-01-19	1	249.99
ORD034	C666	P004	2024-01-20	1	49.99
ORD035	C555	P005	2024-01-21	1	599.99

Customer IDs C999, C888, C777, C666, C555 don't exist in dim_customer.

Strategy 1: Unknown (Default)¶

params:
  orphan_handling: unknown

Orphans are assigned to the unknown member (SK = 0):

order_id	product_sk	note
ORD031	1	C999 not found → SK = 0
ORD032	2	C888 not found → SK = 0
ORD033	3	C777 not found → SK = 0
ORD034	4	C666 not found → SK = 0
ORD035	5	C555 not found → SK = 0

Pros: Data isn't lost, can investigate later Cons: Need to ensure unknown member exists in dimension

Strategy 2: Reject¶

params:
  orphan_handling: reject

Pipeline fails with an error listing orphan values:

OrphanRecordError: Found 5 orphan records for dimension 'dim_customer':
  - customer_id='C999' (1 record)
  - customer_id='C888' (1 record)
  - customer_id='C777' (1 record)
  - customer_id='C666' (1 record)
  - customer_id='C555' (1 record)

Pros: Strict data quality enforcement Cons: Entire load fails, need to fix source data

Strategy 3: Quarantine¶

params:
  orphan_handling: quarantine

Orphan records are routed to a separate quarantine table:

fact_orders (valid records): 30 rows

fact_orders_quarantine (orphans): 5 rows

order_id	customer_id	orphan_reason
ORD031	C999	customer_id not found in dim_customer
ORD032	C888	customer_id not found in dim_customer
ORD033	C777	customer_id not found in dim_customer
ORD034	C666	customer_id not found in dim_customer
ORD035	C555	customer_id not found in dim_customer

Pros: Valid data loads, orphans are preserved for review Cons: Need to manage quarantine table

Step 3: Grain Validation¶

The grain defines what makes a fact row unique. For orders, it's typically order_id.

params:
  grain: [order_id]

What Happens with Duplicate Grain?¶

If your source has duplicate order IDs:

order_id	customer_id	quantity
ORD001	C001	1
ORD001	C001	2

The pattern detects this and raises an error:

GrainValidationError: Duplicate grain detected in fact_orders
  Grain columns: ['order_id']
  Duplicate count: 1
  Sample duplicates:
    - order_id='ORD001' (2 occurrences)

Enabling Deduplication¶

If duplicates are expected and you want to keep the latest:

params:
  grain: [order_id]
  deduplicate: true
  keys: [order_id]

This keeps only the last occurrence of each order_id.

Step 4: Measure Calculations¶

Measures are the numeric values in your fact table. You can: - Pass through existing columns - Rename columns - Calculate derived values

params:
  measures:
    # Pass through
    - quantity
    - unit_price

    # Rename (maps status to order_status)
    - order_status: status

    # Calculate
    - line_total: "quantity * unit_price"
    - discount_amount: "unit_price * 0.1"
    - net_total: "quantity * unit_price * 0.9"

Output with Calculated Measures¶

order_id	quantity	unit_price	line_total	discount_amount	net_total
ORD001	1	1299.99	1299.99	130.00	1169.99
ORD002	2	29.99	59.98	3.00	53.98
ORD003	1	249.99	249.99	25.00	224.99

Complete Fact Table Output¶

Here's the complete fact_orders table (30 rows):

order_id	customer_sk	product_sk	date_sk	quantity	unit_price	line_total	status	load_timestamp
ORD001	1	1	20240115	1	1299.99	1299.99	completed	2024-01-30 10:00:00
ORD002	1	2	20240115	2	29.99	59.98	completed	2024-01-30 10:00:00
ORD003	2	3	20240116	1	249.99	249.99	completed	2024-01-30 10:00:00
ORD004	3	4	20240116	3	49.99	149.97	completed	2024-01-30 10:00:00
ORD005	4	5	20240117	1	599.99	599.99	completed	2024-01-30 10:00:00
ORD006	5	6	20240117	1	149.99	149.99	completed	2024-01-30 10:00:00
ORD007	6	7	20240118	2	399.99	799.98	completed	2024-01-30 10:00:00
ORD008	7	8	20240118	4	45.99	183.96	completed	2024-01-30 10:00:00
ORD009	8	9	20240119	1	79.99	79.99	completed	2024-01-30 10:00:00
ORD010	9	10	20240119	1	189.99	189.99	completed	2024-01-30 10:00:00
ORD011	10	1	20240120	1	1299.99	1299.99	completed	2024-01-30 10:00:00
ORD012	11	2	20240120	5	29.99	149.95	completed	2024-01-30 10:00:00
ORD013	12	3	20240121	2	249.99	499.98	completed	2024-01-30 10:00:00
ORD014	1	4	20240121	1	49.99	49.99	completed	2024-01-30 10:00:00
ORD015	2	5	20240122	1	599.99	599.99	pending	2024-01-30 10:00:00
ORD016	3	6	20240122	2	149.99	299.98	completed	2024-01-30 10:00:00
ORD017	4	7	20240123	1	399.99	399.99	completed	2024-01-30 10:00:00
ORD018	5	8	20240123	3	45.99	137.97	completed	2024-01-30 10:00:00
ORD019	6	9	20240124	2	79.99	159.98	completed	2024-01-30 10:00:00
ORD020	7	10	20240124	1	189.99	189.99	completed	2024-01-30 10:00:00
ORD021	8	1	20240125	1	1299.99	1299.99	completed	2024-01-30 10:00:00
ORD022	9	2	20240125	3	29.99	89.97	completed	2024-01-30 10:00:00
ORD023	10	3	20240126	1	249.99	249.99	cancelled	2024-01-30 10:00:00
ORD024	11	4	20240126	2	49.99	99.98	completed	2024-01-30 10:00:00
ORD025	12	5	20240127	1	599.99	599.99	completed	2024-01-30 10:00:00
ORD026	1	6	20240127	1	149.99	149.99	completed	2024-01-30 10:00:00
ORD027	2	7	20240128	1	399.99	399.99	completed	2024-01-30 10:00:00
ORD028	3	8	20240128	2	45.99	91.98	completed	2024-01-30 10:00:00
ORD029	4	9	20240115	1	79.99	79.99	completed	2024-01-30 10:00:00
ORD030	5	10	20240116	1	189.99	189.99	completed	2024-01-30 10:00:00

Complete Runnable Example¶

# File: odibi_fact_tutorial.yaml
project: fact_tutorial
engine: pandas

connections:
  source:
    type: local
    base_path: ./examples/tutorials/dimensional_modeling/data
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_fact_orders
    description: "Build fact table with SK lookups"
    nodes:
      # Load dimension tables
      - name: dim_customer
        read:
          connection: warehouse
          path: dim_customer
          format: parquet

      - name: dim_product
        read:
          connection: warehouse
          path: dim_product
          format: parquet

      - name: dim_date
        read:
          connection: warehouse
          path: dim_date
          format: parquet

      # Build fact table
      - name: fact_orders
        description: "Orders fact table with surrogate key lookups"
        depends_on: [dim_customer, dim_product, dim_date]
        read:
          connection: source
          path: orders.csv
          format: csv

        pattern:
          type: fact
          params:
            # Define the grain (uniqueness)
            grain: [order_id]

            # Define dimension lookups
            dimensions:
              - source_column: customer_id
                dimension_table: dim_customer
                dimension_key: customer_id
                surrogate_key: customer_sk
                scd2: true  # Filter to is_current = true

              - source_column: product_id
                dimension_table: dim_product
                dimension_key: product_id
                surrogate_key: product_sk

              - source_column: order_date
                dimension_table: dim_date
                dimension_key: full_date
                surrogate_key: date_sk

            # Handle missing dimension values
            orphan_handling: unknown

            # Define measures
            measures:
              - quantity
              - unit_price
              - line_total: "quantity * unit_price"

            # Add audit columns
            audit:
              load_timestamp: true
              source_system: "pos"

        write:
          connection: warehouse
          path: fact_orders
          format: parquet
          mode: overwrite

Python API Alternative¶

from odibi.patterns.fact import FactPattern
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd

# Load source data
orders_df = pd.read_csv("examples/tutorials/dimensional_modeling/data/orders.csv")

# Load dimensions
dim_customer = pd.read_parquet("warehouse/dim_customer")
dim_product = pd.read_parquet("warehouse/dim_product")
dim_date = pd.read_parquet("warehouse/dim_date")

# Create context and register dimensions
context = EngineContext(df=orders_df, engine_type=EngineType.PANDAS)
context.register("dim_customer", dim_customer)
context.register("dim_product", dim_product)
context.register("dim_date", dim_date)

# Create pattern
pattern = FactPattern(params={
    "grain": ["order_id"],
    "dimensions": [
        {
            "source_column": "customer_id",
            "dimension_table": "dim_customer",
            "dimension_key": "customer_id",
            "surrogate_key": "customer_sk",
            "scd2": True
        },
        {
            "source_column": "product_id",
            "dimension_table": "dim_product",
            "dimension_key": "product_id",
            "surrogate_key": "product_sk"
        },
        {
            "source_column": "order_date",
            "dimension_table": "dim_date",
            "dimension_key": "full_date",
            "surrogate_key": "date_sk"
        }
    ],
    "orphan_handling": "unknown",
    "measures": [
        "quantity",
        "unit_price",
        {"line_total": "quantity * unit_price"}
    ],
    "audit": {
        "load_timestamp": True,
        "source_system": "pos"
    }
})

# Execute pattern
result_df = pattern.execute(context)

print(f"Generated {len(result_df)} fact rows")
print(result_df.head(10))

Querying the Star Schema¶

Now you can run powerful analytical queries:

"Sales by region"¶

SELECT
    c.region,
    COUNT(*) AS order_count,
    SUM(f.line_total) AS total_revenue
FROM fact_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
WHERE c.is_current = true
GROUP BY c.region
ORDER BY total_revenue DESC;

region	order_count	total_revenue
North	8	2,599.87
East	8	2,023.87
South	7	2,449.92
West	7	2,629.88

"Daily sales trend"¶

SELECT
    d.full_date,
    d.day_of_week,
    SUM(f.line_total) AS daily_revenue
FROM fact_orders f
JOIN dim_date d ON f.date_sk = d.date_sk
GROUP BY d.full_date, d.day_of_week
ORDER BY d.full_date;

"Top products by category"¶

SELECT
    p.category,
    p.name,
    SUM(f.quantity) AS units_sold,
    SUM(f.line_total) AS revenue
FROM fact_orders f
JOIN dim_product p ON f.product_sk = p.product_sk
GROUP BY p.category, p.name
ORDER BY revenue DESC;

What You Learned¶

In this tutorial, you learned:

Surrogate key lookups automatically replace natural keys with dimension SKs
scd2: true filters to current dimension rows when looking up SCD2 dimensions
Orphan handling strategies: unknown (default), reject, or quarantine
Grain validation detects duplicate records at the grain level
Measures can be passed through, renamed, or calculated
depends_on ensures dimension tables are loaded before the fact pattern runs
Audit columns track when and where data was loaded

Next Steps¶

Now that you have a complete star schema, let's build aggregated tables for faster reporting.

Next: Aggregation Pattern Tutorial

Previous	Up	Next
Date Dimension	Tutorials	Aggregation Pattern

Reference¶

For complete parameter documentation, see: Fact Pattern Reference