Skip to content

FK Validation Tutorial

In this tutorial, you'll learn how to validate foreign key relationships between fact and dimension tables to ensure data quality.

What You'll Learn: - Why validate foreign keys - Defining relationships - Detecting orphan records - Handling orphans with different strategies - Generating lineage diagrams


Why Validate Foreign Keys?

In a star schema, fact tables reference dimension tables via foreign keys. But what happens when: - A customer places an order, but the customer isn't in dim_customer? - An order references a product that was never loaded? - A date value doesn't exist in the date dimension?

These are called orphan recordsβ€”facts that reference non-existent dimensions.

Problems with orphans: - Reports show "Unknown" values - Counts and sums may be incorrect - Data quality issues go undetected - Downstream analytics are unreliable

FK validation helps you detect, report, and handle these issues.


Source Data

We'll use the fact_orders table with some intentional orphan records:

fact_orders (35 rows including 5 orphans)

order_id customer_sk product_sk date_sk line_total status
ORD001 1 1 20240115 1299.99 completed
ORD002 1 2 20240115 59.98 completed
... ... ... ... ... ...
ORD030 5 10 20240116 189.99 completed
ORD031 99 1 20240117 1299.99 completed
ORD032 88 2 20240118 29.99 completed
ORD033 77 3 20240119 249.99 completed
ORD034 66 4 20240120 49.99 completed
ORD035 55 5 20240121 599.99 completed

Orphan records: ORD031-ORD035 have customer_sk values (99, 88, 77, 66, 55) that don't exist in dim_customer.

dim_customer (13 rows)

customer_sk customer_id name
0 -1 Unknown
1 C001 Alice Johnson
2 C002 Bob Smith
... ... ...
12 C012 Leo Anderson

Note: customer_sk values 55, 66, 77, 88, 99 do NOT exist.


Step 1: Define Relationships

First, declare the FK relationships in your star schema:

Python Code

from odibi.validation.fk import RelationshipConfig, RelationshipRegistry

# Define all FK relationships
relationships = [
    RelationshipConfig(
        name="orders_to_customers",
        fact="fact_orders",
        dimension="dim_customer",
        fact_key="customer_sk",
        dimension_key="customer_sk",
        nullable=False,
        on_violation="error"
    ),
    RelationshipConfig(
        name="orders_to_products",
        fact="fact_orders",
        dimension="dim_product",
        fact_key="product_sk",
        dimension_key="product_sk",
        nullable=False,
        on_violation="error"
    ),
    RelationshipConfig(
        name="orders_to_dates",
        fact="fact_orders",
        dimension="dim_date",
        fact_key="date_sk",
        dimension_key="date_sk",
        nullable=True,  # Pending orders may not have dates
        on_violation="warn"
    )
]

# Create registry
registry = RelationshipRegistry(relationships=relationships)

YAML Alternative

# relationships.yaml
relationships:
  - name: orders_to_customers
    fact: fact_orders
    dimension: dim_customer
    fact_key: customer_sk
    dimension_key: customer_sk
    nullable: false
    on_violation: error

  - name: orders_to_products
    fact: fact_orders
    dimension: dim_product
    fact_key: product_sk
    dimension_key: product_sk
    nullable: false
    on_violation: error

  - name: orders_to_dates
    fact: fact_orders
    dimension: dim_date
    fact_key: date_sk
    dimension_key: date_sk
    nullable: true
    on_violation: warn

RelationshipConfig Fields

Field Type Required Description
name str Yes Unique identifier
fact str Yes Fact table name
dimension str Yes Dimension table name
fact_key str Yes FK column in fact
dimension_key str Yes PK/SK column in dimension
nullable bool No Whether nulls are allowed (default: false)
on_violation str No Action: "error", "warn", "filter" (default: "error")

Step 2: Validate a Clean Fact Table

Let's first validate a fact table with NO orphans:

Python Code

from odibi.validation.fk import FKValidator
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd

# Load data
fact_orders = pd.read_parquet("warehouse/fact_orders")  # Clean data - 30 rows
dim_customer = pd.read_parquet("warehouse/dim_customer")
dim_product = pd.read_parquet("warehouse/dim_product")
dim_date = pd.read_parquet("warehouse/dim_date")

# Setup context
context = EngineContext(df=None, engine_type=EngineType.PANDAS)
context.register("dim_customer", dim_customer)
context.register("dim_product", dim_product)
context.register("dim_date", dim_date)

# Create validator
validator = FKValidator(registry)

# Validate
report = validator.validate_fact(fact_orders, "fact_orders", context)

# Check results
print(f"All valid: {report.all_valid}")
print(f"Total relationships: {report.total_relationships}")
print(f"Valid relationships: {report.valid_relationships}")

Validation Report (Clean Data)

All valid: True
Total relationships: 3
Valid relationships: 3

Relationship: orders_to_customers
  Status: VALID
  Total rows: 30
  Orphan count: 0

Relationship: orders_to_products
  Status: VALID
  Total rows: 30
  Orphan count: 0

Relationship: orders_to_dates
  Status: VALID
  Total rows: 30
  Orphan count: 0

Step 3: Validate with Orphan Records

Now let's validate the data with 5 orphan records:

Python Code

# Load data with orphans
fact_orders_dirty = pd.read_csv("data/orders_with_orphans.csv")

# This has 35 rows - 5 with invalid customer_sk values

# Validate
report = validator.validate_fact(fact_orders_dirty, "fact_orders", context)

print(f"All valid: {report.all_valid}")
print(f"Orphan records found: {len(report.orphan_records)}")

Validation Report (With Orphans)

All valid: False
Total relationships: 3
Valid relationships: 2
Orphan records found: 5

Relationship: orders_to_customers
  Status: INVALID
  Total rows: 35
  Orphan count: 5
  Orphan values (sample):
    - customer_sk = 99 (1 occurrence)
    - customer_sk = 88 (1 occurrence)
    - customer_sk = 77 (1 occurrence)
    - customer_sk = 66 (1 occurrence)
    - customer_sk = 55 (1 occurrence)

Relationship: orders_to_products
  Status: VALID
  Total rows: 35
  Orphan count: 0

Relationship: orders_to_dates
  Status: VALID
  Total rows: 35
  Orphan count: 0

Detailed Orphan Records

# Get the orphan records
for orphan in report.orphan_records:
    print(f"Order {orphan.order_id}: customer_sk={orphan.customer_sk} not found")
order_id customer_sk reason
ORD031 99 Not found in dim_customer
ORD032 88 Not found in dim_customer
ORD033 77 Not found in dim_customer
ORD034 66 Not found in dim_customer
ORD035 55 Not found in dim_customer

Step 4: Handle Orphans with Different Strategies

Strategy 1: Error (Default)

Raise an exception when orphans are found:

from odibi.validation.fk import validate_fk_on_load

try:
    validated_df = validate_fk_on_load(
        fact_df=fact_orders_dirty,
        relationships=relationships,
        context=context,
        on_failure="error"
    )
except ValueError as e:
    print(f"Validation failed: {e}")

Output:

Validation failed: FK validation found 5 orphan records:
  - orders_to_customers: 5 orphans (customer_sk: 99, 88, 77, 66, 55)

Use case: Strict data qualityβ€”fail the pipeline if any orphans exist.

Strategy 2: Warn

Log a warning but continue processing:

import logging

logging.basicConfig(level=logging.WARNING)

validated_df = validate_fk_on_load(
    fact_df=fact_orders_dirty,
    relationships=relationships,
    context=context,
    on_failure="warn"
)

print(f"Returned {len(validated_df)} rows (including orphans)")

Output:

WARNING:odibi.validation.fk:FK validation found 5 orphan records for orders_to_customers
  Orphan values: customer_sk in [99, 88, 77, 66, 55]
Returned 35 rows (including orphans)

Use case: Log issues for investigation but don't block processing.

Strategy 3: Filter

Remove orphan records from the result:

validated_df = validate_fk_on_load(
    fact_df=fact_orders_dirty,
    relationships=relationships,
    context=context,
    on_failure="filter"
)

print(f"Before: {len(fact_orders_dirty)} rows")
print(f"After:  {len(validated_df)} rows")
print(f"Filtered: {len(fact_orders_dirty) - len(validated_df)} orphan rows")

Output:

Before: 35 rows
After:  30 rows
Filtered: 5 orphan rows

Before (35 rows):

order_id customer_sk line_total
ORD001 1 1299.99
ORD002 1 59.98
... ... ...
ORD030 5 189.99
ORD031 99 1299.99
ORD032 88 29.99
ORD033 77 249.99
ORD034 66 49.99
ORD035 55 599.99

After (30 rows - orphans removed):

order_id customer_sk line_total
ORD001 1 1299.99
ORD002 1 59.98
... ... ...
ORD030 5 189.99

Use case: Silently exclude bad data (use with cautionβ€”you may lose legitimate records).


Step 5: Generate Lineage from Relationships

The relationship registry can generate a lineage graph:

Python Code

lineage = registry.generate_lineage()
print(lineage)

Output

{
    'fact_orders': ['dim_customer', 'dim_product', 'dim_date']
}

Mermaid Diagram

# Generate mermaid diagram code
mermaid_code = registry.to_mermaid()
print(mermaid_code)
erDiagram
    FACT_ORDERS ||--o{ DIM_CUSTOMER : "customer_sk"
    FACT_ORDERS ||--o{ DIM_PRODUCT : "product_sk"
    FACT_ORDERS ||--o{ DIM_DATE : "date_sk"

    FACT_ORDERS {
        string order_id PK
        int customer_sk FK
        int product_sk FK
        int date_sk FK
        decimal line_total
    }

    DIM_CUSTOMER {
        int customer_sk PK
        string customer_id
        string name
        string region
    }

    DIM_PRODUCT {
        int product_sk PK
        string product_id
        string name
        string category
    }

    DIM_DATE {
        int date_sk PK
        date full_date
        string day_of_week
        string month_name
    }

Complete Example

Here's a complete script for FK validation:

#!/usr/bin/env python3
"""
FK Validation Example

This script demonstrates:
1. Defining FK relationships
2. Validating a fact table
3. Handling orphan records
4. Generating lineage
"""

from odibi.validation.fk import (
    RelationshipConfig,
    RelationshipRegistry,
    FKValidator,
    validate_fk_on_load
)
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd

# =============================================================================
# 1. LOAD DATA
# =============================================================================
print("=" * 60)
print("Step 1: Load Data")
print("=" * 60)

# Clean fact data
fact_orders = pd.read_parquet("warehouse/fact_orders")
print(f"fact_orders: {len(fact_orders)} rows")

# Fact data with orphans
fact_orders_dirty = pd.read_csv(
    "examples/tutorials/dimensional_modeling/data/orders_with_orphans.csv"
)
print(f"fact_orders_dirty: {len(fact_orders_dirty)} rows (5 orphans)")

# Dimensions
dim_customer = pd.read_parquet("warehouse/dim_customer")
dim_product = pd.read_parquet("warehouse/dim_product")
dim_date = pd.read_parquet("warehouse/dim_date")
print()

# =============================================================================
# 2. DEFINE RELATIONSHIPS
# =============================================================================
print("=" * 60)
print("Step 2: Define Relationships")
print("=" * 60)

relationships = [
    RelationshipConfig(
        name="orders_to_customers",
        fact="fact_orders",
        dimension="dim_customer",
        fact_key="customer_sk",
        dimension_key="customer_sk",
        on_violation="error"
    ),
    RelationshipConfig(
        name="orders_to_products",
        fact="fact_orders",
        dimension="dim_product",
        fact_key="product_sk",
        dimension_key="product_sk",
        on_violation="error"
    ),
    RelationshipConfig(
        name="orders_to_dates",
        fact="fact_orders",
        dimension="dim_date",
        fact_key="date_sk",
        dimension_key="date_sk",
        on_violation="warn"
    )
]

registry = RelationshipRegistry(relationships=relationships)
print(f"Defined {len(relationships)} relationships")
print()

# =============================================================================
# 3. SETUP CONTEXT
# =============================================================================
context = EngineContext(df=None, engine_type=EngineType.PANDAS)
context.register("dim_customer", dim_customer)
context.register("dim_product", dim_product)
context.register("dim_date", dim_date)

# =============================================================================
# 4. VALIDATE CLEAN DATA
# =============================================================================
print("=" * 60)
print("Step 3: Validate Clean Data")
print("=" * 60)

validator = FKValidator(registry)
report = validator.validate_fact(fact_orders, "fact_orders", context)

print(f"All valid: {report.all_valid}")
print(f"Relationships checked: {report.total_relationships}")
print(f"Valid: {report.valid_relationships}")
print()

# =============================================================================
# 5. VALIDATE DIRTY DATA
# =============================================================================
print("=" * 60)
print("Step 4: Validate Data with Orphans")
print("=" * 60)

report = validator.validate_fact(fact_orders_dirty, "fact_orders", context)

print(f"All valid: {report.all_valid}")
print(f"Orphan records: {len(report.orphan_records)}")

for result in report.results:
    status = "VALID" if result.valid else "INVALID"
    print(f"\n  {result.relationship_name}: {status}")
    print(f"    Total rows: {result.total_rows}")
    print(f"    Orphans: {result.orphan_count}")
    if not result.valid:
        print(f"    Orphan values: {result.orphan_values[:5]}")
print()

# =============================================================================
# 6. DEMONSTRATE HANDLING STRATEGIES
# =============================================================================
print("=" * 60)
print("Step 5: Orphan Handling Strategies")
print("=" * 60)

# Filter strategy
filtered_df = validate_fk_on_load(
    fact_df=fact_orders_dirty,
    relationships=[relationships[0]],  # Just customer relationship
    context=context,
    on_failure="filter"
)

print(f"\nFilter Strategy:")
print(f"  Before: {len(fact_orders_dirty)} rows")
print(f"  After:  {len(filtered_df)} rows")
print(f"  Removed: {len(fact_orders_dirty) - len(filtered_df)} orphan rows")
print()

# =============================================================================
# 7. GENERATE LINEAGE
# =============================================================================
print("=" * 60)
print("Step 6: Generate Lineage")
print("=" * 60)

lineage = registry.generate_lineage()
print(f"Lineage: {lineage}")

print("\nDimensions referenced by fact_orders:")
for dim in lineage.get("fact_orders", []):
    print(f"  β†’ {dim}")

print()
print("=" * 60)
print("COMPLETE!")
print("=" * 60)

What You Learned

In this tutorial, you learned:

  • Why FK validation matters: Orphan records cause data quality issues
  • Defining relationships: Specify fact, dimension, and key columns
  • Validating facts: Use FKValidator to detect orphans
  • Handling strategies:
  • error: Fail the pipeline (strict)
  • warn: Log and continue (monitoring)
  • filter: Remove orphans (permissive)
  • Generating lineage: Visualize table relationships

Congratulations!

You've completed the entire dimensional modeling tutorial series!

What you built: - Part 1: Complete star schema (dimensions, facts, aggregates) - Part 2: Semantic layer (metrics, dimensions, materializations) - Part 3: FK validation (data quality)

You now have all the tools to build production-ready dimensional data warehouses with Odibi.


Previous Up Next
Semantic Full Example Tutorials -

Reference

For complete documentation, see: FK Validation Reference