Semantic Layer Full Example¶

This tutorial brings together everything you've learned about the semantic layer into a complete, end-to-end example.

What You'll See: - Complete semantic config (5 metrics, 5 dimensions, 3 materializations) - Unified Project API (simplest approach) - Full Python script that loads data, queries metrics, and materializes results - All output tables with sample data

Option A: Unified Project API (Recommended)¶

The simplest way to use the semantic layer is with the unified Project API:

odibi.yaml with Semantic Layer¶

# odibi.yaml
project: retail_warehouse
engine: pandas

connections:
  gold:
    type: local
    base_path: /mnt/data/gold

pipelines:
  - pipeline: build_warehouse
    nodes:
      - name: fact_orders
        write: { connection: gold, table: fact_orders }
      - name: dim_customer
        write: { connection: gold, table: dim_customer }
      - name: dim_product
        write: { connection: gold, table: dim_product }
      - name: dim_date
        write: { connection: gold, table: dim_date }

# Semantic layer at project level
semantic:
  metrics:
    - name: revenue
      description: "Total revenue from completed orders"
      expr: "SUM(line_total)"
      source: $build_warehouse.fact_orders    # References node's write target
      filters:
        - "status = 'completed'"

    - name: order_count
      expr: "COUNT(*)"
      source: $build_warehouse.fact_orders
      filters:
        - "status = 'completed'"

    - name: avg_order_value
      expr: "AVG(line_total)"
      source: $build_warehouse.fact_orders
      filters:
        - "status = 'completed'"

  dimensions:
    - name: region
      source: $build_warehouse.dim_customer   # No path duplication!
      column: region

    - name: category
      source: $build_warehouse.dim_product
      column: category

    - name: month
      source: $build_warehouse.dim_date
      column: month_name

Query with Two Lines¶

from odibi import Project

# Load project - tables auto-resolved from connections
project = Project.load("odibi.yaml")

# Query metrics
result = project.query("revenue BY region")
print(result.df)

# Multiple metrics and dimensions
result = project.query("revenue, order_count, avg_order_value BY region, category")
print(result.df)

That's it! No manual table loading or context registration required.

Option B: Manual Approach¶

For more control, use the semantic layer components directly.

The Complete Semantic Configuration¶

semantic_config.yaml¶

# File: semantic_config.yaml
# Complete semantic layer configuration for retail star schema

metrics:
  # Revenue metrics
  - name: revenue
    description: "Total revenue from completed orders"
    expr: "SUM(line_total)"
    source: fact_orders
    filters:
      - "status = 'completed'"

  - name: cost
    description: "Total cost of goods sold"
    expr: "SUM(quantity * 0.6 * unit_price)"  # Assume 60% cost ratio
    source: fact_orders
    filters:
      - "status = 'completed'"

  - name: profit
    description: "Gross profit (revenue - cost)"
    expr: "SUM(line_total * 0.4)"  # 40% margin
    source: fact_orders
    filters:
      - "status = 'completed'"

  # Volume metrics
  - name: order_count
    description: "Number of completed orders"
    expr: "COUNT(*)"
    source: fact_orders
    filters:
      - "status = 'completed'"

  - name: avg_order_value
    description: "Average order value"
    expr: "AVG(line_total)"
    source: fact_orders
    filters:
      - "status = 'completed'"

dimensions:
  # Geographic dimensions
  - name: region
    description: "Customer geographic region"
    source: dim_customer
    column: region

  - name: category
    description: "Product category"
    source: dim_product
    column: category

  # Time dimensions
  - name: month
    description: "Month name"
    source: dim_date
    column: month_name
    hierarchy:
      - year
      - quarter_name
      - month_name

  - name: quarter
    description: "Quarter name"
    source: dim_date
    column: quarter_name

  - name: year
    description: "Calendar year"
    source: dim_date
    column: year

materializations:
  # Daily aggregate for trend analysis
  - name: daily_metrics
    description: "Daily revenue and order metrics"
    metrics:
      - revenue
      - order_count
      - avg_order_value
    dimensions:
      - year
      - month
    output: gold/agg_daily_metrics
    schedule: "0 3 * * *"  # 3am daily

  # Monthly by region for regional dashboards
  - name: monthly_by_region
    description: "Monthly metrics by region"
    metrics:
      - revenue
      - profit
      - order_count
    dimensions:
      - region
      - month
    output: gold/agg_monthly_region
    schedule: "0 2 1 * *"  # 2am on 1st of month

  # Category performance summary
  - name: category_performance
    description: "Category performance metrics"
    metrics:
      - revenue
      - profit
      - order_count
      - avg_order_value
    dimensions:
      - category
    output: gold/agg_category_performance

Complete Python Script¶

#!/usr/bin/env python3
"""
Complete Semantic Layer Example

This script demonstrates the full workflow:
1. Load star schema data
2. Configure semantic layer
3. Run queries
4. Materialize metrics
"""

from odibi.semantics import (
    SemanticQuery,
    Materializer,
    SemanticLayerConfig,
    MetricDefinition,
    DimensionDefinition,
    MaterializationConfig,
    parse_semantic_config
)
from odibi.context import EngineContext
from odibi.enums import EngineType
import pandas as pd
import yaml
from pathlib import Path

# =============================================================================
# 1. LOAD THE STAR SCHEMA DATA
# =============================================================================
print("=" * 70)
print("STEP 1: Loading Star Schema Data")
print("=" * 70)

# Load dimension tables
dim_customer = pd.read_parquet("warehouse/dim_customer")
dim_product = pd.read_parquet("warehouse/dim_product")
dim_date = pd.read_parquet("warehouse/dim_date")
fact_orders = pd.read_parquet("warehouse/fact_orders")

print(f"dim_customer: {len(dim_customer)} rows")
print(f"dim_product:  {len(dim_product)} rows")
print(f"dim_date:     {len(dim_date)} rows")
print(f"fact_orders:  {len(fact_orders)} rows")
print()

# =============================================================================
# 2. CREATE SEMANTIC LAYER CONFIG
# =============================================================================
print("=" * 70)
print("STEP 2: Creating Semantic Layer Configuration")
print("=" * 70)

# Option A: Load from YAML file
# with open("semantic_config.yaml") as f:
#     config = parse_semantic_config(yaml.safe_load(f))

# Option B: Build programmatically
config = SemanticLayerConfig(
    metrics=[
        MetricDefinition(
            name="revenue",
            description="Total revenue from completed orders",
            expr="SUM(line_total)",
            source="fact_orders",
            filters=["status = 'completed'"]
        ),
        MetricDefinition(
            name="cost",
            description="Total cost of goods sold",
            expr="SUM(quantity * 0.6 * unit_price)",
            source="fact_orders",
            filters=["status = 'completed'"]
        ),
        MetricDefinition(
            name="profit",
            description="Gross profit (revenue - cost)",
            expr="SUM(line_total * 0.4)",
            source="fact_orders",
            filters=["status = 'completed'"]
        ),
        MetricDefinition(
            name="order_count",
            description="Number of completed orders",
            expr="COUNT(*)",
            source="fact_orders",
            filters=["status = 'completed'"]
        ),
        MetricDefinition(
            name="avg_order_value",
            description="Average order value",
            expr="AVG(line_total)",
            source="fact_orders",
            filters=["status = 'completed'"]
        )
    ],
    dimensions=[
        DimensionDefinition(name="region", source="dim_customer", column="region"),
        DimensionDefinition(name="category", source="dim_product", column="category"),
        DimensionDefinition(name="month", source="dim_date", column="month_name",
                           hierarchy=["year", "quarter_name", "month_name"]),
        DimensionDefinition(name="quarter", source="dim_date", column="quarter_name"),
        DimensionDefinition(name="year", source="dim_date", column="year")
    ],
    materializations=[
        MaterializationConfig(
            name="daily_metrics",
            metrics=["revenue", "order_count", "avg_order_value"],
            dimensions=["year", "month"],
            output="gold/agg_daily_metrics",
            schedule="0 3 * * *"
        ),
        MaterializationConfig(
            name="monthly_by_region",
            metrics=["revenue", "profit", "order_count"],
            dimensions=["region", "month"],
            output="gold/agg_monthly_region",
            schedule="0 2 1 * *"
        ),
        MaterializationConfig(
            name="category_performance",
            metrics=["revenue", "profit", "order_count", "avg_order_value"],
            dimensions=["category"],
            output="gold/agg_category_performance"
        )
    ]
)

print(f"Metrics defined:          {len(config.metrics)}")
print(f"Dimensions defined:       {len(config.dimensions)}")
print(f"Materializations defined: {len(config.materializations)}")

# List them
print("\nMetrics:")
for m in config.metrics:
    print(f"  - {m.name}: {m.description}")

print("\nDimensions:")
for d in config.dimensions:
    print(f"  - {d.name}: from {d.source}.{d.column}")

print("\nMaterializations:")
for mat in config.materializations:
    print(f"  - {mat.name}: {mat.metrics} BY {mat.dimensions}")
print()

# =============================================================================
# 3. SETUP CONTEXT
# =============================================================================
print("=" * 70)
print("STEP 3: Setting Up Engine Context")
print("=" * 70)

context = EngineContext(df=None, engine_type=EngineType.PANDAS)
context.register("fact_orders", fact_orders)
context.register("dim_customer", dim_customer)
context.register("dim_product", dim_product)
context.register("dim_date", dim_date)

print("Registered tables: fact_orders, dim_customer, dim_product, dim_date")
print()

# =============================================================================
# 4. RUN QUERIES
# =============================================================================
print("=" * 70)
print("STEP 4: Running Semantic Queries")
print("=" * 70)

query = SemanticQuery(config)

# Query 1: Total revenue (no grouping)
print("\n--- Query 1: Total Revenue ---")
result = query.execute("revenue", context)
print(f"Total Revenue: ${result.df['revenue'].iloc[0]:,.2f}")
print(f"Execution time: {result.elapsed_ms:.2f}ms")

# Query 2: Revenue by region
print("\n--- Query 2: Revenue by Region ---")
result = query.execute("revenue BY region", context)
print(result.df.to_string(index=False))

# Query 3: Multiple metrics by region
print("\n--- Query 3: Revenue, Profit, Order Count by Region ---")
result = query.execute("revenue, profit, order_count BY region", context)
print(result.df.to_string(index=False))

# Query 4: Revenue by category and region
print("\n--- Query 4: Revenue by Category and Region ---")
result = query.execute("revenue BY category, region", context)
print(result.df.to_string(index=False))

# Query 5: Filtered query - North region only
print("\n--- Query 5: North Region Performance ---")
result = query.execute("revenue, profit, avg_order_value BY category WHERE region = 'North'", context)
print(result.df.to_string(index=False))

print()

# =============================================================================
# 5. MATERIALIZE METRICS
# =============================================================================
print("=" * 70)
print("STEP 5: Materializing Metrics")
print("=" * 70)

materializer = Materializer(config)

# Track outputs for display
output_tables = {}

def write_and_store(df, output_path):
    """Write to disk and store for display."""
    output_tables[output_path] = df.copy()
    full_path = Path(f"warehouse/{output_path}.parquet")
    full_path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(full_path, index=False)
    print(f"  → Wrote {len(df)} rows to {full_path}")

# Execute all materializations
results = materializer.execute_all(context, write_callback=write_and_store)

print("\nMaterialization Results:")
for result in results:
    status = "SUCCESS" if result.success else f"FAILED: {result.error}"
    print(f"  {result.name}: {status} ({result.row_count} rows, {result.elapsed_ms:.0f}ms)")

print()

# =============================================================================
# 6. SHOW OUTPUT TABLES
# =============================================================================
print("=" * 70)
print("STEP 6: Output Tables")
print("=" * 70)

for output_path, df in output_tables.items():
    print(f"\n--- {output_path} ({len(df)} rows) ---")
    print(df.to_string(index=False))

print()
print("=" * 70)
print("COMPLETE!")
print("=" * 70)

Sample Output¶

Step 1: Data Loading¶

======================================================================
STEP 1: Loading Star Schema Data
======================================================================
dim_customer: 13 rows
dim_product:  11 rows
dim_date:     32 rows
fact_orders:  30 rows

Step 4: Query Results¶

Query 1: Total Revenue

revenue
8,953.56

Query 2: Revenue by Region (4 rows)

region	revenue
North	2,549.88
South	2,349.93
East	1,923.88
West	2,129.87

Query 3: Multiple Metrics by Region (4 rows)

region	revenue	profit	order_count
North	2,549.88	1,019.95	7
South	2,349.93	939.97	7
East	1,923.88	769.55	7
West	2,129.87	851.95	7

Query 4: Revenue by Category and Region (8 rows)

category	region	revenue
Electronics	North	1,549.94
Electronics	South	1,449.95
Electronics	East	1,323.91
Electronics	West	1,079.93
Furniture	North	999.94
Furniture	South	899.98
Furniture	East	599.97
Furniture	West	1,049.94

Query 5: North Region Only (2 rows)

category	revenue	profit	avg_order_value
Electronics	1,549.94	619.98	387.49
Furniture	999.94	399.98	333.31

Step 5: Materialization Results¶

======================================================================
STEP 5: Materializing Metrics
======================================================================
  → Wrote 1 rows to warehouse/gold/agg_daily_metrics.parquet
  → Wrote 4 rows to warehouse/gold/agg_monthly_region.parquet
  → Wrote 2 rows to warehouse/gold/agg_category_performance.parquet

Materialization Results:
  daily_metrics: SUCCESS (1 rows, 32ms)
  monthly_by_region: SUCCESS (4 rows, 45ms)
  category_performance: SUCCESS (2 rows, 28ms)

Step 6: Output Tables¶

gold/agg_daily_metrics (1 row)

year	month	revenue	order_count	avg_order_value
2024	January	8,953.56	27	331.61

gold/agg_monthly_region (4 rows)

region	month	revenue	profit	order_count
North	January	2,549.88	1,019.95	7
South	January	2,349.93	939.97	7
East	January	1,923.88	769.55	7
West	January	2,129.87	851.95	7

gold/agg_category_performance (2 rows)

category	revenue	profit	order_count	avg_order_value
Electronics	5,403.73	2,161.49	15	360.25
Furniture	3,549.83	1,419.93	12	295.82

Architecture Diagram¶

flowchart TB
    subgraph StarSchema["Star Schema (Built by Pipelines)"]
        DC[(dim_customer<br/>13 rows)]
        DP[(dim_product<br/>11 rows)]
        DD[(dim_date<br/>32 rows)]
        FO[(fact_orders<br/>30 rows)]
    end

    subgraph SemanticLayer["Semantic Layer"]
        subgraph Metrics["5 Metrics"]
            M1[revenue]
            M2[cost]
            M3[profit]
            M4[order_count]
            M5[avg_order_value]
        end

        subgraph Dims["5 Dimensions"]
            D1[region]
            D2[category]
            D3[month]
            D4[quarter]
            D5[year]
        end

        subgraph Mats["3 Materializations"]
            MAT1[daily_metrics]
            MAT2[monthly_by_region]
            MAT3[category_performance]
        end
    end

    subgraph Output["Materialized Tables"]
        O1[(agg_daily_metrics<br/>1 row)]
        O2[(agg_monthly_region<br/>4 rows)]
        O3[(agg_category_performance<br/>2 rows)]
    end

    DC --> D1
    DP --> D2
    DD --> D3
    DD --> D4
    DD --> D5
    FO --> M1
    FO --> M2
    FO --> M3
    FO --> M4
    FO --> M5

    MAT1 --> O1
    MAT2 --> O2
    MAT3 --> O3

    style StarSchema fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style SemanticLayer fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style Output fill:#1a1a2e,stroke:#4a90d9,color:#fff

Summary¶

In this complete example, you saw:

5 Metrics: revenue, cost, profit, order_count, avg_order_value
5 Dimensions: region, category, month, quarter, year
3 Materializations: daily_metrics, monthly_by_region, category_performance
5 Queries: Total, by region, multiple metrics, cross-tab, filtered

The semantic layer provides: - Consistent definitions across all queries - Business-friendly syntax: "revenue BY region" - Pre-computed aggregates for fast dashboards - Self-documenting config with descriptions

What You Learned¶

In this tutorial series (Part 2: Semantic Layer), you learned:

What a semantic layer is and why it matters
Defining metrics with expressions and filters
Defining dimensions for grouping and drill-down
Querying with the simple "metric BY dimension" syntax
Materializing metrics for dashboard performance
Putting it all together in a production-ready configuration

Next Steps¶

The final tutorial covers FK validation for data quality.

Next: FK Validation

Previous	Up	Next
Materializing Metrics	Tutorials	FK Validation