Skip to content

Semantic Layer Introduction

In this tutorial, you'll learn what a semantic layer is, why it's valuable, and how it sits on top of your star schema to provide a business-friendly query interface.

What You'll Learn: - What is a semantic layer? - Why not just write SQL? - Metrics, dimensions, and materializations - When to use pipelines vs semantic layer


What is a Semantic Layer?

A semantic layer is a translation layer between your raw data and your business users. It defines business concepts (like "revenue" or "active customers") once, and lets everyone query them consistently.

Think of it like a glossary for your data:

Business Term Technical Definition
Revenue SUM(line_total) from fact_orders where status = 'completed'
Order Count COUNT(*) from fact_orders
Active Customer Customer with order in last 90 days
North Region region = 'North' in dim_customer

Without a semantic layer, everyone writes their own SQLβ€”and everyone might calculate "revenue" differently.


The Problem: Inconsistent Definitions

Without Semantic Layer

Marketing team:

-- "Revenue" = all orders
SELECT SUM(line_total) AS revenue FROM fact_orders;
-- Result: $9,803.54

Finance team:

-- "Revenue" = only completed orders
SELECT SUM(line_total) AS revenue 
FROM fact_orders 
WHERE status = 'completed';
-- Result: $8,953.56

Executive dashboard:

-- "Revenue" = completed orders, excluding discounts
SELECT SUM(line_total * 0.95) AS revenue 
FROM fact_orders 
WHERE status = 'completed';
-- Result: $8,505.88

Result: Three different "revenue" numbers in the same meeting. Chaos.

With Semantic Layer

# Everyone uses the same metric definition
result = query.execute("revenue", context)
# Result: $8,953.56 (always)

The semantic layer enforces a single, governed definition of "revenue."


Why Not Just Write SQL?

SQL is powerful, but it has limitations for business users:

Challenge Without Semantic Layer With Semantic Layer
Complex joins Users must know table relationships Automatic join handling
Consistent definitions Everyone writes their own Define once, use everywhere
Filter logic Repeated in every query Embedded in metric definition
Aggregation errors Easy to make mistakes Pre-validated expressions
Self-service Requires SQL expertise Business-friendly syntax
Governance No central control Single source of truth

How the Semantic Layer Fits

The semantic layer sits on top of your star schema:

flowchart TB
    subgraph Sources["Source Systems"]
        S1[CRM]
        S2[ERP]
        S3[POS]
    end

    subgraph Pipeline["Odibi Pipelines (YAML)"]
        P1[dimension pattern]
        P2[fact pattern]
        P3[aggregation pattern]
    end

    subgraph StarSchema["Star Schema"]
        D1[(dim_customer)]
        D2[(dim_product)]
        D3[(dim_date)]
        F1[(fact_orders)]
        A1[(agg_daily_sales)]
    end

    subgraph Semantic["Semantic Layer (Python API)"]
        M1[Metrics]
        M2[Dimensions]
        M3[Materializations]
    end

    subgraph Consumers["Consumers"]
        C1[Dashboards]
        C2[Reports]
        C3[Ad-hoc Queries]
        C4[Data Apps]
    end

    S1 --> P1
    S2 --> P2
    S3 --> P3

    P1 --> D1
    P1 --> D2
    P2 --> D3
    P2 --> F1
    P3 --> A1

    D1 --> M1
    D2 --> M1
    D3 --> M1
    F1 --> M1
    A1 --> M1

    M1 --> C1
    M2 --> C2
    M3 --> C3
    M1 --> C4

    style Sources fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style Pipeline fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style StarSchema fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style Semantic fill:#1a1a2e,stroke:#4a90d9,color:#fff
    style Consumers fill:#1a1a2e,stroke:#4a90d9,color:#fff

Odibi Pipelines build the data (dimensions, facts, aggregates). Semantic Layer defines how to query the data (metrics, dimensions, materializations).


Core Concepts

Metrics

A metric is a measurable value that can be aggregated. It answers "how much?" or "how many?"

metrics:
  - name: revenue
    description: "Total revenue from completed orders"
    expr: "SUM(line_total)"
    source: fact_orders
    filters:
      - "status = 'completed'"

Examples: - Revenue (SUM(line_total)) - Order Count (COUNT(*)) - Average Order Value (AVG(line_total)) - Unique Customers (COUNT(DISTINCT customer_sk))

Dimensions

A dimension is an attribute for grouping and filtering. It answers "by what?" or "for what?"

dimensions:
  - name: region
    source: dim_customer
    column: region

Examples: - Region (North, South, East, West) - Category (Electronics, Furniture) - Month (January, February, ...) - Day of Week (Monday, Tuesday, ...)

Materializations

A materialization pre-computes metrics at a specific grain and saves them to a table. It answers "what should be pre-calculated for dashboards?"

materializations:
  - name: monthly_revenue_by_region
    metrics: [revenue, order_count]
    dimensions: [region, month]
    output: gold/agg_monthly_revenue
    schedule: "0 2 1 * *"  # Monthly

When to Use What

Task Solution
Build dimension tables from source Odibi Pipeline: dimension pattern
Build fact tables from source Odibi Pipeline: fact pattern
Build scheduled aggregates Odibi Pipeline: aggregation pattern
Ad-hoc metric queries Semantic Layer: SemanticQuery
Self-service analytics Semantic Layer with dimensions
Dashboard metrics Semantic Layer: Materializer

Pipelines vs Semantic Layer

Use Pipelines when: - Building the star schema from source data - Scheduled ETL/ELT processes - Transforming and cleaning data - Generating surrogate keys

Use Semantic Layer when: - Defining business metrics consistently - Enabling self-service analytics - Pre-computing dashboard metrics - Creating a governed metric catalog


The simplest way to use the semantic layer is through the unified Project API. This connects your pipelines and semantic layer seamlessly:

1. Add Semantic Config to odibi.yaml

# odibi.yaml
project: my_warehouse
engine: pandas

connections:
  gold:
    type: local
    base_path: /mnt/data/gold

pipelines:
  - pipeline: build_warehouse
    nodes:
      - name: fact_orders
        write: { connection: gold, table: fact_orders }
      - name: dim_customer
        write: { connection: gold, table: dim_customer }

# Semantic layer at project level
semantic:
  metrics:
    - name: revenue
      expr: "SUM(line_total)"
      source: $build_warehouse.fact_orders    # References node's write target
      filters:
        - "status = 'completed'"

  dimensions:
    - name: region
      source: $build_warehouse.dim_customer   # No path duplication!
      column: region

2. Query with Two Lines of Code

from odibi import Project

project = Project.load("odibi.yaml")
result = project.query("revenue BY region")
print(result.df)

That's it! The Project class: - Reads connections and pipelines from your YAML - Resolves $build_warehouse.fact_orders β†’ node's write path - Auto-loads Delta tables when queried - No manual context.register() calls needed


Manual Approach

If you prefer more control, you can use the semantic layer components directly.

Example: Revenue Metric

Let's see how a simple metric works:

1. Define the Metric

from odibi.semantics import MetricDefinition

revenue = MetricDefinition(
    name="revenue",
    description="Total revenue from completed orders",
    expr="SUM(line_total)",
    source="fact_orders",
    filters=["status = 'completed'"]
)

2. Create the Config

from odibi.semantics import SemanticLayerConfig, DimensionDefinition

config = SemanticLayerConfig(
    metrics=[revenue],
    dimensions=[
        DimensionDefinition(
            name="region",
            source="dim_customer",
            column="region"
        )
    ]
)

3. Query the Metric

from odibi.semantics import SemanticQuery

query = SemanticQuery(config)

# Total revenue
result = query.execute("revenue", context)
print(result.df)
# | revenue    |
# |------------|
# | 8,953.56   |

# Revenue by region
result = query.execute("revenue BY region", context)
print(result.df)
# | region | revenue  |
# |--------|----------|
# | North  | 2,549.88 |
# | South  | 2,349.93 |
# | East   | 1,923.88 |
# | West   | 2,129.87 |

The Query Syntax

Semantic queries use a simple, business-friendly syntax:

metric1, metric2 BY dimension1, dimension2 WHERE condition

Examples:

Query Meaning
"revenue" Total revenue
"revenue BY region" Revenue grouped by region
"revenue, order_count BY region" Multiple metrics by region
"revenue BY region, month" Revenue by region and month
"revenue BY region WHERE year = 2024" Filtered revenue by region

Benefits Summary

Benefit Description
Consistency One definition of "revenue" everywhere
Governance Central control over metric logic
Self-Service Business users query without SQL
Performance Pre-computed materializations for dashboards
Discoverability Metrics are documented and cataloged
Maintainability Change definition once, updates everywhere

Architecture Overview

classDiagram
    class SemanticLayerConfig {
        +metrics: List[MetricDefinition]
        +dimensions: List[DimensionDefinition]
        +materializations: List[MaterializationConfig]
        +get_metric(name)
        +get_dimension(name)
    }

    class MetricDefinition {
        +name: str
        +description: str
        +expr: str
        +source: str
        +filters: List[str]
    }

    class DimensionDefinition {
        +name: str
        +source: str
        +column: str
        +hierarchy: List[str]
    }

    class SemanticQuery {
        +config: SemanticLayerConfig
        +execute(query_string, context)
        +parse(query_string)
        +validate(parsed_query)
    }

    class Materializer {
        +config: SemanticLayerConfig
        +execute(name, context)
        +execute_all(context)
    }

    SemanticLayerConfig --> MetricDefinition
    SemanticLayerConfig --> DimensionDefinition
    SemanticQuery --> SemanticLayerConfig
    Materializer --> SemanticLayerConfig

What You Learned

In this tutorial, you learned:

  • A semantic layer translates business concepts into technical queries
  • It ensures consistent definitions across all users
  • Metrics are measurable values (SUM, COUNT, AVG)
  • Dimensions are grouping attributes (region, category, date)
  • Materializations pre-compute metrics for performance
  • Pipelines build the data; semantic layer queries it
  • The query syntax is simple: "metric BY dimension WHERE filter"

Next Steps

Now let's learn how to define metrics in detail.

Next: Defining Metrics


Previous Up Next
Full Star Schema Tutorials Defining Metrics

Reference

For complete documentation, see: Semantic Layer Overview