Aggregation Pattern Tutorial¶

In this tutorial, you'll learn how to use Odibi's aggregation pattern to build pre-aggregated tables for faster reporting and dashboards.

What You'll Learn: - Why pre-aggregate fact tables - Defining grain and measures - Using aggregate functions (SUM, COUNT, AVG) - Incremental merge strategies (replace vs sum)

Why Pre-Aggregate?¶

Consider a dashboard showing "Daily Revenue by Product Category":

Without pre-aggregation:

-- Runs against 10 million fact rows every time
SELECT
    d.full_date,
    p.category,
    SUM(f.line_total) AS revenue
FROM fact_orders f
JOIN dim_date d ON f.date_sk = d.date_sk
JOIN dim_product p ON f.product_sk = p.product_sk
GROUP BY d.full_date, p.category;
-- Takes 30 seconds

With pre-aggregation:

-- Runs against 5,000 aggregated rows
SELECT full_date, category, total_revenue
FROM agg_daily_category_sales;
-- Takes 0.1 seconds

Pre-aggregated tables trade storage for speed. For frequently-queried metrics, this is almost always worth it.

Source Data: fact_orders¶

We'll aggregate the fact table from the previous tutorial (30 rows):

order_id	customer_sk	product_sk	date_sk	quantity	unit_price	line_total	status
ORD001	1	1	20240115	1	1299.99	1299.99	completed
ORD002	1	2	20240115	2	29.99	59.98	completed
ORD003	2	3	20240116	1	249.99	249.99	completed
ORD004	3	4	20240116	3	49.99	149.97	completed
ORD005	4	5	20240117	1	599.99	599.99	completed
ORD006	5	6	20240117	1	149.99	149.99	completed
ORD007	6	7	20240118	2	399.99	799.98	completed
ORD008	7	8	20240118	4	45.99	183.96	completed
ORD009	8	9	20240119	1	79.99	79.99	completed
ORD010	9	10	20240119	1	189.99	189.99	completed
ORD011	10	1	20240120	1	1299.99	1299.99	completed
ORD012	11	2	20240120	5	29.99	149.95	completed
ORD013	12	3	20240121	2	249.99	499.98	completed
ORD014	1	4	20240121	1	49.99	49.99	completed
ORD015	2	5	20240122	1	599.99	599.99	pending
ORD016	3	6	20240122	2	149.99	299.98	completed
ORD017	4	7	20240123	1	399.99	399.99	completed
ORD018	5	8	20240123	3	45.99	137.97	completed
ORD019	6	9	20240124	2	79.99	159.98	completed
ORD020	7	10	20240124	1	189.99	189.99	completed
ORD021	8	1	20240125	1	1299.99	1299.99	completed
ORD022	9	2	20240125	3	29.99	89.97	completed
ORD023	10	3	20240126	1	249.99	249.99	cancelled
ORD024	11	4	20240126	2	49.99	99.98	completed
ORD025	12	5	20240127	1	599.99	599.99	completed
ORD026	1	6	20240127	1	149.99	149.99	completed
ORD027	2	7	20240128	1	399.99	399.99	completed
ORD028	3	8	20240128	2	45.99	91.98	completed
ORD029	4	9	20240115	1	79.99	79.99	completed
ORD030	5	10	20240116	1	189.99	189.99	completed

Step 1: Basic Aggregation by Date and Product¶

Let's aggregate orders by date and product to see daily product sales.

YAML Configuration¶

project: aggregation_tutorial
engine: pandas

connections:
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_aggregates
    nodes:
      - name: agg_daily_product_sales
        read:
          connection: warehouse
          path: fact_orders
          format: parquet

        pattern:
          type: aggregation
          params:
            grain: [date_sk, product_sk]
            measures:
              - name: total_revenue
                expr: "SUM(line_total)"
              - name: order_count
                expr: "COUNT(*)"
              - name: total_quantity
                expr: "SUM(quantity)"
            audit:
              load_timestamp: true

        write:
          connection: warehouse
          path: agg_daily_product_sales
          format: parquet
          mode: overwrite

Understanding Grain and Measures¶

Grain: The columns that define uniqueness in the output. Here, each combination of date_sk + product_sk gets one row.

Measures: The aggregations to compute. Each measure needs: - name: Output column name - expr: SQL aggregation expression

Output: agg_daily_product_sales (26 rows)¶

Here are all 26 aggregated rows:

date_sk	product_sk	total_revenue	order_count	total_quantity	load_timestamp
20240115	1	1299.99	1	1	2024-01-30 10:00:00
20240115	2	59.98	1	2	2024-01-30 10:00:00
20240115	9	79.99	1	1	2024-01-30 10:00:00
20240116	3	249.99	1	1	2024-01-30 10:00:00
20240116	4	149.97	1	3	2024-01-30 10:00:00
20240116	10	189.99	1	1	2024-01-30 10:00:00
20240117	5	599.99	1	1	2024-01-30 10:00:00
20240117	6	149.99	1	1	2024-01-30 10:00:00
20240118	7	799.98	1	2	2024-01-30 10:00:00
20240118	8	183.96	1	4	2024-01-30 10:00:00
20240119	9	79.99	1	1	2024-01-30 10:00:00
20240119	10	189.99	1	1	2024-01-30 10:00:00
20240120	1	1299.99	1	1	2024-01-30 10:00:00
20240120	2	149.95	1	5	2024-01-30 10:00:00
20240121	3	499.98	1	2	2024-01-30 10:00:00
20240121	4	49.99	1	1	2024-01-30 10:00:00
20240122	5	599.99	1	1	2024-01-30 10:00:00
20240122	6	299.98	1	2	2024-01-30 10:00:00
20240123	7	399.99	1	1	2024-01-30 10:00:00
20240123	8	137.97	1	3	2024-01-30 10:00:00
20240124	9	159.98	1	2	2024-01-30 10:00:00
20240124	10	189.99	1	1	2024-01-30 10:00:00
20240125	1	1299.99	1	1	2024-01-30 10:00:00
20240125	2	89.97	1	3	2024-01-30 10:00:00
20240126	3	249.99	1	1	2024-01-30 10:00:00
20240126	4	99.98	1	2	2024-01-30 10:00:00
20240127	5	599.99	1	1	2024-01-30 10:00:00
20240127	6	149.99	1	1	2024-01-30 10:00:00
20240128	7	399.99	1	1	2024-01-30 10:00:00
20240128	8	91.98	1	2	2024-01-30 10:00:00

Result: 30 fact rows became 26 aggregate rows (some date+product combinations had multiple orders).

Step 2: Adding More Measures¶

Let's add average and distinct count measures:

params:
  grain: [date_sk, product_sk]
  measures:
    - name: total_revenue
      expr: "SUM(line_total)"
    - name: order_count
      expr: "COUNT(*)"
    - name: total_quantity
      expr: "SUM(quantity)"
    - name: avg_order_value
      expr: "AVG(line_total)"
    - name: unique_customers
      expr: "COUNT(DISTINCT customer_sk)"
    - name: max_order_value
      expr: "MAX(line_total)"
    - name: min_order_value
      expr: "MIN(line_total)"

Available Aggregation Functions¶

Function	Description	Example
`SUM(column)`	Total of values	`SUM(line_total)`
`COUNT(*)`	Number of rows	`COUNT(*)`
`COUNT(column)`	Non-null count	`COUNT(customer_sk)`
`COUNT(DISTINCT column)`	Unique values	`COUNT(DISTINCT customer_sk)`
`AVG(column)`	Average value	`AVG(line_total)`
`MAX(column)`	Maximum value	`MAX(line_total)`
`MIN(column)`	Minimum value	`MIN(line_total)`

Complex Expressions¶

You can use expressions within aggregations:

measures:
  # Total after 10% discount
  - name: discounted_revenue
    expr: "SUM(line_total * 0.9)"

  # Margin (if cost column existed)
  - name: total_margin
    expr: "SUM(line_total - cost)"

  # Discount rate
  - name: avg_discount_rate
    expr: "AVG(discount_amount / line_total)"

Step 3: Aggregating by Different Grains¶

Daily Sales (Grain: date only)¶

params:
  grain: [date_sk]
  measures:
    - name: total_revenue
      expr: "SUM(line_total)"
    - name: order_count
      expr: "COUNT(*)"
    - name: unique_products
      expr: "COUNT(DISTINCT product_sk)"
    - name: unique_customers
      expr: "COUNT(DISTINCT customer_sk)"

Output: agg_daily_sales (14 rows)

date_sk	total_revenue	order_count	unique_products	unique_customers
20240115	1439.96	3	3	2
20240116	589.95	3	3	3
20240117	749.98	2	2	2
20240118	983.94	2	2	2
20240119	269.98	2	2	2
20240120	1449.94	2	2	2
20240121	549.97	2	2	2
20240122	899.97	2	2	2
20240123	537.96	2	2	2
20240124	349.97	2	2	2
20240125	1389.96	2	2	2
20240126	349.97	2	2	2
20240127	749.98	2	2	2
20240128	491.97	2	2	2

Product Sales (Grain: product only)¶

params:
  grain: [product_sk]
  measures:
    - name: total_revenue
      expr: "SUM(line_total)"
    - name: order_count
      expr: "COUNT(*)"
    - name: total_quantity
      expr: "SUM(quantity)"

Output: agg_product_sales (10 rows)

product_sk	total_revenue	order_count	total_quantity
1	3899.97	3	3
2	299.90	3	10
3	999.96	3	4
4	299.94	3	6
5	1799.97	3	3
6	599.96	3	4
7	1599.96	3	4
8	413.91	3	9
9	319.96	3	4
10	569.97	3	3

Step 4: Incremental Merge Strategies¶

For ongoing pipelines, you don't want to rebuild the entire aggregate table. Incremental strategies allow you to update only affected rows.

Replace Strategy¶

How it works: New aggregates completely replace existing rows for matching grain keys.

params:
  grain: [date_sk, product_sk]
  measures:
    - name: total_revenue
      expr: "SUM(line_total)"
    - name: order_count
      expr: "COUNT(*)"
  incremental:
    timestamp_column: load_timestamp
    merge_strategy: replace
  target: warehouse.agg_daily_product_sales

Example Scenario:

Existing agg_daily_product_sales:

date_sk	product_sk	total_revenue	order_count
20240115	1	1299.99	1
20240115	2	59.98	1
20240116	3	249.99	1

New orders arrive for 2024-01-15:

order_id	product_sk	date_sk	line_total
ORD100	1	20240115	1299.99

New aggregate computed:

date_sk	product_sk	total_revenue	order_count
20240115	1	2599.98	2

After Replace Merge:

date_sk	product_sk	total_revenue	order_count	Note
20240115	1	2599.98	2	Replaced
20240115	2	59.98	1	Unchanged
20240116	3	249.99	1	Unchanged

Use case: Best for most scenarios. Handles late-arriving data, corrections, and restatements correctly.

Sum Strategy¶

How it works: New measure values are added to existing values.

params:
  incremental:
    timestamp_column: load_timestamp
    merge_strategy: sum

Example Scenario:

Existing:

date_sk	product_sk	total_revenue	order_count
20240115	1	1299.99	1

New aggregate:

date_sk	product_sk	total_revenue	order_count
20240115	1	1299.99	1

After Sum Merge:

date_sk	product_sk	total_revenue	order_count
20240115	1	2599.98	2

Use case: Only for purely additive metrics (counts, sums) on append-only data.

Warning: Do NOT use sum strategy for: - AVG (would become average of averages, incorrect) - COUNT DISTINCT (would overcount) - MIN/MAX (would be wrong) - Data with updates or late-arriving records

Complete YAML Example¶

Here's a complete example with multiple aggregate tables:

# File: odibi_aggregation_tutorial.yaml
project: aggregation_tutorial
engine: pandas

connections:
  warehouse:
    type: local
    base_path: ./warehouse

story:
  connection: warehouse
  path: stories

pipelines:
  - pipeline: build_aggregates
    description: "Build all aggregate tables from fact_orders"
    nodes:
      # Daily aggregate at product level
      - name: agg_daily_product_sales
        description: "Daily sales by product"
        read:
          connection: warehouse
          path: fact_orders
          format: parquet

        pattern:
          type: aggregation
          params:
            grain:
              - date_sk
              - product_sk
            measures:
              - name: total_revenue
                expr: "SUM(line_total)"
              - name: order_count
                expr: "COUNT(*)"
              - name: total_quantity
                expr: "SUM(quantity)"
              - name: avg_order_value
                expr: "AVG(line_total)"
            having: "SUM(line_total) > 0"
            audit:
              load_timestamp: true

        write:
          connection: warehouse
          path: agg_daily_product_sales
          format: parquet
          mode: overwrite

      # Daily summary (no product breakdown)
      - name: agg_daily_sales
        description: "Daily sales summary"
        read:
          connection: warehouse
          path: fact_orders
          format: parquet

        pattern:
          type: aggregation
          params:
            grain: [date_sk]
            measures:
              - name: total_revenue
                expr: "SUM(line_total)"
              - name: order_count
                expr: "COUNT(*)"
              - name: unique_products
                expr: "COUNT(DISTINCT product_sk)"
              - name: unique_customers
                expr: "COUNT(DISTINCT customer_sk)"
              - name: avg_order_value
                expr: "AVG(line_total)"
            audit:
              load_timestamp: true

        write:
          connection: warehouse
          path: agg_daily_sales
          format: parquet
          mode: overwrite

      # Product summary (no date breakdown)
      - name: agg_product_sales
        description: "Product sales summary"
        read:
          connection: warehouse
          path: fact_orders
          format: parquet

        pattern:
          type: aggregation
          params:
            grain: [product_sk]
            measures:
              - name: total_revenue
                expr: "SUM(line_total)"
              - name: order_count
                expr: "COUNT(*)"
              - name: total_quantity
                expr: "SUM(quantity)"
              - name: unique_customers
                expr: "COUNT(DISTINCT customer_sk)"
            audit:
              load_timestamp: true

        write:
          connection: warehouse
          path: agg_product_sales
          format: parquet
          mode: overwrite

Using Aggregates in Queries¶

Dashboard Query: Daily Revenue Trend¶

SELECT
    a.date_sk,
    d.full_date,
    d.day_of_week,
    a.total_revenue,
    a.order_count
FROM agg_daily_sales a
JOIN dim_date d ON a.date_sk = d.date_sk
ORDER BY a.date_sk;

full_date	day_of_week	total_revenue	order_count
2024-01-15	Monday	1439.96	3
2024-01-16	Tuesday	589.95	3
2024-01-17	Wednesday	749.98	2
2024-01-18	Thursday	983.94	2
2024-01-19	Friday	269.98	2
2024-01-20	Saturday	1449.94	2
2024-01-21	Sunday	549.97	2
2024-01-22	Monday	899.97	2
2024-01-23	Tuesday	537.96	2
2024-01-24	Wednesday	349.97	2
2024-01-25	Thursday	1389.96	2
2024-01-26	Friday	349.97	2
2024-01-27	Saturday	749.98	2
2024-01-28	Sunday	491.97	2

Dashboard Query: Top Products¶

SELECT
    p.name AS product_name,
    p.category,
    a.total_revenue,
    a.order_count,
    a.total_quantity
FROM agg_product_sales a
JOIN dim_product p ON a.product_sk = p.product_sk
ORDER BY a.total_revenue DESC
LIMIT 5;

product_name	category	total_revenue	order_count	total_quantity
Laptop Pro 15	Electronics	3899.97	3	3
Standing Desk	Furniture	1799.97	3	3
Monitor 27"	Electronics	1599.96	3	4
Office Chair	Furniture	999.96	3	4
Mechanical Keyboard	Electronics	599.96	3	4

What You Learned¶

In this tutorial, you learned:

Pre-aggregation speeds up reporting by reducing data volume
Grain defines the uniqueness (GROUP BY columns) of aggregate rows
Measures define what to compute (SUM, COUNT, AVG, etc.)
Complex expressions can be used within aggregations
Replace strategy is best for most incremental scenarios
Sum strategy works only for purely additive metrics on append-only data
Aggregate tables are joined back to dimensions for rich reporting

Next Steps¶

Now let's put it all together with a complete star schema example.

Next: Full Star Schema Tutorial

Previous	Up	Next
Fact Pattern	Tutorials	Full Star Schema

Reference¶

For complete parameter documentation, see: Aggregation Pattern Reference