Skip to content

Medallion Architecture Guide

A beginner-friendly guide to understanding Bronze, Silver, and Gold data layers.


What is Medallion Architecture?

Medallion Architecture organizes your data into three layers, like refining raw ore into polished gold:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   Source Systems          Bronze           Silver           Gold   │
│   ──────────────          ──────           ──────           ────   │
│                                                                     │
│   [SQL Server] ──────►  [Raw Copy]  ───►  [Cleaned]  ───►  [Facts] │
│   [API]        ──────►  [Raw Copy]  ───►  [Cleaned]  ───►  [Dims]  │
│   [Files]      ──────►  [Raw Copy]  ───►  [Cleaned]  ───►  [KPIs]  │
│                                                                     │
│   "Just land it"      "Fix it"         "Make it                    │
│                                          business-ready"           │
└─────────────────────────────────────────────────────────────────────┘

Think of it like cooking: - Bronze = Raw ingredients from the store (as-is, untouched) - Silver = Ingredients washed, chopped, and prepped (cleaned, standardized) - Gold = The finished dish ready to serve (combined, calculated, business-ready)


Layer 1: Bronze (Raw Data)

Purpose

Land data exactly as it comes from the source. No transformations. Just copy it.

What Happens Here

Operation Example Why
Raw ingestion Copy SQL table to Delta Preserve original data
Add metadata _extracted_at timestamp Track when data arrived
Schema preservation Keep all columns, even unused Don't lose anything

What Does NOT Happen Here

❌ No cleaning
❌ No transformations
❌ No joins
❌ No filtering (except maybe date ranges for incremental loads)

Example Bronze Node

- name: bronze_sales_orders
  read:
    connection: erp_database
    table: dbo.SalesOrders
  write:
    connection: datalake
    path: bronze/sales_orders
    format: delta

The Golden Rule of Bronze

"If the source system has garbage data, Bronze has garbage data. That's okay."


Layer 2: Silver (Cleaned & Conformed)

Purpose

Clean and standardize ONE source at a time. Make it trustworthy.

The Key Question

"Could this node run if only ONE source system existed?"

If YES → Silver ✓
If NO → Probably Gold

Reference Tables Are Allowed in Silver

The One-Source Test refers to business source systems, not reference/lookup data.

Silver CAN join with:

  • Reference/lookup tables (code mappings, static lists)
  • Dimension lookups for enrichment (product_code → product_name)
  • Self-joins within the same source

Silver should NOT join:

  • Multiple business source systems (SAP + Salesforce → use Gold)

What Happens Here

1. Data Cleaning

Fixing problems in the source data.

Operation Example Category
Deduplication ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) Cleaning
Remove bad characters REPLACE(name, '"', '') Cleaning
Fix typos REPLACE(status, 'Actve', 'Active') Cleaning
Handle nulls COALESCE(middle_name, '') Cleaning
Trim whitespace TRIM(customer_name) Cleaning
-- Example: Cleaning product codes
CASE
    WHEN LEFT(REPLACE(product_code, '"', ''), 1) = 'X'
    THEN SUBSTRING(REPLACE(product_code, '"', ''), 2)
    ELSE REPLACE(product_code, '"', '')
END AS product_code

2. Type Casting & Standardization

Making data types consistent.

Operation Example Category
Cast types CAST(date_string AS DATE) Standardization
Parse timestamps to_timestamp(date_col) Standardization
Unit conversion hours * 60 AS duration_minutes Standardization
Standardize casing UPPER(country_code) Standardization
-- Example: Standardizing dates
to_timestamp(order_date) AS order_date,
DATEDIFF(to_date(order_date), to_date('2020-01-01')) + 1 AS date_id

3. Conforming to Standard Schema

Mapping source-specific values to enterprise-standard values.

Operation Example Category
Code mapping 'M1' → 'Machine 1' Conforming
Category standardization 'Sched%' → 'Scheduled' Conforming
Rename columns cust_id AS customer_id Conforming
Add source context 'West Region' AS region_name Conforming
-- Example: Mapping source codes to standard names
CASE
    WHEN machine_code = 'M1' THEN 'Machine 1'
    WHEN machine_code = 'M2' THEN 'Machine 2'
    WHEN machine_code = 'M3' THEN 'Machine 3'
END AS machine_name,

CASE
    WHEN category LIKE '%Sched%' THEN 'Scheduled'
    WHEN category LIKE '%Maint%' THEN 'Maintenance'
    WHEN category LIKE '%Breakdown%' THEN 'Unplanned'
    ELSE 'Other'
END AS downtime_category

4. Enrichment via Lookups

Adding dimension attributes from reference tables.

Operation Example Category
Join to calendar Get date_id from date dimension Enrichment
Join to location Get location_id from location dimension Enrichment
Join to reason codes Get reason_id from reason lookup Enrichment
Join to product master Get product_name from product dim Enrichment
-- Example: Enriching with dimension lookups
SELECT
    e.event_id,
    e.event_date,
    e.duration_minutes,
    r.reason_id,           -- From reason code lookup
    l.location_id          -- From location dimension
FROM events e
LEFT JOIN reason_codes r
    ON r.category = e.category 
LEFT JOIN dim_location l
    ON e.site_code = l.site_code

5. Soft Delete Detection

Tracking records that exist in Bronze but no longer exist in the source.

Operation Example Category
Compare snapshots Find missing keys Delete Detection
Flag deleted records _is_deleted = true Delete Detection
Filter active records WHERE _is_deleted = false Delete Detection

What Does NOT Happen Here

❌ Combining data from multiple source systems
❌ Business calculations (like KPIs, ratios)
❌ Aggregations for reporting
❌ Creating facts that span multiple sources

Example Silver Node

- name: cleaned_warehouse_events
  inputs:
    input_name: $bronze.warehouse_event_log
  depends_on:
    - cleaned_reason_codes     # Lookup table
    - cleaned_dim_location     # Dimension table
  transformer: deduplicate
  params:
    keys: [event_id]
    order_by: "_extracted_at DESC"
  transform:
    steps:
      - sql: |
          SELECT
              to_timestamp(event_date) AS event_date,
              DATEDIFF(to_date(event_date), '2020-01-01') + 1 AS date_id,
              'Warehouse A' AS location_name,
              CASE WHEN machine = 'M1' THEN 'Machine 1' ... END AS machine_name,
              duration_hours * 60 AS duration_minutes
          FROM df
      - function: detect_deletes
        params:
          mode: sql_compare
          keys: [event_id]

The Golden Rule of Silver

"One source in, one cleaned version out. The output should be the best possible version of that single source."


Layer 3: Gold (Business-Ready)

Purpose

Combine cleaned Silver data into business-meaningful outputs.

The Key Question

"Does this require data from MULTIPLE Silver sources?"

If YES → Gold ✓
If NO → Probably Silver

What Happens Here

1. Combining Multiple Sources (UNION)

Merging the same type of data from different systems.

Operation Example Category
Union facts Combine events from System A + System B + System C Combining
Reconciliation UNION (not UNION ALL) to dedupe across sources Combining
Cross-system dedup Same event recorded in multiple systems Combining
-- Example: Combining events from all sources
SELECT date_id, location_id, duration_minutes, notes
FROM cleaned_system_a_events

UNION ALL

SELECT date_id, location_id, duration_minutes, notes
FROM cleaned_system_b_events

UNION ALL

SELECT date_id, location_id, duration_minutes, notes
FROM cleaned_system_c_events

2. Business Calculations

Applying business definitions and formulas.

Operation Example Category
Define metrics total_output = COALESCE(revised_qty, original_qty) Business Rule
Calculate KPIs efficiency = actual_output / expected_output * 100 Business Rule
Apply business logic "If negative, treat as zero" Business Rule
Default values "Use default reason if null and duration < 10 min" Business Rule
-- Example: Business definition of Total Output
COALESCE(
    CASE
        WHEN COALESCE(revised_quantity, original_quantity) <= 0 THEN 0
        ELSE COALESCE(revised_quantity, original_quantity)
    END, 
0) AS total_output

This is Gold because it answers: "What does 'output' MEAN to the business?"

3. Cross-Fact Joins

Joining multiple fact tables together.

Operation Example Category
Join facts Production + Downtime + Quality → Efficiency Cross-Fact
Build wide tables Denormalized reporting tables Cross-Fact
Calculate ratios Downtime / Available Hours Cross-Fact
-- Example: Joining facts for efficiency calculation
SELECT 
    c.date_id,
    c.location_id,
    p.total_output,
    d.downtime_minutes,
    q.defect_count,
    -- Efficiency uses multiple facts
    (p.total_output / p.target_output) * 100 AS efficiency_pct
FROM calendar_scaffold c
LEFT JOIN combined_production p 
    ON c.date_id = p.date_id AND c.location_id = p.location_id
LEFT JOIN combined_downtime d 
    ON c.date_id = d.date_id AND c.location_id = d.location_id
LEFT JOIN combined_quality q 
    ON c.date_id = q.date_id AND c.location_id = q.location_id

4. Aggregations for Reporting

Pre-computing summaries for dashboards and reports.

Operation Example Category
Daily rollups SUM(production) GROUP BY date, location Aggregation
Weekly summaries AVG(efficiency) by week Aggregation
YTD calculations Running totals Aggregation

5. Derived Dimensions

Creating dimensions that don't exist in source systems.

Operation Example Category
Date spine Calendar × Locations for all combinations Derived Dim
Distinct lists All locations with any activity Derived Dim
-- Example: Create all Date × Location combinations
SELECT *
FROM dim_calendar
CROSS JOIN distinct_locations

Example Gold Node

- name: fact_daily_efficiency
  description: "Daily efficiency metrics by location"
  depends_on:
    - combined_production   # Multiple sources unioned
    - combined_downtime     # Multiple sources unioned
    - combined_quality      # Multiple sources unioned
    - calendar_scaffold     # Date × Location scaffold
  transform:
    steps:
      - sql: |
          SELECT
              c.date_id,
              c.location_id,
              COALESCE(p.total_output, 0) AS output_units,
              COALESCE(d.downtime_minutes, 0) AS downtime_min,
              COALESCE(q.defect_count, 0) AS defects,
              -- Efficiency Calculation (Business Formula)
              CASE 
                  WHEN p.target_output > 0 
                  THEN (p.total_output / p.target_output) * 100
                  ELSE 0 
              END AS efficiency_pct
          FROM calendar_scaffold c
          LEFT JOIN combined_production p 
              ON c.date_id = p.date_id AND c.location_id = p.location_id
          LEFT JOIN combined_downtime d 
              ON c.date_id = d.date_id AND c.location_id = d.location_id
          LEFT JOIN combined_quality q 
              ON c.date_id = q.date_id AND c.location_id = q.location_id
  write:
    connection: gold
    table: fact_daily_efficiency
    format: delta

The Golden Rule of Gold

"This is what the business sees. Every row and column should have business meaning."


Quick Reference: Where Does This Belong?

By Operation Type

Operation Bronze Silver Gold
Raw ingestion
Add _extracted_at
Deduplication
Remove bad characters
Fix typos
Type casting
Unit conversion
Map codes to standard names
Join to dimension/lookup tables
Soft delete detection
UNION multiple sources
Business calculations
Cross-fact joins
Aggregations for reporting
KPI definitions

By Question

Question Layer
"How do I get data from the source?" Bronze
"How do I fix this source's data quality issues?" Silver
"How do I standardize this source to our schema?" Silver
"How do I look up IDs from a dimension table?" Silver
"How do I combine data from System A + B + C?" Gold
"What does 'Total Output' mean to the business?" Gold
"How do I calculate efficiency?" Gold
"What should the dashboard show?" Gold

The One-Source Test

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   "Could this node work with only ONE source system?"       │
│                                                             │
│   YES ──────────────────────────────► SILVER                │
│    │                                                        │
│    │   Examples:                                            │
│    │   • Cleaning System A data                             │
│    │   • Joining System B data to calendar dimension        │
│    │   • Mapping System C codes to standard categories      │
│    │                                                        │
│   NO ───────────────────────────────► GOLD                  │
│    │                                                        │
│    │   Examples:                                            │
│    │   • Combining all event sources                        │
│    │   • Calculating efficiency from production + downtime  │
│    │   • Creating unified fact tables                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Common Mistakes

❌ Mistake 1: Business Logic in Silver

# WRONG - Business calculation in Silver
- name: cleaned_production
  transform:
    steps:
      - sql: |
          SELECT 
              *,
              (actual / target) * 100 AS efficiency  -- Business formula!
          FROM df

Why it's wrong: Efficiency is a business definition. Silver should just clean the data.

Fix: Move the efficiency calculation to Gold.

❌ Mistake 2: Raw Data in Silver

# WRONG - No Bronze layer, reading directly from source
- name: cleaned_orders
  read:
    connection: erp
    table: dbo.Orders  # Reading directly from source!
  transform:
    steps:
      - sql: SELECT * FROM df WHERE status != 'DELETED'

Why it's wrong: If the source changes, you lose the original data.

Fix: Add a Bronze layer that preserves the raw data first.

❌ Mistake 3: Combining Sources in Silver

# WRONG - UNION in Silver
- name: cleaned_all_events
  depends_on:
    - cleaned_system_a_events
    - cleaned_system_b_events
  transform:
    steps:
      - sql: |
          SELECT * FROM cleaned_system_a_events
          UNION ALL
          SELECT * FROM cleaned_system_b_events  -- Combining sources!

Why it's wrong: Silver should process one source at a time.

Fix: Move the UNION to a Gold layer node.


Project Structure Example

pipelines/
├── bronze/
│   └── bronze.yaml
│       # Nodes: bronze_system_a, bronze_system_b, bronze_system_c
├── silver/
│   └── silver.yaml
│       # Nodes: cleaned_system_a, cleaned_system_b, cleaned_system_c
│       # Each cleans ONE source
└── gold/
    └── gold.yaml
        # Nodes: combined_events, combined_production, fact_daily_efficiency
        # Combines Silver outputs, applies business logic

Summary

Layer Input Output Key Activities
Bronze Source systems Raw copy Ingest, add metadata
Silver Bronze (one source) Cleaned version Clean, standardize, enrich with lookups
Gold Silver (multiple sources) Business facts Combine, calculate, aggregate

Remember: - Bronze = "Land it as-is" - Silver = "Clean this ONE source" - Gold = "Combine and calculate for business"