Pattern: Append-Only Raw Layer (Landing → Raw)¶
Status: Core Pattern
Layer: Raw (Bronze)
Engine: Spark Structured Streaming or Batch
Write Mode: append
Problem¶
You have source data coming from multiple places (SQL databases, APIs, files). You need to: - Preserve the complete history of what arrived and when - Enable replay/reconstruction if downstream logic breaks - Avoid data loss from overwriting mistakes
Solution¶
Append every piece of data you receive to the Raw layer without modification. Raw is the immutable audit log.
Key Principles¶
- Raw is Sacred → Never delete or overwrite Raw data
- Append-Only → Each run appends new rows (possibly duplicates)
- One-to-One Copy → Raw mirrors the source schema, nothing more
- Permanent Retention → Keep forever; storage is cheap
Pattern Flow¶
Source System (SQL, API, Files)
↓
Read Data
↓
Append to Raw
(append mode)
↓
Raw Layer (history preserved)
↓
[Next: Merge into Silver for dedup/cleanup]
Example: SQL Source → Raw Layer¶
Scenario¶
You have a SQL table dbo.orders with new/updated rows arriving daily.
First Run (Day 1):
Source (dbo.orders):
order_id | product | qty | created_at
---------|--------------|-----|-------------------
1 | Widget A | 10 | 2025-11-01 10:00
2 | Widget B | 5 | 2025-11-01 11:00
3 | Widget A | 2 | 2025-11-01 12:00
After Append to Raw:
Raw.orders (Delta Table)
order_id | product | qty | created_at
---------|--------------|-----|-------------------
1 | Widget A | 10 | 2025-11-01 10:00
2 | Widget B | 5 | 2025-11-01 11:00
3 | Widget A | 2 | 2025-11-01 12:00
Second Run (Day 2):
Source now has 2 new rows + 1 duplicate:
order_id | product | qty | created_at
---------|--------------|-----|-------------------
1 | Widget A | 10 | 2025-11-01 10:00 ← Already seen
2 | Widget B | 5 | 2025-11-01 11:00 ← Already seen
3 | Widget A | 2 | 2025-11-01 12:00 ← Already seen
4 | Widget C | 8 | 2025-11-02 09:00 ← NEW
5 | Widget B | 12 | 2025-11-02 10:00 ← NEW
After Append to Raw:
Raw.orders (all 8 rows)
order_id | product | qty | created_at
---------|--------------|-----|-------------------
1 | Widget A | 10 | 2025-11-01 10:00
2 | Widget B | 5 | 2025-11-01 11:00
3 | Widget A | 2 | 2025-11-01 12:00
1 | Widget A | 10 | 2025-11-01 10:00 ← DUPLICATE (OK)
2 | Widget B | 5 | 2025-11-01 11:00 ← DUPLICATE (OK)
3 | Widget A | 2 | 2025-11-01 12:00 ← DUPLICATE (OK)
4 | Widget C | 8 | 2025-11-02 09:00
5 | Widget B | 12 | 2025-11-02 10:00
Note: Duplicates in Raw are OK. Silver's merge will deduplicate them.
Odibi YAML¶
Using High Water Mark (Incremental)¶
pipelines:
- pipeline: sql_to_raw
layer: bronze
nodes:
- id: load_orders_raw
name: "Load Orders to Raw (Bronze)"
read:
connection: sql_prod
format: sql
table: dbo.orders
options:
query: |
SELECT * FROM dbo.orders
WHERE COALESCE(updated_at, created_at) > (
SELECT COALESCE(MAX(COALESCE(updated_at, created_at)), '1900-01-01')
FROM raw.orders
)
write:
connection: adls_prod
format: delta
table: raw.orders
mode: append
Using Scheduled Batch (Full Rescan)¶
If your source doesn't have timestamps, rescan everything and append:
- id: load_orders_raw
name: "Load Orders to Raw"
read:
connection: sql_prod
format: sql
table: dbo.orders
write:
connection: adls_prod
format: delta
table: raw.orders
mode: append
Warning: This will append the entire table each run, creating duplicates. Use High Water Mark pattern when possible.
Trade-Offs¶
Advantages¶
✓ Complete audit trail (when did each row arrive?)
✓ Safe replay/reconstruction if Silver breaks
✓ Simple logic (no deduplication, no logic)
✓ Idempotent (rerun is safe; creates duplicates but that's OK)
Disadvantages¶
✗ Storage grows unbounded (mitigated by cheap cloud storage)
✗ Duplicates must be handled downstream (Silver's job)
✗ Full rescans can be slow without timestamp filtering
When to Use¶
Always use Append-Only for Raw ingestion. No exceptions.
When NOT to Use¶
Never overwrite Raw. Never delete from Raw. Never use merge in Raw.
Related Patterns¶
- High Water Mark → Efficiently filter SQL sources for incremental reads
- Merge/Upsert → Deduplicate Raw data in Silver