Bronze Layer Tutorial¶
The Bronze Layer is where raw data lands. No transformations, no cleaningβjust reliable ingestion with traceability.
Layer Philosophy¶
"Raw is sacred. Preserve everything, trust nothing."
Bronze is your insurance policy. If downstream logic has bugs, you can always reprocess from Bronze.
| Principle | Why |
|---|---|
| Append-only | Never lose source data |
| Schema as-is | Don't transform on ingest |
| Full fidelity | Keep all columns, all rows |
| Traceable | Know when each row arrived |
Quick Start: File Ingestion¶
The simplest Bronze pipeline loads files and appends them:
# pipelines/bronze/ingest_orders.yaml
pipelines:
- pipeline: bronze_orders
layer: bronze
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
options:
header: true
write:
connection: bronze
table: raw_orders
mode: append
Common Problems & Solutions¶
1. "I'm reprocessing files I've already loaded"¶
Problem: Each run loads all files, creating duplicates.
Solution: Use stateful incremental tracking with a high-water mark column.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
incremental:
mode: stateful # Track HWM (high-water mark)
column: file_modified_date # Column to track
write:
connection: bronze
table: raw_orders
mode: append
How it works:
- Odibi records the MAX value of column after each run
- On next run, only rows with values > stored HWM are processed
- For time-based lookback instead, use mode: rolling_window
See: Incremental Loading Pattern
2. "Files have inconsistent schemas"¶
Problem: New files have extra/missing columns, breaking the pipeline.
Solution: Enable Delta schema evolution on write.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
options:
header: true
write:
connection: bronze
table: raw_orders
mode: append
options:
mergeSchema: true # Delta: allow schema evolution
How it works:
- Spark infers schema from each file
- Delta's mergeSchema adds new columns to the target table automatically
- Odibi tracks schema changes in the System Catalog
See: Schema Tracking
3. "Malformed records crash the pipeline"¶
Problem: One bad CSV row fails the entire load.
Solution: Route bad records to an error path using Spark options.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
options:
mode: PERMISSIVE # Don't fail on bad rows
columnNameOfCorruptRecord: _corrupt_record
badRecordsPath: /landing/errors/orders/
write:
connection: bronze
table: raw_orders
mode: append
Result:
- Valid rows load normally
- Corrupt rows written to badRecordsPath for investigation
- Pipeline doesn't fail
4. "Empty source files break downstream"¶
Problem: Source sends empty files, causing downstream failures.
Solution: Add a row count contract.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
contracts:
- type: row_count
min: 1 # Fail if empty
on_fail: fail
write:
connection: bronze
table: raw_orders
mode: append
on_fail options:
| Option | Behavior |
|--------|----------|
| fail | Fail the node |
| warn | Log warning, continue |
See: Contracts
5. "Source volume dropped 90%βsomething's wrong"¶
Problem: Upstream system broke, sending almost no data.
Solution: Add volume drop detection.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
contracts:
- type: volume_drop
threshold: 0.5 # Fail if <50% of previous run
lookback_days: 3 # Days of history to average
on_fail: fail
write:
connection: bronze
table: raw_orders
mode: append
6. "I need to reprocess a specific date range"¶
Problem: Bug in source data, need to reload specific dates.
Solution: Use Delta's partition replacement.
nodes:
- name: raw_orders
read:
connection: landing
format: csv
path: orders/*.csv
write:
connection: bronze
table: raw_orders
mode: overwrite
options:
replaceWhere: "file_date >= '2025-01-01' AND file_date <= '2025-01-15'"
partitionBy: [file_date]
How it works:
- replaceWhere only replaces matching partitions
- Rest of the table remains unchanged
- Useful for targeted reloads without full table rebuild
See: Windowed Reprocess Pattern
7. "I'm loading from SQL Server, not files"¶
Problem: Source is a database table, not files.
Solution: Use SQL read with stateful incremental.
connections:
source_db:
type: sql_server
host: server.database.windows.net
database: sales
auth:
mode: sql_login
username: "${DB_USER}"
password: "${DB_PASSWORD}"
nodes:
- name: raw_orders
read:
connection: source_db
format: jdbc
table: dbo.orders
incremental:
mode: stateful # Track HWM
column: updated_at # Track by this column
write:
connection: bronze
table: raw_orders
mode: append
How it works: - First run: loads all data, stores MAX(updated_at) as HWM - Next runs: loads only WHERE updated_at > stored_hwm - HWM is updated after each successful run
See: Incremental Loading Pattern
Bronze Layer Checklist¶
Before moving to Silver, verify:
- [ ] Append-only? Raw data is never overwritten (except intentional reprocess)
- [ ] Incremental? Only new/changed data is loaded each run
- [ ] Traceable? Each row has arrival metadata (
_loaded_at, source file, etc.) - [ ] Contracts? Row count, schema, or volume checks in place
- [ ] Error handling? Bad records routed to error path, not failing pipeline
Next Steps¶
- Silver Layer Tutorial β Clean and transform Bronze data
- Append-Only Raw Pattern β Detailed pattern docs
- Getting Started β End-to-end first pipeline