Observability Tables¶
Auto-populating observability tables for leadership dashboards with zero manual effort.
Overview¶
Odibi automatically populates observability tables on every pipeline run, enabling Power BI dashboards for leadership without manual intervention.
What Leadership Gets¶
| Dashboard | Source Table | Auto-Updated |
|---|---|---|
| Platform Health | meta_pipeline_health |
✅ Every run |
| Cost Trends | meta_daily_stats |
✅ Every run |
| SLA Compliance | meta_sla_status |
✅ Every run |
| Failure Analysis | meta_failures |
✅ On failure |
Key Features¶
- Zero-touch: Tables auto-populate on pipeline completion
- Exactly-once: Guard table prevents duplicate updates
- Engine parity: Works on Spark, Pandas/delta-rs, and SQL Server
- Failure-resilient: Observability errors never fail pipelines
Table Taxonomy¶
Observability tables are divided into two categories:
Fact Tables (Append-Only)¶
Immutable records that capture what happened. Never modified after initial write.
| Table | Purpose | Granularity |
|---|---|---|
meta_pipeline_runs |
Pipeline execution log | One row per pipeline execution |
meta_node_runs |
Node execution log | One row per node execution |
meta_failures |
Failure details | One row per failure event |
meta_observability_errors |
Observability system failures | One row per observability failure |
meta_derived_applied_runs |
Idempotency guard | One row per (derived_table, run_id) |
Derived Tables (Incrementally Maintained)¶
Aggregated views that are upserted on each pipeline completion.
| Table | Purpose | Update Trigger |
|---|---|---|
meta_daily_stats |
Daily aggregates | Upsert on pipeline completion |
meta_pipeline_health |
Current health snapshot | Upsert on pipeline completion |
meta_sla_status |
Freshness compliance | Upsert on pipeline completion |
Configuration¶
Pipeline-Level Config¶
Enable SLA tracking by adding owner and freshness_sla to your pipeline:
pipelines:
- pipeline: orders_silver
description: "Transform orders to silver layer"
layer: silver
owner: "data-team@company.com" # Pipeline owner for SLA alerts
freshness_sla: "6h" # Expected freshness (6 hours)
freshness_anchor: run_completion # Default: when pipeline last ran
nodes:
# ...
| Field | Type | Required | Description |
|---|---|---|---|
owner |
string | No | Pipeline owner (email or name). Shown in health/SLA dashboards |
freshness_sla |
string | No | Expected data freshness SLA. Format: number + unit where unit is s (seconds), m (minutes), h (hours), or d (days). Examples: 30m, 6h, 1d |
freshness_anchor |
string | No | What defines freshness. Options: run_completion (default), table_max_timestamp, watermark_state |
Freshness SLA Required for SLA Tracking
The meta_sla_status table is only updated if freshness_sla is configured.
System-Level Config¶
Configure cost tracking and retention in the system section:
system:
connection: catalog_storage
path: _odibi_system
cost_per_compute_hour: 2.50 # Estimated cost per compute hour (USD)
retention_days:
daily_stats: 365 # Keep daily stats for 1 year
failures: 90 # Keep failure records for 90 days
observability_errors: 90 # Keep observability errors for 90 days
| Field | Type | Default | Description |
|---|---|---|---|
cost_per_compute_hour |
float | None | Estimated cost per compute hour (USD) for cost tracking |
retention_days.daily_stats |
int | 365 | Days to retain daily stats |
retention_days.failures |
int | 90 | Days to retain failure records |
retention_days.observability_errors |
int | 90 | Days to retain observability errors |
Schema Reference¶
meta_pipeline_runs¶
Pipeline execution log. One row per pipeline execution.
| Column | Type | Description |
|---|---|---|
run_id |
STRING | Primary key (UUID) |
pipeline_name |
STRING | Pipeline name |
owner |
STRING | Pipeline owner (nullable) |
layer |
STRING | Medallion layer (nullable) |
run_start_at |
TIMESTAMP | Execution start time |
run_end_at |
TIMESTAMP | Execution end time |
duration_ms |
BIGINT | Duration in milliseconds |
status |
STRING | SUCCESS or FAILURE |
nodes_total |
INT | Total nodes in pipeline |
nodes_succeeded |
INT | Nodes that succeeded |
nodes_failed |
INT | Nodes that failed |
nodes_skipped |
INT | Nodes that were skipped |
rows_processed |
BIGINT | Sum of terminal node rows (nullable) |
error_summary |
STRING | First 500 chars of error (nullable) |
terminal_nodes |
STRING | Comma-separated terminal node names (nullable) |
environment |
STRING | Environment tag (nullable) |
created_at |
TIMESTAMP | Record creation time |
meta_node_runs¶
Node execution log. One row per node execution.
| Column | Type | Description |
|---|---|---|
run_id |
STRING | FK to pipeline run |
node_id |
STRING | UUID for this node execution |
pipeline_name |
STRING | Pipeline name |
node_name |
STRING | Node name |
status |
STRING | SUCCESS, FAILURE, or SKIPPED |
run_start_at |
TIMESTAMP | Node execution start time |
run_end_at |
TIMESTAMP | Node execution end time |
duration_ms |
BIGINT | Duration in milliseconds |
rows_processed |
BIGINT | Rows processed (nullable) |
metrics_json |
STRING | Flat dict of metrics (scalars only) |
environment |
STRING | Environment tag (nullable) |
created_at |
TIMESTAMP | Record creation time |
meta_failures¶
Failure details. One row per failure event.
| Column | Type | Description |
|---|---|---|
failure_id |
STRING | Primary key (UUID) |
run_id |
STRING | FK to pipeline run |
pipeline_name |
STRING | Pipeline name |
node_name |
STRING | Node name |
error_type |
STRING | Exception class name |
error_message |
STRING | Error message (max 1000 chars) |
error_code |
STRING | Error code for taxonomy (nullable) |
stack_trace |
STRING | Stack trace (max 2000 chars, nullable) |
timestamp |
TIMESTAMP | When failure occurred |
date |
DATE | For partitioning |
meta_observability_errors¶
Observability system failures. Self-heals by logging its own errors.
| Column | Type | Description |
|---|---|---|
error_id |
STRING | Primary key (UUID) |
run_id |
STRING | Pipeline run ID (nullable) |
pipeline_name |
STRING | Pipeline name (nullable) |
component |
STRING | Component that failed (e.g., catalog_update, derived_updates) |
error_message |
STRING | Error message (max 500 chars) |
timestamp |
TIMESTAMP | When error occurred |
date |
DATE | For partitioning |
meta_derived_applied_runs (Guard Table)¶
Idempotency guard for derived table updates. Ensures exactly-once semantics.
| Column | Type | Description |
|---|---|---|
derived_table |
STRING | PK (with run_id): Derived table name |
run_id |
STRING | PK (with derived_table): Pipeline run ID |
claim_token |
STRING | UUID of claiming process |
status |
STRING | CLAIMED, APPLIED, or FAILED |
claimed_at |
TIMESTAMP | When claim was acquired |
applied_at |
TIMESTAMP | When update completed (nullable) |
error_message |
STRING | Error if failed (max 500 chars, nullable) |
meta_daily_stats¶
Daily aggregates. Primary key: (date, pipeline_name).
| Column | Type | Description |
|---|---|---|
date |
DATE | Stats date |
pipeline_name |
STRING | Pipeline name |
runs |
BIGINT | Total runs on this day |
successes |
BIGINT | Successful runs |
failures |
BIGINT | Failed runs |
total_rows |
BIGINT | Total rows processed |
total_duration_ms |
BIGINT | Total execution time |
estimated_cost_usd |
DOUBLE | Estimated cost (nullable) |
actual_cost_usd |
DOUBLE | Actual cost from billing (nullable) |
cost_source |
STRING | configured_rate, databricks_billing, none, or mixed |
cost_is_actual |
BOOLEAN | Whether cost is from billing |
meta_pipeline_health¶
Current health snapshot. Primary key: pipeline_name.
| Column | Type | Description |
|---|---|---|
pipeline_name |
STRING | Pipeline name |
owner |
STRING | Pipeline owner (nullable) |
layer |
STRING | Medallion layer (nullable) |
total_runs |
BIGINT | Lifetime total runs |
total_successes |
BIGINT | Lifetime successes |
total_failures |
BIGINT | Lifetime failures |
success_rate_7d |
DOUBLE | 7-day success rate (nullable) |
success_rate_30d |
DOUBLE | 30-day success rate (nullable) |
avg_duration_ms_7d |
DOUBLE | 7-day average duration (nullable) |
total_rows_30d |
BIGINT | 30-day total rows (nullable) |
estimated_cost_30d |
DOUBLE | 30-day estimated cost (nullable) |
last_success_at |
TIMESTAMP | Last successful run (nullable) |
last_failure_at |
TIMESTAMP | Last failed run (nullable) |
last_run_at |
TIMESTAMP | Most recent run |
updated_at |
TIMESTAMP | Record update time |
meta_sla_status¶
Freshness compliance. Primary key: pipeline_name.
| Column | Type | Description |
|---|---|---|
pipeline_name |
STRING | Pipeline name |
owner |
STRING | Pipeline owner (nullable) |
freshness_sla |
STRING | SLA string (e.g., 6h) |
freshness_anchor |
STRING | run_completion, table_max_timestamp, or watermark_state |
freshness_sla_minutes |
INT | SLA in minutes |
last_success_at |
TIMESTAMP | Last successful run (nullable) |
minutes_since_success |
INT | Minutes since last success (nullable) |
sla_met |
BOOLEAN | Whether SLA is currently met |
hours_overdue |
DOUBLE | Hours overdue if SLA breached (nullable) |
updated_at |
TIMESTAMP | Record update time |
How Auto-Population Works¶
When a pipeline completes, the following sequence occurs:
Pipeline Execution
│
▼
┌───────────────────────────────────────────┐
│ 1. Write Facts (append-only) │
│ • meta_pipeline_runs │
│ • meta_node_runs │
│ • meta_failures (if any) │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ 2. Update Derived Tables (with guard) │
│ For each derived table: │
│ • Try to claim via guard table │
│ • If claimed, update derived table │
│ • Mark applied or failed │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ 3. On Any Error │
│ • Log to meta_observability_errors │
│ • Continue pipeline (never fail) │
└───────────────────────────────────────────┘
Guard Semantics (Exactly-Once)¶
The guard table (meta_derived_applied_runs) ensures each derived table update happens exactly once per run.
Status Values¶
| Status | Meaning | Can Reclaim? |
|---|---|---|
CLAIMED |
Update in progress | Yes, if stale (>60 min) |
APPLIED |
Update completed successfully | No (terminal) |
FAILED |
Update failed | Yes (always) |
Claim Lifecycle¶
- Try Claim: Insert
CLAIMEDrow with unique token - On Success: Update to
APPLIED(terminal state) - On Failure: Update to
FAILED(reclaimable) - Stale Claims:
CLAIMEDentries older than 60 minutes are reclaimable
APPLIED is Terminal
Once a row reaches APPLIED status, it can never be reclaimed. This prevents double-counting in derived tables.
Concurrent Safety¶
The guard table uses atomic operations (MERGE in Spark/SQL Server, append+verify in Pandas) to handle concurrent updates safely.
CLI Commands¶
rebuild-summaries¶
Recompute derived tables from fact tables. Use after failures or when derived tables become inconsistent.
# Rebuild specific pipeline since a date
odibi system rebuild-summaries config.yaml --pipeline orders_silver --since 2024-01-01
# Rebuild all pipelines since a date
odibi system rebuild-summaries config.yaml --all --since 2024-01-01
# Custom stale claim threshold (default: 60 minutes)
odibi system rebuild-summaries config.yaml --all --since 2024-01-01 --max-age-minutes 30
| Option | Required | Description |
|---|---|---|
--pipeline |
No* | Specific pipeline to rebuild |
--all |
No* | Rebuild all pipelines |
--since |
Yes | Start date (YYYY-MM-DD) |
--max-age-minutes |
No | Max age for stale CLAIMED entries (default: 60) |
--env |
No | Environment override |
*Must specify either --pipeline or --all
When to use:
- After system outages that left updates incomplete
- After fixing bugs in derived table logic
- When derived tables show incorrect aggregates
- To backfill historical data
cleanup¶
Delete old records based on retention configuration.
# Preview what would be deleted
odibi system cleanup config.yaml --dry-run
# Actually delete old records
odibi system cleanup config.yaml
# With environment override
odibi system cleanup config.yaml --env prod
| Option | Required | Description |
|---|---|---|
--dry-run |
No | Preview without deleting |
--env |
No | Environment override |
Tables affected:
| Table | Retention | Default |
|---|---|---|
meta_daily_stats |
retention_days.daily_stats |
365 days |
meta_failures |
retention_days.failures |
90 days |
meta_observability_errors |
retention_days.observability_errors |
90 days |
Engine Parity¶
All operations work across Spark, Pandas/delta-rs, and SQL Server with semantic equivalence.
| Operation | Spark | Pandas/delta-rs | SQL Server |
|---|---|---|---|
| try_claim | Atomic MERGE | Append + verify | Atomic MERGE |
| mark_applied | UPDATE | Read-modify-write | UPDATE |
| mark_failed | UPDATE | Read-modify-write | UPDATE |
| daily_stats | MERGE + deltas | Groupby + overwrite | MERGE |
| pipeline_health | MERGE + window | Filter + overwrite | MERGE + CTE |
| sla_status | SQL CTE | Python datetime | DATEDIFF |
Pandas Mode Limitations
Pandas/delta-rs mode uses optimistic concurrency with retries. Under very high concurrency, some operations may need multiple attempts.
Troubleshooting¶
Derived Updates Failing¶
Symptoms: meta_derived_applied_runs has FAILED entries.
Check status:
-- Find failed updates
SELECT derived_table, run_id, error_message, claimed_at
FROM meta_derived_applied_runs
WHERE status = 'FAILED'
ORDER BY claimed_at DESC
LIMIT 20
Resolution:
Stale CLAIMED Entries¶
Symptoms: Updates stuck in CLAIMED status for >60 minutes.
Cause: Process crashed or was killed before completing.
Resolution:
# Rebuild with shorter stale threshold
odibi system rebuild-summaries config.yaml --all --since 2024-01-01 --max-age-minutes 30
Derived Tables Out of Sync¶
Symptoms: meta_daily_stats doesn't match meta_pipeline_runs aggregates.
Cause: Failed updates, race conditions, or bug in derived logic.
Resolution:
Guard Table Full Scan Performance¶
For large guard tables, consider partitioning by date or periodic cleanup of old APPLIED entries.
Observability Errors¶
Check meta_observability_errors for internal issues:
SELECT component, error_message, COUNT(*) as count
FROM meta_observability_errors
WHERE timestamp > current_date - 7
GROUP BY component, error_message
ORDER BY count DESC
Complete Example¶
project: SalesAnalytics
engine: spark
system:
connection: catalog_storage
path: _odibi_system
environment: prod
cost_per_compute_hour: 2.50
retention_days:
daily_stats: 365
failures: 90
observability_errors: 90
connections:
catalog_storage:
type: azure_adls
account: "${STORAGE_ACCOUNT}"
container: metadata
bronze:
type: azure_adls
account: "${STORAGE_ACCOUNT}"
container: bronze
silver:
type: azure_adls
account: "${STORAGE_ACCOUNT}"
container: silver
pipelines:
- pipeline: orders_silver
description: "Transform orders to silver layer"
layer: silver
owner: "data-team@company.com"
freshness_sla: "6h"
nodes:
- name: read_orders
type: read
connection: bronze
path: raw/orders
format: delta
- name: transform
type: transform
input: read_orders
transform: |
SELECT * FROM {input}
WHERE order_date >= '2024-01-01'
- name: write_orders
type: write
input: transform
connection: silver
path: orders
format: delta
mode: merge
merge_keys: [order_id]
After running this pipeline:
meta_pipeline_runs: New row with execution detailsmeta_node_runs: 3 rows (one per node)meta_daily_stats: Upserted with today's aggregatesmeta_pipeline_health: Upserted with lifetime statsmeta_sla_status: Upserted with freshness compliance
Executive Dashboard Views¶
When using Catalog Sync to replicate data to SQL Server, Odibi automatically creates pre-built views optimized for visualization tools.
Available Views¶
| View | Purpose | Key Metrics |
|---|---|---|
vw_pipeline_health_status |
RAG status per pipeline | health_status (RED/AMBER/GREEN), health_reason |
vw_exec_overview |
Executive summary by project | Success rates (7d/30d/90d), cost trends, reliability score |
vw_table_freshness |
Data staleness monitoring | freshness_status, hours_since_update |
vw_pipeline_sla_status |
SLA compliance dashboard | sla_met, hours_overdue, sla_rag |
vw_exec_current_issues |
What's broken now | Failed pipelines with error details, priority order |
vw_pipeline_risk |
Risk scoring | risk_score, risk_level (CRITICAL/HIGH/MEDIUM/LOW) |
vw_cost_summary |
Cost tracking | 7d/30d costs, runtime hours, cost trends |
Health Status Logic¶
The vw_pipeline_health_status view uses this RAG logic:
| Status | Condition |
|---|---|
| RED | Last run failed, success rate <90%, or no runs in 7 days |
| AMBER | Success rate <100% or no run in 48+ hours |
| GREEN | 100% success rate and recent runs |
SLA Status Logic¶
The vw_pipeline_sla_status view combines SLA tracking with business context:
| RAG | Condition |
|---|---|
| GREEN | SLA met |
| AMBER | SLA breached by ≤1 hour |
| RED | SLA breached by >1 hour or high-criticality pipeline not successful |
Risk Scoring¶
The vw_pipeline_risk view calculates risk as:
Where criticality_weight is 3 (High), 2 (Medium), or 1 (Low).
Business Context (dim_pipeline_context)¶
The dim_pipeline_context table is manually populated to add business metadata to your pipelines. This enriches the executive views with ownership and criticality information.
| Column | Type | Description |
|---|---|---|
project |
STRING | Project name (PK) |
pipeline_name |
STRING | Pipeline name (PK) |
environment |
STRING | Environment (PK) |
business_criticality |
STRING | High, Medium, or Low |
business_owner |
STRING | Business stakeholder name/email |
business_process |
STRING | Business process this pipeline supports |
notes |
STRING | Optional notes |
Example:
INSERT INTO [odibi_system].[dim_pipeline_context]
(project, pipeline_name, environment, business_criticality, business_owner, business_process)
VALUES
('SalesAnalytics', 'orders_silver', 'prod', 'High', 'Jane Smith', 'Daily Sales Reporting'),
('SalesAnalytics', 'inventory_bronze', 'prod', 'Medium', 'Bob Jones', 'Inventory Sync');
Optional but Recommended
Views work without business context (columns will be NULL), but populating this table enables priority-based alerting and risk scoring.
Best Practices¶
- Set owners - Configure
owneron all pipelines for accountability - Define SLAs - Set
freshness_slafor business-critical pipelines - Monitor health - Build dashboards on
meta_pipeline_health - Periodic cleanup - Run
odibi system cleanupweekly/monthly - Check observability errors - Review
meta_observability_errorsregularly - Use rebuild sparingly - Only when derived tables are actually inconsistent
Related¶
- System Catalog - Core catalog tables and configuration
- Alerting - Notifications for pipeline events
- CLI Reference - Full CLI command reference
- Diagnostics - Pipeline debugging tools