Execution Stories¶

Auto-generated pipeline execution documentation with rich metadata, sample data, and multiple output formats.

Overview¶

Odibi's Story system provides: - Execution timeline: Complete record of pipeline runs with timestamps - Node-level metrics: Duration, row counts, schema changes per node - Sample data capture: Input/output samples with automatic redaction - Multiple renderers: HTML, Markdown, JSON output formats - Themes: Customizable styling for HTML reports - Retention policies: Automatic cleanup of old stories

Configuration¶

Basic Story Setup¶

story:
  connection: "local_data"
  path: "stories/"
  max_sample_rows: 10
  retention_days: 30
  retention_count: 100
  failure_sample_size: 100
  max_failure_samples: 500
  max_sampled_validations: 5

Story Config Options¶

Field	Type	Required	Default	Description
`connection`	string	Yes	-	Connection name for story output
`path`	string	Yes	-	Path for stories (relative to connection base_path)
`max_sample_rows`	int	No	`10`	Maximum rows to include in samples
`retention_days`	int	No	`30`	Days to keep stories before cleanup
`retention_count`	int	No	`100`	Maximum number of stories to retain
`failure_sample_size`	int	No	`100`	Rows to capture per validation failure
`max_failure_samples`	int	No	`500`	Total failed rows across all validations
`max_sampled_validations`	int	No	`5`	After this many validations, show only counts
`theme`	string	No	`default`	Built-in options: `'default'`, `'corporate'`, `'dark'`, `'minimal'`, or path to custom theme YAML file
`include_samples`	bool	No	`true`	Whether to include data samples

Remote Storage¶

Stories can be written to remote storage (ADLS, S3) using fsspec:

story:
  output_path: abfss://container@account.dfs.core.windows.net/stories/
  storage_options:
    account_key: "${STORAGE_ACCOUNT_KEY}"

Story Contents¶

Each story captures comprehensive execution metadata:

Execution Timeline¶

Metric	Description
`started_at`	ISO timestamp when pipeline started
`completed_at`	ISO timestamp when pipeline finished
`duration`	Total execution time in seconds
`run_id`	Unique identifier for the run

Node Results¶

For each node in the pipeline:

Metric	Description
`node_name`	Name of the node
`operation`	Operation type (read, transform, write)
`status`	Execution status: `success`, `failed`, `skipped`
`duration`	Node execution time in seconds
`rows_in`	Input row count
`rows_out`	Output row count
`rows_change`	Row count difference
`rows_change_pct`	Percentage change in row count

Sample Data¶

Sample data is captured with automatic redaction of sensitive values:

sample_data:
  - order_id: 12345
    customer_email: "[REDACTED]"
    amount: 99.99
  - order_id: 12346
    customer_email: "[REDACTED]"
    amount: 149.99

Configure sample capture:

story:
  max_sample_rows: 5      # Limit sample size
  include_samples: true   # Enable/disable samples

Schema Changes¶

Stories track schema evolution:

Field	Description
`schema_in`	Input column names
`schema_out`	Output column names
`columns_added`	New columns added
`columns_removed`	Columns removed
`columns_renamed`	Renamed columns

Validation Results¶

Validation warnings and errors are captured:

validation_warnings:
  - "Column 'email' has 5% null values"
  - "Date range extends beyond expected bounds"

Error details for failed nodes:

error_type: ValueError
error_message: "Column 'order_id' contains duplicate values"
error_traceback: "Full Python traceback..."
error_traceback_cleaned: "Cleaned traceback (Spark/Java noise removed)"

Execution Steps¶

Stories capture the execution steps taken during node processing for debugging:

execution_steps:
  - "Read from bronze_db"
  - "Applied pattern 'deduplicate'"
  - "Executed 2 pre-SQL statement(s)"
  - "Passed 3 contract checks"

Failed Rows Samples¶

When validations fail, stories capture sample rows that failed each validation:

failed_rows_samples:
  not_null_customer_id:
    - { order_id: 123, customer_id: null, amount: 50.00 }
    - { order_id: 456, customer_id: null, amount: 75.00 }
  positive_amount:
    - { order_id: 789, customer_id: "C001", amount: -10.00 }

failed_rows_counts:
  not_null_customer_id: 150
  positive_amount: 25

Configure failure sample limits:

story:
  failure_sample_size: 100        # Max rows per validation
  max_failure_samples: 500        # Total rows across all validations
  max_sampled_validations: 5      # After 5 validations, show only counts

Retry History¶

When retries occur, the full history is captured:

retry_history:
  - attempt: 1
    success: false
    error: "Connection timeout"
    error_type: "TimeoutError"
    duration: 1.2
  - attempt: 2
    success: false
    error: "Connection timeout"
    error_type: "TimeoutError"
    duration: 2.4
  - attempt: 3
    success: true
    duration: 0.8

Delta Lake Info¶

For Delta Lake writes, version and operation metrics are captured:

delta_info:
  version: 42
  operation: MERGE
  operation_metrics:
    numTargetRowsInserted: 150
    numTargetRowsUpdated: 25

Themes¶

Customize HTML story appearance with built-in or custom themes.

Built-in Themes¶

Theme	Description
`default`	Clean, professional blue theme
`corporate`	Traditional business styling with serif headings
`dark`	Dark mode with high-contrast colors
`minimal`	Simple black and white, compact layout

Using Themes¶

story:
  theme: dark

Custom Theme File¶

Create a custom theme YAML file:

# my_theme.yaml
name: company_brand
primary_color: "#003366"
success_color: "#2e7d32"
error_color: "#c62828"
warning_color: "#ff9900"
bg_color: "#ffffff"
text_color: "#333333"
font_family: "Arial, sans-serif"
heading_font: "Georgia, serif"
logo_url: "https://example.com/logo.png"
company_name: "Acme Corp"
footer_text: "Confidential - Internal Use Only"

Reference in config:

story:
  theme: path/to/my_theme.yaml

Theme Options¶

Option	Type	Description
`name`	string	Theme identifier
`primary_color`	hex	Main accent color
`success_color`	hex	Success status color
`error_color`	hex	Error status color
`warning_color`	hex	Warning status color
`bg_color`	hex	Background color
`text_color`	hex	Primary text color
`border_color`	hex	Border color
`code_bg`	hex	Code block background
`font_family`	string	Body font stack
`heading_font`	string	Heading font stack
`code_font`	string	Monospace font stack
`font_size`	string	Base font size
`max_width`	string	Container max width
`logo_url`	string	URL to company logo
`company_name`	string	Company name for branding
`footer_text`	string	Custom footer text
`custom_css`	string	Additional CSS rules

Renderers¶

Stories can be rendered in multiple formats.

HTML Renderer¶

Default format with interactive, responsive design:

from odibi.story.renderers import HTMLStoryRenderer, get_renderer
from odibi.story.themes import get_theme

# Using the factory
renderer = get_renderer("html")
html = renderer.render(metadata)

# With custom theme
theme = get_theme("dark")
renderer = HTMLStoryRenderer(theme=theme)
html = renderer.render(metadata)

Features: - Collapsible node sections - Status indicators with color coding - Summary statistics dashboard - Responsive layout

JSON Renderer¶

Machine-readable format for API integration:

from odibi.story.renderers import JSONStoryRenderer

renderer = JSONStoryRenderer()
json_str = renderer.render(metadata)

Output structure:

{
  "pipeline_name": "process_orders",
  "run_id": "20240130_101500",
  "started_at": "2024-01-30T10:15:00",
  "completed_at": "2024-01-30T10:15:45",
  "duration": 45.23,
  "total_nodes": 5,
  "completed_nodes": 4,
  "failed_nodes": 1,
  "skipped_nodes": 0,
  "success_rate": 80.0,
  "total_rows_processed": 15000,
  "nodes": [...]
}

Markdown Renderer¶

GitHub-flavored markdown for documentation:

from odibi.story.renderers import MarkdownStoryRenderer

renderer = MarkdownStoryRenderer()
md = renderer.render(metadata)

Renderer Factory¶

Use the factory function to get a renderer by format:

from odibi.story.renderers import get_renderer

# Supported formats: "html", "markdown", "md", "json"
renderer = get_renderer("json")
output = renderer.render(metadata)

Retention¶

Stories are automatically cleaned up based on retention policies.

Retention Configuration¶

story:
  retention_days: 30    # Delete stories older than 30 days
  retention_count: 100  # Keep maximum 100 stories per pipeline

How Retention Works¶

Count-based: When story count exceeds retention_count, oldest stories are deleted first
Time-based: Stories older than retention_days are deleted
Both apply: A story is deleted if it exceeds either limit

Storage Structure¶

Stories are organized by pipeline and date:

stories/
├── process_orders/
│   ├── 2024-01-30/
│   │   ├── run_10-15-00.html
│   │   ├── run_10-15-00.json
│   │   ├── run_14-30-00.html
│   │   └── run_14-30-00.json
│   └── 2024-01-31/
│       └── ...
└── process_customers/
    └── ...

Remote Storage Cleanup¶

Note: Automatic cleanup for remote storage (ADLS, S3) is not yet implemented. Monitor storage usage manually.

Examples¶

Complete Story Configuration¶

project: DataPipeline
engine: spark

story:
  output_path: stories/
  max_sample_rows: 10
  retention_days: 30
  retention_count: 100
  theme: corporate
  include_samples: true

pipelines:
  - pipeline: process_orders
    nodes:
      - name: read_orders
        read:
          connection: bronze
          path: orders/

      - name: transform_orders
        transform:
          operation: sql
          query: |
            SELECT order_id, customer_id, amount
            FROM {read_orders}
            WHERE amount > 0

      - name: write_orders
        write:
          connection: silver
          path: orders/
          mode: merge

Generated Story Output (JSON)¶

{
  "pipeline_name": "process_orders",
  "pipeline_layer": "silver",
  "run_id": "20240130_101500",
  "started_at": "2024-01-30T10:15:00",
  "completed_at": "2024-01-30T10:15:45",
  "duration": 45.23,
  "total_nodes": 3,
  "completed_nodes": 3,
  "failed_nodes": 0,
  "skipped_nodes": 0,
  "success_rate": 100.0,
  "total_rows_processed": 15000,
  "project": "DataPipeline",
  "nodes": [
    {
      "node_name": "read_orders",
      "operation": "read",
      "status": "success",
      "duration": 5.12,
      "rows_out": 15500,
      "schema_out": ["order_id", "customer_id", "amount", "created_at"]
    },
    {
      "node_name": "transform_orders",
      "operation": "transform",
      "status": "success",
      "duration": 2.34,
      "rows_in": 15500,
      "rows_out": 15000,
      "rows_change": -500,
      "rows_change_pct": -3.2,
      "columns_removed": ["created_at"]
    },
    {
      "node_name": "write_orders",
      "operation": "write",
      "status": "success",
      "duration": 37.77,
      "rows_out": 15000,
      "delta_info": {
        "version": 42,
        "operation": "MERGE",
        "operation_metrics": {
          "numTargetRowsInserted": 500,
          "numTargetRowsUpdated": 14500
        }
      }
    }
  ]
}

Programmatic Story Generation¶

from odibi.story.generator import StoryGenerator
from odibi.story.metadata import PipelineStoryMetadata
from odibi.story.themes import get_theme

# Create generator
generator = StoryGenerator(
    pipeline_name="process_orders",
    max_sample_rows=10,
    output_path="stories/",
    retention_days=30,
    retention_count=100,
)

# Generate story after pipeline execution
story_path = generator.generate(
    node_results=node_results,
    completed=["read_orders", "transform_orders", "write_orders"],
    failed=[],
    skipped=[],
    duration=45.23,
    start_time="2024-01-30T10:15:00",
    end_time="2024-01-30T10:15:45",
)

# Get summary for alerts
alert_summary = generator.get_alert_summary()

Documentation Stories¶

Generate stakeholder-ready documentation from pipeline config:

from odibi.story.doc_story import DocStoryGenerator
from odibi.config import PipelineConfig

# Load pipeline config
pipeline_config = PipelineConfig.from_yaml("pipeline.yaml")

# Generate documentation
doc_generator = DocStoryGenerator(pipeline_config)
doc_path = doc_generator.generate(
    output_path="docs/pipeline_doc.html",
    format="html",
    include_flow_diagram=True,
)

Alerting - Stories linked in alert payloads
Quality Gates - Gate results captured in stories
Schema Tracking - Schema changes in stories
YAML Schema Reference