Odibi Configuration Reference¶

This manual details the YAML configuration schema for Odibi projects. Auto-generated from Pydantic models.

Project Structure¶

`ProjectConfig`¶

Complete project configuration from YAML.

🏢 "Enterprise Setup" Guide¶

Business Problem: "We need a robust production environment with alerts, retries, and proper logging."

Recipe: Production Ready

project: "Customer360"
engine: "spark"

# 1. Resilience
retry:
    enabled: true
    max_attempts: 3
    backoff: "exponential"

# 2. Observability
logging:
    level: "INFO"
    structured: true  # JSON logs for Splunk/Datadog

# 3. Alerting
alerts:
    - type: "slack"
    url: "${SLACK_WEBHOOK_URL}"
    on_events: ["on_failure"]

# ... connections and pipelines ...

Field	Type	Required	Default	Description
project	str	Yes	-	Project name
engine	EngineType	No	`EngineType.PANDAS`	Execution engine
connections	Dict[str, ConnectionConfig]	Yes	-	Named connections (at least one required) Options: LocalConnectionConfig, AzureBlobConnectionConfig, DeltaConnectionConfig, SQLServerConnectionConfig, HttpConnectionConfig, CustomConnectionConfig
pipelines	List[PipelineConfig]	Yes	-	Pipeline definitions (at least one required)
story	StoryConfig	Yes	-	Story generation configuration (mandatory)
system	SystemConfig	Yes	-	System Catalog configuration (mandatory)
lineage	Optional[LineageConfig]	No	-	OpenLineage configuration
description	Optional[str]	No	-	Project description
version	str	No	`1.0.0`	Project version
owner	Optional[str]	No	-	Project owner/contact
vars	Dict[str, Any]	No	`PydanticUndefined`	Global variables for substitution (e.g. ${vars.env})
retry	RetryConfig	No	`PydanticUndefined`	Retry configuration for transient failures. Applies to all nodes unless overridden. Default: enabled with 3 attempts, exponential backoff.
logging	LoggingConfig	No	`PydanticUndefined`	Logging configuration for pipeline execution. Set level (DEBUG/INFO/WARNING/ERROR), enable structured JSON logs, add metadata.
alerts	List[AlertConfig]	No	`PydanticUndefined`	Alert configurations
performance	PerformanceConfig	No	`PydanticUndefined`	Performance tuning
environments	Optional[Dict[str, Dict[str, Any]]]	No	-	Structure: same as ProjectConfig but with only overridden fields. Not yet validated strictly.
semantic	Optional[Dict[str, Any]]	No	-	Semantic layer configuration. Can be inline or reference external file. Contains metrics, dimensions, and materializations for self-service analytics. Example: semantic: { config: 'semantic_config.yaml' } or inline definitions.

`PipelineConfig`¶

Used in: ProjectConfig

Configuration for a pipeline.

Example:

pipelines:
  - pipeline: "user_onboarding"
    description: "Ingest and process new users"
    layer: "silver"
    owner: "data-team@example.com"
    freshness_sla: "6h"
    nodes:
      - name: "node1"
        ...

Field	Type	Required	Default	Description
pipeline	str	Yes	-	Pipeline name
description	Optional[str]	No	-	Pipeline description
layer	Optional[str]	No	-	Logical layer (bronze/silver/gold)
owner	Optional[str]	No	-	Pipeline owner (email or name)
freshness_sla	Optional[str]	No	-	Expected data freshness SLA. Format: number followed by unit — s (seconds), m (minutes), h (hours), or d (days). Examples: '6h', '1d', '30m'.
freshness_anchor	Literal['run_completion', 'table_max_timestamp', 'watermark_state']	No	`run_completion`	What defines freshness. Only 'run_completion' implemented initially.
nodes	List[NodeConfig]	Yes	-	List of nodes in this pipeline
auto_cache_threshold	Optional[int]	No	`3`	Auto-cache nodes with N or more downstream dependencies. Prevents redundant ADLS re-reads when a node is used by multiple downstream nodes. Default: 3. Set to null to disable auto-caching. Individual nodes can override with explicit cache: true/false.

`NodeConfig`¶

Used in: PipelineConfig

Configuration for a single node.

🧠 "The Smart Node" Pattern¶

Business Problem: "We need complex dependencies, caching for heavy computations, and the ability to run only specific parts of the pipeline."

The Solution: Nodes are the building blocks. They handle dependencies (depends_on), execution control (tags, enabled), and performance (cache).

🕸️ DAG & Dependencies¶

The Glue of the Pipeline. Nodes don't run in isolation. They form a Directed Acyclic Graph (DAG).

depends_on: Critical! If Node B reads from Node A (in memory), you MUST list ["Node A"].
- Implicit Data Flow: If a node has no read block, it automatically picks up the DataFrame from its first dependency.

🧠 Smart Read & Incremental Loading¶

Automated History Management.

Odibi intelligently determines whether to perform a Full Load or an Incremental Load based on the state of the target.

The "Smart Read" Logic: 1. First Run (Full Load): If the target table (defined in write) does not exist: * Incremental filtering rules are ignored. * The entire source dataset is read. * Use write.first_run_query (optional) to override the read query for this initial bootstrap (e.g., to backfill only 1 year of history instead of all time).

Subsequent Runs (Incremental Load): If the target table exists:
- Rolling Window: Filters source data where column >= NOW() - lookback.
- Stateful: Filters source data where column > last_high_water_mark.

This ensures you don't need separate "init" and "update" pipelines. One config handles both lifecycle states.

🏷️ Orchestration Tags¶

Run What You Need. Tags allow you to execute slices of your pipeline. * odibi run --tag daily -> Runs all nodes with "daily" tag. * odibi run --tag critical -> Runs high-priority nodes.

🤖 Choosing Your Logic: Transformer vs. Transform¶

1. The "Transformer" (Top-Level) * What it is: A pre-packaged, heavy-duty operation that defines the entire purpose of the node. * When to use: When applying a standard Data Engineering pattern (e.g., SCD2, Merge, Deduplicate). * Analogy: "Run this App." * Syntax: transformer: "scd2" + params: {...}

2. The "Transform Steps" (Process Chain) * What it is: A sequence of smaller steps (SQL, functions, operations) executed in order. * When to use: For custom business logic, data cleaning, or feature engineering pipelines. * Analogy: "Run this Script." * Syntax: transform: { steps: [...] }

Note: You can use both! The transformer runs first, then transform steps refine the result.

🔗 Chaining Operations¶

You can mix and match! The execution order is always: 1. Read (or Dependency Injection) 2. Transformer (The "App" logic, e.g., Deduplicate) 3. Transform Steps (The "Script" logic, e.g., cleanup) 4. Validation 5. Write

Constraint: You must define at least one of read, transformer, transform, or write.

⚡ Example: App vs. Script¶

Scenario 1: The Full ETL Flow (Chained) Shows explicit Read, Transform Chain, and Write.

# 1. Ingest (The Dependency)
- name: "load_raw_users"
  read: { connection: "s3_landing", format: "json", path: "users/*.json" }
  write: { connection: "bronze", format: "parquet", path: "users_raw" }

# 2. Process (The Consumer)
- name: "clean_users"
  depends_on: ["load_raw_users"]

  # "clean_text" is a registered function from the Transformer Catalog
  transform:
    steps:
      - sql: "SELECT * FROM df WHERE email IS NOT NULL"
      - function: "clean_text"
        params: { columns: ["email"], case: "lower" }

  write: { connection: "silver", format: "delta", table: "dim_users" }

Scenario 2: The "App" Node (Top-Level Transformer) Shows a node that applies a pattern (Deduplicate) to incoming data.

- name: "deduped_users"
  depends_on: ["clean_users"]

  # The "App": Deduplication (From Transformer Catalog)
  transformer: "deduplicate"
  params:
    keys: ["user_id"]
    order_by: "updated_at DESC"

  write: { connection: "gold", format: "delta", table: "users_unique" }

Scenario 3: The Tagged Runner (Reporting) Shows how tags allow running specific slices (e.g., odibi run --tag daily).

- name: "daily_report"
  tags: ["daily", "reporting"]
  depends_on: ["deduped_users"]

  # Ad-hoc aggregation script
  transform:
    steps:
      - sql: "SELECT date_trunc('day', updated_at) as day, count(*) as total FROM df GROUP BY 1"

  write: { connection: "local_data", format: "csv", path: "reports/daily_stats.csv" }

Scenario 4: The "Kitchen Sink" (All Operations) Shows Read -> Transformer -> Transform -> Write execution order.

Why this works: 1. Internal Chaining (df): In every step (Transformer or SQL), df refers to the output of the previous step. 2. External Access (depends_on): If you added depends_on: ["other_node"], you could also run SELECT * FROM other_node in your SQL steps!

- name: "complex_flow"
  # 1. Read -> Creates initial 'df'
  read: { connection: "bronze", format: "parquet", path: "users" }

  # 2. Transformer (The "App": Deduplicate first)
  # Takes 'df' (from Read), dedups it, returns new 'df'
  transformer: "deduplicate"
  params: { keys: ["user_id"], order_by: "updated_at DESC" }

  # 3. Transform Steps (The "Script": Filter AFTER deduplication)
  # SQL sees the deduped data as 'df'
  transform:
    steps:
      - sql: "SELECT * FROM df WHERE status = 'active'"

  # 4. Write -> Saves the final filtered 'df'
  write: { connection: "silver", format: "delta", table: "active_unique_users" }

📚 Transformer Catalog¶

These are the built-in functions you can use in two ways:

As a Top-Level Transformer: transformer: "name" (Defines the node's main logic)
As a Step in a Chain: transform: { steps: [{ function: "name" }] } (Part of a sequence)

Note: merge and scd2 are special "Heavy Lifters" and should generally be used as Top-Level Transformers.

Data Engineering Patterns * merge: Upsert/Merge into target (Delta/SQL). (Params) * scd2: Slowly Changing Dimensions Type 2. (Params) * deduplicate: Remove duplicates using window functions. (Params)

Relational Algebra * join: Join two datasets. (Params) * union: Stack datasets vertically. (Params) * pivot: Rotate rows to columns. (Params) * unpivot: Rotate columns to rows (melt). (Params) * aggregate: Group by and sum/count/avg. (Params)

Data Quality & Cleaning * validate_and_flag: Check rules and flag invalid rows. (Params) * clean_text: Trim and normalize case. (Params) * filter_rows: SQL-based filtering. (Params) * fill_nulls: Replace NULLs with defaults. (Params)

Feature Engineering * derive_columns: Create new cols via SQL expressions. (Params) * case_when: Conditional logic (if-else). (Params) * generate_surrogate_key: Create MD5 keys from columns. (Params) * date_diff, date_add, date_trunc: Date arithmetic.

Scenario 1: The Full ETL Flow (Show two nodes: one loader, one processor)

# 1. Ingest (The Dependency)
- name: "load_raw_users"
  read: { connection: "s3_landing", format: "json", path: "users/*.json" }
  write: { connection: "bronze", format: "parquet", path: "users_raw" }

# 2. Process (The Consumer)
- name: "clean_users"
  depends_on: ["load_raw_users"]  # <--- Explicit dependency

  # Explicit Transformation Steps
  transform:
    steps:
      - sql: "SELECT * FROM df WHERE email IS NOT NULL"
      - function: "clean_text"
        params: { columns: ["email"], case: "lower" }

  write: { connection: "silver", format: "delta", table: "dim_users" }

Scenario 2: The "App" Node (Transformer) (Show a node that is a Transformer, no read needed if it picks up from dependency)

- name: "deduped_users"
  depends_on: ["clean_users"]

  # The "App": Deduplication
  transformer: "deduplicate"
  params:
    keys: ["user_id"]
    order_by: "updated_at DESC"

  write: { connection: "gold", format: "delta", table: "users_unique" }

Scenario 3: The Tagged Runner Run only this with odibi run --tag daily

- name: "daily_report"
  tags: ["daily", "reporting"]
  # ...

Scenario 4: Pre/Post SQL Hooks Setup and cleanup with SQL statements.

- name: "optimize_sales"
  depends_on: ["load_sales"]
  pre_sql:
    - "SET spark.sql.shuffle.partitions = 200"
    - "CREATE TEMP VIEW staging AS SELECT * FROM bronze.raw_sales"
  transform:
    steps:
      - sql: "SELECT * FROM staging WHERE amount > 0"
  post_sql:
    - "OPTIMIZE gold.fact_sales ZORDER BY (customer_id)"
    - "VACUUM gold.fact_sales RETAIN 168 HOURS"
  write:
    connection: "gold"
    format: "delta"
    table: "fact_sales"

Scenario 5: Materialization Strategies Choose how output is persisted.

# Option 1: View (no physical storage, logical model)
- name: "vw_active_customers"
  materialized: "view"  # Creates SQL view instead of table
  transform:
    steps:
      - sql: "SELECT * FROM customers WHERE status = 'active'"
  write:
    connection: "gold"
    table: "vw_active_customers"

# Option 2: Incremental (append to existing Delta table)
- name: "fact_events"
  materialized: "incremental"  # Uses APPEND mode
  read:
    connection: "bronze"
    table: "raw_events"
    incremental:
      mode: "stateful"
      column: "event_time"
  write:
    connection: "silver"
    format: "delta"
    table: "fact_events"

# Option 3: Table (default - full overwrite)
- name: "dim_products"
  materialized: "table"  # Default behavior
  # ...

Field	Type	Required	Default	Description
name	str	Yes	-	Unique node name
description	Optional[str]	No	-	Human-readable description
explanation	Optional[str]	No	-	Markdown-formatted explanation of the node's transformation logic. Rendered in the Data Story HTML report. Supports tables, code blocks, and rich formatting. Use to document business rules, data mappings, and transformation rationale for stakeholder communication. Mutually exclusive with 'explanation_file'.
explanation_file	Optional[str]	No	-	Path to external Markdown file containing the explanation, relative to the YAML file. Use for longer documentation to keep YAML files clean. Mutually exclusive with 'explanation'.
runbook_url	Optional[str]	No	-	URL to troubleshooting guide or runbook. Shown as 'Troubleshooting guide →' link on failures.
enabled	bool	No	`True`	If False, node is skipped during execution
tags	List[str]	No	`PydanticUndefined`	Operational tags for selective execution (e.g., 'daily', 'critical'). Use with `odibi run --tag`.
depends_on	List[str]	No	`PydanticUndefined`	List of parent nodes that must complete before this node runs. The output of these nodes is available for reading.
columns	Dict[str, ColumnMetadata]	No	`PydanticUndefined`	Data Dictionary defining the output schema. Used for documentation, PII tagging, and validation.
read	Optional[ReadConfig]	No	-	Input operation (Load). If missing, data is taken from the first dependency.
inputs	Optional[Dict[str, str \| Dict[str, Any]]]	No	-	Multi-input support for cross-pipeline dependencies. Map input names to either: (a) $pipeline.node reference (e.g., '$read_bronze.shift_events') (b) Explicit read config dict. Cannot be used with 'read'. Example: inputs: {events: '$read_bronze.events', calendar: {connection: 'goat', path: 'cal'}}
transform	Optional[TransformConfig]	No	-	Chain of fine-grained transformation steps (SQL, functions). Runs after 'transformer' if both are present.
write	Optional[WriteConfig]	No	-	Output operation (Save to file/table).
streaming	bool	No	`False`	Enable streaming execution for this node (Spark only)
transformer	Optional[str]	No	-	Name of the 'App' logic to run (e.g., 'deduplicate', 'scd2'). See Transformer Catalog for options.
params	Dict[str, Any]	No	`PydanticUndefined`	Parameters for transformer
pre_sql	List[str]	No	`PydanticUndefined`	List of SQL statements to execute before node runs. Use for setup: temp tables, variable initialization, grants. Example: ['SET spark.sql.shuffle.partitions=200', 'CREATE TEMP VIEW src AS SELECT * FROM raw']
post_sql	List[str]	No	`PydanticUndefined`	List of SQL statements to execute after node completes. Use for cleanup, optimization, or audit logging. Example: ['OPTIMIZE gold.fact_sales', 'VACUUM gold.fact_sales RETAIN 168 HOURS']
materialized	Optional[Literal['table', 'view', 'incremental']]	No	-	Materialization strategy. Options: 'table' (default physical write), 'view' (creates SQL view instead of table), 'incremental' (uses append mode for Delta tables). Views are useful for Gold layer logical models.
cache	bool	No	`False`	Cache result for reuse
log_level	Optional[LogLevel]	No	-	Override log level for this node
on_error	ErrorStrategy	No	`ErrorStrategy.FAIL_LATER`	Failure handling strategy
validation	Optional[ValidationConfig]	No	-	-
contracts	List[TestConfig]	No	`PydanticUndefined`	Pre-condition contracts (Circuit Breakers). Runs on input data before transformation. Options: NotNullTest, UniqueTest, AcceptedValuesTest, RowCountTest, CustomSQLTest, RangeTest, RegexMatchTest, VolumeDropTest, SchemaContract, DistributionContract, FreshnessContract
schema_policy	Optional[SchemaPolicyConfig]	No	-	Schema drift handling policy
privacy	Optional[PrivacyConfig]	No	-	Privacy Suite: PII anonymization settings
sensitive	bool \| List[str]	No	`False`	If true or list of columns, masks sample data in stories
source_yaml	Optional[str]	No	-	Internal: source YAML file path for sql_file resolution

`ColumnMetadata`¶

Used in: NodeConfig

Metadata for a column in the data dictionary.

Field	Type	Required	Default	Description
description	Optional[str]	No	-	Column description
pii	bool	No	`False`	Contains PII?
tags	List[str]	No	`PydanticUndefined`	Tags (e.g. 'business_key', 'measure')

`SystemConfig`¶

Used in: ProjectConfig

Configuration for the Odibi System Catalog (The Brain).

Stores metadata, state, and pattern configurations. The primary connection must be a storage connection (blob/local) that supports Delta tables.

Example:

system:
  connection: adls_bronze        # Primary - must be blob/local storage
  path: _odibi_system
  environment: dev

With sync to SQL Server (for dashboards/queries):

system:
  connection: adls_prod          # Primary - Delta tables
  environment: prod
  sync_to:
    connection: sql_server_prod  # Secondary - SQL for visibility
    schema_name: odibi_system

With sync to another blob (cross-region backup):

system:
  connection: adls_us_east
  sync_to:
    connection: adls_us_west
    path: _odibi_system_replica

Field	Type	Required	Default	Description
connection	str	Yes	-	Connection for primary system tables. Must be blob storage (azure_blob) or local filesystem - NOT SQL Server. Delta tables require storage backends.
path	str	No	`_odibi_system`	Path relative to connection root
environment	Optional[str]	No	-	Environment tag (e.g., 'dev', 'qat', 'prod'). Written to all system table records for cross-environment querying.
schema_name	Optional[str]	No	-	Deprecated. Use sync_to.schema_name for SQL Server targets.
sync_to	Optional[SyncToConfig]	No	-	Secondary destination to sync system catalog data to. Use for SQL Server dashboards or cross-region Delta replication.
sync_from	Optional[SyncFromConfig]	No	-	Source to sync system data from. Enables pushing local development data to centralized system tables.
cost_per_compute_hour	Optional[float]	No	-	Estimated cost per compute hour (USD) for cost tracking
databricks_billing_enabled	bool	No	`False`	Attempt to query Databricks billing tables for actual costs
retention_days	Optional[RetentionConfig]	No	-	Retention periods for system tables
optimize_catalog	bool	No	`False`	Run OPTIMIZE + VACUUM on all system catalog Delta tables after each pipeline run. Compacts small files created by frequent MERGE operations. Adds ~15-20s but prevents accumulation of small files that degrade read performance over time. Set to false to skip.
sync_timeout_seconds	float	No	`30.0`	Maximum time (seconds) to wait for async catalog sync to complete. Reduced from 300s default to 30s for better performance. Sync is incremental, so incomplete syncs will catch up on next run.
async_derived_updates	bool	No	`True`	Run derived table updates (meta_daily_stats, meta_pipeline_health, etc.) asynchronously in background thread. Saves ~20-30s per pipeline. Updates complete eventually - safe for reporting tables.
async_lineage	bool	No	`True`	Build lineage incrementally as each pipeline completes, then merge at end. Saves ~40s by parallelizing lineage construction with pipeline execution. Lineage still generated, just built in background.
skip_sync_wait_in_databricks	bool	No	`True`	Skip waiting for catalog sync to complete when running in Databricks. Databricks clusters stay alive, so background sync threads complete safely. Saves ~90s overhead. Set to false to always wait for sync completion.

`SyncFromConfig`¶

Used in: SystemConfig

Configuration for syncing system data from a source location.

Used to pull system data (runs, state) from another backend into the target.

Example:

sync_from:
  connection: local_parquet
  path: .odibi/system/

Field	Type	Required	Default	Description
connection	str	Yes	-	Connection name for the source system data
path	Optional[str]	No	-	Path to source system data (for file-based sources)
schema_name	Optional[str]	No	-	Schema name for SQL Server source (if applicable)

Connections¶

`LocalConnectionConfig`¶

Used in: ProjectConfig

Local filesystem connection.

When to Use: Development, testing, small datasets, local processing.

See Also: AzureBlobConnectionConfig for cloud alternatives.

Example:

local_data:
  type: "local"
  base_path: "./data"

Field	Type	Required	Default	Description
type	Literal['local']	No	`ConnectionType.LOCAL`	-
validation_mode	ValidationMode	No	`ValidationMode.LAZY`	-
base_path	str	No	`./data`	Base directory path

`DeltaConnectionConfig`¶

Used in: ProjectConfig

Delta Lake connection for ACID-compliant data lakes.

When to Use: - Production data lakes on Azure/AWS/GCP - Need time travel, ACID transactions, schema evolution - Upsert/merge operations

See Also: WriteConfig for Delta write options

Scenario 1: Delta via metastore

delta_silver:
  type: "delta"
  catalog: "spark_catalog"
  schema: "silver_db"

Scenario 2: Direct path + Node usage

delta_local:
  type: "local"
  base_path: "dbfs:/mnt/delta"

# In pipeline:
# read:
#   connection: "delta_local"
#   format: "delta"
#   path: "bronze/orders"

Field	Type	Required	Default	Description
type	Literal['delta']	No	`ConnectionType.DELTA`	-
validation_mode	ValidationMode	No	`ValidationMode.LAZY`	-
catalog	str	Yes	-	Spark catalog name (e.g. 'spark_catalog')
schema_name	str	Yes	-	Database/schema name
table	Optional[str]	No	-	Optional default table name for this connection (used by story/pipeline helpers)

`AzureBlobConnectionConfig`¶

Used in: ProjectConfig

Azure Blob Storage / ADLS Gen2 connection.

When to Use: Azure-based data lakes, landing zones, raw data storage.

See Also: DeltaConnectionConfig for Delta-specific options

Scenario 1: Prod with Key Vault-managed key

adls_bronze:
  type: "azure_blob"
  account_name: "myaccount"
  container: "bronze"
  auth:
    mode: "key_vault"
    key_vault: "kv-data"
    secret: "adls-account-key"

Scenario 2: Local dev with inline account key

adls_dev:
  type: "azure_blob"
  account_name: "devaccount"
  container: "sandbox"
  auth:
    mode: "account_key"
    account_key: "${ADLS_ACCOUNT_KEY}"

Scenario 3: MSI (no secrets)

adls_msi:
  type: "azure_blob"
  account_name: "myaccount"
  container: "bronze"
  auth:
    mode: "aad_msi"
    # optional: client_id for user-assigned identity
    client_id: "00000000-0000-0000-0000-000000000000"

Field	Type	Required	Default	Description
type	Literal['azure_blob']	No	`ConnectionType.AZURE_BLOB`	-
validation_mode	ValidationMode	No	`ValidationMode.LAZY`	-
account_name	str	Yes	-	-
container	str	Yes	-	-
auth	AzureBlobAuthConfig	No	`PydanticUndefined`	Authentication configuration. Choose one mode: 'account_key' (storage key), 'sas' (SAS token), 'connection_string', 'key_vault' (Azure Key Vault), or 'aad_msi' (Managed Identity, default). For production, prefer key_vault or aad_msi to avoid storing secrets in config. Options: AzureBlobKeyVaultAuth, AzureBlobAccountKeyAuth, AzureBlobSasAuth, AzureBlobConnectionStringAuth, AzureBlobMsiAuth

`SQLServerConnectionConfig`¶

Used in: ProjectConfig

SQL Server / Azure SQL Database connection.

When to Use: Reading from SQL Server sources, Azure SQL DB, Azure Synapse.

See Also: ReadConfig for query options

Scenario 1: Managed identity (AAD MSI)

sql_dw_msi:
  type: "sql_server"
  host: "server.database.windows.net"
  database: "dw"
  auth:
    mode: "aad_msi"

Scenario 2: SQL login

sql_dw_login:
  type: "sql_server"
  host: "server.database.windows.net"
  database: "dw"
  port: 1433
  driver: "ODBC Driver 17 for SQL Server"
  auth:
    mode: "sql_login"
    username: "dw_writer"
    password: "${DW_PASSWORD}"

Field	Type	Required	Default	Description
type	Literal['sql_server']	No	`ConnectionType.SQL_SERVER`	-
validation_mode	ValidationMode	No	`ValidationMode.LAZY`	-
host	str	Yes	-	-
database	str	Yes	-	-
port	int	No	`1433`	-
driver	str	No	`ODBC Driver 18 for SQL Server`	-
auth	SQLServerAuthConfig	No	`PydanticUndefined`	Authentication configuration. Choose one mode: 'sql_login' (username/password), 'aad_password' (Azure AD service principal), 'aad_msi' (Managed Identity, default), or 'connection_string' (full JDBC string). For Databricks/Azure, prefer aad_msi for passwordless auth. Options: SQLLoginAuth, SQLAadPasswordAuth, SQLMsiAuth, SQLConnectionStringAuth

`HttpConnectionConfig`¶

Used in: ProjectConfig

HTTP connection.

Scenario: Bearer token via env var

api_source:
  type: "http"
  base_url: "https://api.example.com"
  headers:
    User-Agent: "odibi-pipeline"
  auth:
    mode: "bearer"
    token: "${API_TOKEN}"

Field	Type	Required	Default	Description
type	Literal['http']	No	`ConnectionType.HTTP`	-
validation_mode	ValidationMode	No	`ValidationMode.LAZY`	-
base_url	str	Yes	-	Base URL for all API requests (e.g., 'https://api.example.com/v1')
headers	Dict[str, str]	No	`PydanticUndefined`	Default HTTP headers included in all requests. Example: {'User-Agent': 'odibi-pipeline', 'Accept': 'application/json'}. Auth headers are typically set via the 'auth' block instead.
auth	HttpAuthConfig	No	`PydanticUndefined`	Authentication configuration. Choose one mode: 'none' (no auth), 'basic' (username/password), 'bearer' (token), or 'api_key' (custom header). Tokens can use env vars: '${API_TOKEN}'. Options: HttpNoAuth, HttpBasicAuth, HttpBearerAuth, HttpApiKeyAuth

Node Operations¶

`ReadConfig`¶

Used in: NodeConfig

Configuration for reading data into a node.

When to Use: First node in a pipeline, or any node that reads from storage.

Key Concepts: - connection: References a named connection from connections: section - format: File format (csv, parquet, delta, json, sql) - incremental: Enable incremental loading (only new data)

See Also: - Incremental Loading - HWM-based loading - IncrementalConfig - Incremental loading options

📖 "Universal Reader" Guide¶

Business Problem: "I need to read from files, databases, streams, and even travel back in time to see how data looked yesterday."

Recipe 1: The Time Traveler (Delta/Iceberg) Reproduce a bug by seeing the data exactly as it was.

read:
  connection: "silver_lake"
  format: "delta"
  table: "fact_sales"
  time_travel:
    as_of_timestamp: "2023-10-25T14:00:00Z"

Recipe 2: The Streamer Process data in real-time.

read:
  connection: "event_hub"
  format: "json"
  streaming: true

Recipe 3: The SQL Query Push down filtering to the source database.

read:
  connection: "enterprise_dw"
  format: "sql"
  # Use the query option to filter at source!
  query: "SELECT * FROM huge_table WHERE date >= '2024-01-01'"

Recipe 4: Archive Bad Records (Spark) Capture malformed records for later inspection.

read:
  connection: "landing"
  format: "json"
  path: "events/*.json"
  archive_options:
    badRecordsPath: "/mnt/quarantine/bad_records"

Recipe 5: Optimize JDBC Parallelism (Spark) Control partition count for SQL sources to reduce task overhead.

read:
  connection: "enterprise_dw"
  format: "sql"
  table: "small_lookup_table"
  options:
    numPartitions: 1  # Single partition for small tables

Performance Tip: For small tables (<100K rows), use numPartitions: 1 to avoid excessive Spark task scheduling overhead. For large tables, increase partitions to enable parallel reads (requires partitionColumn, lowerBound, upperBound).

Field	Type	Required	Default	Description
connection	Optional[str]	No	-	Connection name from project.yaml (null for synthetic/simulation sources)
format	ReadFormat	Yes	-	Data format: csv, parquet, delta, json, sql, api, excel, avro, cloudFiles
table	Optional[str]	No	-	Table name for SQL/Delta
path	Optional[str]	No	-	Path for file-based sources
streaming	bool	No	`False`	Enable streaming read (Spark only)
schema_ddl	Optional[str]	No	-	Schema for streaming reads from file sources (required for Avro, JSON, CSV). Use Spark DDL format: 'col1 STRING, col2 INT, col3 TIMESTAMP'. Not required for Delta (schema is inferred from table metadata).
query	Optional[str]	No	-	SQL query to filter at source (pushdown). Mutually exclusive with table/path if supported by connector.
sql_file	Optional[str]	No	-	Path to external .sql file containing the query, relative to the YAML file defining the node. Mutually exclusive with 'query'.
filter	Optional[str]	No	-	SQL WHERE clause filter (pushed down to source for SQL formats). Example: "DAY > '2022-12-31'"
incremental	Optional[IncrementalConfig]	No	-	Automatic incremental loading strategy (CDC-like). If set, generates query based on target state (HWM).
time_travel	Optional[TimeTravelConfig]	No	-	Time travel options (Delta only)
archive_options	Dict[str, Any]	No	`PydanticUndefined`	Options for archiving bad records (e.g. badRecordsPath for Spark)
options	Dict[str, Any]	No	`PydanticUndefined`	Format-specific options

`IncrementalConfig`¶

Used in: ReadConfig

Configuration for automatic incremental loading.

When to Use: Load only new/changed data instead of full table scans.

See Also: ReadConfig

Modes: 1. Rolling Window (Default): Uses a time-based lookback from NOW(). Good for: Stateless loading where you just want "recent" data. Args: lookback, unit

Stateful: Tracks the High-Water Mark (HWM) of the key column. Good for: Exact incremental ingestion (e.g. CDC-like). Args: state_key (optional), watermark_lag (optional)

Generates SQL: - Rolling: WHERE column >= NOW() - lookback - Stateful: WHERE column > :last_hwm

Example (Rolling Window):

incremental:
  mode: "rolling_window"
  column: "updated_at"
  lookback: 3
  unit: "day"

Example (Stateful HWM):

incremental:
  mode: "stateful"
  column: "id"
  # Optional: track separate column for HWM state
  state_key: "last_processed_id"

Example (Stateful with Watermark Lag):

incremental:
  mode: "stateful"
  column: "updated_at"
  # Handle late-arriving data: look back 2 hours from HWM
  watermark_lag: "2h"

Example (Oracle Date Format):

incremental:
  mode: "rolling_window"
  column: "EVENT_START"
  lookback: 3
  unit: "day"
  # For string columns with Oracle format (DD-MON-YY)
  date_format: "oracle"

Supported date_format values: - oracle: DD-MON-YY for Oracle databases (uses TO_TIMESTAMP) - oracle_sqlserver: DD-MON-YY format stored in SQL Server (uses TRY_CONVERT) - sql_server: Uses CONVERT with style 120 - us: MM/DD/YYYY format - eu: DD/MM/YYYY format - iso: YYYY-MM-DDTHH:MM:SS format

Field	Type	Required	Default	Description
mode	IncrementalMode	No	`IncrementalMode.ROLLING_WINDOW`	Incremental strategy: 'rolling_window' or 'stateful'
column	str	Yes	-	Primary column to filter on (e.g., updated_at)
fallback_column	Optional[str]	No	-	Backup column if primary is NULL (e.g., created_at). Generates COALESCE(col, fallback) >= ...
lookback	Optional[int]	No	-	Time units to look back (Rolling Window only)
unit	Optional[IncrementalUnit]	No	-	Time unit for lookback (Rolling Window only). Options: 'hour', 'day', 'month', 'year'
state_key	Optional[str]	No	-	Unique ID for state tracking. Defaults to node name if not provided.
watermark_lag	Optional[str]	No	-	Safety buffer for late-arriving data in stateful mode. Subtracts this duration from the stored HWM when filtering. Format: number followed by unit — s (seconds), m (minutes), h (hours), or d (days). Examples: '2h' (2 hours), '30m' (30 minutes), '1d' (1 day). Use when source has replication lag or eventual consistency.
date_format	Optional[str]	No	-	Source date format when the column is stored as a string. Options: 'oracle' (DD-MON-YY for Oracle DB), 'oracle_sqlserver' (DD-MON-YY format in SQL Server), 'sql_server' (uses CONVERT with style 120), 'us' (MM/DD/YYYY), 'eu' (DD/MM/YYYY), 'iso' (YYYY-MM-DDTHH:MM:SS). When set, SQL pushdown will use appropriate CONVERT/TO_TIMESTAMP functions.

`TimeTravelConfig`¶

Used in: ReadConfig

Configuration for time travel reading (Delta/Iceberg).

Example:

time_travel:
  as_of_version: 10
  # OR
  as_of_timestamp: "2023-10-01T12:00:00Z"

Field	Type	Required	Default	Description
as_of_version	Optional[int]	No	-	Version number to time travel to
as_of_timestamp	Optional[str]	No	-	Timestamp string to time travel to

`TransformConfig`¶

Used in: NodeConfig

Configuration for transformation steps within a node.

When to Use: Custom business logic, data cleaning, SQL transformations.

Key Concepts: - steps: Ordered list of operations (SQL, functions, or both) - Each step receives the DataFrame from the previous step - Steps execute in order: step1 → step2 → step3

See Also: Transformer Catalog

Transformer vs Transform: - transformer: Single heavy operation (scd2, merge, deduplicate) - transform.steps: Chain of lighter operations

🔧 "Transformation Pipeline" Guide¶

Business Problem: "I have complex logic that mixes SQL for speed and Python for complex calculations."

The Solution: Chain multiple steps together. Output of Step 1 becomes input of Step 2.

Function Registry: The function step type looks up functions registered with @transform (or @register). This allows you to use the same registered functions as both top-level Transformers and steps in a chain.

Recipe: The Mix-and-Match

transform:
  steps:
    # Step 1: SQL Filter (Fast)
    - sql: "SELECT * FROM df WHERE status = 'ACTIVE'"

    # Step 2: Custom Python Function (Complex Logic)
    # Looks up 'calculate_lifetime_value' in the registry
    - function: "calculate_lifetime_value"
      params: { discount_rate: 0.05 }

    # Step 3: Built-in Operation (Standard)
    - operation: "drop_duplicates"
      params: { subset: ["user_id"] }

Field	Type	Required	Default	Description
steps	List[str \| TransformStep]	Yes	-	List of transformation steps (SQL strings or TransformStep configs)

`DeleteDetectionConfig`¶

Configuration for delete detection in Silver layer.

🔍 "CDC Without CDC" Guide¶

Business Problem: "Records are deleted in our Azure SQL source, but our Silver tables still show them."

The Solution: Use delete detection to identify and flag records that no longer exist in the source.

Recipe 1: SQL Compare (Recommended for HWM)

transform:
  steps:
    - operation: detect_deletes
      params:
        mode: sql_compare
        keys: [customer_id]
        source_connection: azure_sql
        source_table: dbo.Customers

Recipe 2: Snapshot Diff (For Full Snapshot Sources) Use ONLY with full snapshot ingestion, NOT with HWM incremental. Requires connection and path to specify the target Delta table for comparison.

transform:
  steps:
    - operation: detect_deletes
      params:
        mode: snapshot_diff
        keys: [customer_id]
        connection: silver_conn    # Required: connection to target Delta table
        path: "silver/customers"   # Required: path to target Delta table

Recipe 3: Conservative Threshold

transform:
  steps:
    - operation: detect_deletes
      params:
        mode: sql_compare
        keys: [customer_id]
        source_connection: erp
        source_table: dbo.Customers
        max_delete_percent: 20.0
        on_threshold_breach: error

Recipe 4: Hard Delete (Remove Rows)

transform:
  steps:
    - operation: detect_deletes
      params:
        mode: sql_compare
        keys: [customer_id]
        source_connection: azure_sql
        source_table: dbo.Customers
        soft_delete_col: null  # removes rows instead of flagging

Field	Type	Required	Default	Description
mode	DeleteDetectionMode	No	`DeleteDetectionMode.NONE`	Delete detection strategy: none, snapshot_diff, sql_compare
keys	List[str]	No	`PydanticUndefined`	Business key columns for comparison
connection	Optional[str]	No	-	For snapshot_diff: connection name to target Delta table (required for snapshot_diff)
path	Optional[str]	No	-	For snapshot_diff: path to target Delta table (required for snapshot_diff)
soft_delete_col	Optional[str]	No	`_is_deleted`	Column to flag deletes (True = deleted). Set to null for hard-delete (removes rows).
source_connection	Optional[str]	No	-	For sql_compare: connection name to query live source
source_table	Optional[str]	No	-	For sql_compare: table to query for current keys
source_query	Optional[str]	No	-	For sql_compare: custom SQL query for keys (overrides source_table)
snapshot_column	Optional[str]	No	-	For snapshot_diff on non-Delta: column to identify snapshots. If None, uses Delta time travel (default).
on_first_run	FirstRunBehavior	No	`FirstRunBehavior.SKIP`	Behavior when no previous version exists for snapshot_diff
max_delete_percent	Optional[float]	No	`50.0`	Safety threshold: warn/error if more than X% of rows would be deleted
on_threshold_breach	ThresholdBreachAction	No	`ThresholdBreachAction.WARN`	Behavior when delete percentage exceeds max_delete_percent

`ValidationConfig`¶

Used in: NodeConfig

Configuration for data validation (post-transform checks).

When to Use: Output data quality checks that run after transformation but before writing.

See Also: Validation Guide, Quarantine Guide, Contracts Overview (pre-transform checks)

🛡️ "The Indestructible Pipeline" Pattern¶

Business Problem: "Bad data polluted our Gold reports, causing executives to make wrong decisions. We need to stop it before it lands."

The Solution: A Quality Gate that runs after transformation but before writing.

Recipe: The Quality Gate

href="#__codelineno-39-1">validation: mode: "fail" # fail (stop pipeline) or warn (log only) on_fail: "alert" # alert or ignore tests: # 1. Completeness - type: "not_null" columns: ["transaction_id", "customer_id"] # 2. Integrity - type: "unique" columns: ["transaction_id"] - type: "accepted_values" column: "status" values: ["PENDING", "COMPLETED", "FAILED"] # 3. Ranges & Patterns - type: "range" column: "age" min: 18 max: 120 - type: "regex_match" column: "email" pattern: "^[\w\.-]+@[\w\.-]+\.\w+$" # 4. Business Logic (SQL) - type: "custom_sql" name: "dates_ordered" condition: "created_at <= completed_at" threshold: 0.01 # Allow 1% failure

Recipe: Quarantine + Gate

validation:
  tests:
    - type: not_null
      columns: [customer_id]
      on_fail: quarantine
  quarantine:
    connection: silver
    path: customers_quarantine
  gate:
    require_pass_rate: 0.95
    on_fail: abort

Field	Type	Required	Default	Description
mode	ValidationAction	No	`ValidationAction.FAIL`	Execution mode: 'fail' (stop pipeline) or 'warn' (log only)
on_fail	OnFailAction	No	`OnFailAction.ALERT`	Action on failure: 'alert' (send notification) or 'ignore'
tests	List[TestConfig]	No	`PydanticUndefined`	List of validation tests Options: NotNullTest, UniqueTest, AcceptedValuesTest, RowCountTest, CustomSQLTest, RangeTest, RegexMatchTest, VolumeDropTest, SchemaContract, DistributionContract, FreshnessContract
quarantine	Optional[QuarantineConfig]	No	-	Quarantine configuration for failed rows
gate	Optional[GateConfig]	No	-	Quality gate configuration for batch-level validation
fail_fast	bool	No	`False`	Stop validation on first failure. Skips remaining tests for faster feedback.
cache_df	bool	No	`False`	Cache DataFrame before validation (Spark only). Improves performance with many tests.

`QuarantineConfig`¶

Used in: ValidationConfig

Configuration for quarantine table routing.

When to Use: Capture invalid records for review/reprocessing instead of failing the pipeline.

See Also: Quarantine Guide, ValidationConfig

Routes rows that fail validation tests to a quarantine table with rejection metadata for later analysis/reprocessing.

Example:

validation:
  tests:
    - type: not_null
      columns: [customer_id]
      on_fail: quarantine
  quarantine:
    connection: silver
    path: customers_quarantine
    add_columns:
      _rejection_reason: true
      _rejected_at: true
    max_rows: 10000
    sample_fraction: 0.1

Field	Type	Required	Default	Description
connection	str	Yes	-	Connection for quarantine writes
path	Optional[str]	No	-	Path for quarantine data
table	Optional[str]	No	-	Table name for quarantine
add_columns	QuarantineColumnsConfig	No	`PydanticUndefined`	Metadata columns to add to quarantined rows
retention_days	Optional[int]	No	`90`	Days to retain quarantined data (auto-cleanup)
max_rows	Optional[int]	No	-	Maximum number of rows to quarantine per run. Limits storage for high-failure batches.
sample_fraction	Optional[float]	No	-	Sample fraction of invalid rows to quarantine (0.0-1.0). Use for sampling large invalid sets.

`QuarantineColumnsConfig`¶

Used in: QuarantineConfig

Columns added to quarantined rows for debugging and reprocessing.

Example:

quarantine:
  connection: silver
  path: customers_quarantine
  add_columns:
    _rejection_reason: true
    _rejected_at: true
    _source_batch_id: true
    _failed_tests: true
    _original_node: false

Field	Type	Required	Default	Description
rejection_reason	bool	No	`True`	Add _rejection_reason column with test failure description
rejected_at	bool	No	`True`	Add _rejected_at column with UTC timestamp
source_batch_id	bool	No	`True`	Add _source_batch_id column with run ID for traceability
failed_tests	bool	No	`True`	Add _failed_tests column with comma-separated list of failed test names
original_node	bool	No	`False`	Add _original_node column with source node name

`GateConfig`¶

Used in: ValidationConfig

Quality gate configuration for batch-level validation.

When to Use: Pipeline-level pass/fail thresholds, row count limits, change detection.

See Also: Quality Gates, ValidationConfig

Gates evaluate the entire batch before writing, ensuring data quality thresholds are met.

Example:

gate:
  require_pass_rate: 0.95
  on_fail: abort
  thresholds:
    - test: not_null
      min_pass_rate: 0.99
  row_count:
    min: 100
    change_threshold: 0.5

Field	Type	Required	Default	Description
require_pass_rate	float	No	`0.95`	Minimum percentage of rows passing ALL tests
on_fail	GateOnFail	No	`GateOnFail.ABORT`	Action when gate fails
thresholds	List[GateThreshold]	No	`PydanticUndefined`	Per-test thresholds (overrides global require_pass_rate)
row_count	Optional[RowCountGate]	No	-	Row count anomaly detection

`GateThreshold`¶

Used in: GateConfig

Per-test threshold configuration for quality gates.

Allows setting different pass rate requirements for specific tests.

Example:

gate:
  thresholds:
    - test: not_null
      min_pass_rate: 0.99
    - test: unique
      min_pass_rate: 1.0

Field	Type	Required	Default	Description
test	str	Yes	-	Test name or type to apply threshold to
min_pass_rate	float	Yes	-	Minimum pass rate required (0.0-1.0, e.g., 0.99 = 99%)

`RowCountGate`¶

Used in: GateConfig

Row count anomaly detection for quality gates.

Validates that batch size falls within expected bounds and detects significant changes from previous runs.

Example:

gate:
  row_count:
    min: 100
    max: 1000000
    change_threshold: 0.5

Field	Type	Required	Default	Description
min	Optional[int]	No	-	Minimum expected row count
max	Optional[int]	No	-	Maximum expected row count
change_threshold	Optional[float]	No	-	Max allowed change vs previous run (e.g., 0.5 = 50% change triggers failure)

`WriteConfig`¶

Used in: NodeConfig

Configuration for writing data from a node.

When to Use: Any node that persists data to storage.

Key Concepts: - mode: How to handle existing data (overwrite, append, upsert) - keys: Required for upsert mode - columns that identify unique records - partition_by: Columns to partition output by (improves query performance)

See Also: - Performance Tuning - Partitioning strategies

🚀 "Big Data Performance" Guide¶

Business Problem: "My dashboards are slow because the query scans terabytes of data just to find one day's sales."

The Solution: Use Partitioning for coarse filtering (skipping huge chunks) and Z-Ordering for fine-grained skipping (colocating related data).

Recipe: Lakehouse Optimized

write:
  connection: "gold_lake"
  format: "delta"
  table: "fact_sales"
  mode: "append"

  # 1. Partitioning: Physical folders.
  # Use for low-cardinality columns often used in WHERE clauses.
  # WARNING: Do NOT partition by high-cardinality cols like ID or Timestamp!
  partition_by: ["country_code", "txn_year_month"]

  # 2. Z-Ordering: Data clustering.
  # Use for high-cardinality columns often used in JOINs or predicates.
  zorder_by: ["customer_id", "product_id"]

  # 3. Table Properties: Engine tuning.
  table_properties:
    "delta.autoOptimize.optimizeWrite": "true"
    "delta.autoOptimize.autoCompact": "true"

Field	Type	Required	Default	Description
connection	str	Yes	-	Connection name from project.yaml
format	ReadFormat	Yes	-	Output format: csv, parquet, delta, json, sql, api, excel, avro, cloudFiles
table	Optional[str]	No	-	Table name for SQL/Delta
path	Optional[str]	No	-	Path for file-based outputs
register_table	Optional[str]	No	-	Register file output as external table (Spark/Delta only)
mode	WriteMode	No	`WriteMode.OVERWRITE`	Write mode. Options: 'overwrite', 'append', 'upsert', 'append_once', 'merge'. Use 'append_once' for idempotent Bronze ingestion (requires 'keys' in options). See WriteMode enum for details.
partition_by	List[str]	No	`PydanticUndefined`	List of columns to physically partition the output by (folder structure). Use for low-cardinality columns (e.g. date, country).
zorder_by	List[str]	No	`PydanticUndefined`	List of columns to Z-Order by. Improves read performance for high-cardinality columns used in filters/joins (Delta only).
table_properties	Dict[str, str]	No	`PydanticUndefined`	Delta table properties. Overrides global performance.delta_table_properties. Example: {'delta.columnMapping.mode': 'name'} to allow special characters in column names.
merge_schema	bool	No	`False`	Allow schema evolution (mergeSchema option in Delta)
overwrite_schema	bool	No	`False`	Allow schema overwrite on mode=overwrite (overwriteSchema option in Delta). Use when the incoming schema differs from the existing table schema.
first_run_query	Optional[str]	No	-	SQL query for full-load on first run (High Water Mark pattern). If set, uses this query when target table doesn't exist, then switches to incremental. Only applies to SQL reads.
options	Dict[str, Any]	No	`PydanticUndefined`	Format-specific options
auto_optimize	bool \| AutoOptimizeConfig	No	-	Auto-run OPTIMIZE and VACUUM after write (Delta only)
add_metadata	bool \| WriteMetadataConfig	No	-	Add metadata columns for Bronze layer lineage. Set to `true` to add all applicable columns, or provide a WriteMetadataConfig for selective columns. Columns: _extracted_at, _source_file (file sources), _source_connection, _source_table (SQL sources).
skip_if_unchanged	bool	No	`False`	Skip write if DataFrame content is identical to previous write. Computes SHA256 hash of entire DataFrame and compares to stored hash in Delta table metadata. Useful for snapshot tables without timestamps to avoid redundant appends. Only supported for Delta format.
skip_hash_columns	Optional[List[str]]	No	-	Columns to include in hash computation for skip_if_unchanged. If None, all columns are used. Specify a subset to ignore volatile columns like timestamps.
skip_hash_sort_columns	Optional[List[str]]	No	-	Columns to sort by before hashing for deterministic comparison. Required if row order may vary between runs. Typically your business key columns.
streaming	Optional[StreamingWriteConfig]	No	-	Streaming write configuration for Spark Structured Streaming. When set, uses writeStream instead of batch write. Requires a streaming DataFrame from a streaming read source.
merge_keys	Optional[List[str]]	No	-	Key columns for SQL Server MERGE operations. Required when mode='merge'. These columns form the ON clause of the MERGE statement.
merge_options	Optional[SqlServerMergeOptions]	No	-	Options for SQL Server MERGE operations (conditions, staging, audit cols)
overwrite_options	Optional[SqlServerOverwriteOptions]	No	-	Options for SQL Server overwrite operations (strategy, audit cols)

`WriteMetadataConfig`¶

Used in: WriteConfig

Configuration for metadata columns added during Bronze writes.

📋 Bronze Metadata Guide¶

Business Problem: "We need lineage tracking and debugging info for our Bronze layer data."

The Solution: Add metadata columns during ingestion for traceability.

Recipe 1: Add All Metadata (Recommended)

write:
  connection: bronze
  table: customers
  mode: append
  add_metadata: true  # adds all applicable columns

Recipe 2: Selective Metadata

write:
  connection: bronze
  table: customers
  mode: append
  add_metadata:
    extracted_at: true
    source_file: true
    source_connection: false
    source_table: false

Available Columns: - _extracted_at: Pipeline execution timestamp (all sources) - _source_file: Source filename/path (file sources only) - _source_connection: Connection name used (all sources) - _source_table: Table or query name (SQL sources only)

Field	Type	Required	Default	Description
extracted_at	bool	No	`True`	Add _extracted_at column with pipeline execution timestamp
source_file	bool	No	`True`	Add _source_file column with source filename (file sources only)
source_connection	bool	No	`False`	Add _source_connection column with connection name
source_table	bool	No	`False`	Add _source_table column with table/query name (SQL sources only)

`StreamingWriteConfig`¶

Used in: WriteConfig

Configuration for Spark Structured Streaming writes.

🚀 "Real-Time Pipeline" Guide¶

Business Problem: "I need to process data continuously as it arrives from Kafka/Event Hubs and write it to Delta Lake in near real-time."

The Solution: Configure streaming write with checkpoint location for fault tolerance and trigger interval for processing frequency.

Recipe: Streaming Ingestion

write:
  connection: "silver_lake"
  format: "delta"
  table: "events_stream"
  streaming:
    output_mode: append
    checkpoint_location: "/checkpoints/events_stream"
    trigger:
      processing_time: "10 seconds"

Recipe: One-Time Streaming (Batch-like)

write:
  connection: "silver_lake"
  format: "delta"
  table: "events_batch"
  streaming:
    output_mode: append
    checkpoint_location: "/checkpoints/events_batch"
    trigger:
      available_now: true

Field	Type	Required	Default	Description
output_mode	Literal['append', 'update', 'complete']	No	`append`	Output mode for streaming writes. 'append' - Only new rows. 'update' - Updated rows only. 'complete' - Entire result table (requires aggregation).
checkpoint_location	str	Yes	-	Path for streaming checkpoints. Required for fault tolerance. Must be a reliable storage location (e.g., cloud storage, DBFS).
trigger	Optional[TriggerConfig]	No	-	Trigger configuration. If not specified, processes data as fast as possible. Use 'processing_time' for micro-batch intervals, 'once' for single batch, 'available_now' for processing all available data then stopping.
query_name	Optional[str]	No	-	Name for the streaming query (useful for monitoring and debugging)
await_termination	Optional[bool]	No	`False`	Wait for the streaming query to terminate. Set to True for batch-like streaming with 'once' or 'available_now' triggers.
timeout_seconds	Optional[int]	No	-	Timeout in seconds when await_termination is True. If None, waits indefinitely.

`TriggerConfig`¶

Used in: StreamingWriteConfig

Configuration for streaming trigger intervals.

Specify exactly one of the trigger options.

Example:

trigger:
  processing_time: "10 seconds"

Or for one-time processing:

trigger:
  once: true

Field	Type	Required	Default	Description
processing_time	Optional[str]	No	-	Trigger interval as duration string (e.g., '10 seconds', '1 minute')
once	Optional[bool]	No	-	Process all available data once and stop
available_now	Optional[bool]	No	-	Process all available data in multiple batches, then stop
continuous	Optional[str]	No	-	Continuous processing with checkpoint interval (e.g., '1 second')

`AutoOptimizeConfig`¶

Used in: WriteConfig

Configuration for Delta Lake automatic optimization.

Example:

auto_optimize:
  enabled: true
  vacuum_retention_hours: 168

Field	Type	Required	Default	Description
enabled	bool	No	`True`	Enable auto optimization
vacuum_retention_hours	int	No	`168`	Hours to retain history for VACUUM (default 7 days). Set to 0 to disable VACUUM.

`ApiOptionsConfig`¶

Complete options configuration for API data sources (format: api).

When to Use: Pull data from REST APIs with pagination, retry, and rate limiting.

See Also: API Data Sources Guide

Example:

nodes:
  - name: api_data
    read:
      connection: my_api
      format: api
      path: /v1/records
      options:
        pagination:
          type: offset_limit
          limit: 1000
          max_pages: 100
        response:
          items_path: data.records
          add_fields:
            _source: "my_api"
            _fetched_at: "$now"
        retry:
          max_retries: 3
        rate_limit:
          requests_per_second: 5

Field	Type	Required	Default	Description
pagination	Optional[ApiPaginationConfig]	No	-	Pagination configuration
response	Optional[ApiResponseConfig]	No	-	Response parsing configuration
retry	Optional[ApiRetryConfig]	No	-	Retry configuration
rate_limit	Optional[ApiRateLimitConfig]	No	-	Rate limiting configuration
method	str	No	`GET`	HTTP method (GET, POST)
headers	Optional[Dict[str, str]]	No	-	Additional HTTP headers
params	Optional[Dict[str, str]]	No	-	Additional query parameters
json_body	Optional[Dict[str, Any]]	No	-	JSON body for POST requests

`ApiPaginationConfig`¶

Used in: ApiOptionsConfig

Pagination configuration for API data sources.

When to Use: Configure how to paginate through API results.

Example (offset/limit pagination):

read:
  format: api
  options:
    pagination:
      type: offset_limit
      offset_param: skip
      limit_param: limit
      limit: 1000
      max_pages: 100

Example (cursor-based pagination):

read:
  format: api
  options:
    pagination:
      type: cursor
      cursor_path: meta.next_cursor
      cursor_param: cursor

Example (link header pagination - GitHub style):

read:
  format: api
  options:
    pagination:
      type: link_header
      limit: 100

Field	Type	Required	Default	Description
type	ApiPaginationType	No	`ApiPaginationType.OFFSET_LIMIT`	Pagination strategy to use
offset_param	str	No	`offset`	Query param name for offset (offset_limit type)
limit_param	str	No	`limit`	Query param name for limit
limit	int	No	`100`	Number of records per page
max_pages	Optional[int]	No	-	Maximum pages to fetch (None = unlimited)
cursor_path	Optional[str]	No	-	Dotted path to cursor in response (cursor type)
cursor_param	Optional[str]	No	-	Query param name for cursor (cursor type)
page_param	str	No	`page`	Query param name for page number (page_number type)
start_page	int	No	`1`	Starting page number (page_number type)

`ApiRateLimitConfig`¶

Used in: ApiOptionsConfig

Rate limiting configuration for API requests.

Example:

read:
  format: api
  options:
    rate_limit:
      requests_per_second: 10

Field	Type	Required	Default	Description
requests_per_second	Optional[float]	No	-	Max requests per second (None = no limit)

`ApiResponseConfig`¶

Used in: ApiOptionsConfig

Response parsing configuration for API data sources.

When to Use: Configure how to extract data from API responses.

Date Variables: Use in add_fields OR params for dynamic dates:

Variable	Format	Example Value
`$now`	ISO timestamp	`2024-01-15T10:30:00+00:00`
`$today`	YYYY-MM-DD	`2024-01-15`
`$yesterday`	YYYY-MM-DD	`2024-01-14`
`$date`	YYYY-MM-DD	`2024-01-15`
`$7_days_ago`	YYYY-MM-DD	`2024-01-08`
`$30_days_ago`	YYYY-MM-DD	`2023-12-16`
`$90_days_ago`	YYYY-MM-DD	`2023-10-17`
`$start_of_week`	YYYY-MM-DD	`2024-01-15`
`$start_of_month`	YYYY-MM-DD	`2024-01-01`
`$start_of_year`	YYYY-MM-DD	`2024-01-01`
`$today_compact`	YYYYMMDD	`20240115`
`$yesterday_compact`	YYYYMMDD	`20240114`
`$7_days_ago_compact`	YYYYMMDD	`20240108`
`$30_days_ago_compact`	YYYYMMDD	`20231216`
`$90_days_ago_compact`	YYYYMMDD	`20231017`

Example (add_fields):

read:
  format: api
  options:
    response:
      items_path: results
      add_fields:
        _fetched_at: "$now"
        _load_date: "$today"

Example (params with compact dates for openFDA):

read:
  format: api
  options:
    params:
      search: "report_date:[$30_days_ago_compact+TO+$today_compact]"

Field	Type	Required	Default	Description
items_path	str	No	-	Dotted path to items array in response (e.g., 'results', 'data.items'). Empty = response is the array.
dict_to_list	bool	No	`False`	If True and items_path resolves to a dict, extract dict values as rows with keys preserved in '_key' field. Useful for APIs like ddragon that return {'Aatrox': {...}, 'Ahri': {...}}.
add_fields	Optional[Dict[str, Any]]	No	-	Fields to add to each record. Supports date variables: $now, $today, $yesterday, $7_days_ago, etc.

`ApiRetryConfig`¶

Used in: ApiOptionsConfig

Retry configuration for API requests.

Example:

read:
  format: api
  options:
    retry:
      max_retries: 3
      backoff_factor: 2.0
      retry_codes: [429, 500, 502, 503]

Field	Type	Required	Default	Description
max_retries	int	No	`3`	Maximum retry attempts
backoff_factor	float	No	`2.0`	Exponential backoff multiplier
retry_codes	List[int]	No	`[429, 500, 502, 503, 504]`	HTTP status codes to retry on

`PrivacyConfig`¶

Used in: NodeConfig

Configuration for PII anonymization.

🔐 Privacy & PII Protection¶

How It Works: 1. Mark columns as pii: true in the columns metadata 2. Configure a privacy block with the anonymization method 3. During node execution, all columns marked as PII (and inherited from dependencies) are anonymized 4. Upstream PII markings are inherited by downstream nodes

Example:

columns:
  customer_email:
    pii: true  # Mark as PII
  customer_id:
    pii: false

privacy:
  method: hash       # hash, mask, or redact
  salt: "secret_key" # Optional: makes hash unique/secure
  declassify: []     # Remove columns from PII protection

Methods: - hash: SHA256 hash (length 64). With salt, prevents pre-computed rainbow tables. - mask: Show only last 4 chars, replace rest with *. Example: john@email.com → ****@email.com - redact: Replace entire value with [REDACTED]

Important: - pii: true alone does NOTHING. You must set a privacy.method to actually mask data. - PII inheritance: If dependency outputs PII columns, this node inherits them unless declassified. - Salt is optional but recommended for hash to prevent attacks.

Field	Type	Required	Default	Description
method	PrivacyMethod	Yes	-	Anonymization method: 'hash' (SHA256), 'mask' (show last 4), or 'redact' ([REDACTED])
salt	Optional[str]	No	-	Salt for hashing (optional but recommended). Appended before hashing to create unique hashes. Example: 'company_secret_key_2025'
declassify	List[str]	No	`PydanticUndefined`	List of columns to remove from PII protection (stops inheritance from upstream). Example: ['customer_id']

`SqlServerAuditColsConfig`¶

Used in: SqlServerMergeOptions, SqlServerOverwriteOptions

Audit column configuration for SQL Server merge operations.

These columns are automatically populated with GETUTCDATE() during merge: - created_col: Set on INSERT only - updated_col: Set on INSERT and UPDATE

Example:

audit_cols:
  created_col: created_ts
  updated_col: updated_ts

Field	Type	Required	Default	Description
created_col	Optional[str]	No	-	Column name for creation timestamp (set on INSERT)
updated_col	Optional[str]	No	-	Column name for update timestamp (set on INSERT and UPDATE)

`SqlServerMergeOptions`¶

Used in: WriteConfig

Options for SQL Server MERGE operations (Phase 1).

Enables incremental sync from Spark to SQL Server using T-SQL MERGE. Data is written to a staging table, then merged into the target.

Basic Usage¶

write:
  connection: azure_sql
  format: sql_server
  table: sales.fact_orders
  mode: merge
  merge_keys: [DateId, store_id]
  merge_options:
    update_condition: "source._hash_diff != target._hash_diff"
    exclude_columns: [_hash_diff]
    audit_cols:
      created_col: created_ts
      updated_col: updated_ts

Conditions¶

update_condition: Only update rows matching this condition (e.g., hash diff)
delete_condition: Delete rows matching this condition (soft delete pattern)
insert_condition: Only insert rows matching this condition

Field	Type	Required	Default	Description
update_condition	Optional[str]	No	-	SQL condition for WHEN MATCHED UPDATE. Use 'source.' and 'target.' prefixes. Example: 'source._hash_diff != target._hash_diff'
delete_condition	Optional[str]	No	-	SQL condition for WHEN MATCHED DELETE. Example: 'source._is_deleted = 1'
insert_condition	Optional[str]	No	-	SQL condition for WHEN NOT MATCHED INSERT. Example: 'source.is_valid = 1'
exclude_columns	List[str]	No	`PydanticUndefined`	Columns to exclude from MERGE (not written to target table)
staging_schema	str	No	`staging`	Schema for staging table. Table name: {staging_schema}.{table}_staging
audit_cols	Optional[SqlServerAuditColsConfig]	No	-	Audit columns for created/updated timestamps
validations	Optional[ForwardRef('SqlServerMergeValidationConfig')]	No	-	Validation checks before merge (null keys, duplicate keys)
auto_create_schema	bool	No	`False`	Auto-create schema if it doesn't exist (Phase 4). Runs CREATE SCHEMA IF NOT EXISTS.
auto_create_table	bool	No	`False`	Auto-create target table if it doesn't exist (Phase 4). Infers schema from DataFrame.
schema_evolution	Optional[ForwardRef('SqlServerSchemaEvolutionConfig')]	No	-	Schema evolution configuration (Phase 4). Controls handling of schema differences.
batch_size	Optional[int]	No	-	Batch size for staging table writes (Phase 4). Chunks large DataFrames for memory efficiency.
primary_key_on_merge_keys	bool	No	`False`	Create a clustered primary key on merge_keys when auto-creating table. Enforces uniqueness.
index_on_merge_keys	bool	No	`False`	Create a nonclustered index on merge_keys. Use if primary key already exists elsewhere.
incremental	bool	No	`False`	Enable incremental merge optimization. When True, reads target table's keys and hashes to determine which rows changed, then only writes changed rows to staging. Significantly faster when few rows change between runs.
hash_column	Optional[str]	No	-	Name of pre-computed hash column in DataFrame for change detection. Used when incremental=True. If not specified, will auto-detect '_hash_diff' column.
change_detection_columns	Optional[List[str]]	No	-	Columns to use for computing change detection hash. Used when incremental=True and no hash_column is specified. If None, uses all non-key columns.
bulk_copy	bool	No	`False`	Enable bulk copy mode for fast staging table loads. Writes data to ADLS as staging file, then uses BULK INSERT. 10-50x faster than JDBC for large datasets. Requires staging_connection.
staging_connection	Optional[str]	No	-	Connection name for staging files (ADLS/Blob storage). Required when bulk_copy=True. The connection must have write access.
staging_path	Optional[str]	No	-	Path prefix for staging files. Defaults to 'odibi_staging/bulk'. Files are automatically cleaned up after successful load.
external_data_source	Optional[str]	No	-	SQL Server external data source name for BULK INSERT. If not specified and auto_setup=True, will be auto-generated as 'odibi_{staging_connection}'.
keep_staging_files	bool	No	`False`	Keep staging files after load (for debugging). Default deletes after success.
auto_setup	bool	No	`False`	Auto-create SQL Server external data source and credential if they don't exist. Reads auth credentials from staging_connection and creates matching SQL objects. Requires elevated SQL permissions (ALTER ANY EXTERNAL DATA SOURCE, CONTROL).
force_recreate	bool	No	`False`	Force recreation of external data source and credential even if they exist. Use when you've rotated SAS tokens or storage keys and need to update SQL Server. Has no effect if auto_setup=False.
csv_options	Optional[Dict[str, str]]	No	-	Custom CSV options for bulk copy when writing to Azure SQL Database. Passed to Spark CSV writer. Defaults: quote='"', escape='"', escapeQuotes='true', nullValue='', emptyValue='', encoding='UTF-8'. Override any option here.

`SqlServerMergeValidationConfig`¶

Used in: SqlServerMergeOptions, SqlServerOverwriteOptions

Validation configuration for SQL Server merge/overwrite operations.

Validates source data before writing to SQL Server.

Example:

merge_options:
  validations:
    check_null_keys: true
    check_duplicate_keys: true
    fail_on_validation_error: true

Field	Type	Required	Default	Description
check_null_keys	bool	No	`True`	Fail if merge_keys contain NULL values
check_duplicate_keys	bool	No	`True`	Fail if merge_keys have duplicate combinations
fail_on_validation_error	bool	No	`True`	If False, log warning instead of failing on validation errors

`SqlServerOverwriteOptions`¶

Used in: WriteConfig

Options for SQL Server overwrite operations (Phase 2).

Enhanced overwrite with multiple strategies for different use cases.

Strategies¶

truncate_insert: TRUNCATE TABLE then INSERT (fastest, requires TRUNCATE permission)
drop_create: DROP TABLE, CREATE TABLE, INSERT (refreshes schema)
delete_insert: DELETE FROM then INSERT (works with limited permissions)

Example¶

write:
  connection: azure_sql
  format: sql_server
  table: fact.combined_downtime
  mode: overwrite
  overwrite_options:
    strategy: truncate_insert
    audit_cols:
      created_col: created_ts
      updated_col: updated_ts

Field	Type	Required	Default	Description
strategy	SqlServerOverwriteStrategy	No	`SqlServerOverwriteStrategy.TRUNCATE_INSERT`	Overwrite strategy: truncate_insert, drop_create, delete_insert
audit_cols	Optional[SqlServerAuditColsConfig]	No	-	Audit columns for created/updated timestamps
validations	Optional[SqlServerMergeValidationConfig]	No	-	Validation checks before overwrite
auto_create_schema	bool	No	`False`	Auto-create schema if it doesn't exist (Phase 4). Runs CREATE SCHEMA IF NOT EXISTS.
auto_create_table	bool	No	`False`	Auto-create target table if it doesn't exist (Phase 4). Infers schema from DataFrame.
schema_evolution	Optional[SqlServerSchemaEvolutionConfig]	No	-	Schema evolution configuration (Phase 4). Controls handling of schema differences.
batch_size	Optional[int]	No	-	Batch size for writes (Phase 4). Chunks large DataFrames for memory efficiency.
bulk_copy	bool	No	`False`	Enable bulk copy mode for fast writes. Writes data to ADLS as staging file, then uses BULK INSERT. 10-50x faster than JDBC for large datasets. Requires staging_connection.
staging_connection	Optional[str]	No	-	Connection name for staging files (ADLS/Blob storage). Required when bulk_copy=True. The connection must have write access.
staging_path	Optional[str]	No	-	Path prefix for staging files. Defaults to 'odibi_staging/bulk'. Files are automatically cleaned up after successful load.
external_data_source	Optional[str]	No	-	SQL Server external data source name for BULK INSERT. If not specified and auto_setup=True, will be auto-generated as 'odibi_{staging_connection}'.
keep_staging_files	bool	No	`False`	Keep staging files after load (for debugging). Default deletes after success.
auto_setup	bool	No	`False`	Auto-create SQL Server external data source and credential if they don't exist. Reads auth credentials from staging_connection and creates matching SQL objects. Requires elevated SQL permissions (ALTER ANY EXTERNAL DATA SOURCE, CONTROL).
force_recreate	bool	No	`False`	Force recreation of external data source and credential even if they exist. Use when you've rotated SAS tokens or storage keys and need to update SQL Server. Has no effect if auto_setup=False.
csv_options	Optional[Dict[str, str]]	No	-	Custom CSV options for bulk copy when writing to Azure SQL Database. Passed to Spark CSV writer. Defaults: quote='"', escape='"', escapeQuotes='true', nullValue='', emptyValue='', encoding='UTF-8'. Override any option here.

`SqlServerSchemaEvolutionConfig`¶

Used in: SqlServerMergeOptions, SqlServerOverwriteOptions

Schema evolution configuration for SQL Server operations (Phase 4).

Controls automatic schema changes when DataFrame schema differs from target table.

Example:

merge_options:
  schema_evolution:
    mode: evolve
    add_columns: true

Field	Type	Required	Default	Description
mode	SqlServerSchemaEvolutionMode	No	`SqlServerSchemaEvolutionMode.STRICT`	Schema evolution mode: strict (fail), evolve (add columns), ignore (skip mismatched)
add_columns	bool	No	`False`	If mode='evolve', automatically add new columns via ALTER TABLE ADD COLUMN

`TransformStep`¶

Used in: TransformConfig

Single transformation step.

Supports four step types (exactly one required):

sql - Inline SQL query string
sql_file - Path to external .sql file (relative to the YAML file defining the node)
function - Registered Python function name
operation - Built-in operation (e.g., drop_duplicates)

sql_file Example:

If your project structure is:

project.yaml              # imports pipelines/silver/silver.yaml
pipelines/
  silver/
    silver.yaml           # defines the node
    sql/
      transform.sql       # your SQL file

In silver.yaml, use a path relative to silver.yaml:

transform:
  steps:
    - sql_file: sql/transform.sql   # relative to silver.yaml

Important: The path is resolved relative to the YAML file where the node is defined, NOT the project.yaml that imports it. Do NOT use absolute paths like /pipelines/silver/sql/....

Field	Type	Required	Default	Description
sql	Optional[str]	No	-	Inline SQL query. Use `df` to reference the current DataFrame.
sql_file	Optional[str]	No	-	Path to external .sql file, relative to the YAML file defining the node. Example: 'sql/transform.sql' resolves relative to the node's source YAML.
function	Optional[str]	No	-	Name of a registered Python function (@transform or @register).
operation	Optional[str]	No	-	Built-in operation name (e.g., drop_duplicates, fill_na).
params	Dict[str, Any]	No	`PydanticUndefined`	Parameters to pass to function or operation.

Contracts (Data Quality Gates)¶

Contracts (Pre-Transform Checks)¶

Contracts are fail-fast data quality checks that run on input data before transformation. They always halt execution on failure - use them to prevent bad data from entering the pipeline.

Contracts vs Validation vs Quality Gates:

Feature	When it Runs	On Failure	Use Case
Contracts	Before transform	Always fails	Input data quality (not-null, unique keys)
Validation	After transform	Configurable (fail/warn/quarantine)	Output data quality (ranges, formats)
Quality Gates	After validation	Configurable (abort/warn)	Pipeline-level thresholds (pass rate, row counts)
Quarantine	With validation	Routes bad rows	Capture invalid records for review

See Also: - Validation Guide - Full validation configuration - Quarantine Guide - Quarantine setup and review - Getting Started: Validation

Example:

- name: "process_orders"
  contracts:
    - type: not_null
      columns: [order_id, customer_id]
    - type: row_count
      min: 100
    - type: freshness
      column: created_at
      max_age: "24h"
  read:
    source: raw_orders

`AcceptedValuesTest`¶

Used in: NodeConfig, ValidationConfig

Ensures a column only contains values from an allowed list.

When to Use: Enum-like fields, status columns, categorical data validation.

See Also: Contracts Overview

contracts:
  - type: accepted_values
    column: status
    values: [pending, approved, rejected]

Field	Type	Required	Default	Description
type	Literal['accepted_values']	No	`TestType.ACCEPTED_VALUES`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
column	str	Yes	-	Column to check
values	List[Any]	Yes	-	Allowed values

`CustomSQLTest`¶

Used in: NodeConfig, ValidationConfig

Runs a custom SQL condition and fails if too many rows violate it.

contracts:
  - type: custom_sql
    condition: "amount > 0"
    threshold: 0.01  # Allow up to 1% failures

Field	Type	Required	Default	Description
type	Literal['custom_sql']	No	`TestType.CUSTOM_SQL`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
condition	str	Yes	-	SQL condition that should be true for valid rows
threshold	float	No	`0.0`	Failure rate threshold (0.0 = strictly no failures allowed)

`DistributionContract`¶

Used in: NodeConfig, ValidationConfig

Checks if a column's statistical distribution is within expected bounds.

When to Use: Detect data drift, anomaly detection, statistical monitoring.

See Also: Contracts Overview

contracts:
  - type: distribution
    column: price
    metric: mean
    threshold: ">100"  # Mean must be > 100
    on_fail: warn

Field	Type	Required	Default	Description
type	Literal['distribution']	No	`TestType.DISTRIBUTION`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.WARN`	-
column	str	Yes	-	Column to analyze
metric	Literal['mean', 'min', 'max', 'null_percentage']	Yes	-	Statistical metric to check
threshold	str	Yes	-	Threshold expression (e.g., '>100', '<0.05')

`FreshnessContract`¶

Used in: NodeConfig, ValidationConfig

Validates that data is not stale by checking a timestamp column.

When to Use: Source systems that should update regularly, SLA monitoring.

See Also: Contracts Overview

contracts:
  - type: freshness
    column: updated_at
    max_age: "24h"  # Fail if no data newer than 24 hours

Field	Type	Required	Default	Description
type	Literal['freshness']	No	`TestType.FRESHNESS`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	-
column	str	No	`updated_at`	Timestamp column to check
max_age	str	Yes	-	Maximum allowed age (e.g., '24h', '7d')

`NotNullTest`¶

Used in: NodeConfig, ValidationConfig

Ensures specified columns contain no NULL values.

When to Use: Primary keys, required fields, foreign keys that must resolve.

See Also: Contracts Overview

contracts:
  - type: not_null
    columns: [order_id, customer_id, created_at]

Field	Type	Required	Default	Description
type	Literal['not_null']	No	`TestType.NOT_NULL`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
columns	List[str]	Yes	-	Columns that must not contain nulls

`RangeTest`¶

Used in: NodeConfig, ValidationConfig

Ensures column values fall within a specified range.

When to Use: Numeric bounds validation (ages, prices, quantities), date ranges.

See Also: Contracts Overview

contracts:
  - type: range
    column: age
    min: 0
    max: 150

Field	Type	Required	Default	Description
type	Literal['range']	No	`TestType.RANGE`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
column	str	Yes	-	Column to check
min	int \| float \| str	No	-	Minimum value (inclusive)
max	int \| float \| str	No	-	Maximum value (inclusive)

`RegexMatchTest`¶

Used in: NodeConfig, ValidationConfig

Ensures column values match a regex pattern.

When to Use: Format validation (emails, phone numbers, IDs, codes).

See Also: Contracts Overview

contracts:
  - type: regex_match
    column: email
    pattern: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

Field	Type	Required	Default	Description
type	Literal['regex_match']	No	`TestType.REGEX_MATCH`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
column	str	Yes	-	Column to check
pattern	str	Yes	-	Regex pattern to match

`RowCountTest`¶

Used in: NodeConfig, ValidationConfig

Validates that row count falls within expected bounds.

When to Use: Ensure minimum data completeness, detect truncated loads, cap batch sizes.

See Also: Contracts Overview, GateConfig

contracts:
  - type: row_count
    min: 1000
    max: 100000

Field	Type	Required	Default	Description
type	Literal['row_count']	No	`TestType.ROW_COUNT`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
min	Optional[int]	No	-	Minimum row count
max	Optional[int]	No	-	Maximum row count

`SchemaContract`¶

Used in: NodeConfig, ValidationConfig

Validates that the DataFrame schema matches expected columns.

When to Use: Enforce schema stability, detect upstream schema drift, ensure column presence.

See Also: Contracts Overview, SchemaPolicyConfig

Uses the columns metadata from NodeConfig to verify schema.

contracts:
  - type: schema
    strict: true  # Fail if extra columns present

Field	Type	Required	Default	Description
type	Literal['schema']	No	`TestType.SCHEMA`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	-
strict	bool	No	`True`	If true, fail on unexpected columns

`UniqueTest`¶

Used in: NodeConfig, ValidationConfig

Ensures specified columns (or combination) contain unique values.

When to Use: Primary keys, natural keys, deduplication verification.

See Also: Contracts Overview

contracts:
  - type: unique
    columns: [order_id]  # Single column
  # OR composite key:
  - type: unique
    columns: [customer_id, order_date]  # Composite uniqueness

Field	Type	Required	Default	Description
type	Literal['unique']	No	`TestType.UNIQUE`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
columns	List[str]	Yes	-	Columns that must be unique (composite key if multiple)

`VolumeDropTest`¶

Used in: NodeConfig, ValidationConfig

Checks if row count dropped significantly compared to history.

When to Use: Detect source outages, partial loads, or data pipeline issues.

See Also: Contracts Overview, RowCountTest

Formula: (current - avg) / avg < -threshold

contracts:
  - type: volume_drop
    threshold: 0.5  # Fail if > 50% drop from 7-day average
    lookback_days: 7

Field	Type	Required	Default	Description
type	Literal['volume_drop']	No	`TestType.VOLUME_DROP`	-
name	Optional[str]	No	-	Optional name for the check
on_fail	ContractSeverity	No	`ContractSeverity.FAIL`	Action on failure
threshold	float	No	`0.5`	Max allowed drop (0.5 = 50% drop)
lookback_days	int	No	`7`	Days of history to average

Global Settings¶

`LineageConfig`¶

Used in: ProjectConfig

Configuration for OpenLineage integration.

Example:

lineage:
  url: "http://localhost:5000"
  namespace: "my_project"

Field	Type	Required	Default	Description
url	Optional[str]	No	-	OpenLineage API URL
namespace	str	No	`odibi`	Namespace for jobs
api_key	Optional[str]	No	-	API Key

`AlertConfig`¶

Used in: ProjectConfig

Configuration for alerts with throttling support.

Supports Slack, Teams, and generic webhooks with event-specific payloads.

Available Events: - on_start - Pipeline started - on_success - Pipeline completed successfully - on_failure - Pipeline failed - on_quarantine - Rows were quarantined - on_gate_block - Quality gate blocked the pipeline - on_threshold_breach - A threshold was exceeded

Example:

alerts:
  - type: slack
    url: "${SLACK_WEBHOOK_URL}"
    on_events:
      - on_failure
      - on_quarantine
      - on_gate_block
    metadata:
      throttle_minutes: 15
      max_per_hour: 10
      channel: "#data-alerts"

Field	Type	Required	Default	Description
type	AlertType	Yes	-	-
url	str	Yes	-	Webhook URL
on_events	List[AlertEvent]	No	`[<AlertEvent.ON_FAILURE: 'on_failure'>]`	Events to trigger alert: on_start, on_success, on_failure, on_quarantine, on_gate_block, on_threshold_breach
metadata	Dict[str, Any]	No	`PydanticUndefined`	Extra metadata: throttle_minutes, max_per_hour, channel, etc.

`LoggingConfig`¶

Used in: ProjectConfig

Logging configuration.

Example:

logging:
  level: "INFO"
  structured: true

Field	Type	Required	Default	Description
level	LogLevel	No	`LogLevel.INFO`	-
structured	bool	No	`False`	Output JSON logs
metadata	Dict[str, Any]	No	`PydanticUndefined`	Extra metadata in logs

`PerformanceConfig`¶

Used in: ProjectConfig

Performance tuning configuration.

Example:

performance:
  use_arrow: true
  spark_config:
    "spark.sql.shuffle.partitions": "200"
    "spark.sql.adaptive.enabled": "true"
    "spark.databricks.delta.optimizeWrite.enabled": "true"
  delta_table_properties:
    "delta.columnMapping.mode": "name"

Spark Config Notes: - Configs are applied via spark.conf.set() at runtime - For existing sessions (e.g., Databricks), only runtime-settable configs will take effect - Session-level configs (e.g., spark.executor.memory) require session restart - Common runtime-safe configs: shuffle partitions, adaptive query execution, Delta optimizations

Field	Type	Required	Default	Description
use_arrow	bool	No	`True`	Use Apache Arrow-backed DataFrames (Pandas only). Reduces memory and speeds up I/O.
spark_config	Dict[str, str]	No	`PydanticUndefined`	Spark configuration settings applied at runtime via spark.conf.set(). Example: {'spark.sql.shuffle.partitions': '200', 'spark.sql.adaptive.enabled': 'true'}. Note: Some configs require session restart and cannot be set at runtime.
delta_table_properties	Dict[str, str]	No	`PydanticUndefined`	Default table properties applied to all Delta writes. Example: {'delta.columnMapping.mode': 'name'} to allow special characters in column names.
skip_null_profiling	bool	No	`False`	Skip null profiling in metadata collection phase. Reduces execution time for large DataFrames by avoiding an additional Spark job.
skip_catalog_writes	bool	No	`False`	Skip catalog metadata writes (register_asset, track_schema, log_pattern, record_lineage) after each node write. Significantly improves performance for high-throughput pipelines like Bronze layer ingestion. Set to true when catalog tracking is not needed.
skip_run_logging	bool	No	`False`	Skip batch catalog writes at pipeline end (log_runs_batch, register_outputs_batch). Saves 10-20s per pipeline run. Enable when you don't need run history in the catalog. Stories are still generated and contain full execution details.

`RetryConfig`¶

Used in: ProjectConfig

Retry configuration for transient failures.

Automatically retries failed operations (database timeouts, network issues, rate limits) with configurable backoff strategy.

Example:

retry:
  enabled: true
  max_attempts: 3
  backoff: "exponential"

Field	Type	Required	Default	Description
enabled	bool	No	`True`	Enable automatic retry on transient failures (timeouts, connection errors)
max_attempts	int	No	`3`	Maximum number of retry attempts before failing. Total attempts = 1 + max_attempts.
backoff	BackoffStrategy	No	`BackoffStrategy.EXPONENTIAL`	Wait strategy between retries. 'exponential' (2^n seconds, recommended), 'linear' (n seconds), or 'constant' (fixed 1 second).

`StoryConfig`¶

Used in: ProjectConfig

Story generation configuration.

Stories are ODIBI's core value - execution reports with lineage. They must use a connection for consistent, traceable output.

Example:

story:
  connection: "local_data"
  path: "stories/"
  retention_days: 30
  failure_sample_size: 100
  max_failure_samples: 500
  max_sampled_validations: 5

Failure Sample Settings: - failure_sample_size: Number of failed rows to capture per validation (default: 100) - max_failure_samples: Total failed rows across all validations (default: 500) - max_sampled_validations: After this many validations, show only counts (default: 5)

Field	Type	Required	Default	Description
connection	str	Yes	-	Connection name for story output (uses connection's path resolution)
path	str	Yes	-	Path for stories (relative to connection base_path)
max_sample_rows	int	No	`10`	Maximum rows to include in data samples within story reports. Higher values give more debugging context but increase file size. Set to 0 to disable data sampling.
auto_generate	bool	No	`True`	-
retention_days	Optional[int]	No	`30`	Days to keep stories
retention_count	Optional[int]	No	`100`	Max number of stories to keep
failure_sample_size	int	No	`100`	Number of failed rows to capture per validation rule
max_failure_samples	int	No	`500`	Maximum total failed rows across all validations
max_sampled_validations	int	No	`5`	After this many validations, show only counts (no samples)
async_generation	bool	No	`False`	Generate stories asynchronously (fire-and-forget). Pipeline returns immediately while story writes in background. Improves multi-pipeline performance by ~5-10s per pipeline.
generate_lineage	bool	No	`True`	Generate combined lineage graph from all stories. Creates a unified view of data flow across pipelines.
docs	Optional[ForwardRef('DocsConfig')]	No	-	Documentation generation settings. Generates README.md, TECHNICAL_DETAILS.md, NODE_CARDS/*.md from Story data.

Transformation Reference¶

How to Use Transformers¶

You can use any transformer in two ways:

1. As a Top-Level Transformer ("The App") Use this for major operations that define the node's purpose (e.g. Merge, SCD2).

- name: "my_node"
  transformer: "<transformer_name>"
  params:
    <param_name>: <value>

2. As a Step in a Chain ("The Script") Use this for smaller operations within a transform block (e.g. clean_text, filter).

- name: "my_node"
  transform:
    steps:
      - function: "<transformer_name>"
         params:
           <param_name>: <value>

Available Transformers: The models below describe the params required for each transformer.

Field	Type	Required	Default	Description
prefix	str	Yes	-	Prefix to add to column names
columns	Optional[List[str]]	No	-	Columns to prefix (default: all columns)
exclude	Optional[List[str]]	No	-	Columns to exclude from prefixing

Field	Type	Required	Default	Description
suffix	str	Yes	-	Suffix to add to column names
columns	Optional[List[str]]	No	-	Columns to suffix (default: all columns)
exclude	Optional[List[str]]	No	-	Columns to exclude from suffixing

Field	Type	Required	Default	Description
cases	List[CaseWhenCase]	Yes	-	List of conditional branches
default	str	No	`NULL`	Default value if no condition met
output_col	str	Yes	-	Name of the resulting column

Field	Type	Required	Default	Description
columns	List[str]	Yes	-	List of columns to clean
trim	bool	No	`True`	Apply TRIM()
case	Literal['lower', 'upper', 'preserve']	No	`preserve`	Case conversion

Field	Type	Required	Default	Description
columns	List[str]	Yes	-	List of columns to coalesce (in priority order)
output_col	str	Yes	-	Name of the output column
drop_source	bool	No	`False`	Drop the source columns after coalescing

Field	Type	Required	Default	Description
columns	List[str]	Yes	-	Columns to concatenate
separator	str	No	-	Separator string
output_col	str	Yes	-	Resulting column name

Field	Type	Required	Default	Description
col	str	Yes	-	Timestamp column to convert
source_tz	str	No	`UTC`	Source timezone (e.g., 'UTC', 'America/New_York')
target_tz	str	Yes	-	Target timezone (e.g., 'America/Los_Angeles')
output_col	Optional[str]	No	-	Name of the result column (default: {col}_{target_tz})

Field	Type	Required	Default	Description
col	str	Yes	-	-
value	int	Yes	-	-
unit	Literal['day', 'month', 'year', 'hour', 'minute', 'second']	Yes	-	-

Field	Type	Required	Default	Description
start_col	str	Yes	-	-
end_col	str	Yes	-	-
unit	Literal['day', 'hour', 'minute', 'second']	No	`day`	-

Field	Type	Required	Default	Description
col	str	Yes	-	-
unit	Literal['year', 'month', 'day', 'hour', 'minute', 'second']	Yes	-	-

Field	Type	Required	Default	Description
source_col	str	Yes	-	-
prefix	Optional[str]	No	-	-
parts	Literal[typing.Literal['year', 'month', 'day', 'hour']]	No	`['year', 'month', 'day']`	-

Field	Type	Required	Default	Description
n	int	Yes	-	Number of rows to return
offset	int	No	`0`	Number of rows to skip

Field	Type	Required	Default	Description
style	Literal['snake_case', 'none']	No	`snake_case`	Naming style: 'snake_case' converts spaces/special chars to underscores
lowercase	bool	No	`True`	Convert names to lowercase
remove_special	bool	No	`True`	Remove special characters except underscores

Field	Type	Required	Default	Description
rename	Optional[Dict[str, str]]	No	`PydanticUndefined`	old_name -> new_name
drop	Optional[List[str]]	No	`PydanticUndefined`	Columns to remove; ignored if not present
select_order	Optional[List[str]]	No	-	Final column order; any missing columns appended after

Field	Type	Required	Default	Description
columns	List[str]	Yes	-	Columns to apply replacements to
mapping	Dict[str, Optional[str]]	Yes	-	Map of old value to new value (use null for NULL)

Field	Type	Required	Default	Description
fraction	float	Yes	-	Fraction of rows to return (0.0 to 1.0)
seed	Optional[int]	No	-	-

Field	Type	Required	Default	Description
by	str \| List[str]	Yes	-	Column(s) to sort by
ascending	bool	No	`True`	Sort order

Field	Type	Required	Default	Description
col	str	Yes	-	Column to split
delimiter	str	Yes	-	Delimiter to split by
index	int	Yes	-	1-based index of the token to extract

Field	Type	Required	Default	Description
group_by	List[str]	Yes	-	Columns to group by
aggregations	Dict[str, AggFunc]	Yes	-	Map of column to aggregation function (sum, avg, min, max, count)

Field	Type	Required	Default	Description
right_dataset	str	Yes	-	Name of the node/dataset to join with
on	str \| List[str]	Yes	-	Column(s) to join on
how	Literal['inner', 'left', 'right', 'full', 'cross', 'anti', 'semi']	No	`left`	Join type
prefix	Optional[str]	No	-	Prefix for columns from right dataset to avoid collisions

Odibi Configuration Reference¶

Project Structure¶

ProjectConfig¶

🏢 "Enterprise Setup" Guide¶

PipelineConfig¶

NodeConfig¶

🧠 "The Smart Node" Pattern¶

🕸️ DAG & Dependencies¶

🧠 Smart Read & Incremental Loading¶

🏷️ Orchestration Tags¶

🤖 Choosing Your Logic: Transformer vs. Transform¶

🔗 Chaining Operations¶

⚡ Example: App vs. Script¶

📚 Transformer Catalog¶

ColumnMetadata¶

SystemConfig¶

SyncFromConfig¶

Connections¶

LocalConnectionConfig¶

DeltaConnectionConfig¶

AzureBlobConnectionConfig¶

SQLServerConnectionConfig¶

HttpConnectionConfig¶

Node Operations¶

ReadConfig¶

📖 "Universal Reader" Guide¶

IncrementalConfig¶

TimeTravelConfig¶

TransformConfig¶

🔧 "Transformation Pipeline" Guide¶

DeleteDetectionConfig¶

🔍 "CDC Without CDC" Guide¶

ValidationConfig¶

🛡️ "The Indestructible Pipeline" Pattern¶

QuarantineConfig¶

QuarantineColumnsConfig¶

GateConfig¶

GateThreshold¶

RowCountGate¶

WriteConfig¶

🚀 "Big Data Performance" Guide¶

WriteMetadataConfig¶

📋 Bronze Metadata Guide¶

StreamingWriteConfig¶

🚀 "Real-Time Pipeline" Guide¶

TriggerConfig¶

AutoOptimizeConfig¶

ApiOptionsConfig¶

ApiPaginationConfig¶

ApiRateLimitConfig¶

ApiResponseConfig¶

ApiRetryConfig¶

PrivacyConfig¶

🔐 Privacy & PII Protection¶

SqlServerAuditColsConfig¶

SqlServerMergeOptions¶

Basic Usage¶

Conditions¶

SqlServerMergeValidationConfig¶

SqlServerOverwriteOptions¶

Strategies¶

Example¶

SqlServerSchemaEvolutionConfig¶

TransformStep¶

Contracts (Data Quality Gates)¶

Contracts (Pre-Transform Checks)¶

AcceptedValuesTest¶

CustomSQLTest¶

DistributionContract¶

FreshnessContract¶

NotNullTest¶

RangeTest¶

RegexMatchTest¶

RowCountTest¶

SchemaContract¶

UniqueTest¶

VolumeDropTest¶

Global Settings¶

LineageConfig¶

AlertConfig¶

`ProjectConfig`¶

`PipelineConfig`¶

`NodeConfig`¶

`ColumnMetadata`¶

`SystemConfig`¶

`SyncFromConfig`¶

`LocalConnectionConfig`¶

`DeltaConnectionConfig`¶

`AzureBlobConnectionConfig`¶

`SQLServerConnectionConfig`¶

`HttpConnectionConfig`¶

`ReadConfig`¶

`IncrementalConfig`¶

`TimeTravelConfig`¶

`TransformConfig`¶

`DeleteDetectionConfig`¶

`ValidationConfig`¶

`QuarantineConfig`¶

`QuarantineColumnsConfig`¶

`GateConfig`¶

`GateThreshold`¶

`RowCountGate`¶

`WriteConfig`¶

`WriteMetadataConfig`¶

`StreamingWriteConfig`¶

`TriggerConfig`¶

`AutoOptimizeConfig`¶

`ApiOptionsConfig`¶

`ApiPaginationConfig`¶

`ApiRateLimitConfig`¶

`ApiResponseConfig`¶

`ApiRetryConfig`¶

`PrivacyConfig`¶

`SqlServerAuditColsConfig`¶

`SqlServerMergeOptions`¶

`SqlServerMergeValidationConfig`¶

`SqlServerOverwriteOptions`¶

`SqlServerSchemaEvolutionConfig`¶

`TransformStep`¶

`AcceptedValuesTest`¶

`CustomSQLTest`¶

`DistributionContract`¶

`FreshnessContract`¶

`NotNullTest`¶

`RangeTest`¶

`RegexMatchTest`¶

`RowCountTest`¶

`SchemaContract`¶

`UniqueTest`¶

`VolumeDropTest`¶

`LineageConfig`¶

`AlertConfig`¶

`LoggingConfig`¶

`PerformanceConfig`¶

`RetryConfig`¶

`StoryConfig`¶

`add_prefix` (AddPrefixParams)¶

`add_suffix` (AddSuffixParams)¶

`case_when` (CaseWhenParams)¶

`cast_columns` (CastColumnsParams)¶

`clean_text` (CleanTextParams)¶

`coalesce_columns` (CoalesceColumnsParams)¶

`concat_columns` (ConcatColumnsParams)¶

`convert_timezone` (ConvertTimezoneParams)¶

`date_add` (DateAddParams)¶

`date_diff` (DateDiffParams)¶

`date_trunc` (DateTruncParams)¶

`derive_columns` (DeriveColumnsParams)¶

`distinct` (DistinctParams)¶

`drop_columns` (DropColumnsParams)¶

`extract_date_parts` (ExtractDateParams)¶

`fill_nulls` (FillNullsParams)¶

`filter_rows` (FilterRowsParams)¶

`limit` (LimitParams)¶

`normalize_column_names` (NormalizeColumnNamesParams)¶

`normalize_schema` (NormalizeSchemaParams)¶

`rename_columns` (RenameColumnsParams)¶

`replace_values` (ReplaceValuesParams)¶

`sample` (SampleParams)¶

`select_columns` (SelectColumnsParams)¶

Field	Type	Required	Default	Description
group_by	List[str]	Yes	-	-
pivot_col	str	Yes	-	-
agg_col	str	Yes	-	-
agg_func	Literal['sum', 'count', 'avg', 'max', 'min', 'first']	No	`sum`	-
values	Optional[List[str]]	No	-	Specific values to pivot (for Spark optimization)

Field	Type	Required	Default	Description
datasets	List[str]	Yes	-	List of node names to union with current
by_name	bool	No	`True`	Match columns by name (UNION ALL BY NAME)

Field	Type	Required	Default	Description
id_cols	List[str]	Yes	-	-
value_vars	List[str]	Yes	-	-
var_name	str	No	`variable`	-
value_name	str	No	`value`	-

Field	Type	Required	Default	Description
type	str	Yes	-	Check type: 'row_count_diff', 'schema_match'
inputs	List[str]	Yes	-	List of node names to compare
threshold	float	No	`0.0`	Threshold for diff (0.0-1.0)

Field	Type	Required	Default	Description
created_col	Optional[str]	No	-	Column to set only on first insert
updated_col	Optional[str]	No	-	Column to update on every merge

Field	Type	Required	Default	Description
target	Optional[str]	No	-	Target table name or full path (use this OR connection+path)
connection	Optional[str]	No	-	Connection name to resolve path (use with 'path' param)
path	Optional[str]	No	-	Relative path within connection (e.g., 'OEE/silver/customers')
register_table	Optional[str]	No	-	Register as Unity Catalog/metastore table after merge (e.g., 'silver.customers')
keys	List[str]	Yes	-	List of join keys
strategy	MergeStrategy	No	`MergeStrategy.UPSERT`	Merge behavior: 'upsert', 'append_only', 'delete_match'
audit_cols	Optional[AuditColumnsConfig]	No	-	{'created_col': '...', 'updated_col': '...'}
optimize_write	bool	No	`False`	Run OPTIMIZE after write (Spark)
vacuum_hours	Optional[int]	No	-	Hours to retain for VACUUM after merge (Spark only). Set to 168 for 7 days. None disables VACUUM.
zorder_by	Optional[List[str]]	No	-	Columns to Z-Order by
cluster_by	Optional[List[str]]	No	-	Columns to Liquid Cluster by (Delta)
update_condition	Optional[str]	No	-	SQL condition for update clause (e.g. 'source.ver > target.ver')
insert_condition	Optional[str]	No	-	SQL condition for insert clause (e.g. 'source.status != "deleted"')
delete_condition	Optional[str]	No	-	SQL condition for delete clause (e.g. 'source.status = "deleted"')
table_properties	Optional[dict]	No	-	Delta table properties for initial table creation (e.g., column mapping)

Field	Type	Required	Default	Description
timer_col	str	Yes	-	Timer column name for this phase
start_threshold	Optional[int]	No	-	Override default start threshold for this phase (seconds)

Field	Type	Required	Default	Description
group_by	str \| List[str]	Yes	-	Column(s) to group by. Can be a single column name or list of columns. E.g., 'BatchID' or ['BatchID', 'AssetID']
timestamp_col	str	No	`ts`	Timestamp column for ordering events
phases	List[str \| PhaseConfig]	Yes	-	List of phase timer columns (strings) or PhaseConfig objects. Phases are processed sequentially - each phase starts after the previous ends.
start_threshold	int	No	`240`	Default max timer value (seconds) to consider as valid phase start. Filters out late readings where timer already shows large elapsed time.
status_col	Optional[str]	No	-	Column containing equipment status codes
status_mapping	Optional[Dict[int, str]]	No	-	Mapping of status codes to names. E.g.,
phase_metrics	Optional[Dict[str, str]]	No	-	Columns to aggregate within each phase window. E.g., {Level: max, Pressure: max}. Outputs {Phase}_{Column} columns.
metadata	Optional[Dict[str, str]]	No	-	Columns to include in output with aggregation method. Options: 'first', 'last', 'first_after_start', 'max', 'min', 'mean', 'sum'. E.g.,
output_time_format	str	No	`%Y-%m-%d %H:%M:%S`	Format for output timestamp columns
fill_null_minutes	bool	No	`False`	If True, fill null numeric columns (_max_minutes, _status_minutes, _metrics) with 0. Timestamp columns remain null for skipped phases.
spark_native	bool	No	`False`	If True, use native Spark window functions. If False (default), use applyInPandas which is often faster for datasets with many batches.

Field	Type	Required	Default	Description
name	str	Yes	-	Name of the shift (e.g., 'Day', 'Night')
start	str	Yes	-	Start time in HH:MM format (e.g., '06:00')
end	str	Yes	-	End time in HH:MM format (e.g., '14:00')

Field	Type	Required	Default	Description
keys	List[str]	Yes	-	List of columns to partition by (columns that define uniqueness)
order_by	Optional[str]	No	-	SQL Order by clause (e.g. 'updated_at DESC') to determine which record to keep (first one is kept)