Managing Environments¶

Odibi allows you to define pipeline configurations that adapt to different contexts (e.g., Local Development, Testing, Production). You can override infrastructure settings per environment, or even have completely different pipeline implementations per environment. This prevents configuration drift and lets you experiment freely in dev without risking production.

How it Works¶

Odibi uses a Base Configuration + Override model: 1. Base Configuration: Defines your default settings (typically for local development). 2. Environment Overrides: Specific blocks that patch or replace values in the base configuration when that environment is active.

Configuration Structure¶

Odibi supports two ways to define environments: 1. Inline Block: Using an environments block in your main config file. 2. External Files: Using separate env.{env}.yaml files (e.g., env.prod.yaml).

Method 1: Inline Block¶

Add an environments section to your project.yaml:

# ... base config ...
environments:
  prod:
    engine: spark

Method 2: External Files (Recommended for large configs)¶

Keep your main odibi.yaml clean by putting overrides in separate files.

File: odibi.yaml

project: Sales Data Pipeline
engine: pandas
connections:
  data_lake:
    type: local
    base_path: ./data

File: env.prod.yaml

# Automatically merged when running with --env prod
engine: spark
connections:
  data_lake:
    type: azure_adls
    account: prod_acc

When you run odibi run odibi.yaml --env prod, Odibi will: 1. Load odibi.yaml. 2. Look for env.prod.yaml in the same directory. 3. Merge the prod config on top of the base config.

Inline Example (Method 1)¶

# --- 1. Base Configuration (Default / Local) ---
project: Sales Data Pipeline
engine: pandas
retry:
  enabled: false

connections:
  data_lake:
    type: local
    base_path: ./data/raw

pipelines:
  - pipeline: ingest_sales
    nodes:
      - name: read_csv
        read:
          connection: data_lake
          path: sales.csv

# --- 2. Environment Overrides ---
environments:
  # Production Environment
  prod:
    engine: spark  # Switch to Spark for scale
    retry:
      enabled: true
      max_attempts: 3
    connections:
      data_lake:
        type: azure_adls
        account: mycompanyprod
        container: sales-data
        auth:
          mode: aad_msi
    story:
      max_sample_rows: 0 # Disable data sampling for security

  # Testing Environment
  test:
    connections:
      data_lake:
        type: local
        base_path: ./data/test_fixtures

Usage¶

CLI¶

Use the --env flag to activate an environment.

Run in Default (Base) Environment:

odibi run project.yaml

Run in Production:

odibi run project.yaml --env prod

Python API¶

Pass the env parameter when initializing the PipelineManager.

from odibi.pipeline import PipelineManager

# Load Prod Configuration
manager = PipelineManager.from_yaml("project.yaml", env="prod")

# Run Pipeline
manager.run("ingest_sales")

Databricks Example¶

In a Databricks notebook, you can use widgets to switch environments dynamically without changing code.

# 1. Create Widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "test", "prod"])

# 2. Get Selection
current_env = dbutils.widgets.get("environment")

# 3. Run Pipeline
manager = PipelineManager.from_yaml("/dbfs/project.yaml", env=current_env)
manager.run()

Per-Environment Pipelines¶

One of the most powerful patterns is having different pipeline implementations per environment. This lets you experiment freely in dev without risking production.

Architecture: Environment-Owned Pipelines¶

Instead of putting pipelines in project.yaml, keep the base config as shared infrastructure and let each environment own its pipelines:

my_project/
├── project.yaml              # Shared config (no pipelines)
├── env.dev.yaml              # Dev pipelines + local connections
├── env.qat.yaml              # QAT pipelines + test connections
├── env.prod.yaml             # Prod pipelines + cloud connections
├── pipelines/
│   ├── bronze_dev.yaml       # Dev: reads from local CSVs
│   ├── bronze_qat.yaml       # QAT: reads from test database
│   └── bronze_prod.yaml      # Prod: reads from ADLS

Base config — shared infrastructure only:

# project.yaml
project: Sales Analytics
engine: pandas

connections:
  data_lake:
    type: local
    base_path: ./data

story:
  connection: data_lake
  path: stories

system:
  connection: data_lake
  path: _odibi_system

Each environment imports its own pipelines:

# env.dev.yaml
imports:
  - pipelines/bronze_dev.yaml

connections:
  data_lake:
    type: local
    base_path: ./test_data

# env.prod.yaml
imports:
  - pipelines/bronze_prod.yaml

engine: spark

connections:
  data_lake:
    type: azure_adls
    account_name: ${PROD_STORAGE_ACCOUNT}
    container: bronze
    credential: ${PROD_SAS_TOKEN}

Run with the environment flag:

odibi run project.yaml --env dev    # Loads env.dev.yaml → dev pipelines
odibi run project.yaml --env prod   # Loads env.prod.yaml → prod pipelines

Note: When using environment-owned pipelines, --env is required. Without it, the base config has no pipelines and validation will fail. This is intentional — it forces you to be explicit about which environment you're running.

Shared Pipelines with Environment-Specific Overrides¶

If your pipeline logic is mostly the same and only connections/engine differ, keep pipelines in the base config and override infrastructure per environment:

# project.yaml — pipelines defined here, shared across environments
pipelines:
  - pipeline: bronze_sales
    nodes:
      - name: ingest
        read:
          connection: data_lake
          path: sales/

environments:
  prod:
    engine: spark
    connections:
      data_lake:
        type: azure_adls
        account: prod_acc

Duplicate Pipeline Names (Override Behavior)¶

If both the base config and an environment file define a pipeline with the same name, the environment version wins (last definition takes precedence). A warning is logged so the override is traceable:

WARNING: Pipeline 'bronze' defined multiple times, using last definition (environment override)

This is useful when you want the same logical pipeline name (bronze) but completely different implementations per environment.

Environment Files Support Imports¶

External env.{env}.yaml files are full YAML configs — they support imports, ${VAR} substitution, ${vars.xxx} references, and ${date:...} expressions, just like the main config. This means you can organize per-environment pipelines into separate files and import them:

# env.dev.yaml
imports:
  - pipelines/bronze_dev.yaml
  - pipelines/silver_dev.yaml
  - pipelines/gold_dev.yaml

Common Use Cases¶

1. Swapping Storage (Local vs. Cloud)¶

Develop locally with CSVs, deploy to ADLS/S3 without changing pipeline code.

connections:
  storage: { type: local, base_path: ./data }

environments:
  prod:
    connections:
      storage: { type: azure_adls, account: prod_acc, container: data }

2. Scaling Engines (Pandas vs. Spark)¶

Use Pandas for fast local iteration and unit tests, but switch to Spark for distributed processing in production.

engine: pandas

environments:
  prod:
    engine: spark

3. Security & Privacy¶

Disable data sampling in stories for production to prevent PII leakage, while keeping it enabled in dev for debugging.

story:
  max_sample_rows: 20

environments:
  prod:
    story:
      max_sample_rows: 0

4. Alerting¶

Only send Slack/Teams notifications when running in production.

alerts: []  # No alerts in dev

environments:
  prod:
    alerts:
      - type: slack
        url: ${SLACK_WEBHOOK}

5. System Environment Tagging¶

Tag all system catalog records (runs, state) with the environment for cross-environment observability:

system:
  connection: catalog_storage
  path: _odibi_system
  environment: dev  # Default environment tag

environments:
  qat:
    system:
      environment: qat
  prod:
    system:
      environment: prod

This enables queries across environments:

SELECT * FROM meta_runs WHERE environment = 'prod' AND status = 'FAILED'

6. Per-Environment Pipelines (Dev Sandbox)¶

Break things in dev without touching prod. Each environment gets its own pipeline definitions:

# env.dev.yaml — experiment freely
imports:
  - pipelines/bronze_dev.yaml   # reads from local test CSVs

connections:
  data_lake:
    type: local
    base_path: ./test_data

# env.prod.yaml — production pipelines
imports:
  - pipelines/bronze_prod.yaml  # reads from ADLS

engine: spark
connections:
  data_lake:
    type: azure_adls
    account: ${PROD_ACCOUNT}

7. Centralized SQL Server System Catalog¶

Store system metadata in a central SQL Server for unified observability:

system:
  connection: local_storage
  path: .odibi/system

environments:
  prod:
    system:
      connection: sql_server
      schema_name: odibi_system
      environment: prod
      sync_from:
        connection: local_storage
        path: .odibi/system

connections:
  local_storage:
    type: local
    base_path: ./
  sql_server:
    type: sql_server
    host: central-server.database.windows.net
    database: odibi_metadata

In production, the SQL Server backend: - Auto-creates schema and tables - Stores meta_runs and meta_state - Enables syncing local dev data to central location

Sync local data to SQL Server:

odibi system sync project.yaml --env prod