Managing Environments¶
Odibi allows you to define pipeline configurations that adapt to different contexts (e.g., Local Development, Testing, Production). You can override infrastructure settings per environment, or even have completely different pipeline implementations per environment. This prevents configuration drift and lets you experiment freely in dev without risking production.
How it Works¶
Odibi uses a Base Configuration + Override model: 1. Base Configuration: Defines your default settings (typically for local development). 2. Environment Overrides: Specific blocks that patch or replace values in the base configuration when that environment is active.
Configuration Structure¶
Odibi supports two ways to define environments:
1. Inline Block: Using an environments block in your main config file.
2. External Files: Using separate env.{env}.yaml files (e.g., env.prod.yaml).
Method 1: Inline Block¶
Add an environments section to your project.yaml:
Method 2: External Files (Recommended for large configs)¶
Keep your main odibi.yaml clean by putting overrides in separate files.
File: odibi.yaml
File: env.prod.yaml
# Automatically merged when running with --env prod
engine: spark
connections:
data_lake:
type: azure_adls
account: prod_acc
When you run odibi run odibi.yaml --env prod, Odibi will:
1. Load odibi.yaml.
2. Look for env.prod.yaml in the same directory.
3. Merge the prod config on top of the base config.
Inline Example (Method 1)¶
# --- 1. Base Configuration (Default / Local) ---
project: Sales Data Pipeline
engine: pandas
retry:
enabled: false
connections:
data_lake:
type: local
base_path: ./data/raw
pipelines:
- pipeline: ingest_sales
nodes:
- name: read_csv
read:
connection: data_lake
path: sales.csv
# --- 2. Environment Overrides ---
environments:
# Production Environment
prod:
engine: spark # Switch to Spark for scale
retry:
enabled: true
max_attempts: 3
connections:
data_lake:
type: azure_adls
account: mycompanyprod
container: sales-data
auth:
mode: aad_msi
story:
max_sample_rows: 0 # Disable data sampling for security
# Testing Environment
test:
connections:
data_lake:
type: local
base_path: ./data/test_fixtures
Usage¶
CLI¶
Use the --env flag to activate an environment.
Run in Default (Base) Environment:
Run in Production:
Python API¶
Pass the env parameter when initializing the PipelineManager.
from odibi.pipeline import PipelineManager
# Load Prod Configuration
manager = PipelineManager.from_yaml("project.yaml", env="prod")
# Run Pipeline
manager.run("ingest_sales")
Databricks Example¶
In a Databricks notebook, you can use widgets to switch environments dynamically without changing code.
# 1. Create Widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "test", "prod"])
# 2. Get Selection
current_env = dbutils.widgets.get("environment")
# 3. Run Pipeline
manager = PipelineManager.from_yaml("/dbfs/project.yaml", env=current_env)
manager.run()
Per-Environment Pipelines¶
One of the most powerful patterns is having different pipeline implementations per environment. This lets you experiment freely in dev without risking production.
Architecture: Environment-Owned Pipelines¶
Instead of putting pipelines in project.yaml, keep the base config as shared infrastructure and let each environment own its pipelines:
my_project/
├── project.yaml # Shared config (no pipelines)
├── env.dev.yaml # Dev pipelines + local connections
├── env.qat.yaml # QAT pipelines + test connections
├── env.prod.yaml # Prod pipelines + cloud connections
├── pipelines/
│ ├── bronze_dev.yaml # Dev: reads from local CSVs
│ ├── bronze_qat.yaml # QAT: reads from test database
│ └── bronze_prod.yaml # Prod: reads from ADLS
Base config — shared infrastructure only:
# project.yaml
project: Sales Analytics
engine: pandas
connections:
data_lake:
type: local
base_path: ./data
story:
connection: data_lake
path: stories
system:
connection: data_lake
path: _odibi_system
Each environment imports its own pipelines:
# env.dev.yaml
imports:
- pipelines/bronze_dev.yaml
connections:
data_lake:
type: local
base_path: ./test_data
# env.prod.yaml
imports:
- pipelines/bronze_prod.yaml
engine: spark
connections:
data_lake:
type: azure_adls
account_name: ${PROD_STORAGE_ACCOUNT}
container: bronze
credential: ${PROD_SAS_TOKEN}
Run with the environment flag:
odibi run project.yaml --env dev # Loads env.dev.yaml → dev pipelines
odibi run project.yaml --env prod # Loads env.prod.yaml → prod pipelines
Note: When using environment-owned pipelines,
--envis required. Without it, the base config has no pipelines and validation will fail. This is intentional — it forces you to be explicit about which environment you're running.
Shared Pipelines with Environment-Specific Overrides¶
If your pipeline logic is mostly the same and only connections/engine differ, keep pipelines in the base config and override infrastructure per environment:
# project.yaml — pipelines defined here, shared across environments
pipelines:
- pipeline: bronze_sales
nodes:
- name: ingest
read:
connection: data_lake
path: sales/
environments:
prod:
engine: spark
connections:
data_lake:
type: azure_adls
account: prod_acc
Duplicate Pipeline Names (Override Behavior)¶
If both the base config and an environment file define a pipeline with the same name, the environment version wins (last definition takes precedence). A warning is logged so the override is traceable:
This is useful when you want the same logical pipeline name (bronze) but completely different implementations per environment.
Environment Files Support Imports¶
External env.{env}.yaml files are full YAML configs — they support imports, ${VAR} substitution, ${vars.xxx} references, and ${date:...} expressions, just like the main config. This means you can organize per-environment pipelines into separate files and import them:
# env.dev.yaml
imports:
- pipelines/bronze_dev.yaml
- pipelines/silver_dev.yaml
- pipelines/gold_dev.yaml
Common Use Cases¶
1. Swapping Storage (Local vs. Cloud)¶
Develop locally with CSVs, deploy to ADLS/S3 without changing pipeline code.
connections:
storage: { type: local, base_path: ./data }
environments:
prod:
connections:
storage: { type: azure_adls, account: prod_acc, container: data }
2. Scaling Engines (Pandas vs. Spark)¶
Use Pandas for fast local iteration and unit tests, but switch to Spark for distributed processing in production.
3. Security & Privacy¶
Disable data sampling in stories for production to prevent PII leakage, while keeping it enabled in dev for debugging.
4. Alerting¶
Only send Slack/Teams notifications when running in production.
5. System Environment Tagging¶
Tag all system catalog records (runs, state) with the environment for cross-environment observability:
system:
connection: catalog_storage
path: _odibi_system
environment: dev # Default environment tag
environments:
qat:
system:
environment: qat
prod:
system:
environment: prod
This enables queries across environments:
6. Per-Environment Pipelines (Dev Sandbox)¶
Break things in dev without touching prod. Each environment gets its own pipeline definitions:
# env.dev.yaml — experiment freely
imports:
- pipelines/bronze_dev.yaml # reads from local test CSVs
connections:
data_lake:
type: local
base_path: ./test_data
# env.prod.yaml — production pipelines
imports:
- pipelines/bronze_prod.yaml # reads from ADLS
engine: spark
connections:
data_lake:
type: azure_adls
account: ${PROD_ACCOUNT}
7. Centralized SQL Server System Catalog¶
Store system metadata in a central SQL Server for unified observability:
system:
connection: local_storage
path: .odibi/system
environments:
prod:
system:
connection: sql_server
schema_name: odibi_system
environment: prod
sync_from:
connection: local_storage
path: .odibi/system
connections:
local_storage:
type: local
base_path: ./
sql_server:
type: sql_server
host: central-server.database.windows.net
database: odibi_metadata
In production, the SQL Server backend:
- Auto-creates schema and tables
- Stores meta_runs and meta_state
- Enables syncing local dev data to central location
Sync local data to SQL Server: