Getting Started with Odibi¶
This tutorial will guide you through creating your first data pipeline. By the end, you will have a running project that reads data, cleans it, and generates an audit report ("Data Story").
Prerequisites: * Python 3.9 or higher installed. * Basic familiarity with terminal/command line.
1. Installation¶
First, install Odibi. We recommend creating a virtual environment to keep your system clean.
# 1. Create a virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# 2. Install Odibi
pip install odibi
Note: If you plan to use Spark or Azure later, you can install pip install "odibi[spark,azure]", but for this tutorial, the base package is enough.
2. Create Sample Data¶
Odibi shines when working with messy real-world data. Let's create some "bad" data to clean.
Create a folder named raw_data and a file inside it named customers.csv:
raw_data/customers.csv
id, name, email, joined_at
1, Alice, alice@example.com, 2023-01-01
2, Bob, bob@example.com, 2023-02-15
3, Charlie, NULL, 2023-03-10
4, Dave, dave@example.com, invalid-date
3. Generate Your Project¶
Instead of writing configuration files from scratch, use the Odibi Initializer. It creates a project skeleton with best practices baked in.
Run this command in your terminal:
This creates a new folder my_first_project with a standard structure:
* odibi.yaml: The pipeline configuration.
* data/: Folders for your data layers (landing, raw, silver, etc.).
* README.md: Instructions for your project.
Move your sample data into the landing zone:
# On Windows (PowerShell)
mv raw_data/customers.csv my_first_project/data/landing/
# On Mac/Linux
mv raw_data/customers.csv my_first_project/data/landing/
Note:
odibi init,odibi create, andodibi generate-projectare all aliases forodibi init-pipeline.
4. Explore the Project¶
Navigate into your new project:
You will see a file structure like this:
odibi.yaml: The brain of your project. It defines the pipeline.sql/: Contains SQL transformation files.data/: (Created automatically) Where data will be stored.
Open odibi.yaml in your text editor. You will see two "nodes" (steps):
1. Ingestion Node: Reads the customers.csv from landing/.
2. Refinement Node: Merges the data into silver/.
Since we used the template, the config is already set up to look for landing/customers.csv.
5. Run the Pipeline¶
Now, execute the pipeline:
Odibi will:
1. Read customers.csv from landing/.
2. Convert it to Parquet in raw/.
3. Merge it into a Delta/Parquet table in silver/.
4. Generate a "Data Story".
6. View the Data Story¶
Data engineering is often invisible. Odibi makes it visible. Every run generates a report.
List the generated stories:
You will see output like:
📚 Stories in .odibi/stories:
================================================================================
📄 main_documentation.html
Modified: 2025-11-21 14:30:00
Size: 15.2KB
Path: .odibi/stories/main_documentation.html
Open the HTML file in your browser to view the report:
- Windows: start .odibi/stories/main_documentation.html
- Mac: open .odibi/stories/main_documentation.html
- Linux: xdg-open .odibi/stories/main_documentation.html
What to look for in the report: * Row Counts: Did we lose any rows? * Schema: Did the column types change? * Execution Time: How long did it take?
7. Add Data Validation¶
Data pipelines are only as good as their data quality. Let's add validation tests to catch bad data before it corrupts your warehouse.
Inline Validation in YAML¶
Add validation tests directly to your node:
nodes:
- name: customers
read:
connection: landing
format: csv
path: customers.csv
validation:
mode: warn # or "fail" to stop the pipeline
tests:
- type: not_null
columns: [id, name]
- type: unique
columns: [id]
- type: row_count
min: 1
write:
connection: raw
format: parquet
path: customers
Using Contracts for Input Validation¶
Contracts validate data before processing:
nodes:
- name: validate_orders
contracts:
- type: not_null
columns: [order_id, customer_id, amount]
- type: freshness
column: created_at
max_age: "24h"
read:
connection: landing
path: orders.csv
write:
connection: raw
path: orders
If contracts fail, the pipeline stops immediately with clear error messages.
Running Validation¶
Run the pipeline and watch for validation warnings:
Validation results appear in both the console output and the Data Story.
8. Building Dimensions (SCD2)¶
Once you're comfortable with basic pipelines, you can build proper dimensional models. Here's a quick example of a Slowly Changing Dimension Type 2:
nodes:
- name: dim_customer
read:
connection: bronze
table: raw_customers
pattern:
type: dimension
params:
natural_key: customer_id # Business key
surrogate_key: customer_sk # Generated integer key
scd_type: 2 # Track history
track_cols: [name, email, city]
target: silver.dim_customer # Read existing for merge
unknown_member: true # Add SK=0 for orphans
write:
connection: silver
table: dim_customer
What this does:
- Generates integer surrogate keys (customer_sk)
- Tracks changes to name, email, city over time
- Maintains is_current, valid_from, valid_to columns
- Creates an "unknown" row (SK=0) for handling orphan fact records
For a complete dimensional modeling tutorial, see Dimensional Modeling.
Troubleshooting¶
"ModuleNotFoundError: No module named 'odibi'"¶
Cause: Odibi not installed or virtual environment not activated.
Fix:
# Activate your virtual environment first
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
# Then verify installation
pip show odibi
Pipeline runs but no output files¶
Causes: - Write path doesn't exist - Permission denied on output directory - Dry-run mode enabled
Fix:
# Check if dry-run is enabled (remove --dry-run flag)
odibi run odibi.yaml
# Ensure output directory exists
mkdir -p data/silver
"No such file or directory" for input data¶
Cause: File path in config doesn't match actual location.
Fix: Verify the path relative to where you run the command:
# If config says: path: landing/customers.csv
# File should be at: ./data/landing/customers.csv (relative to base_path)
ls data/landing/customers.csv
Story not generated¶
Causes: - Story connection not configured - Story path doesn't exist
Fix: Ensure your config has a story section:
9. What's Next?¶
You have successfully built a data pipeline with data validation!
- Incremental Loading: Learn how to efficiently process only new data using State Tracking ("Auto-Pilot").
- Write Custom Transformations: Learn how to add Python logic (like advanced validation) to your pipeline.
- Data Validation Guide: Deep dive into all validation options.
- Spark Engine Tutorial: Scale up with Apache Spark.
- Azure Connections: Connect to Azure Blob, ADLS, and SQL.
- Master the CLI: Learn about
odibi doctorand advanced CLI commands.