Simulation¶
Generate realistic synthetic data for any domain - manufacturing, operations, IoT, process control, business - directly in your pipeline YAML. No code needed.
Simulation is Odibi's built-in data generator. Define what you need in YAML, and the framework produces time-series, categorical, relational, and process data that behaves like the real thing. Swap to real sources later - your downstream pipeline stays unchanged.
Who built this
Odibi was built by a chemical engineer turned data engineer - the only DE on an analytics team in operations, not IT. Simulation exists because I needed realistic process data and couldn't wait for IT to provision it. I know what a PID controller is, what a material balance looks like, and what happens when your test data doesn't behave like the real thing. Every pattern in this library comes from a real problem I've solved. The unique angle: this isn't just a data tool. It's an engineering tool built by someone who understands both sides.
read:
format: simulation
options:
simulation:
scope: { start_time: "2026-01-01", timestep: "5m", row_count: 288, seed: 42 }
entities: { count: 3, id_prefix: "sensor_" }
columns:
- name: temperature
data_type: float
generator: { type: random_walk, start: 72, min: 60, max: 90, volatility: 0.5 }
Key Capabilities¶
| Capability | What You Get |
|---|---|
| 13 Generator Types | range ยท random_walk ยท daily_profile ยท categorical ยท boolean ยท timestamp ยท sequential ยท constant ยท derived ยท uuid ยท email ยท ipv4 ยท geo |
| Stateful Functions | prev(), ema(), pid(), delay() โ values that depend on history for dynamic process simulation |
| Cross-Entity References | One entity reacts to another: downstream sensor reads upstream output |
| Entity Overrides | Per-entity behavior variation โ entity A runs hot, entity B runs cold |
| Scheduled Events | Maintenance windows, setpoint changes, recurring events, condition-based triggers, ramp transitions |
| Chaos Engineering | Outliers, duplicates, downtime gaps, null injection โ realistic imperfections |
| Incremental Mode | Continuous data generation with HWM state โ each run picks up where the last left off |
| Deterministic | Same seed = same output, every time |
| Multi-Engine | Same YAML works on Pandas, Spark, and Polars |
Learning Path¶
Work through the simulation docs in order, or jump to what you need:
| Page | What You'll Learn |
|---|---|
| :material-star: Why Odibi Simulation | What makes this different and why it matters |
| :material-rocket-launch: Getting Started | Your first simulation in 5 minutes |
| :material-book-open-variant: Core Concepts | Scope, entities, and columns - the three building blocks |
| :material-format-list-bulleted-type: Generators Reference | All 13 generator types with parameters and examples |
| :material-function-variant: Stateful Functions | prev(), ema(), pid(), delay() โ history-dependent values |
| :material-puzzle: Advanced Features | Cross-entity references, overrides, scheduled events (recurring, condition-based, ramp), chaos |
| :material-sync: Incremental Mode | Continuous data generation across pipeline runs |
| :material-chef-hat: Patterns & Recipes | Real-world scenarios: IoT fleets, batch reactors, order streams |
| :material-flask: Process Simulation | ChemE and process control: FOPTD, PID loops, reactor dynamics |
When to Use Simulation¶
- Build pipelines before source data exists โ design transforms, test patterns, validate schema now
- Test with safe, reproducible data โ no PII, no compliance headaches
- Stress test Delta Lake at scale โ 1,000 entities ร 10,000 rows = 10M rows from a single YAML node
- Demo without exposing real data โ realistic enough for stakeholders, safe enough for anywhere
- Simulate manufacturing, operations, and IoT โ sensors, PLCs, batch processes, alarms for a local data platform
- Prototype analytics before production data arrives โ build dashboards on synthetic facts and dimensions
Installation
Simulation is built into odibi core. No extra dependencies, no plugins โ if you have odibi installed, you have simulation.
Next Steps¶
Start with Getting Started to generate your first dataset in under 5 minutes, or browse the Generators Reference to see what's available.