Core Concepts: Scope, Entities, and Columns¶
Every simulation is built from three concepts. Understand these and you can generate any dataset.
How Simulation Works¶
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโ
โ Scope โ โ โ Entities โ โ โ Columns โ
โ (when) โ โ (who) โ โ (what) โ
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโ
- Scope โ When and how much data: start time, interval, row count
- Entities โ Who generates data: sensors, users, machines, production lines
- Columns โ What data each entity produces: temperatures, IDs, statuses
Each entity gets its own independent copy of row_count rows. So 5 entities ร 100 rows = 500 total rows.
The simulation engine evaluates these in order: scope sets the time axis, entities multiply the dataset, and columns fill it with values.
Scope: When and How Much¶
Scope defines the time boundaries and size of your simulation. It answers: "When does the data start, how often are readings taken, and how many rows do I get?"
scope:
start_time: "2026-01-01T00:00:00Z" # When data starts
timestep: "5m" # Interval between rows
row_count: 288 # Rows per entity
seed: 42 # Makes output reproducible
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
start_time |
string | Yes | โ | ISO8601 timestamp in Zulu format (e.g., "2026-01-01T00:00:00Z") |
timestep |
string | Yes | โ | Time between rows. Format: number + unit where unit is s (seconds), m (minutes), h (hours), or d (days). E.g., 5s, 10m, 1h, 2d |
row_count |
int | Either this or end_time |
โ | Exact number of rows to generate per entity |
end_time |
string | Either this or row_count |
โ | ISO8601 end timestamp in Zulu format (e.g., "2026-01-02T00:00:00Z"); Odibi calculates row count automatically |
seed |
int | No | 42 |
Random seed for deterministic output |
start_time¶
The timestamp of the first row. Use ISO8601 format with a Z suffix (UTC):
Every subsequent row's timestamp increments by timestep from this starting point.
timestep¶
The time interval between consecutive rows. Supported formats:
| Format | Meaning | Example |
|---|---|---|
Ns |
N seconds | 5s = 5 seconds |
Nm |
N minutes | 10m = 10 minutes |
Nh |
N hours | 1h = 1 hour |
Nd |
N days | 2d = 2 days |
row_count vs end_time¶
You must specify exactly one of these โ they are mutually exclusive. Odibi validates this at configuration load time and raises an error if you provide both or neither.
Option A: Fixed count โ You know how many rows you want:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 288 # Exactly 288 rows per entity
Option B: Time range โ You know the time period you want to cover:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
end_time: "2026-01-02T00:00:00Z" # Generate until this time
Odibi calculates the row count by dividing the time range by the timestep. In this example: 24 hours รท 5 minutes = 288 rows.
Quick math
row_count ร timestep = duration. So 288 rows ร 5 minutes = 1,440 minutes = 1 day.
seed¶
The random seed controls all randomness in the simulation. Same seed = same output, every time, on every machine.
Why it matters:
- Reproducibility โ teammates get identical data from the same config
- Debugging โ regenerate the exact dataset that caused an issue
- Testing โ assertions can rely on specific values
- CI/CD โ pipeline tests produce deterministic results
Entities: Who Generates Data¶
Entities are the "things" producing data โ sensors, users, machines, production lines, pumps, reactors, or anything else in your domain. Each entity generates its own independent stream of data.
Auto-generated entities¶
Use count and id_prefix to auto-generate entity names:
This produces: sensor_01, sensor_02, ... sensor_10.
Explicit entity names¶
Use names when you need specific, meaningful identifiers:
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
count |
int | Either this or names |
โ | Number of entities to auto-generate |
names |
list[str] | Either this or count |
โ | Explicit list of entity names |
id_prefix |
string | No | "entity_" |
Prefix for auto-generated IDs (used with count) |
id_format |
string | No | "sequential" |
"sequential" (e.g., sensor_01) or "uuid" (e.g., sensor_a1b2c3d4) |
Validation rules¶
- You must specify exactly one of
countornamesโ not both, not neither nameslist cannot be emptycountmust be greater than 0
How entities multiply your data¶
Each entity generates its own independent copy of row_count rows. The total output is:
| Entities | Rows per entity | Total rows |
|---|---|---|
| 3 sensors | 288 | 864 |
| 10 pumps | 1,000 | 10,000 |
| 50 machines | 10,000 | 500,000 |
Each entity's data is generated independently with its own random state (derived from the global seed), so entity sensor_01 always produces the same values regardless of how many other entities exist.
id_format¶
Controls how auto-generated IDs are formatted:
"sequential"(default) โ Zero-padded numbers:sensor_01,sensor_02, ..."uuid"โ UUID-based suffixes:sensor_a1b2c3d4,sensor_e5f6g7h8, ...
UUID format is useful when you need globally unique identifiers or want to simulate distributed systems where IDs aren't sequential.
Columns: What Gets Generated¶
Columns define the actual data produced by each entity. Every column has a name, a data type, and a generator that determines how values are created.
The ColumnGeneratorConfig structure¶
columns:
- name: temperature # Column name in the output DataFrame
data_type: float # Python/Spark data type
generator: # How values are produced
type: range
min: 60.0
max: 100.0
null_rate: 0.02 # 2% of values will be NULL
entity_overrides: # Per-entity customization
heavy_duty_pump:
type: range
min: 80.0
max: 120.0
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | โ | Column name in the output DataFrame |
data_type |
string | Yes | โ | One of: string, int, float, boolean, timestamp, categorical |
generator |
object | Yes | โ | Generator config with type and type-specific parameters |
null_rate |
float | No | 0.0 |
Probability of NULL per value (0.0โ1.0) |
entity_overrides |
dict | No | {} |
Map of entity name โ alternative generator config |
Data types¶
| Type | Python Type | Example Value | Use Case |
|---|---|---|---|
string |
str |
"sensor_01" |
IDs, names, labels, categories |
int |
int |
42 |
Counts, sequence numbers, discrete values |
float |
float |
98.6 |
Measurements, temperatures, percentages |
boolean |
bool |
True |
Flags, on/off states, binary conditions |
timestamp |
datetime |
2026-01-01T00:00:00Z |
Event times, row timestamps |
categorical |
str |
"Running" |
Status codes, categories (explicit enum) |
null_rate: Simulating missing data¶
Real-world data has gaps. Use null_rate to inject NULLs randomly:
- name: optional_reading
data_type: float
generator: {type: range, min: 0, max: 100}
null_rate: 0.1 # 10% of values will be NULL
0.0(default) โ no NULLs0.05โ 5% missing (occasional sensor dropout)0.1โ 10% missing (unreliable sensor)0.5โ 50% missing (intermittent connectivity)
entity_overrides: Per-entity customization¶
Override the generator for specific entities. The base generator applies to all entities not listed in overrides.
- name: pressure
data_type: float
generator:
type: range
min: 50
max: 100
entity_overrides:
heavy_duty_pump: # This entity gets a higher range
type: range
min: 100
max: 200
old_pump: # This entity gets more variance
type: range
min: 30
max: 150
In this example:
- heavy_duty_pump generates pressure between 100โ200
- old_pump generates pressure between 30โ150
- All other entities use the default 50โ100 range
Complete column example¶
Here's a column using every available option:
- name: vibration
data_type: float
generator:
type: range
min: 0.1
max: 5.0
distribution: normal
mean: 1.2
std_dev: 0.8
null_rate: 0.03
entity_overrides:
failing_motor:
type: range
min: 5.0
max: 25.0
new_motor:
type: range
min: 0.05
max: 0.5
This generates:
- Most entities: normal distribution centered at 1.2 mm/s, with 3% NULLs
- failing_motor: elevated vibration (5โ25 mm/s) indicating wear
- new_motor: very low vibration (0.05โ0.5 mm/s) indicating new equipment
Column Dependency Resolution¶
Derived columns can reference other columns by name in their expressions. Odibi automatically resolves the correct evaluation order using topological sort โ you don't need to worry about listing columns in the right order.
How it works¶
- Odibi scans all
derivedcolumn expressions for column name references - It builds a dependency graph (column A depends on column B)
- It performs a topological sort to determine safe evaluation order
- Columns are evaluated in dependency order, regardless of their position in the YAML
Example: Temperature conversion¶
columns:
# This column is listed SECOND but evaluated FIRST
- name: temp_celsius
data_type: float
generator:
type: range
min: 20.0
max: 35.0
# This column references temp_celsius โ Odibi evaluates it after
- name: temp_fahrenheit
data_type: float
generator:
type: derived
expression: "temp_celsius * 1.8 + 32"
Even if you reversed the order in YAML (put temp_fahrenheit first), Odibi would still evaluate temp_celsius first because it detects the dependency.
Chained dependencies¶
Dependencies can chain through multiple levels:
columns:
- name: raw_output
data_type: float
generator: {type: range, min: 100, max: 500}
- name: raw_input
data_type: float
generator: {type: range, min: 80, max: 400}
- name: efficiency
data_type: float
generator:
type: derived
expression: "(raw_output / raw_input * 100) if raw_input > 0 else 0"
- name: efficiency_grade
data_type: string
generator:
type: derived
expression: "'A' if efficiency > 90 else 'B' if efficiency > 70 else 'C'"
Evaluation order: raw_output โ raw_input โ efficiency โ efficiency_grade
Circular dependencies
If column A depends on B and B depends on A, Odibi raises an error at config validation time. Circular dependencies are detected before any data is generated.
The YAML Structure¶
Here's how the three building blocks fit into a complete pipeline node:
read:
connection: null # No connection needed for simulation
format: simulation
options:
simulation:
scope: # WHEN: time boundaries
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 288
seed: 42
entities: # WHO: data producers
count: 3
id_prefix: "sensor_"
columns: # WHAT: data columns
- name: sensor_id
data_type: string
generator: {type: constant, value: "{entity_id}"}
- name: timestamp
data_type: timestamp
generator: {type: timestamp}
- name: temperature
data_type: float
generator: {type: range, min: 20, max: 35}
chaos: # Optional: inject noise
outlier_rate: 0.01
outlier_factor: 3.0
scheduled_events: [] # Optional: time/condition-based behavior changes
The scope, entities, and columns keys are required. chaos and scheduled_events are optional and covered in Advanced Features.
What's Next¶
Now that you understand the three building blocks, dive deeper:
- Generators โ All 13 generator types in detail (range, random_walk, daily_profile, categorical, derived, and more)
- Stateful Functions โ
prev(),ema(),pid(),delay()for dynamic, time-dependent data - Safe Functions Reference โ Complete list of all functions available in derived expressions
- Advanced Features โ Chaos engineering, scheduled events, downtime periods, and entity overrides at scale