Incremental (Continuous) Simulation¶
Generate new data on every pipeline run, picking up exactly where the last run left off - using the same High Water Mark (HWM) system as all other Odibi read sources.
Why this matters
A simulation that runs once is a demo. A simulation that runs continuously on a schedule, with PID controllers that remember their integral state and random walks that continue from their last position, is a data platform. Incremental mode turns simulation into a continuous data feed - indistinguishable from a real source to everything downstream. Backfill a year of history on the first run, then add a day at a time. Your dashboard never knows the difference.
Overview¶
Incremental simulation turns a one-shot data generator into a continuous data feed. Instead of regenerating the same dataset every time, each pipeline run produces a fresh batch of rows that starts where the previous batch ended.
This works because simulation plugs into Odibi's standard incremental loading infrastructure β the same incremental: block you use for SQL, Delta, CSV, and every other read source. The simulator simply respects the High Water Mark to know where to start.
Run 1: Generates Jan 1 00:00 β 23:55 β Saves HWM: Jan 1 23:55
Run 2: Generates Jan 2 00:00 β 23:55 β Saves HWM: Jan 2 23:55
Run 3: Generates Jan 3 00:00 β 23:55 β Saves HWM: Jan 3 23:55
...
Each run picks up exactly where the last one left off. No gaps, no overlaps, no manual bookkeeping.
Same config as all other sources
The incremental: block is the exact same config used for SQL, Delta, CSV, and every other read source. See Stateful Incremental Loading for full details on the incremental system.
Configuration¶
To enable incremental mode, add an incremental: block to your read node. That's it β no special simulation-specific flags.
- name: sensor_stream
read:
connection: null
format: simulation
options:
simulation:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 288 # 1 day of data per run
seed: 42
entities:
count: 10
id_prefix: "sensor_"
columns:
- name: sensor_id
data_type: string
generator: {type: constant, value: "{entity_id}"}
- name: timestamp
data_type: timestamp
generator: {type: timestamp}
- name: value
data_type: float
generator: {type: range, min: 0, max: 100}
incremental:
mode: stateful
column: timestamp # Must match a timestamp column
write:
connection: my_lake
format: delta
table: sensor_stream
mode: append # Append each run's data
Key configuration points¶
| Field | Purpose |
|---|---|
incremental.mode |
Must be stateful for simulation |
incremental.column |
Must reference a timestamp column defined in your simulation |
scope.row_count |
Controls how many rows each run generates (per entity) |
scope.start_time |
Only used on the first run; subsequent runs start from the HWM |
write.mode |
Use append to accumulate data across runs |
row_count vs end_time
For incremental simulation, prefer row_count over end_time. With row_count, each run always produces a fixed batch regardless of where the HWM sits. With end_time, the run will stop generating once the end is reached β see Troubleshooting for the "exhausted incremental window" error.
How It Works¶
Here's the step-by-step lifecycle of an incremental simulation:
First run (no existing HWM)¶
- Odibi checks the state backend for an existing HWM for this node β finds nothing
- Simulation starts at
scope.start_time(2026-01-01T00:00:00Z) - Generates
row_countrows per entity (288 rows Γ 10 entities = 2,880 total) - Timestamps span
00:00to23:55(288 Γ 5min = 1 day) - After the write step, Odibi saves the maximum timestamp (
Jan 1 23:55) as the HWM
Second run (HWM exists)¶
- Odibi reads the HWM:
2026-01-01T23:55:00Z - Simulation starts at HWM + one timestep =
2026-01-02T00:00:00Z - Generates another 288 rows per entity
- Timestamps span
Jan 2 00:00toJan 2 23:55 - Saves new HWM:
2026-01-02T23:55:00Z
Subsequent runs¶
The pattern repeats indefinitely. Each run:
- Reads previous HWM
- Starts at HWM + timestep
- Generates
row_countrows - Saves new HWM
ββββββββββββββββββββ
β State Backend β
β (HWM storage) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Run 1 β β Run 2 β β Run 3 β
β Jan 1 ββββΆβ Jan 2 ββββΆβ Jan 3 ββββΆ ...
β 288 rowsβ β 288 rowsβ β 288 rowsβ
βββββββββββ βββββββββββ βββββββββββ
Stateful Function Persistence¶
When using incremental.mode: stateful, all stateful functions preserve their internal state between pipeline runs. This is the key difference between stateful mode and a hypothetical "restart-from-scratch" approach.
What gets preserved¶
| Function | What's Preserved | Why It Matters |
|---|---|---|
prev() |
Last value per entity | Running totals, level tracking, and lag responses continue seamlessly |
ema() |
Last smoothed value per entity | Exponential moving average doesn't reset to the default |
pid() |
Integral sum and last error per entity | PID controllers maintain their accumulated correction |
random_walk |
Last walk value per entity | The walk continues from its last position, no jump to start |
What this means in practice¶
Without stateful mode: Each run restarts all state from defaults. A tank level that ended at 63.2 mΒ³ would jump back to the initial default (e.g., 50.0 mΒ³) at the start of the next run.
With stateful mode: Run 2 picks up exactly where Run 1 left off β no discontinuities, no resets.
Run 1: level starts at 50.0, ends at 63.2 β state saved
Run 2: level starts at 63.2, ends at 58.7 β state saved
Run 3: level starts at 58.7, ... β continuous
Example: Continuous random walk across runs¶
The random_walk generator's last value per entity is persisted. This means the walk is smooth and continuous across run boundaries β no jumps back to start.
- name: continuous_pressure
read:
connection: null
format: simulation
options:
simulation:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 288
seed: 42
entities:
count: 5
id_prefix: "pump_"
columns:
- name: pump_id
data_type: string
generator: {type: constant, value: "{entity_id}"}
- name: timestamp
data_type: timestamp
generator: {type: timestamp}
- name: pressure_psi
data_type: float
generator:
type: random_walk
start: 150.0
min: 100.0
max: 200.0
volatility: 0.5
mean_reversion: 0.05
incremental:
mode: stateful
column: timestamp
write:
connection: my_lake
format: delta
table: pump_pressure
mode: append
Run 1: pump_01's pressure_psi ends at 147.3.
Run 2: pump_01's pressure_psi starts at 147.3 (not 150.0), and continues its walk naturally.
Without stateful mode, Run 2 would reset pump_01 to start: 150.0 β creating an unrealistic jump in the time series.
Determinism¶
Incremental simulation is fully deterministic. Same seed + same HWM = identical output.
The RNG state for each run is derived from:
- The base seed (
scope.seed) - The HWM timestamp (or
start_timefor the first run) - The entity index (each entity has its own independent RNG stream)
This means:
- Reproducibility β given the same config and the same HWM, any machine will produce identical data
- Debugging β you can replay any specific run by manually setting the HWM to the value it had before that run
- CI/CD β pipeline tests produce deterministic results even in incremental mode
- Environment parity β development, staging, and production generate the same data from the same state
Seed advancement
The seed doesn't just increment by 1 per run. It's derived deterministically from the HWM timestamp, so even if you skip a run or adjust the HWM manually, the output is still reproducible.
Write Modes¶
Incremental simulation should be paired with the right write mode:
mode: append β Recommended¶
Each run's data is added to the existing table. This is the natural choice for incremental simulation β it mirrors how real streaming sources accumulate data.
mode: overwrite β Development/Testing¶
Replaces all existing data with the current run's output. Useful during development when you're iterating on the simulation config and don't want old data accumulating. Not recommended for production β you lose all historical data on each run.
mode: merge β Not recommended¶
The merge write mode is designed for SQL Server upsert operations and is not a natural fit for simulation data. Simulation generates new rows with new timestamps β there's nothing to merge against.
Common Patterns¶
Daily data feed¶
Generate one day of data per run, schedule with cron (or Databricks Jobs) to run daily:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 288 # 288 Γ 5min = 1,440 min = 1 day
seed: 42
Schedule the pipeline to run once per day. Each execution adds exactly one day of simulated data.
Hourly micro-batches¶
Generate one hour of data per run for higher-frequency processing:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "5m"
row_count: 12 # 12 Γ 5min = 60 min = 1 hour
seed: 42
Schedule hourly. Each run adds one hour of data β useful for near-real-time dashboards or testing hourly aggregation patterns.
Backfill then switch to increments¶
Start with a large historical dataset on the first run, then switch to smaller increments:
scope:
start_time: "2025-01-01T00:00:00Z" # Start a year ago
timestep: "5m"
row_count: 105120 # 365 days of 5-min data
seed: 42
- First run: Generates a full year of historical data (105,120 rows per entity)
- After first run: Edit
row_countto288(1 day) - Subsequent runs: Each adds one day, continuing from the HWM
This is a common pattern for populating dashboards with historical context before switching to daily increments.
High-frequency sensor simulation¶
Simulate 1-second sensor data in 1-minute batches:
scope:
start_time: "2026-01-01T00:00:00Z"
timestep: "1s"
row_count: 60 # 60 Γ 1s = 1 minute
seed: 42
Schedule every minute for continuous high-frequency simulation.
Troubleshooting¶
"Exhausted incremental window"¶
Symptom: The pipeline runs but produces zero rows.
Cause: You used end_time instead of row_count, and the HWM has advanced past end_time. There's no more data to generate.
Solution: Switch from end_time to row_count for ongoing incremental simulation:
# β Will exhaust after the time range is covered
scope:
start_time: "2026-01-01T00:00:00Z"
end_time: "2026-01-31T00:00:00Z"
timestep: "5m"
# β
Generates a fixed batch on every run, indefinitely
scope:
start_time: "2026-01-01T00:00:00Z"
row_count: 288
timestep: "5m"
Duplicate data between runs¶
Symptom: The same timestamps appear in multiple runs' output.
Possible causes:
- Write mode is
overwriteinstead ofappendβ each run replaces previous data, so it looks like duplicates if you inspect after re-running - HWM not being saved β check that the state backend is configured and writable
- Manual HWM tampering β if the HWM was reset or set to an earlier value, data will be regenerated for that period
Solution: Verify write mode is append and check the state file:
Look for your node's entry and verify the HWM timestamp matches the last run's maximum timestamp.
State file location¶
Odibi stores HWM state in different backends depending on your environment:
| Environment | Backend | Location |
|---|---|---|
| Pandas / Local | Local JSON | .odibi/state.json |
| Spark / Databricks | Delta table | odibi_meta.state table |
Local JSON example:
To reset a node's HWM (force a full reload from start_time):
- Local: Delete the node's entry from
.odibi/state.json, or delete the file entirely to reset all nodes - Databricks: Remove the row from the
odibi_meta.stateDelta table
Resetting HWM with stateful functions
If you reset the HWM, stateful function state (prev, ema, pid, random_walk) is also reset. This means Run N+1 after a reset will start from defaults β expect discontinuities in the generated data.
Stateful function state not persisting¶
Symptom: prev() resets to its default value on each run. Tank levels jump back to 50.0 instead of continuing.
Cause: Incremental mode is not set to stateful, or the state backend is not configured.
Solution: Ensure your config uses mode: stateful:
Related Documentation¶
- Core Concepts β Scope, entities, and columns β the three building blocks
- Stateful Functions β
prev(),ema(),pid(),delay()β history-dependent values and their persistence - Patterns & Recipes β Real-world simulation scenarios
- Stateful Incremental Loading β Full documentation of Odibi's incremental loading system (shared by all read sources)