Skip If Unchanged Pattern¶

Use Case: Avoid redundant writes for snapshot tables that may not change between pipeline runs.

The Problem¶

When ingesting snapshot data (full table extracts without timestamps), an hourly pipeline will append the same 192k rows 24 times per day if the source data doesn't change. This wastes:

Storage: 24× the necessary data
Compute: Unnecessary write operations
Query performance: More files to scan

The Solution¶

The skip_if_unchanged feature computes a hash of the DataFrame content before writing. If the hash matches the previous write, the write is skipped entirely.

nodes:
  - name: bronze_data
    read:
      connection: azure_sql
      format: sql
      table: dbo.MySnapshotTable
    write:
      connection: bronze
      format: delta
      table: my_snapshot_table
      mode: append
      skip_if_unchanged: true
      skip_hash_sort_columns: [store_id, date_id]  # For deterministic ordering

How It Works¶

flowchart TD
    A[Read source data] --> B[Compute SHA256 hash of DataFrame]
    B --> C{Previous hash exists?}
    C -->|No| D[Write data]
    D --> E[Store hash in Delta metadata]
    C -->|Yes| F{Hashes match?}
    F -->|Yes| G[Skip write, log 'unchanged']
    F -->|No| D

Hash Computation: Before writing, the entire DataFrame is converted to CSV bytes and hashed with SHA256
Hash Storage: The hash is stored in Delta table properties (odibi.content_hash)
Comparison: On subsequent runs, the new hash is compared to the stored hash
Skip or Write: If hashes match, the write is skipped; otherwise, data is written and hash updated

Configuration Options¶

Option	Type	Description
`skip_if_unchanged`	bool	Enable hash-based skip detection
`skip_hash_columns`	list	Subset of columns to hash (default: all)
`skip_hash_sort_columns`	list	Columns to sort by before hashing (for determinism)

When to Use¶

✅ Good fit: - Snapshot tables without updated_at timestamps - Reference/dimension data that changes infrequently - Tables where you don't know the change frequency

❌ Not recommended: - Tables with reliable updated_at (use HWM instead) - Append-only fact tables (new data every run) - Very large tables (hash computation is expensive)

Example: Reference Data Sync¶

# Pipeline runs hourly for freshness
# But this table only changes 1-2 times per day

nodes:
  - name: bronze_reference_data
    read:
      connection: azure_sql
      format: sql
      table: dbo.vw_reference_detail
    write:
      connection: bronze
      format: delta
      table: reference_detail
      mode: append
      add_metadata: true
      skip_if_unchanged: true
      skip_hash_sort_columns: [store_id, date_id]

Result: - Pipeline checks every hour (data is fresh when needed) - Only writes when data actually changes (storage efficient) - Logs show "Skipping write - content unchanged" for skipped runs

Storage Impact¶

Scenario	Daily Writes	Annual Rows (192k/snapshot)
No skip (hourly)	24	1.7 billion
With skip (2 changes/day)	2	140 million
Savings	92%	92%

Limitations¶

Delta only: Currently only supported for Delta format
Full DataFrame hash: Computes hash of entire DataFrame (not row-by-row)
Memory: DataFrame must fit in driver memory for hashing
First run: Always writes on first run (no previous hash to compare)

Append-Only Raw Layer - Bronze layer best practices
Incremental Stateful - For tables with timestamps
Smart Read - Full load pattern