Skip to content

Odibi β€” Lessons Learned System

Purpose: Structured, searchable memory that persists across all agent sessions. Rule: Every session that discovers something new MUST add an entry here. Read this file at the start of every session. It prevents repeat mistakes.


How This System Works

Previously, lessons were buried in AGENTS.md's "Testing Gotchas" section β€” a growing blob that mixed coverage numbers, mock patterns, and design decisions. This file separates persistent knowledge into four searchable categories:

Category What Goes Here Example
Decisions Why we chose X over Y "Use DuckDB SQL, not raw Pandas"
Traps Things that look right but break "caplog doesn't capture structured logging"
Patterns Working code recipes "Mock PySpark setup order"
Discoveries Behavior we verified empirically "Polars adds +00:00 to all datetimes"

AGENTS.md keeps its existing role: coverage tracker + test infrastructure notes. This file handles the why and how that agents need to make correct decisions.


Decisions Log

Design choices with rationale. Search here before proposing alternatives.

D-001: SQL-First Transformers

Date: 2026-01
Context: Building transformer system for engine parity
Decision: Use context.sql() (DuckDB for Pandas, Spark SQL for Spark) as the primary transform mechanism
Rationale: One SQL query works on all engines. Engine-specific code only when SQL cannot express the logic (UDFs, CoolProp, complex window frames)
Affected: All transformers in odibi/transformers/

D-002: Pydantic for Config, Plain Dicts for Pattern Params

Date: 2026-01
Context: Pattern validation needs flexible params without strict Pydantic at the pattern layer
Decision: YAML β†’ Pydantic (config.py) β†’ params: dict passed to patterns. Patterns validate params themselves in validate()
Rationale: NodeConfig validates structure. Pattern.validate() validates semantics. Decoupled.
Affected: All patterns in odibi/patterns/

D-003: No %run on Databricks β€” lib/ Pattern Only

Date: 2026-04
Context: %run hangs and gets stuck in Databricks notebooks
Decision: All new self-contained projects use the validation_e2e lib/ pattern: lib/ folder with setup.py (auto-installs deps, exports API), config.py, __init__.py. Each notebook does sys.path.insert(0, PROJECT_ROOT) then from lib.setup import *
Rationale: Eliminates %run dependency chains. Each notebook is self-contained.
Affected: All Databricks notebooks going forward

D-004: Coverage via coverage run, Never pytest --cov

Date: 2026-04
Context: pytest --cov changes import/execution order, triggering Rich Text vs str mismatch
Decision: Always use python -m coverage run --source=... -m pytest ... then python -m coverage report --show-missing
Rationale: Same data, no ordering issue
Affected: All coverage measurement

D-005: Test Batching Required

Date: 2026-04
Context: Full tests/unit/ suite hangs due to ThreadPoolExecutor + coverage tracing + mock contamination + Polars anonymize mask hang
Decision: Always run tests in batches. Use scripts/run_coverage.ps1 or manual batches.
Affected: All CI and local test runs

D-006: AGENTS.md as Living Coverage Tracker

Date: 2026-04
Context: Need to track per-module coverage across sessions
Decision: AGENTS.md "Current Coverage Status" section is the authoritative record. Every session that adds tests updates it.
Rationale: Single location, searchable, always current
Affected: All test-writing sessions

D-007: Standalone Functions Over Stateful Classes

Date: 2026-01
Context: Framework architecture decision
Decision: Transformers are standalone functions (context, params) -> context. Patterns are the only classes (extend Pattern ABC). No classes with mutable state.
Rationale: Simpler to test, compose, and reason about. Patterns need class hierarchy for polymorphism.
Affected: All odibi code

D-008: _safe_partition_count for Spark Connect Compatibility

Date: 2026-04
Context: SparkEngine crashed on shared clusters because .rdd.getNumPartitions() is blocked by Spark Connect
Decision: Added _safe_partition_count(df) helper that wraps df.rdd.getNumPartitions() in try/except, returning -1 on failure. All partition-count logging uses this helper.
Rationale: Partition count is logging-only metadata. The -1 sentinel is visible in logs without crashing the pipeline. Core read/write/upsert don't depend on partition count.
Affected: odibi/engine/spark_engine.py β€” 7 call sites


Traps (Known Pitfalls)

Things that look correct but break. Search here before attempting a new approach.

T-001: caplog Does Not Capture Structured Logging

Symptom: Tests using caplog pass individually but fail in batch
Root Cause: get_logging_context() singleton accumulates state. Rich highlighter receives Text object instead of str, triggering blocking TypeError
Fix: Never use caplog. Assert on return values and behavior only.
Modules: All modules using get_logging_context()

T-002: Mock PySpark Order Matters

Symptom: Spark branch tests fail silently β€” fallback type stubs undefined
Root Cause: If you mock pyspark BEFORE importing odibi, catalog.py's try/except succeeds (real pyspark exists) but the fallback stubs never get defined
Fix: Always import odibi.catalog FIRST, then install mocks in sys.modules
Modules: catalog.py, any Spark branch testing

T-003: "spark" or "delta" in Test Filenames β†’ Skipped

Symptom: Test file exists but never runs
Root Cause: conftest.py skip filter on Windows
Fix: Use names like test_catalog_mock_engine_reads.py, not test_catalog_spark_reads.py
Modules: All test files

T-004: Delta Lake Rejects Null-Type Columns

Symptom: SchemaMismatchError: Invalid data type for Delta Lake: Null
Root Cause: Pandas DataFrames with None values create Null-type columns
Fix: Use pa.table() with explicit types for Delta seeding. Always provide non-None environment values.
Modules: state/__init__.py, any Delta Lake tests

T-005: NodeResult.result_schema Field Alias

Symptom: result_schema is silently None despite being passed
Root Cause: Pydantic Field(alias="schema") β€” must construct with schema=["a","b"], not result_schema=["a","b"]
Fix: Pass as schema=[...] in constructor, read via .result_schema
Modules: node.py

T-006: Polars Freshness Tests Need UTC

Symptom: "Fresh" data appears stale in Polars tests
Root Cause: Polars adds +00:00 timezone. Validation compares with datetime.now(timezone.utc). Naive timestamps are 5h off (Central).
Fix: Use datetime.now(timezone.utc) - timedelta(...) in test data
Modules: validation/engine.py

T-007: MagicMock Dunder Methods Don't Work on Instances

Symptom: mock_obj.__getitem__ is ignored
Root Cause: Python resolves dunder methods on the TYPE, not the instance
Fix: Use plain tuples/dicts instead. mock_stats = (100.0,) for code that does stats[0]
Modules: Any mock-heavy tests

T-008: builtins.import Patching Is Dangerous

Symptom: ImportError during unrelated code when simulating missing libraries
Root Cause: Patching builtins.__import__ catches internal imports for Rich, logging, coverage
Fix: Use sys.modules manipulation: set sys.modules["pkg"] = None, restore in finally
Modules: Any tests simulating missing optional dependencies

T-009: SCD2 Float/NaN Comparison β€” Resolved (All Engines)

Date: 2026-04-30 (Spark), 2026-05-01 (Pandas/Polars/DuckDB verified)
Symptom: SCD2 change detection false-positives on float columns containing NaN.
Root Cause: NaN != NaN in IEEE 754. Without NaN-safe comparison, every SCD2 re-run creates spurious new versions for rows with NaN in tracked columns.
Resolution: All 5 comparison paths now handle NaN correctly:
- Spark legacy (L439-452): eqNullSafe() treats NaN == NaN β†’ no false positive
- Spark Delta MERGE (L654-656): <=> operator is NaN-safe in Spark SQL
- DuckDB SQL (L801-804): IS DISTINCT FROM treats NaN == NaN as False in DuckDB (unlike standard SQL)
- Pandas fallback (L933-952): _safe_isna() detects NaN via pd.isna(), math.isclose(rel_tol=1e-9, abs_tol=1e-12) handles float epsilon
- Polars (L1108-1119): ne_missing() already treats NaN == NaN natively; defensive is_nan() guard on Float32/Float64 columns
Verified: campaign/18_scd2_nan_fix β€” 5/5 tests PASS across Pandas, Polars, Spark, DuckDB
Status: Resolved for all engines. #248 closeable.
Modules: odibi/transformers/scd.py

T-010: Polars repeat_by + list.join Hangs

Symptom: Test suite hangs when anonymize "mask" method runs in Polars
Root Cause: Polars expression evaluation deadlock in certain test contexts
Fix: Isolate Polars anonymize tests in separate batch
Modules: engine/polars_engine.py anonymize method

T-011: Spark Connect Temp View Lifecycle β€” Views Dropped Before Lazy Eval

Date: 2026-04
Symptom: Spark Connect queries fail with INVALID_HANDLE.Format: OPERATION_HANDLE or return empty results
Root Cause: EngineContext.sql() used a finally block to unregister temp views after SQL execution. On classic Spark this works because DataFrames eagerly resolve. On Spark Connect, plan resolution is deferred β€” the view is dropped before the lazy DataFrame materializes.
Fix: Only unregister temp views on error (except block). Successful results keep views alive. Unique view names (_odibi_view_{uuid}) prevent conflicts; session cleanup handles the rest.
Modules: odibi/context.py

T-012: bool(LazyFrame) TypeError in Row Counting

Date: 2026-04
Symptom: TypeError: the truth value of a LazyFrame is ambiguous when running transformers with Polars
Root Cause: Row-counting logic used if rows_before and rows_after: which calls bool() on the result of .count(). Polars .count() returns a LazyFrame (not an int), and bool(LazyFrame) raises TypeError.
Fix: Replace if rows_before and rows_after: with if isinstance(rows_before, (int, float)) and isinstance(rows_after, (int, float)): β€” explicit type checks instead of truthiness.
Modules: odibi/transformers/sql_core.py, odibi/transformers/advanced.py, odibi/transformers/relational.py

T-014: Always Run ruff format Before Committing

Date: 2026-04 Symptom: CI fails on ruff format --check odibi/ tests/ after merging agent-authored code Root Cause: Agent edited transformer files but did not run the formatter. Long lines and expression formatting diverged from project style. Fix: Every commit touching .py files in odibi/ or tests/ must run ruff format <changed files> before git add. CI checks both ruff check (linting) and ruff format --check (formatting). Modules: All Python files

T-013: Polars SQL Dialect Gaps β€” EXCEPT vs EXCLUDE, No Subqueries

Date: 2026-04
Symptom: polars.exceptions.SQLInterfaceError on deduplicate: "EXCEPT syntax not supported" and "subqueries not supported"
Root Cause: Polars SQL uses EXCLUDE for column exclusion (like DuckDB/Pandas), not EXCEPT (Spark/SQL Server). Polars SQL also does not support subqueries like ORDER BY (SELECT NULL).
Fix: Added Polars to the EXCLUDE dialect alongside Pandas/DuckDB. Use ORDER BY 1 for Polars when no explicit order_by is specified.
Modules: odibi/transformers/advanced.py (deduplicate)


T-015: Databricks editAsset for Files Replaces Entire Content

Date: 2026-04 Symptom: CHANGELOG.md (61,524 chars) truncated to 5,984 chars after using editAsset with type: file Root Cause: The editAsset tool for workspace files replaces the entire file body with the provided content β€” it does NOT patch or append. When the content parameter only contained the new [Unreleased] section + a version header, all 50+ historical version entries were lost. Fix: For large files, use Python open(path, 'a') to append, or read the full file first and include ALL content in the editAsset body. Never use editAsset for partial updates to large files. Modules: Any workspace file editing via Databricks assistant tools

T-016: External Internet Blocked on Databricks Cluster β€” No GitHub Recovery

Date: 2026-04 Symptom: SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING]')) when calling raw.githubusercontent.com, api.github.com, or any external HTTPS endpoint from the shared-173-LTS cluster. Root Cause: Corporate proxy or firewall blocks outbound HTTPS to non-approved hosts. PyPI works (via allowed proxy), but GitHub, raw content, and API endpoints do not. Impact: Cannot recover workspace files from GitHub when accidentally overwritten. Cannot pip install from GitHub URLs. Cannot use requests or urllib for external data. Fix: Never rely on external HTTP for file recovery. Keep critical content in the conversation cache or workspace. For package installs, only use PyPI-hosted packages. Modules: Any agent workflow that assumes internet access

T-017: Spark Connect Blocks .rdd and .sparkContext on Shared Clusters

Date: 2026-04
Symptom: PySparkNotImplementedError: [NOT_IMPLEMENTED] when SparkEngine calls df.rdd.getNumPartitions() or spark.sparkContext.appName on shared/user-isolation clusters
Root Cause: Databricks shared clusters use Spark Connect (thin client), which blocks JVM-dependent APIs: df.rdd, spark.sparkContext, spark.sparkContext._gateway.jvm.*. These are all used in SparkEngine for logging/metrics only β€” the core read/write/upsert functionality is fully Spark Connect-compatible.
Affected locations (spark_engine.py):
- __init__: sparkContext.setLogLevel() (x2), sparkContext.appName (x1)
- read(): df.rdd.getNumPartitions() (x3: JDBC, table, file paths)
- sql(): result.rdd.getNumPartitions() (x1)
- write(): df.rdd.getNumPartitions() (x1, already protected)
- table_exists(): sparkContext._gateway.jvm (x1, already protected)
Fix: Added _safe_partition_count(df) helper that returns -1 on failure. Wrapped all sparkContext access in try/except. All 6 unprotected locations now safe.
Modules: odibi/engine/spark_engine.py

T-018: normalize_column_names Uses Wrong Quote Character on Spark

Date: 2026-04-30
Symptom: On Spark, normalize_column_names returns the literal string "First Name" as data instead of the actual column value. Column names are renamed correctly but all data values become the old column name string.
Root Cause: SQL generation hardcoded double-quote identifiers ("First Name") for all engines. Spark SQL treats "..." as a string literal; column identifiers require backticks. DuckDB/Pandas unaffected (ANSI SQL: double quotes = identifiers).
Fix: Added engine-aware quoting: q = '`' if SPARK else '"' β€” consistent with rename_columns and replace_values which already used this pattern.
Rule: Always use engine-aware quoting for column identifiers in generated SQL.
Modules: odibi/transformers/sql_core.py β€” normalize_column_names()


T-019: Cross Join Generates ON Clause β€” Spark Returns Inner Join Results

Date: 2026-04-30
Symptom: Cross join returns 2 rows instead of 12 (4Γ—3 cartesian product). Spark interprets CROSS JOIN ... ON condition as an inner join, silently filtering to only matching rows.
Root Cause: JoinParams.on was a required field (no way to omit for cross joins), and join() always appended ON {join_condition} to SQL regardless of how. Additionally, the Pandas path passed on= to merge() even for cross joins.
Fix: Made on optional (None default), added model_post_init to enforce on for non-cross joins, branched SQL generation to skip ON for cross joins, added Pandas merge(how="cross") path without on parameter.
Rule: Cross joins must never have an ON clause. Validate join parameters holistically β€” field-level validators can't enforce cross-field constraints.
Modules: odibi/transformers/relational.py β€” JoinParams, join()

T-023: coverage --source Triggers NumPy Double-Import on Databricks

Date: 2026-04-30
Symptom: TypeError: int() argument must be a string, a bytes-like object or a real number, not '_NoValueType' when running coverage run --source=odibi.transformers.delete_detection
Root Cause: --source activates coverage's import hooks, which cause NumPy to be imported twice. The double-import breaks numpy._core._methods._sum β€” the _NoValue sentinel from the first import is not recognized by the reloaded module. This only affects code paths using pandas.merge(..., indicator=True) which calls .where() internally.
Fix: Use --include=odibi/transformers/delete_detection.py instead of --source=odibi.transformers.delete_detection. The --include flag filters coverage reporting by file path without activating import hooks.
Modules: Any coverage measurement on Databricks clusters


Patterns (Working Recipes)

Proven code recipes. Copy these instead of reinventing.

P-005: Polars Anti-Join for Delete Detection

Date: 2026-04-30
Context: Porting Pandas merge(..., indicator=True) pattern to Polars
Pattern:

# Pandas: merge + indicator + filter
merged = prev_keys.merge(curr_keys, on=keys, how="left", indicator=True)
deleted_keys = merged[merged["_merge"] == "left_only"][keys]

# Polars: native anti-join (cleaner, faster)
deleted_keys = prev_keys.join(curr_keys, on=keys, how="anti")
Why: Polars has no indicator parameter on joins. The anti join type is purpose-built for this β€” returns rows from the left that have NO match in the right. No temporary columns to clean up.
Also applies to: hard delete (df.join(deleted_keys, on=keys, how="anti")), any "find missing" pattern
Modules: odibi/transformers/delete_detection.py

P-006: LazyFrame Guard Pattern for Polars Engine Branches

Date: 2026-04-30
Context: Polars engine may receive either eager DataFrame or LazyFrame
Pattern:

import polars as pl

df = context.df
if isinstance(df, pl.LazyFrame):
    df = df.collect()
Why: Several Polars operations (.height, column indexing, .to_list()) require an eager DataFrame. Always collect at the top of engine-specific branches.
Modules: All Polars branches in delete_detection.py, manufacturing.py

T-020: Spark Connect Lazy Evaluation β€” spark.table() and DeltaTable.forName() Don't Throw for Missing Tables

Date: 2026-04-30 Symptom: SCD2 and Merge tests fail with TABLE_OR_VIEW_NOT_FOUND / DELTA_MISSING_DELTA_TABLE on first run. Table-existence checks pass (no exception), but operations on the returned object throw later. Root Cause: On Spark Connect (shared clusters / USER_ISOLATION), spark.table(name) and DeltaTable.forName(spark, name) are lazy β€” they return client-side objects without validating table existence on the server. Errors surface only when an action is triggered (.columns, .count(), .execute()). This made try: spark.table(t); exists=True always succeed, bypassing initial-write paths. Fix: Replace lazy checks with spark.catalog.tableExists(target) as a gate. For Merge, also verify Delta format: gate on tableExists(), then DeltaTable.forName() inside a try/except (handles non-Delta tables on Hive metastore). File paths still use DeltaTable.isDeltaTable() / spark.read.format("delta").load() which are eager. Rule: Never use try: spark.table(name) or try: DeltaTable.forName(spark, name) as existence checks on Spark Connect. Use spark.catalog.tableExists() β€” it's the only reliable, eager check across classic and Connect runtimes. CI Impact: When replacing spark.table() with spark.catalog.tableExists(), existing mock tests that set spark.table.side_effect = Exception(...) must also set spark.catalog.tableExists.return_value = False. MagicMock auto-returns truthy values, so without explicit return_value, the mock thinks the table exists. Modules: odibi/transformers/scd.py β€” _scd2_spark_delta_merge(), _scd2_spark(); odibi/transformers/merge_transformer.py β€” merge_batch(); odibi/patterns/base.py β€” _load_existing_spark() (used by DimensionPattern and FactPattern)

T-021: Run Existing Unit Tests Before Pushing Spark Fix Branches

Date: 2026-04-30 Symptom: Task 8 (T-020 fix) broke CI β€” test_merge_optimization_first_run_writes_directly failed with assert None is not None. Root Cause: Changing spark.table() to spark.catalog.tableExists() in production code invalidated mock setups in existing tests. Genie only ran Databricks integration tests, not the existing CI unit tests. Rule: Before pushing any branch that modifies production code, run pytest tests/unit/ -q --tb=short (or at minimum the relevant test files) to catch mock regressions. Databricks integration tests validate real behavior; CI unit tests validate mock contracts. Both must pass.

T-022: "Unsupported Engine" Tests Break When Genie Adds Engine Support

Date: 2026-04-30 Symptom: 5 CI failures after Task 13 (Polars SCD2/Merge). Tests like test_unsupported_engine_raises used EngineType.POLARS to trigger the else branch, but Genie added _scd2_polars/_merge_polars so POLARS is now dispatched. Pandas DataFrames hit Polars code paths β†’ AttributeError: 'DataFrame' has no attribute 'schema'. Root Cause: Tests assumed a specific engine/context was "unsupported" instead of using a truly fake value. When parity work adds support for that engine, the tests break. Rule: Never use a real EngineType member or real context class (e.g., PolarsContext) as the "unsupported" test case. Use engine_type="unknown" or type("UnknownContext", (), {})() β€” these can never become supported. Also update regex patterns to match the actual error message (e.g., "does not support context type" not "Unsupported context").


Patterns (Working Recipes)

Copy-paste these. They're battle-tested.

P-001: Standard CatalogManager Test Fixture

from odibi.catalog import CatalogManager
from odibi.config import SystemConfig
from odibi.engine.pandas_engine import PandasEngine

@pytest.fixture
def catalog(tmp_path):
    return CatalogManager(
        spark=None,
        config=SystemConfig(connection="local", path="_odibi_system"),
        base_path=str(tmp_path / "_odibi_system"),
        engine=PandasEngine(config={}),
    )

P-002: Mock PySpark for Spark Branch Testing

import sys
from unittest.mock import MagicMock

# Step 1: Import odibi FIRST (triggers real pyspark + fallback stubs)
import odibi.catalog

# Step 2: Build mock hierarchy
mock_pyspark = MagicMock()
mock_sql = MagicMock()
mock_types = MagicMock()

# Step 3: Use real fallback types
mock_types.StringType = odibi.catalog.StringType
mock_types.StructField = odibi.catalog.StructField
mock_types.StructType = odibi.catalog.StructType
mock_sql.types = mock_types
mock_pyspark.sql = mock_sql

# Step 4: Override (NOT setdefault)
sys.modules["pyspark"] = mock_pyspark
sys.modules["pyspark.sql"] = mock_sql
sys.modules["pyspark.sql.types"] = mock_types

P-003: DuckDB SQL Executor for Pattern Tests

import duckdb

def sql_executor(query, context):
    conn = duckdb.connect()
    for name in context.list_names():
        conn.register(name, context.get(name))
    return conn.execute(query).fetchdf()

P-004: Delta Table Seeding (No Null Types)

import pyarrow as pa
from deltalake import write_deltalake, DeltaTable

def seed_delta(path, data_dict, schema=None):
    table = pa.table(data_dict, schema=schema)  # NOT pd.DataFrame
    write_deltalake(str(path), table, mode="overwrite")

def read_delta(path):
    return DeltaTable(str(path)).to_pandas()

P-005: Simulating Missing Libraries

import sys

original = sys.modules.get("fsspec")
sys.modules["fsspec"] = None  # Simulates ImportError
try:
    result = function_that_imports_fsspec()
finally:
    if original is not None:
        sys.modules["fsspec"] = original
    else:
        sys.modules.pop("fsspec", None)

P-006: Mock NodeConfig for Pattern Tests

from unittest.mock import MagicMock

def make_config(name="test_node", params=None):
    cfg = MagicMock()
    cfg.name = name
    cfg.params = params or {}
    return cfg

P-007: Databricks Notebook lib/ Pattern

# Every notebook starts with:
import sys, os
PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__)) if "__file__" in dir() else dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().rsplit("/", 1)[0]
sys.path.insert(0, PROJECT_ROOT)
from lib.setup import *

Discoveries (Verified Behavior)

Empirical findings that inform implementation decisions.

V-001: Polars Adds +00:00 to All Datetime Columns

Verified: 2026-04
Finding: Polars always stores timezone-aware datetimes. When comparing with Python datetime.now() (naive), the offset causes false results.
Implication: All Polars datetime tests must use timezone.utc

V-002: CoolProp and Polars Are Installed in CI

Verified: 2026-04
Finding: CoolProp 7.2.0 and Polars are available. No mocking needed for thermodynamics or manufacturing Polars tests.
Implication: Run these tests directly, don't skip them

V-003: PySpark 4.1.1 Is Installed β€” Fallback Stubs Are Not Dead Code

Verified: 2026-04
Finding: Because PySpark IS installed, catalog.py's try/except imports succeed. The fallback type stubs (lines 22-137) are NOT defined at runtime. But they ARE essential for Pandas-only deployments.
Implication: Do not remove fallback stubs. They're not dead code.

V-004: create_connection Does Not Exist in Factory

Verified: 2026-04
Finding: state/__init__.py imports create_connection from odibi.connections.factory, but only typed functions like create_sql_server_connection exist.
Implication: To test SQL Server state backend factory paths, inject mock directly: m.create_connection = MagicMock()

V-005: Module Caching on Shared Databricks Clusters

Verified: 2026-04
Finding: On shared Databricks clusters, Python caches imported modules across cell executions. If you modify odibi source files and re-import, you get the stale cached version. Clearing just odibi.* from sys.modules is insufficient if your project has its own lib/ package β€” both must be cleared.
Implication: Every campaign/test notebook must clear both lib.* and odibi.* from sys.modules before importing. Pattern: for m in [k for k in sys.modules if k.startswith('lib') or k.startswith('odibi')]: del sys.modules[m]

V-006: Spark Connect .collect() Returns List of Rows, Not DataFrame

Verified: 2026-04
Finding: hasattr(df, 'collect') is True for both Polars LazyFrames AND Spark DataFrames. But Spark's .collect() returns list[Row], while Polars' .collect() returns a pl.DataFrame. Using hasattr to detect "needs .collect()" will mis-route Spark DataFrames.
Implication: Always use isinstance(df, pl.LazyFrame) to detect Polars lazy frames. Never use hasattr(df, 'collect') as a Polars check.

V-007: Databricks Repos β€” No .git Directory, No Local Git Commands

Verified: 2026-04
Finding: Databricks Repos are managed externally β€” there is no .git directory on disk. git log, git diff, git checkout all fail with "not a git repository". Branch operations must use the Repos REST API (PATCH /api/2.0/repos/{id}). Switching branches via the API does a full checkout but does NOT restore files that were modified through the workspace API (editAsset) β€” workspace file changes appear to propagate across branches or persist beyond the checkout.
Implication: If you overwrite a file via editAsset, you cannot recover it with git checkout -- file. The only recovery options are: (1) conversation cache from a prior readAssetById call, (2) a separate backup copy, or (3) manual reconstruction. Always read and cache large files before editing them.

V-008: Polars Engine Consumer Code Has 58 Dispatch Gaps

Verified: 2026-04-30
Finding: Full audit of 91 engine-dispatched functions found 58 missing Polars branches. polars_engine.py itself is complete (read/write/SQL/anonymize/harmonize/validate all work). The gaps are in consumer code: transformers, patterns, and validation modules that dispatch on engine_type but only branch for Spark and Pandas. Three gap patterns: (1) 8 functions crash with ValueError, (2) 35 functions silently fall through to Pandas path, (3) 2 quarantine masks return pl.lit(True) for UNIQUE/CUSTOM_SQL. 24 SQL-first transformers work implicitly via context.sql().
Implication: Any Polars pipeline using SCD2, Merge, Dimension, Fact, Delete Detection, or FK validation will fail. Fix priority: stop crashes (8 fns) β†’ fix silent fallbacks (35 fns) β†’ quarantine masks (2 fns).
Reference: docs/09_polars_audit_results.md, campaign/09_polars_audit_results.md


Session Log Template

Copy-paste this at the end of your session:

## Session: [Date] β€” [Brief Description]

### Work Done
- [bullet list of changes]

### Verification Report
- Tests written: N
- Tests passing: N/N
- Command: `pytest tests/unit/path/test_file.py -v`

### New Entries Added
- [ ] Decision: D-NNN β€” [title]
- [ ] Trap: T-NNN β€” [title]
- [ ] Pattern: P-NNN β€” [title]
- [ ] Discovery: V-NNN β€” [title]
- [ ] AGENTS.md coverage updated

### No New Entries Needed
- [ ] Confirmed: no new gotchas discovered this session

Session: 2026-04-30 β€” Odibi Hardening Campaign (Spark Connect + Polars Parity)

Work Done

  • Fixed context.py temp view lifecycle for Spark Connect lazy eval
  • Fixed row-counting bool(LazyFrame) TypeError across 4 transformer files
  • Fixed Polars deduplicate SQL dialect (EXCLUDE + ORDER BY 1)
  • Updated test_context.py to match new view lifecycle behavior
  • Built campaign test harness (01_smoke_test, 02_engine_matrix)
  • Validated: 7,440 unit tests pass, 0 regressions in changed modules

Verification Report

  • Tests written: 0 new (1 updated: test_sql_keeps_view_registered_after_successful_execution)
  • Tests passing: 7,440/7,440 (excluding 49 pre-existing failures from missing deltalake + /tmp env issues)
  • Engine parity: 11/11 parity tests pass
  • Campaign: 9/9 engine matrix tests pass (3 transformers Γ— 3 engines)
  • Command: pytest tests/unit/ -q --no-header

New Entries Added

  • [x] Trap: T-011 β€” Spark Connect Temp View Lifecycle
  • [x] Trap: T-012 β€” bool(LazyFrame) TypeError in Row Counting
  • [x] Trap: T-013 β€” Polars SQL Dialect Gaps
  • [x] Discovery: V-005 β€” Module Caching on Shared Clusters
  • [x] Discovery: V-006 β€” Spark Connect .collect() Returns List of Rows

Session: 2026-04-29 β€” CHANGELOG Catch-Up (#228)

Work Done

  • Updated CHANGELOG.md [Unreleased] section with all post-3.9.0 changes
  • Documented: skills system (15 docs), CUSTOM_INSTRUCTIONS.md, LESSONS_LEARNED.md, AGENT_CAMPAIGN.md, campaign workspace, bug audit (43/46 fixed)
  • Recovered CHANGELOG.md after accidental truncation from editAsset file replacement
  • Reconstructed 67,192-char file from conversation cache in 6 append chunks

Verification Report

  • Tests written: 0 (docs-only change)
  • All 52 version entries preserved
  • 12/12 success criteria passing
  • Command: python -c "import re; content=open('CHANGELOG.md').read(); versions=re.findall(r'## \[([^\]]+)\]', content); print(f'{len(versions)} versions found')"

New Entries Added

  • [x] Trap: T-014 β€” Databricks editAsset for Files Replaces Entire Content
  • [x] Trap: T-015 β€” External Internet Blocked on Databricks Cluster
  • [x] Discovery: V-007 β€” Databricks Repos No .git Directory, No Local Git
  • [ ] AGENTS.md coverage updated (no test changes)

Session: 2026-04-30 β€” SparkEngine Shared Cluster Compatibility Fix

Work Done

  • Audited spark_engine.py: found 8 locations using .rdd or .sparkContext
  • Fixed 6 unprotected locations (2 were already protected)
  • Added _safe_partition_count() helper method
  • Built 03_spark_engine_readwrite.py with 8 integration tests
  • All 8 tests pass on shared cluster (USER_ISOLATION mode)

Verification Report

  • Tests written: 8 (CSV read, Parquet read, Delta read, overwrite, append, upsert, integrity, HWM)
  • Tests passing: 8/8
  • Concrete value assertions: row counts, column names, sample values (Alice/100.50, Bob_Updated/999.99, etc.)
  • Negative tests: future HWM cutoff β†’ 0 rows
  • Edge cases tested: empty result set, overwrite replacement, upsert update+insert+untouched
  • Bug found: SparkEngine crashes on shared clusters β€” filed as T-017

New Entries Added

  • [x] Trap: T-017 β€” Spark Connect Blocks .rdd and .sparkContext
  • [x] Decision: D-008 β€” _safe_partition_count for Spark Connect Compatibility

Session: 2026-04-30 - sql_core Transformers Spark Validation + normalize_column_names Fix

Goal: Test all 26 sql_core transformers with real Spark SQL on Databricks shared cluster.

What happened: - Created campaign/04_spark_transformers_sql_core notebook with 27 tests - 26/27 passed on first run; normalize_column_names failed (T-018) - Fixed T-018: added engine-aware quoting (+4 lines net) - Re-ran: 27/27 pass - 5 existing unit tests: 0 regressions - Ruff: check clean, format applied

Artifacts: - [x] Campaign notebook: campaign/04_spark_transformers_sql_core - [x] Bug fix: odibi/transformers/sql_core.py (normalize_column_names) - [x] Trap: T-018

Session: 2026-04-30 β€” Relational & Advanced Transformers Campaign (05)

Notebook: campaign/05_spark_transformers_relational

Scope: All relational (join, union, pivot, unpivot, aggregate) and advanced (deduplicate, sessionize, split_events_by_period) transformers tested with real Spark SQL + Pandas parity comparison.

Results: 20/20 tests PASS (after cross join fix).

Bug found & fixed: - T-019: join() generated CROSS JOIN ... ON condition β€” Spark interpreted this as INNER JOIN, returning only matching rows instead of a full cartesian product. - Root cause: JoinParams.on was a required field (no way to omit for cross joins), and the SQL template always appended ON {condition}. - Fix: Made on optional with model_post_init validation, branched SQL/Pandas paths for cross joins. 4 locations changed, +12 net lines.

Other issue: split_events_by_period day-split test hit decimal.Decimal - float type error. Fixed by casting to float before comparison.

Session: 2026-04-30 β€” SCD2 & Merge Patterns Campaign (06)

Notebook: campaign/06_spark_patterns_scd2_merge

Scope: Full lifecycle testing of SCD2 and Merge patterns against real Delta tables in Unity Catalog (eaai_dev.hardening_scratch). SCD2: initial load β†’ no-change idempotency β†’ changed records β†’ new records β†’ mixed batch β†’ float/NaN (#248). Merge: initial load β†’ upsert β†’ update-only β†’ append-only β†’ audit columns.

Results: 11/11 tests PASS.

Critical bug found & fixed β€” T-020: Spark Connect Lazy Evaluation: - Symptom: All SCD2 and Merge tests failed on first run. spark.table() and DeltaTable.forName() returned without error for non-existent tables, bypassing initial-write paths. - Root cause: On Spark Connect (shared clusters), both APIs are lazy β€” they return client-side objects without validating table existence on the server. Errors only surface when an action triggers server-side resolution. - Fix (3 locations): - scd.py _scd2_spark_delta_merge (lines 588–605): Replaced try: spark.table(target) with spark.catalog.tableExists(target). - scd.py _scd2_spark (lines 349–373): Same pattern β€” gate on tableExists() before spark.table(). - merge_transformer.py merge_batch (lines 450–475): Gate on tableExists(), then verify Delta format with DeltaTable.forName() inside try/except (handles non-Delta Hive tables).

T-009 resolved (all engines): Spark's eqNullSafe(), DuckDB's IS DISTINCT FROM, Pandas' _safe_isna+math.isclose, and Polars' ne_missing all handle NaN == NaN correctly. #248 closeable β€” verified in campaign/18_scd2_nan_fix (5/5 PASS).

Follow-up fix β€” non-Delta table safety: - Original merge fix assumed any existing table is Delta (tableExists β†’ is_delta = True). Tightened to two-step: tableExists() gate + DeltaTable.forName() verification, so non-Delta Hive tables on non-UC environments fall through correctly.

Files modified: - odibi/transformers/scd.py β€” 2 locations fixed (~15 net lines) - odibi/transformers/merge_transformer.py β€” 1 location fixed (~10 net lines) - campaign/06_spark_patterns_scd2_merge β€” new notebook (9 cells, 11 tests)

New entries added: - T-020: Spark Connect Lazy Evaluation (LESSONS_LEARNED.md) - T-009: Updated with Spark resolution

Session: 2026-04-30 β€” Dimension & Fact Patterns Campaign (07)

Notebook: campaign/07_spark_patterns_dim_fact

Scope: Full lifecycle testing of DimensionPattern (SCD 0/1/2, surrogate keys, unknown member, audit columns) and FactPattern (FK→SK lookup, orphan handling: unknown/reject/quarantine, grain validation, measures, star schema integration) with real Delta tables in Unity Catalog (eaai_dev.hardening_scratch).

Results: 12/12 tests PASS.

Bug found & fixed β€” T-020 extension: _load_existing_spark in base.py: - Same Spark Connect lazy-eval bug as campaign 06. spark.table(target) in Pattern._load_existing_spark() returned lazy DF for non-existent tables, causing DimensionPattern to build plans referencing non-existent tables. - Fix: Gate on spark.catalog.tableExists(target) for table names. Path-based targets unchanged. - This is the 4th location fixed for T-020 (added to odibi/patterns/base.py).

Test infrastructure fixes: - Pattern constructor requires (engine, config) β€” not just config. Tests now use SparkEngine(spark_session=spark) and NodeConfig(name=..., params=..., transformer=...). - NodeConfig requires at least one of read/inputs/transform/write/transformer. Fixed by adding transformer="dimension" or transformer="fact" to make_config(). - Quarantine connection is used as catalog prefix (f"{connection}.{table}"). Set connection=UC_CATALOG instead of "spark".

Files modified: - odibi/patterns/base.py β€” _load_existing_spark() fixed (~15 net lines) - campaign/07_spark_patterns_dim_fact β€” new notebook (10 cells, 12 tests)

Session: 2026-04-30 β€” Spark Validation Campaign (08)

Notebook: campaign/08_spark_validation

Scope: Full validation engine testing with real Spark DataFrames on Databricks. All 11 test types (NOT_NULL, UNIQUE, ACCEPTED_VALUES, ROW_COUNT, RANGE, REGEX_MATCH, CUSTOM_SQL, SCHEMA, FRESHNESS, FK, GATE) plus quarantine chain, quality gate, special tests (fail_fast, severity=warn, empty DataFrame, 10k performance), and Spark vs Pandas cross-engine comparison.

Results: 17/17 tests PASS on first run. No source file modifications needed.

Key findings: - All 11 validation test types work correctly on Spark Connect (USER_ISOLATION cluster). - Quarantine split (validate→quarantine→gate chain) works end-to-end on Spark. - Quality gate correctly rejects/accepts based on pass rate thresholds. - Spark and Pandas produce identical error counts for all 6 compared test types (NOT_NULL, UNIQUE, ACCEPTED_VALUES, ROW_COUNT, RANGE, CUSTOM_SQL). - 10k-row performance: 1.53s for 6 simultaneous validation tests — well under 30s target. - fail_fast correctly stops after first error; severity=warn correctly suppresses errors. - Empty DataFrame: NOT_NULL/UNIQUE pass (0 violations), ROW_COUNT min=1 correctly fails. - No Spark-specific bugs found — validation engine is fully functional on real Spark.

Files modified: - campaign/08_spark_validation β€” new notebook (11 cells, 17 tests)

No source file changes. Validation engine Spark path works correctly as-is.

Session: 2026-04-30 β€” Polars Engine Parity Audit (#212)

Work Done

  • Deep-scanned all 91 engine-dispatched functions across odibi/
  • Found 58 missing Polars branches (23 CRITICAL, 25 MEDIUM, 10 LOW)
  • Identified 3 gap patterns: explicit ValueError (8), silent Pandas fallback (35), missing mask branch (2)
  • Confirmed 33 functions already work with Polars (9 explicit + 24 implicit via SQL)
  • Produced audit report: docs/09_polars_audit_results.md + campaign/09_polars_audit_results.md

Verification Report

  • Tests written: 0 (audit-only, no source changes)
  • Files scanned: 17 with dispatch gaps, 6 with full parity
  • Concrete findings: 58 gaps classified by severity with exact function names and line numbers
  • Edge cases: Checked for both explicit NotImplementedError AND silent fallback patterns
  • Command: Automated Python scan of all .py files in odibi/

New Entries Added

  • [x] Discovery: V-008 β€” Polars Engine Consumer Code Has 58 Dispatch Gaps
  • [ ] AGENTS.md coverage updated (no test changes)

V-009: Polars Pivot/Unpivot API and Agg Function Mapping

Verified: 2026-04-30
Finding: Polars pivot() uses aggregate_function parameter with function names "mean" (not "avg") and "len" (not "count"). The unpivot() method uses variable_name/value_name parameters. Both require eager DataFrames β€” LazyFrames must be .collect()ed first. pivot() signature: df.pivot(on, index, values, aggregate_function). unpivot() signature: df.unpivot(on, index, variable_name, value_name).
Implication: Polars branches must map common agg function names: "avg"β†’"mean", "count"β†’"len". Always check for LazyFrame and collect before pivot/unpivot.

Session: 2026-04-30 β€” Polars Relational Transformers (#212, Task 12)

Work Done

  • Added Polars branches to pivot() and unpivot() in odibi/transformers/relational.py (+56 LOC)
  • Created tests/unit/transformers/test_relational_polars.py (232 LOC, 15 test methods)
  • Built campaign notebook campaign/10_polars_relational_transformers (14 tests)
  • All 14 Databricks integration tests pass (pivot, unpivot, join, union, aggregate + parity)
  • join/union/aggregate already work via context.sql() β€” no new code needed

Verification Report

  • Tests written: 15 unit tests + 14 campaign tests
  • Tests passing: 14/14 (campaign), unit tests pending formal run
  • Concrete value assertions: East Q1 pivot=250, Widget jan revenue=1000, Polars==Pandas parity
  • Edge cases tested: LazyFrame auto-collect, single-row unpivot, avgβ†’mean mapping
  • Production code: +56 LOC (under 250 limit)
  • Total code: +288 LOC (production + tests)

New Entries Added

  • [x] Discovery: V-009 β€” Polars Pivot/Unpivot API and Agg Function Mapping

Session: 2026-04-30 β€” Polars SCD2 & Merge Transformers (#212, Task 13)

Work Done

  • Added _scd2_polars() to odibi/transformers/scd.py (+152 LOC)
  • Native Polars change detection using ne_missing() + is_nan() for NaN-safe comparison
  • LazyFrame auto-collection, file-based target (parquet/csv), self-contained write
  • NaN == NaN confirmed NOT a false positive (#248 resolved for Polars SCD2)
  • Added _merge_polars() to odibi/transformers/merge_transformer.py (+113 LOC)
  • All 3 strategies: upsert (anti_join + concat), append_only (anti_join), delete_match (anti_join)
  • Audit columns: created_at preserved on update via coalesce, updated_at set to UTC now
  • PolarsContext import added to dispatcher
  • Created tests/unit/transformers/test_scd_merge_polars.py (283 LOC, 13 tests)
  • Smoke tested 10/10 on Databricks: 6 SCD2 lifecycle + 4 merge strategies

Verification Report

  • Tests written: 13 unit tests + 10 smoke tests
  • Tests passing: 10/10 (Databricks smoke), unit tests pending formal run
  • Concrete value assertions: Bob expiredβ†’Marketing, NaN==NaNβ†’3 rows, NaNβ†’60000β†’4 rows
  • Negative tests: NaN false-positive prevention
  • Edge cases: LazyFrame, empty target, audit column preservation
  • Production code: +265 LOC (152 scd + 113 merge)
  • Total code: +548 LOC (265 production + 283 tests)

New Entries Added

  • No new D/T/P/V entries (NaN behavior documented under existing V-009 and T-009)

Session: 2026-04-30 β€” Delete Detection Tutorial & E2E Campaign (Task 16, #223)

Work Done

  • Created docs/tutorials/delete_detection.md (218 lines) β€” covers snapshot_diff, sql_compare, soft/hard delete, reappearing records, safety threshold, first run behavior
  • Created examples/delete_detection/config.yaml (55 lines) β€” valid ProjectConfig with detect_deletes step
  • Created examples/delete_detection/data/ β€” 3 CSV files (day1: 100 rows, day2: 95, day3: 100)
  • Created examples/delete_detection/README.md (62 lines)
  • Created campaign notebook campaign/11_delete_detection_e2e (11 cells, 9 tests)
  • All 9 tests PASS: snapshot_diff (day2/day3), first_run_skip, sql_compare_soft, hard_delete, threshold_warn, threshold_error, mode_none, config_yaml

Verification Report

  • Tests written: 9
  • Tests passing: 9/9
  • Concrete value assertions: deleted_ids == [12,27,45,68,91], returned IDs 12/45 active, new IDs 101-103, 95 active after hard delete, DeleteThresholdExceeded raised at 3% threshold
  • Negative tests: threshold_error (DeleteThresholdExceeded), first_run_skip (0 deletes on v0-only Delta)
  • Edge cases tested: reappearing records, new records beyond original range, mode=none passthrough, config YAML parse

New Entries Added

  • [x] Trap: T-024 β€” EngineContext Constructor Requires Positional context Arg
  • [x] Discovery: V-010 β€” deltalake 0.x vs 1.x API Differences (schema_mode vs overwrite_schema)
  • [x] Trap: T-025 β€” DeleteDetectionConfig Requires source_connection for SQL_COMPARE Mode

T-024: EngineContext Constructor Requires Positional context Arg

Date: 2026-04-30 Symptom: TypeError when constructing EngineContext(df=..., engine_type=...) without context. Root Cause: EngineContext.__init__ has context as a required positional argument (PandasContext, SparkContext, or PolarsContext). Passing only df and engine_type skips it. Fix: Always instantiate the appropriate context first, then pass it:

pd_ctx = PandasContext()
ctx = EngineContext(context=pd_ctx, df=my_df, engine_type=EngineType.PANDAS, engine=my_engine)
Modules: odibi/context.py β€” any code constructing EngineContext directly (campaigns, tests)

T-025: DeleteDetectionConfig Requires source_connection for SQL_COMPARE

Date: 2026-04-30 Symptom: ValidationError: 'source_connection' is required when creating DeleteDetectionConfig(mode=SQL_COMPARE, ...) without source_connection. Root Cause: Pydantic model_post_init validation enforces source_connection is set when mode=SQL_COMPARE. Even when simulating sql_compare by calling _apply_deletes directly (which doesn't use source_connection), the config validation still runs. Fix: Pass source_connection="dummy" (or any string) to satisfy validation when simulating sql_compare without a live JDBC connection. _apply_deletes only uses keys, soft_delete_col, max_delete_percent, on_threshold_breach. Modules: odibi/config.py β€” DeleteDetectionConfig; any test/campaign simulating sql_compare mode

V-010: deltalake 0.x vs 1.x API Differences

Verified: 2026-04-30 Finding: The deltalake Python package has breaking API changes between 0.x and 1.x series. Key differences: - write_deltalake schema override: 0.x uses schema_mode="overwrite", 1.x uses overwrite_schema=True - Rust engine kwarg: 0.x supports engine='rust', 1.x removed it (Rust is the only engine) - odibi pins deltalake>=0.18.0,<0.30.0 (resolves to 0.25.5 on Databricks). Cluster pre-installs 1.5.1. Implication: Campaign notebooks that write Delta via deltalake must %pip install "deltalake>=0.18.0,<0.30.0" and %restart_python to match odibi's pinned version. Use schema_mode="overwrite" (not overwrite_schema=True).

Session: 2026-04-30 β€” Quarantine & Orphan Handling Tutorial (Task 17, #222)

Work Done

  • Created docs/tutorials/quarantine_workflow.md (309 lines) β€” covers quarantine split, FK orphan handling (reject/warn/filter), metadata columns, quality gates, sampling, re-processing
  • Created examples/quarantine_workflow/config.yaml (55 lines) β€” valid ProjectConfig with validation + quarantine + gate
  • Created examples/quarantine_workflow/data/ β€” 3 CSV files (dim_customer 20 rows, fact_orders 100 rows, dim_customer_fixed 25 rows)
  • Created examples/quarantine_workflow/README.md (71 lines)
  • Created campaign notebook campaign/12_quarantine_e2e (12 cells, 10 tests)
  • All 10 tests PASS: quarantine_split, quarantine_metadata, fk_reject, fk_warn, fk_filter, gate_abort, gate_warn, sampling, reprocessing, config_yaml

Verification Report

  • Tests written: 10
  • Tests passing: 10/10
  • Concrete value assertions: 8 quarantined / 92 valid, 10 orphans detected, 90 rows after filter, 92% pass rate, 5 metadata cols, max_rows=5 cap
  • Negative tests: gate_abort (92% < 95% threshold), fk_reject (10 orphans β†’ validation fails)
  • Edge cases tested: reappearing dimension keys, sampling cap on large invalid set

New Entries Added

  • [x] Trap: T-026 β€” TestConfig Is a Union Type, Not Directly Instantiable
  • [x] Trap: T-027 β€” NotNullTest Uses columns (Plural List), Not column
  • [x] Trap: T-028 β€” evaluate_gate Per-Row Boolean Lists Must Have Non-Overlapping Failures

T-026: TestConfig Is a Pydantic Discriminated Union β€” Cannot Instantiate Directly

Date: 2026-04-30 Symptom: TypeError: Cannot instantiate typing.Union when calling TestConfig(type=TestType.NOT_NULL, column="amount") Root Cause: TestConfig is defined as Annotated[Union[NotNullTest, RangeTest, AcceptedValuesTest, ...], Discriminator("type")]. It's a type alias for Pydantic parsing (YAML β†’ Python), not a class you can construct. Pydantic dispatches to the correct subclass based on the type discriminator during parsing, but direct constructor calls fail. Fix: Use the specific test config classes directly: NotNullTest(columns=["amount"]), RangeTest(column="amount", min=0), AcceptedValuesTest(column="status", values=[...]). Modules: odibi/config.py β€” any code creating test configs programmatically (campaigns, tests)

T-027: NotNullTest Uses columns (Plural List), Not column (Singular String)

Date: 2026-04-30 Symptom: ValidationError: columns β€” Field required when constructing NotNullTest(column="amount") Root Cause: NotNullTest and UniqueTest use columns: List[str] (plural, supports multiple columns). Other tests like RangeTest, AcceptedValuesTest, RegexMatchTest use column: str (singular). This asymmetry is intentional: not_null/unique checks are commonly applied to multiple columns at once. Fix: Use columns=["amount"] for NotNullTest/UniqueTest. Use column="amount" for RangeTest, AcceptedValuesTest, RegexMatchTest. Rule: Check the specific test class fields before constructing β€” columns (list) vs column (string) varies by test type. Modules: odibi/config.py β€” NotNullTest, UniqueTest (plural) vs all other tests (singular)

T-028: evaluate_gate Per-Row Boolean Lists Need Non-Overlapping Failures for Correct Unique Fail Count

Date: 2026-04-30 Symptom: Gate passes when it should fail. Pass rate shows 97% instead of expected 92%. Root Cause: evaluate_gate computes pass rate as: a row passes if it passes ALL tests. When validation_results lists have failures at the same positions (e.g., rows 97-99 for all tests), only 3 unique rows fail instead of 8. The positional overlap means rows are counted once, not per-test. Fix: When simulating validation_results for testing, spread failures across non-overlapping row positions to match the expected unique failure count. Example: rows 82-84 for status, 85-87 for nulls, 88-89 for negatives = 8 unique failures. Modules: Any test code constructing per-row validation_results for evaluate_gate

Session: 2026-04-30 β€” Aggregation Pattern Stress Test (Task 19, #225)

Work Done

  • Created examples/aggregation_stress/stress_test.py (250 LOC) β€” runs AggregationPattern on Pandas + Spark engines for 8 scenarios and asserts row-level parity.
  • Created examples/aggregation_stress/README.md (60 LOC).
  • Verified all 8 scenarios pass against real Spark Connect cluster 1121-215743-ak1cop0m (DBR 17.3 LTS, Spark 4.0.0). 100k β†’ 1k aggregation completes in 3.4s.
  • No bugs filed β€” Pandas (DuckDB) and Spark engines produce identical aggregates including null grain handling and having filters.

Verification Report

  • Scenarios: 8/8 pass
  • Engines: Pandas + Spark both verified
  • Concrete asserts: row count, every measure column compared via np.allclose (numeric) or == (categorical)
  • Edge cases: nulls in grain, empty result via having, audit load_timestamp recency
  • Performance: 100k rows β†’ 1000 buckets in 3.4s combined Pandas+Spark

New Entries Added

  • [x] Verified: V-011 β€” Aggregation Pattern Pandas/Spark Engines Produce Identical Results
  • [x] Pattern: P-008 β€” Audit Columns (load_timestamp) Are Generated at Runtime β€” Exclude From Engine-Parity Comparisons

V-011: Aggregation Pattern Pandas/Spark Engines Produce Identical Results

Verified: 2026-04-30 against DBR 17.3 LTS (Spark 4.0.0) via Databricks Connect Finding: AggregationPattern.execute() produces row-for-row identical aggregates on Pandas (DuckDB SQL) and Spark engines for: simple SUM/AVG/COUNT/MIN/MAX, multi-column grain, null grain values (NULL becomes its own group on both engines), having filtering to empty result. The _merge_replace and _merge_sum incremental helpers also work identically on Pandas inputs. Implication: No follow-up bug tickets. Future regressions can be caught by re-running examples/aggregation_stress/stress_test.py.

P-008: Audit Columns (load_timestamp) Are Generated at Runtime β€” Exclude From Engine-Parity Comparisons

Pattern: When asserting Pandas vs Spark output equality for patterns with audit: {load_timestamp: true}, the timestamp is set independently per engine run and will differ. Pass an ignore_cols=("load_timestamp",) filter to your parity helper, and assert recency/dtype separately on each engine. Modules: Any cross-engine parity test for AggregationPattern, DimensionPattern, FactPattern, MergePattern audit features.

Session: 2026-04-30 β€” Date Dimension Pattern Full Feature Test (Task 20, #225)

Work Done

  • Created examples/date_dimension_test/full_test.py (210 LOC) β€” 8 scenarios covering all 19 generated columns, Pandas/Spark parity, every fiscal-year start month boundary, unknown-member, single-day, leap-year, and 10-year range performance.
  • Created examples/date_dimension_test/README.md (60 LOC).
  • Verified against real Spark Connect cluster 1121-215743-ak1cop0m (DBR 17.3 LTS, Spark 4.0.0).

Verification Report

  • 8/8 scenarios pass on both engines
  • All 19 generated columns populated (date_sk, full_date, day_of_week, day_of_week_num, day_of_month, day_of_year, is_weekend, week_of_year, month, month_name, quarter, quarter_name, year, fiscal_year, fiscal_quarter, is_month_start, is_month_end, is_year_start, is_year_end)
  • 8 fiscal-year boundary cases (FY start 1/4/7/10) all return correct FY+FQ
  • 2024-02-29 generated in leap year, absent in 2023
  • 10-year range: 3653 rows in 0.19s (Pandas)

New Entries Added

  • [x] Verified: V-012 β€” DateDimensionPattern Pandas/Spark Parity Across All 19 Generated Columns

V-012: DateDimensionPattern Pandas/Spark Parity Across All 19 Generated Columns

Verified: 2026-04-30 against DBR 17.3 LTS (Spark 4.0.0) via Databricks Connect Finding: DateDimensionPattern.execute() produces row-for-row identical values for all 18 non-full_date columns on Pandas and Spark for a 91-day window. The full_date column is datetime.date on Pandas and a Spark date-typed column on Spark; both serialize to the same ISO string, so astype(str) comparison is sufficient. Day-of-week numbering matches between engines because the Spark generator explicitly remaps dayofweek (Sunday=1 default) to ISO Monday=1/Sunday=7 to align with Pandas's dayofweek + 1 convention. Implication: No engine-parity bug in date_dimension.py. The script in examples/date_dimension_test/full_test.py doubles as a regression detector if either engine path drifts.

Session: 2026-04-30 β€” Star Schema End-to-End Build (Task 21, #225)

Work Done

  • Created examples/star_schema_e2e/config.yaml (173 LOC) β€” declarative ProjectConfig-validated star schema (dim_date, dim_customer SCD1, dim_product SCD2, fact_orders) with validation gates, FK lookups, and audit columns.
  • Created examples/star_schema_e2e/star_schema_e2e.py (415 LOC) β€” executable integration test driving DateDimensionPattern, DimensionPattern, scd2 transformer, and FactPattern against real Unity Catalog tables in eaai_dev.hardening_scratch.
  • Created examples/star_schema_e2e/README.md (~110 LOC) β€” usage + bug notes.
  • Verified end-to-end against 1121-215743-ak1cop0m (DBR 17.3 LTS, Spark 4.0.0) via Databricks Connect 17.3.7.
  • Surfaced three real framework bugs (P-009).

Verification Report

  • All 12 assertions pass: row counts (367/101/61/10025), unique SKs containing 0 on every dim, FK integrity 0 orphans on all three FKs (customer_sk, product_sk, order_date_sk), 25 unknown-member usages, 10 SCD2 history rows for changed products, exactly one is_current per product_id, calculated extended_amount consistent on all 10,025 rows, audit columns populated.
  • All 7 created tables dropped after run; managed-table count = 0.
  • Pipeline elapsed β‰ˆ 37s.
  • YAML config validates against ProjectConfig.

New Entries Added

  • [x] Pattern: P-009 β€” DimensionPattern SCD2 + Surrogate Keys Has Three Distinct Failures On Spark Connect (Workarounds Documented)
  • [x] Verified: V-013 β€” FactPattern + Dimension SCD2 + Unknown Member Compose Cleanly When Workarounds Applied

P-009: DimensionPattern SCD2 + Surrogate Keys Has Three Distinct Failures On Spark Connect βœ… RESOLVED

Pattern: Three independent bugs in odibi/patterns/dimension.py and odibi/patterns/fact.py make the documented "SCD2 dim with unknown member + surrogate key" path unusable on Spark Connect:

  1. _ensure_unknown_member line 602 calls spark.createDataFrame([row_values], cols) where valid_to is None. Spark Connect raises [CANNOT_DETERMINE_TYPE] Some of types cannot be determined after inferring. Fix: build the unknown-member row using spark.range(1).select(F.lit(...).cast(dtype).alias(name) ...) so every column has an explicit type.

  2. _execute_scd2 + scd2(use_delta_merge=True) mismatch: the optimized MERGE path inside the scd2 transformer writes history directly to the target and returns only the new/changed rows. DimensionPattern adds surrogate_key to that partial DataFrame; the caller then overwrites the target with this partial result, destroying the merged history. Setting use_delta_merge=False triggers the legacy path, which then fails in unionByName(rows_to_insert) because rows_to_insert follows the source schema (no surrogate_key) while final_target has it (Cannot resolve column name "<sk>" among (<source cols>)).

  3. FactPattern._join_dimension_spark lines 471-474 aliases both the dim_key and the surrogate_key columns as _dim_<col>. When dim_key == surrogate_key (e.g., date_sk β†’ date_sk), this creates two columns with the same name and triggers [AMBIGUOUS_REFERENCE] Reference '_dim_date_sk' is ambiguous. Fix: when dim_key == surrogate_key, only project the column once and reuse it as the SK alias.

Workarounds applied in examples/star_schema_e2e/star_schema_e2e.py: - For dim_product SCD2: bypass DimensionPattern entirely. Call the scd2 transformer directly (default Delta MERGE, history written to target by MERGE), re-read the target, assign sequential product_sk to any rows missing it, inject the unknown member with explicit casts, then overwrite. - For fact_orders dim_date FK: omit dim_date from dimensions:. Source already carries order_date_sk; verify the FK directly with LEFT JOIN dim_date ON f.order_date_sk = d.date_sk.

Modules: odibi/patterns/dimension.py (lines 467-538, 546-641), odibi/patterns/fact.py (lines 454-485).

RESOLVED (Task 26a): All three bugs fixed. Workarounds removed from star_schema_e2e.py. E2E test passes with 15/15 verifications. See T-035, T-036, T-037 below.

V-013: FactPattern + Dimension SCD2 + Unknown Member Compose Cleanly When Workarounds Applied

Verified: 2026-04-30 against DBR 17.3 LTS (Spark 4.0.0) via Databricks Connect Finding: With the P-009 workarounds in place, the full star-schema flow (date dim + SCD1 dim + SCD2 dim with history + fact with FK lookups, deduplication, grain validation, calculated measures, orphan→unknown-member) executes end-to-end against real Unity Catalog tables in 37s for 100 customers, 50 products, 10,025 orders + 25 orphans. SCD2 MERGE correctly produces 10 history rows when prices change on 10 of 50 products. All FK joins yield zero orphans. The unknown-member SK=0 row is used by 25 fact rows whose customer/product source IDs were absent from the dimensions. Implication: The pattern stack is functionally complete once the three P-009 bugs are fixed. The integration test examples/star_schema_e2e/star_schema_e2e.py doubles as a regression detector — if a future change re-introduces any of the three bugs, the script's existing assertions will catch it.

Session: 2026-05-01 β€” State Management E2E (Task 23, #225)

Work Done

  • Created campaign/17_state_management notebook (10 cells, ~200 LOC) β€” tests CatalogStateBackend HWM and run history with real Delta tables on ADLS via Unity Catalog external location.
  • 6/6 tests pass: hwm_roundtrip, hwm_batch, run_history, isolation, incremental, data_types.
  • Verified HWM roundtrip (string, int, update-in-place via MERGE), batch set (4 mixed types), run history (seed + retrieve latest), pipeline isolation (two pipelines, no cross-contamination), incremental load cycle (full β†’ HWM β†’ incremental β†’ 0 dupes), and JSON type preservation (int, float, bool, None, list, dict, nested dict, string).

Verification Report

  • Tests written: 6
  • Tests passing: 6/6
  • Concrete value assertions: HWM string/int values, MERGE row count = 2 (not 3), batch 4 keys verified, run status = FAILED (latest), metadata.rows = 950, incremental = 5 new rows, target = 25 total, 0 duplicate IDs, 8 data types roundtripped
  • Negative tests: get_hwm(nonexistent) = None, get_last_run_info(nonexistent_pipeline) = None
  • Edge cases tested: None value roundtrip (JSON null), nested dict 3 levels deep, HWM update (MERGE not append), pipeline key isolation after update

New Entries Added

  • [x] Pattern: P-010 β€” CatalogStateBackend Test Setup: ADLS Scratch Paths + Direct Delta Seeding
  • [x] Verified: V-014 β€” CatalogStateBackend HWM JSON Roundtrip Preserves All Python Types
  • [x] Verified: V-015 β€” CatalogStateBackend MERGE Correctly Updates HWM In-Place (No Row Duplication)
  • [x] Trap: T-029 β€” CatalogStateBackend.save_pipeline_run Is a No-Op β€” Seed meta_runs Directly

P-010: CatalogStateBackend Test Setup: ADLS Scratch Paths + Direct Delta Seeding

Date: 2026-05-01 Context: Testing CatalogStateBackend with real Delta on Databricks shared cluster (USER_ISOLATION) Pattern:

# 1. Use ADLS external location's files/ area for scratch Delta paths
ADLS_BASE = "abfss://databricks@azncuscdbksdeveaaistrg1.dfs.core.windows.net/"
SCRATCH_DIR = f"{ADLS_BASE}files/odibi_state_test"
STATE_PATH = f"{SCRATCH_DIR}/meta_state"
RUNS_PATH = f"{SCRATCH_DIR}/meta_runs"

# 2. Init backend with Spark session
backend = CatalogStateBackend(
    meta_runs_path=RUNS_PATH,
    meta_state_path=STATE_PATH,
    spark_session=spark,
    environment="test",
)

# 3. Seed meta_runs directly (save_pipeline_run is a no-op)
schema = StructType([
    StructField("pipeline_name", StringType(), False),
    StructField("node_name", StringType(), False),
    StructField("status", StringType(), False),       # "SUCCESS" or "FAILED"
    StructField("metrics_json", StringType(), True),   # JSON-encoded dict
    StructField("timestamp", TimestampType(), True),
])
df_runs = spark.createDataFrame(rows, schema)
df_runs.write.format("delta").mode("overwrite").save(RUNS_PATH)

# 4. Clean up with dbutils.fs.rm(path, recurse=True)
Why: CatalogStateBackend uses delta.\path`in MERGE SQL, requiring paths accessible to the Spark cluster. The ADLS external location'sfiles/directory works as scratch space without conflicting with UC managed storage (unitystorage/). Thesave_pipeline_runmethod is a no-op β€” CatalogManager handles run logging β€” so tests must seed the meta_runs Delta table directly with the exact schema above. **Modules:**odibi/state/__init.py`, any integration tests for CatalogStateBackend

T-029: CatalogStateBackend.save_pipeline_run Is a No-Op β€” Seed meta_runs Directly

Date: 2026-05-01 Symptom: Calling backend.log_run() or backend.save_pipeline_run() writes nothing to the meta_runs Delta table. get_last_run_info() returns None even after logging. Root Cause: CatalogStateBackend.save_pipeline_run is intentionally a no-op. Run history is written by CatalogManager (the orchestrator), not the state backend. The backend only reads run history via get_last_run_info() and get_last_run_status(). Fix: For integration tests, seed the meta_runs Delta table directly using spark.createDataFrame().write.format("delta").save(). Use the exact schema: (pipeline_name STRING, node_name STRING, status STRING, metrics_json STRING, timestamp TIMESTAMP). The status field uses string values "SUCCESS"/"FAILED" β€” get_last_run_info maps these to success: bool. The metrics_json field is JSON-decoded into metadata: dict. Modules: odibi/state/__init__.py β€” any test or campaign exercising run history

V-014: CatalogStateBackend HWM JSON Roundtrip Preserves All Python Types

Verified: 2026-05-01 against DBR 17.3 LTS (Spark 4.0.0) on shared cluster (USER_ISOLATION) Finding: set_hwm / get_hwm correctly roundtrips all common Python types through JSON serialization in the Delta meta_state table: int (42), float (3.14159), bool (True), None (stored as JSON null, retrieved as None), list ([1, "two", 3.0]), dict ({"a": 1, "b": [2, 3]}), nested dict (3 levels deep), and string. Values are stored in a value STRING column as JSON and deserialized on read. No type coercion or precision loss observed. Implication: Safe to use HWM for any JSON-serializable watermark value. No special handling needed for complex types.

V-015: CatalogStateBackend MERGE Correctly Updates HWM In-Place (No Row Duplication)

Verified: 2026-05-01 against DBR 17.3 LTS (Spark 4.0.0) on shared cluster (USER_ISOLATION) Finding: set_hwm uses Delta MERGE INTO (matched β†’ update, not matched β†’ insert). Calling set_hwm("key", 99999) then set_hwm("key", 100500) results in exactly 1 row for that key (value=100500), not 2 rows. The meta_state table schema is (key STRING, value STRING, environment STRING, updated_at TIMESTAMP). The MERGE condition matches on key + environment. Implication: HWM updates are idempotent and safe for retry loops. No deduplication needed after repeated set_hwm calls on the same key.

Session: 2026-05-01 β€” SCD2 NaN Fix Verification (Task 24, #248)

Work Done

  • Created campaign/18_scd2_nan_fix notebook (10 cells, ~200 LOC) β€” verifies SCD2 NaN handling across all 4 engine paths end-to-end.
  • 5/5 tests PASS: nan_equality (Pandas+Polars), nan_transitions (Pandas+Polars), float_epsilon (Pandas), spark_nan (Delta MERGE), duckdb_nan (IS DISTINCT FROM).
  • Updated T-009 to mark fully resolved across all engines. #248 closeable.
  • No source code changes needed β€” all 5 comparison paths already handle NaN correctly.

Verification Report

  • Tests written: 5
  • Tests passing: 5/5
  • Concrete value assertions: NaN==NaN β†’ 0 new versions, NaNβ†’99.9 detected, 42.5β†’NaN detected, epsilon perturbation (1e-12) β†’ 0 changes, DuckDB IS DISTINCT FROM NaN==NaN β†’ False
  • Negative tests: big_change beyond epsilon detected, NaN IS DISTINCT FROM value = True
  • Edge cases tested: NULL vs NaN (distinct in DuckDB), Polars NaNβ†’None parquet roundtrip, Spark Connect lazy DeltaTable.forName

New Entries Added

  • [x] Verified: V-016 β€” SCD2 NaN Handling Verified Across All 4 Engine Paths
  • [x] Trap: T-030 β€” Spark SCD2 Delta MERGE with ADLS Paths Fails on Spark Connect
  • [x] Trap: T-031 β€” EngineContext API Requires Concrete Context Subclass

V-016: SCD2 NaN Handling Verified Across All 4 Engine Paths

Verified: 2026-05-01 against DBR 17.3 LTS (Spark 4.0.0) on shared cluster (USER_ISOLATION) Finding: SCD2 change detection handles float NaN correctly on all engine paths: - Pandas fallback (has_changed): _safe_isna(val) via pd.isna() catches NaN; math.isclose(rel_tol=1e-9, abs_tol=1e-12) prevents epsilon false positives. Both NaN β†’ continue (no change). One NaN β†’ change. Float within tolerance β†’ no change. - DuckDB SQL (_scd2_pandas DuckDB branch): IS DISTINCT FROM treats NaN == NaN as False in DuckDB β€” unlike standard SQL where NaN is a regular float and NaN != NaN. No additional NaN handling needed. - Polars (_scd2_polars): ne_missing() already returns False for NaN vs NaN (Polars treats NaN as equal in missing-aware comparison). The is_nan() guard on Float32/Float64 is defensive but not strictly necessary. - Spark legacy (_scd2_spark): eqNullSafe() handles NaN == NaN β†’ True. - Spark MERGE (_scd2_spark_delta_merge): <=> operator in MERGE condition is NaN-safe. Implication: #248 closeable. No source code changes needed β€” the fix was already in place across all engines. The campaign/18_scd2_nan_fix notebook serves as a regression test.

T-030: Spark SCD2 Delta MERGE with ADLS Paths Fails on Spark Connect

Date: 2026-05-01 Symptom: scd2(target="abfss://...") with use_delta_merge=True fails with PATH_NOT_FOUND on first run, even though the initial write path should handle non-existent targets. Root Cause: DeltaTable.forName(spark, target) is lazy on Spark Connect (T-020). When given an ADLS URI like abfss://container@account/path, it returns a lazy object without throwing. This sets is_delta = True incorrectly, skipping the initial write path. Later, when the code tries to use the delta_table object, it fails with INVALID_IDENTIFIER (ADLS URIs are not valid SQL identifiers). The fallback _scd2_spark() then tries to read the non-existent target β†’ PATH_NOT_FOUND. Fix: Use UC managed table names (e.g., catalog.schema.table) instead of ADLS paths for Spark SCD2 targets. The DeltaTable.forName() API is designed for table names, not file paths. For path-based targets, the DeltaTable.isDeltaTable() + DeltaTable.forPath() fallback works on classic Spark but not reliably on Spark Connect. Modules: odibi/transformers/scd.py β€” _scd2_spark_delta_merge(), any Spark SCD2 campaign tests

T-031: EngineContext API Requires Concrete Context Subclass

Date: 2026-05-01 Symptom: TypeError: EngineContext.__init__() missing 1 required positional argument: 'context' when constructing EngineContext(df=df, engine_type=EngineType.PANDAS). Root Cause: EngineContext.__init__ signature changed to require a context: Context as the first positional arg. Context is an ABC with abstract methods (get, register, has, etc.). The concrete subclasses are PandasContext(), PolarsContext(), SparkContext(spark_session). Fix: Always pass the appropriate concrete context:

# Pandas
ctx = EngineContext(context=PandasContext(), df=df, engine_type=EngineType.PANDAS, engine=engine)
# Polars
ctx = EngineContext(context=PolarsContext(), df=df, engine_type=EngineType.POLARS, engine=engine)
# Spark
ctx = EngineContext(context=SparkContext(spark), df=df, engine_type=EngineType.SPARK, engine=engine)
Note: EngineContext.spark is a property that delegates to self.context.spark. It returns None for Pandas/Polars contexts. Modules: Any code constructing EngineContext directly (campaigns, tests, patterns)


Session: 2026-05-01 (Task 25 β€” YAML Validation Hardening)

V-017: Pydantic Validators Catch YAML Misconfigurations at Parse Time

Problem: Invalid YAML configs (bad names, hallucinated fields, wrong connections) only caught by odibi validate CLI, not at ProjectConfig(**data) parse time. Cause: Pydantic models in config.py lacked field-level and cross-model validators. Fix: Added 4 validators (92 LOC) to config.py: 1. NodeConfig.validate_node_name_format β€” Regex ^[a-zA-Z_][a-zA-Z0-9_]*$, rejects hyphens/dots/spaces/leading digits 2. PipelineConfig.validate_pipeline_name_format β€” Same regex 3. NodeConfig.check_hallucinated_fields β€” @model_validator(mode="before") detects source:, sink:, target:, outputs:, sql: in raw dict before Pydantic drops them 4. ProjectConfig.validate_node_connection_references β€” Cross-references all node read/write connections against connections: section Key Insight: model_validator(mode="before") runs on the raw dict input, so it can detect extra keys that Pydantic would otherwise silently ignore (when extra="ignore" is the default). Verified: 137 existing YAML configs (130 project + 5 pipeline files) still parse β€” 0 new validator failures. Modules: odibi/config.py β€” NodeConfig, PipelineConfig, ProjectConfig

T-032: Line Numbers Not Available from PyYAML safe_load

Problem: Task requested YAML error messages with line numbers. Cause: yaml.safe_load() (PyYAML) discards position info during parsing. Only ruamel.yaml preserves lc (line/column) attributes. Workaround: Not implemented. Would require switching YAML parser in utils/config_loader.py from PyYAML to ruamel.yaml. All error messages include the field name and value instead, which is sufficient for debugging. Modules: odibi/utils/config_loader.py

T-033: model_validator(mode="before") for Hallucination Detection

Problem: NodeConfig uses model_config = {"populate_by_name": True} without extra="forbid", so extra fields (hallucinated names like source:, sink:) are silently dropped. Fix: @model_validator(mode="before") runs BEFORE Pydantic processes the dict, so it can see all raw keys including those that would be ignored. This is the correct hook for detecting unknown/hallucinated fields without breaking configs that legitimately pass extra data. Caveat: Cannot raise errors for fields that ARE valid Pydantic fields (e.g., inputs: is a real NodeConfig field). Only check for known-wrong names. Modules: odibi/config.py β€” NodeConfig.check_hallucinated_fields


Session: 2026-05-01 (Task 26 β€” Pre-Existing Test Failures)

T-034: deltalake write_deltalake() engine="rust" Removed in v1.0+

Problem: 127 catalog + state test failures β€” TypeError: write_deltalake() got an unexpected keyword argument 'engine'. Cause: deltalake (delta-rs) >=1.0 removed the engine parameter because Rust became the default and only engine. The codebase still passed engine="rust" in 16 call sites across 8 files. Fix: Remove engine="rust" from all write_deltalake() calls. Rust is the default β€” no behavioral change. Files: odibi/catalog.py (2), odibi/state/__init__.py (3), odibi/engine/pandas_engine.py (1), + 5 test files (10). Result: 127 failures β†’ 0 failures. All 557 catalog tests pass. Modules: odibi/catalog.py, odibi/state/__init__.py, odibi/engine/pandas_engine.py

D-007: Pre-Existing Test Failures Were Mostly Already Fixed

Decision: The three documented failure categories (caplog ~25, pandas engine ~94, ADF profiler ~70-93) were all already resolved by prior work. Only the engine="rust" incompatibility remained as a real blocker causing 127 failures. The 16 failures in test_catalog_mock_engine_writes.py are stale Spark mock expectations β€” deferred to Task 26a. Rationale: Document what was already fixed to avoid future agents re-investigating resolved issues.

T-035: Spark Connect β€” Use spark.range(1).select() for Schema-Typed Single Rows

Pattern: spark.createDataFrame([values], column_names) fails on Spark Connect when any value is None, raising [CANNOT_DETERMINE_TYPE]. Instead, build single rows with spark.range(1).select(F.lit(val).cast(dtype).alias(name) for field in df.schema.fields). For string columns, use F.lit("Unknown"); for numeric/other types, use F.lit(None).cast(dtype) to avoid CAST_INVALID_INPUT. Where: odibi/patterns/dimension.py _ensure_unknown_member(), P-009 Bug 1.

T-036: SCD2 Delta MERGE Returns Partial Result β€” Re-Read Target for SK Assignment

Pattern: The scd2 transformer's Delta MERGE path writes history directly to the target table via MERGE INTO. The returned EngineContext.df contains only new/changed rows. If the calling pattern (e.g., DimensionPattern) assigns surrogate keys to this partial result and overwrites the target, the merged history is destroyed. Fix: after scd2() on Spark, re-read the full target with _load_existing_target() before assigning surrogate keys. Where: odibi/patterns/dimension.py _execute_scd2(), P-009 Bug 2.

T-037: FactPattern dim_key == sk_col Causes AMBIGUOUS_REFERENCE

Pattern: FactPattern._join_dimension_spark aliases both dim_key and sk_col as _dim_<col>. When they're the same column (e.g., date_sk β†’ date_sk), this creates duplicate column names β†’ [AMBIGUOUS_REFERENCE]. Fix: when dim_key == sk_col, project the column once. In Pandas, handle the analogous problem where the rename eliminates sk_col. Guard .drop(columns=[f"_dim_{dim_key}"]) with if col in df.columns. Where: odibi/patterns/fact.py _join_dimension_spark() and _join_dimension_pandas(), P-009 Bug 3.


Session: 2026-05-01 (Task 27 β€” row_number Transformer)

P-011: Adding a New SQL-First Transformer (End-to-End Recipe)

Recipe: The complete checklist for adding a new transformer to odibi: 1. Params: Add Pydantic model to sql_core.py with YAML example in docstring, Field() for defaults 2. Function: (context: EngineContext, params: Params) -> EngineContext β€” use context.sql(), get_logging_context(), time with elapsed_ms 3. Register: Add registry.register(module.fn, "name", module.Params) to odibi/transformers/__init__.py 4. Test: Create tests/unit/transformers/test_<name>.py β€” use PandasContext + PandasEngine + EngineContext fixture, assert concrete values 5. Docs: Update docs/skills/02_odibi_first_lookup.md (table + count), AGENTS.md (test entry), AGENT_CAMPAIGN.md, CUSTOM_INSTRUCTIONS.md (count) 6. Lint: ruff check --fix --no-cache + ruff format --no-cache on all touched files 7. Regression: Run full tests/unit/transformers/ suite to confirm 0 regressions Timing: ~30 min for a simple SQL transformer (row_number: 70 LOC prod + 211 LOC test). Modules: odibi/transformers/sql_core.py, odibi/transformers/__init__.py

V-018: ROW_NUMBER() OVER () Works on Both DuckDB and Spark SQL

Verified: ROW_NUMBER() OVER () (empty OVER clause β€” no PARTITION BY, no ORDER BY) is valid SQL on both DuckDB and Spark SQL. DuckDB assigns deterministic row numbers based on physical row order. Spark assigns arbitrary but stable numbers within a single partition. Both return valid sequential integers starting at 1. Caveat: Without ORDER BY, the numbering is non-deterministic across runs on Spark (data may be reshuffled). Tests that assert exact ordering without ORDER BY should sort results before comparison, or only assert that values are a permutation of [1..N]. Where: odibi/transformers/sql_core.py row_number(), tested in tests/unit/transformers/test_row_number.py

V-019: Full Suite Regression Count β€” 9,207 PASS / 5 FAIL (All Pre-Existing)

Verified: After adding row_number transformer (Task 27), the full test suite shows 9,207 passed (+14 from new tests), 5 failed (same pre-existing failures documented in D-007). Zero regressions introduced. Pre-existing failures: test_delta_format (delta-rs mock), test_env_override (LocalFileSystem), test_schema_lake_format (delta schema), test_dl_format_schema (delta schema), test_spark_real_session (shared cluster). Where: All 332 test files across tests/

P-012: Adding a Multi-Engine Transformer with Recursive Logic (flatten_struct Recipe)

Pattern: For transformers that need engine-specific recursion (e.g., walking nested struct schemas), keep the dispatcher thin and push all logic into per-engine helpers. The Pandas path used pd.json_normalize(max_level=depth-1) for built-in depth control; the Spark path used a recursive _collect_struct_fields() that walks StructType to build (dot_path, alias) pairs with backtick escaping for special-char field names. Key decisions: (1) Use .drop(column) instead of SELECT * EXCEPT for drop_source β€” avoids a second SQL round-trip and temp view. (2) Backtick-escape every field name in dot-path SQL (metadata.\my field`) since struct fields commonly have spaces/hyphens. (3) Polars raisesNotImplementedErrorwith a helpful message rather than a silent fallback. **Where:**odibi/transformers/advanced.py(#15), tested intests/unit/transformers/test_flatten_struct.py`

V-020: Spark Connect Subprocess Limitation β€” SparkSession Not Reachable from pytest Child Process

Verified: On Databricks shared clusters (USER_ISOLATION / Spark Connect), SparkSession.builder.getOrCreate() fails in subprocess-spawned pytest with [INVALID_CONNECT_URL] because the Spark Connect socket is only available to the notebook driver process. SparkSession.builder.master("local[1]") also fails with [CANNOT_CONFIGURE_SPARK_CONNECT_MASTER]. Workaround: Use SparkSession.getActiveSession() as the primary path (works in notebook cells). In the pytest fixture, try getActiveSession() first, fall back to builder.getOrCreate(), and pytest.skip() if both fail. Verify Spark tests via notebook direct execution (all 8 passed), and let them skip gracefully in subprocess CI. Where: tests/unit/transformers/test_flatten_struct.py spark_fixture()

V-021: Full Suite Regression Count β€” 9,224 PASS / 5 FAIL (Post-Task 28)

Verified: After adding flatten_struct transformer (Task 28) with Spark test coverage and drop_source simplification, the full test suite shows 9,224 passed (+17 from new Pandas tests, +8 Spark skipped in subprocess but verified in notebook), 5 failed (same pre-existing failures). Zero regressions introduced. Where: All test files across tests/

Session: 2026-05-01 (Task 29 β€” apply_mapping Transformer)

P-013: Cross-Dataset SQL Transformer Pattern (apply_mapping Recipe)

Pattern: When a transformer needs data from another named dataset (not just the current df), use context.get(name) to fetch it and context.register_temp_view(view_name, df) to make it available in SQL. This mirrors the join transformer's approach. The context.sql() dispatcher auto-registers the current df as "df" and replaces references, but additional datasets must be registered explicitly with a unique view name (e.g., _mapping_{source_name}). Key decisions: (1) Wrap context.get() in try/except KeyError β€” it raises, not returns None. (2) Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY value) to dedup mapping tables, preventing row multiplication. (3) For column overwrite, use explicit projection (list all columns, replacing the target) rather than SELECT * EXCLUDE β€” avoids engine-specific SQL dialects. Where: odibi/transformers/advanced.py (#16), tested in tests/unit/transformers/test_apply_mapping.py

V-022: context.get() Raises KeyError, Not Returns None

Verified: PandasContext.get(name) raises KeyError when the dataset is not registered, not returns None. Transformer code must catch KeyError (or KeyError | ValueError) rather than checking if result is None. Discovered when test_missing_mapping_source failed β€” the if mapping_df is None guard never triggered. Where: odibi/context.py PandasContext.get(), odibi/transformers/advanced.py apply_mapping()

V-023: Full Suite Regression Count β€” 9,238 PASS / 5 FAIL (Post-Task 29)

Verified: After adding apply_mapping transformer (Task 29), the full test suite shows 9,238 passed (+14 new tests), 5 failed (same pre-existing). Zero regressions. Where: All test files across tests/