Skip to content

Supported File Formats

ODIBI's PandasEngine supports multiple file formats with both local and cloud storage (ADLS, S3, etc.).

Format Support Matrix

Format Read Write ADLS S3 Local Dependencies
CSV pandas (built-in)
Parquet pyarrow or fastparquet
JSON pandas (built-in)
Excel openpyxl
Avro fastavro

Format Details

CSV (Comma-Separated Values)

Use Case: Simple tabular data, human-readable format

Read Example:

- name: load_csv
  read:
    connection: bronze
    path: data/sales.csv
    format: csv
    options:
      sep: ","
      header: true

Write Example:

- name: save_csv
  write:
    connection: bronze
    path: output/results.csv
    format: csv
    mode: overwrite


Parquet (Apache Parquet)

Use Case: Data lake storage, recommended for production

Benefits: - Columnar format (efficient for analytics) - Built-in compression - Type preservation - Fast reads for column-based queries

Read Example:

- name: load_parquet
  read:
    connection: bronze
    path: data/sales.parquet
    format: parquet

Write Example:

- name: save_parquet
  write:
    connection: silver
    path: output/sales.parquet
    format: parquet
    options:
      compression: snappy  # or gzip, brotli


JSON (JSON Lines)

Use Case: Semi-structured data, nested objects

Format: JSON lines (newline-delimited JSON objects)

Read Example:

- name: load_json
  read:
    connection: bronze
    path: data/events.json
    format: json

Write Example:

- name: save_json
  write:
    connection: bronze
    path: output/events.json
    format: json
    options:
      orient: records


Excel (Microsoft Excel)

Use Case: Business reports, spreadsheets, multi-file analysis

Supported: .xlsx files

Dependencies: Requires openpyxl

pip install openpyxl

Read Example (Single File):

- name: load_excel
  read:
    connection: bronze
    path: reports/sales.xlsx
    format: excel
    options:
      sheet_name: "Sheet1"

Reading Multiple Excel Files (Glob Patterns)

Read all Excel files matching a pattern and combine them into a single DataFrame:

- name: load_s_curve_data
  read:
    connection: bronze
    path: "S Curve/*.xlsx"
    format: excel
    options:
      add_source_file: true  # Track which file each row came from

Recursive Pattern (All Subdirectories):

- name: load_all_reports
  read:
    connection: bronze
    path: "data/**/*.xlsx"
    format: excel
    options:
      add_source_file: true

Sheet Pattern Matching

Filter sheets by name pattern instead of reading all sheets:

- name: load_powerbi_sheets
  read:
    connection: bronze
    path: "reports/*.xlsx"
    format: excel
    options:
      sheet_pattern:
        - "*powerbi*"
        - "*power bi*"
      sheet_pattern_case_sensitive: false  # Default: case-insensitive
      add_source_file: true

Source File Tracking

When add_source_file: true, two columns are added to track data lineage: - _source_file: The file path (e.g., S Curve/Project_Alpha.xlsx) - _source_sheet: The sheet name (e.g., Sheet1)

This is essential for debugging and auditing when combining multiple files.

Cloud Storage (Azure/S3)

Excel reading works seamlessly with cloud storage:

- name: load_excel_from_azure
  read:
    connection: bronze  # Uses connection's storage_options
    path: "abfss://container@account.dfs.core.windows.net/reports/*.xlsx"
    format: excel
    options:
      sheet_pattern: ["*data*"]
      add_source_file: true

Write Example:

- name: save_excel
  write:
    connection: bronze
    path: output/report.xlsx
    format: excel

Note: All three engines (Pandas, Spark, Polars) support these Excel features with full parity.


Avro (Apache Avro)

Use Case: Event streaming, schema evolution

Benefits: - Binary format (compact) - Schema included in file - Supports schema evolution - Efficient for serialization

Dependencies: Requires fastavro

pip install fastavro

Read Example:

- name: load_avro
  read:
    connection: bronze
    path: events/stream.avro
    format: avro

Write Example:

- name: save_avro
  write:
    connection: bronze
    path: output/events.avro
    format: avro

Note: Avro schema is automatically inferred from DataFrame dtypes.


Cloud Storage Support

All formats work seamlessly with cloud storage:

ADLS (Azure Data Lake Storage):

connections:
  bronze:
    type: azure_blob
    account_name: mystorageaccount
    container: bronze
    auth:
      mode: key_vault
      key_vault_name: my-vault
      secret_name: storage-key

pipelines:
  - pipeline: multi_format
    nodes:
      - name: read_csv_from_adls
        read:
          connection: bronze
          path: data/sales.csv
          format: csv

      - name: write_avro_to_adls
        depends_on: [read_csv_from_adls]
        write:
          connection: bronze
          path: output/sales.avro
          format: avro

All formats support: - ✅ Multi-account connections - ✅ Key Vault authentication - ✅ Storage options pass-through - ✅ Remote URI handling (abfss://, s3://)


Best Practices

  1. Use Parquet for production data
  2. Efficient storage
  3. Fast analytics
  4. Type preservation

  5. Use CSV for human-readable data

  6. Easy to inspect
  7. Compatible with all tools
  8. Good for small datasets

  9. Use Avro for event streams

  10. Schema evolution support
  11. Compact binary format
  12. Good for append-only logs

Performance Tips

Parquet: - Use snappy compression (good balance of speed/size) - Enable column pruning (read only needed columns) - Consider partitioning for large datasets (Phase 2B)

CSV: - Use chunking for large files - Specify dtypes explicitly to avoid inference

Avro: - Best for write-once, read-many workloads - Schema is embedded (no separate schema files needed)


Delta Lake (Databricks / Open Source)

Use Case: ACID transactions, time travel, data lakehouse

Benefits: - ACID Transactions: No partial writes or corruption - Time Travel: Query previous versions of data - Schema Evolution: Safely evolve schema over time - Audit History: Track all changes to the table

Dependencies: Requires delta-spark (for Spark engine) or deltalake (for Pandas engine).

Read Example:

- name: load_delta
  read:
    connection: bronze
    path: data/sales.delta
    format: delta
    options:
      version_as_of: 5  # Time travel!

Write Example:

- name: save_delta
  write:
    connection: silver
    path: output/sales.delta
    format: delta
    mode: append  # or overwrite

Delta Lake supports advanced features like VACUUM and Restore.


Database Formats

In addition to file formats, Odibi supports reading from and writing to SQL databases.

Format Read Write Merge Dependencies
Azure SQL / SQL Server (sql_server, azure_sql) pyodbc, sqlalchemy
PostgreSQL (postgres, postgresql) psycopg2-binary, sqlalchemy

PostgreSQL

Use Case: Relational database for local or cloud-hosted analytical workloads

Read Example:

- name: load_from_postgres
  read:
    connection: pg_warehouse
    format: postgres
    path: public.orders

Write Example:

- name: write_to_postgres
  write:
    connection: pg_warehouse
    format: postgres
    path: public.order_summary
    mode: append

Installation:

pip install 'odibi[postgres]'

Note

PostgreSQL does not support merge mode. Use append or overwrite for write operations.