Bulk Copy for SQL Server¶

High-performance data loading to SQL Server using ADLS staging and BULK INSERT, achieving 10-50x faster writes compared to standard JDBC.

Performance Tiers¶

Odibi provides two levels of SQL Server write optimization:

1. JDBC Bulk Copy Protocol (Default - All Databases)¶

Enabled automatically for all SQL Server writes via useBulkCopyForBatchInsert=true. No configuration needed.

Database	Support	Speedup
Azure SQL Database	✅ Works	5-10x
Azure SQL Managed Instance	✅ Works	5-10x
Azure Synapse Analytics	✅ Works	5-10x
SQL Server (on-prem)	✅ Works	5-10x

This uses Microsoft's JDBC driver bulk copy protocol - faster than row-by-row inserts, no external staging required.

2. File-Based Bulk Copy (All Azure SQL Platforms)¶

For maximum performance with very large datasets, use bulk_copy: true to stage data in ADLS and load via BULK INSERT.

Database	Format	Support	Speedup
Azure Synapse Analytics	Parquet	✅ Full support	10-50x
SQL Server 2022+	Parquet	✅ With PolyBase	10-50x
Azure SQL Database	CSV	✅ Supported	10-50x
Azure SQL Managed Instance	CSV	✅ Supported	10-50x

Auto-detection: Odibi automatically detects your database type and uses the optimal format: - Synapse/SQL Server 2022+: Parquet (faster, columnar) - Azure SQL DB/MI: CSV with proper escaping and 0x0a row terminators for Linux compatibility

First-run optimization: On initial load (empty target table), odibi skips the staging table and MERGE, doing a direct BULK INSERT to target for maximum speed.

Overview¶

Bulk copy works by:

Writing data to Azure Data Lake Storage (ADLS) as a staging file
Using SQL Server BULK INSERT to load directly from blob storage
Achieving 10-50x faster writes compared to JDBC

The format is auto-selected based on your database type (Parquet for Synapse, CSV for Azure SQL DB).

This approach is ideal for: - Loading millions of rows in seconds - Overwrite operations (full table refresh) - MERGE operations with large staging datasets - ETL pipelines running on Databricks with SQL Server targets

Complete Examples¶

Example 1: Overwrite with Bulk Copy¶

Full table refresh using bulk copy - replaces all data in the target table:

project: wideworldimporters

connections:
  # Source data lake
  adls_silver:
    type: azure_blob
    account_name: ${AZURE_STORAGE_ACCOUNT}
    container: datalake
    auth:
      mode: account_key
      account_key: ${AZURE_STORAGE_KEY}

  # Staging for bulk copy (can be same as source)
  adls_staging:
    type: azure_blob
    account_name: ${AZURE_STORAGE_ACCOUNT}
    container: datalake
    auth:
      mode: account_key
      account_key: ${AZURE_STORAGE_KEY}

  # SQL Server target
  sql_warehouse:
    type: sql_server
    host: ${SQL_SERVER}
    database: ${SQL_DATABASE}
    auth:
      mode: sql_login
      username: ${SQL_USER}
      password: ${SQL_PASSWORD}

pipelines:
  - pipeline: load_fact_tables
    nodes:
      - name: fact_sales_to_sql
        read:
          connection: adls_silver
          path: silver/fact_sales/
          format: parquet
        write:
          connection: sql_warehouse
          format: sql_server
          table: dw.fact_sales
          mode: overwrite
          overwrite_options:
            bulk_copy: true
            staging_connection: adls_staging
            external_data_source: OdibiBulkStagingPROD
            # Optional settings
            auto_create_schema: true
            auto_create_table: true
            audit_cols:
              created_col: created_ts
              updated_col: updated_ts
            schema_evolution:
              mode: evolve
              add_columns: true

Example 2: Merge with Bulk Copy¶

Incremental sync using bulk copy for staging - inserts new rows, updates changed rows:

project: wideworldimporters

connections:
  adls_silver:
    type: azure_blob
    account_name: ${AZURE_STORAGE_ACCOUNT}
    container: datalake
    auth:
      mode: account_key
      account_key: ${AZURE_STORAGE_KEY}

  adls_staging:
    type: azure_blob
    account_name: ${AZURE_STORAGE_ACCOUNT}
    container: datalake
    auth:
      mode: account_key
      account_key: ${AZURE_STORAGE_KEY}

  sql_warehouse:
    type: sql_server
    host: ${SQL_SERVER}
    database: ${SQL_DATABASE}
    auth:
      mode: sql_login
      username: ${SQL_USER}
      password: ${SQL_PASSWORD}

pipelines:
  - pipeline: sync_dimensions
    nodes:
      - name: dim_customer_merge
        read:
          connection: adls_silver
          path: silver/dim_customer/
          format: parquet
        write:
          connection: sql_warehouse
          format: sql_server
          table: dw.dim_customer
          mode: merge
          merge_keys:
            - customer_id
          merge_options:
            bulk_copy: true
            staging_connection: adls_staging
            external_data_source: OdibiBulkStagingPROD
            # MERGE behavior
            update_condition: "source._hash_diff != target._hash_diff"
            exclude_columns:
              - _hash_diff
            # Table setup
            auto_create_schema: true
            auto_create_table: true
            audit_cols:
              created_col: created_ts
              updated_col: updated_ts

Example 3: Auto-Setup Mode¶

Let odibi create the SQL Server objects automatically (requires elevated permissions):

nodes:
  - name: fact_orders_to_sql
    read:
      connection: adls_silver
      path: silver/fact_orders/
      format: parquet
    write:
      connection: sql_warehouse
      format: sql_server
      table: dw.fact_orders
      mode: overwrite
      overwrite_options:
        bulk_copy: true
        staging_connection: adls_staging
        auto_setup: true  # Creates external data source automatically
        auto_create_table: true

With auto_setup: true, odibi will: 1. Create database master key (if needed) 2. Create credential from your ADLS connection auth 3. Create external data source named odibi_adls_staging

Staging Path Organization¶

Staging files are organized by project/pipeline/node for easy debugging:

datalake/odibi_staging/bulk/{project}/{pipeline}/{node}/{uuid}.parquet

Example:

datalake/odibi_staging/bulk/wideworldimporters/load_fact_tables/fact_sales_to_sql/a1b2c3d4.parquet

This prevents conflicts when multiple nodes run in parallel and makes it easy to identify orphaned files.

Configuration Reference¶

Overwrite Options¶

Field	Type	Required	Default	Description
`bulk_copy`	boolean	No	`false`	Enable bulk copy mode
`staging_connection`	string	When `bulk_copy=true`	-	Connection name for ADLS staging
`staging_path`	string	No	`odibi_staging/bulk`	Path prefix for staging files
`external_data_source`	string	No	Auto-generated	SQL Server external data source name
`auto_setup`	boolean	No	`false`	Auto-create SQL Server objects
`keep_staging_files`	boolean	No	`false`	Retain staging files after load
`auto_create_schema`	boolean	No	`false`	Create schema if missing
`auto_create_table`	boolean	No	`false`	Create table if missing
`audit_cols`	object	No	-	Add created_ts/updated_ts columns
`schema_evolution`	object	No	-	Auto-add new columns

Merge Options¶

Field	Type	Required	Default	Description
`bulk_copy`	boolean	No	`false`	Enable bulk copy for staging
`staging_connection`	string	When `bulk_copy=true`	-	Connection name for ADLS staging
`staging_path`	string	No	`odibi_staging/bulk`	Path prefix for staging files
`external_data_source`	string	No	Auto-generated	SQL Server external data source name
`auto_setup`	boolean	No	`false`	Auto-create SQL Server objects
`keep_staging_files`	boolean	No	`false`	Retain staging files after load
`update_condition`	string	No	-	SQL condition for WHEN MATCHED UPDATE
`delete_condition`	string	No	-	SQL condition for WHEN MATCHED DELETE
`exclude_columns`	list	No	-	Columns to exclude from MERGE
`auto_create_schema`	boolean	No	`false`	Create schema if missing
`auto_create_table`	boolean	No	`false`	Create table if missing
`audit_cols`	object	No	-	Add created_ts/updated_ts columns

Authentication Methods¶

When auto_setup: true, odibi reads your ADLS connection auth and creates the matching SQL Server credential:

Account Key¶

connections:
  adls_staging:
    type: azure_blob
    account_name: mystorageaccount
    container: datalake
    auth:
      mode: account_key
      account_key: ${STORAGE_ACCOUNT_KEY}

SAS Token (Recommended for Azure SQL DB)¶

Important: For Azure SQL Database BULK INSERT, use SAS token authentication. The SAS token must have: - Resource types: Container AND Object (both required) - Permissions: Read, List

connections:
  adls_staging:
    type: azure_adls  # or azure_blob
    account_name: mystorageaccount
    container: datalake
    auth:
      mode: sas_token
      sas_token: ${STORAGE_SAS_TOKEN}  # Without leading '?' - odibi handles it

The external data source location must include the container: https://account.blob.core.windows.net/container

Managed Identity (Azure SQL Only)¶

connections:
  adls_staging:
    type: azure_blob
    account_name: mystorageaccount
    container: datalake
    auth:
      mode: aad_msi

Manual SQL Server Setup¶

If you can't use auto_setup: true, create the objects manually:

USE your_database;

-- Create master key (one-time)
IF NOT EXISTS (SELECT 1 FROM sys.symmetric_keys WHERE name = '##MS_DatabaseMasterKey##')
    CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'YourStrongPassword123!';

-- Create credential with account key
CREATE DATABASE SCOPED CREDENTIAL OdibiBulkCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<your-storage-account-key>';

-- Create external data source
CREATE EXTERNAL DATA SOURCE OdibiBulkStagingPROD
WITH (
    TYPE = BLOB_STORAGE,
    LOCATION = 'https://<account>.blob.core.windows.net/<container>',
    CREDENTIAL = OdibiBulkCredential
);

Then reference it in your YAML:

overwrite_options:
  bulk_copy: true
  staging_connection: adls_staging
  external_data_source: OdibiBulkStagingPROD  # Must match SQL object name

Performance Comparison¶

Method	1M Rows	10M Rows	100M Rows
JDBC (row-by-row)	~5 min	~50 min	Hours
Bulk Copy	~10 sec	~60 sec	~8 min

Actual performance varies based on network, SQL Server resources, and data complexity.

Troubleshooting¶

"External data source not found"¶

Either: 1. Set auto_setup: true to create it automatically 2. Ensure the external_data_source name matches exactly what was created in SQL Server

"Cannot bulk load - permission denied"¶

For manual setup, the SQL user needs ADMINISTER BULK OPERATIONS permission:

GRANT ADMINISTER BULK OPERATIONS TO [your_user];

For auto_setup, you need elevated permissions:

GRANT ALTER ANY EXTERNAL DATA SOURCE TO [your_user];
GRANT CONTROL ON DATABASE::your_db TO [your_user];

Staging files not cleaned up¶

If keep_staging_files: false but files remain: - Check the staging connection has delete permissions - Check logs for cleanup errors

Technical Details¶

CSV Format for Azure SQL DB¶

When targeting Azure SQL Database, odibi writes CSV files with: - Quote character: " - Escape character: " (doubled quotes) - Row terminator: 0x0a (hex format for Linux/Databricks compatibility) - Encoding: UTF-8 (CODEPAGE 65001) - Field quote: Enabled for proper parsing

Automatic Data Sanitization¶

Odibi automatically sanitizes data before CSV export to prevent BULK INSERT errors:

Data Type	Issue	Automatic Fix
String	Embedded newlines break row parsing	Replaced with spaces
Decimal	Scientific notation (3.479E-7) not parseable	Formatted as fixed-point (0.00000034790000)
Float/Double	Scientific notation	Formatted to %.15f
Boolean	true/false not recognized	Converted to 1/0
Timestamp	Inconsistent formats	Formatted as yyyy-MM-dd HH:mm:ss.SSS
Date	Inconsistent formats	Formatted as yyyy-MM-dd

This happens transparently - no configuration needed.

ODBC Driver¶

For Databricks, use ODBC Driver 18 (preferred over 17): - TLS 1.3 support - Better connection resilience - Note: May require TrustServerCertificate=yes for Azure SQL

Limitations¶

Spark engine only: Bulk copy requires PySpark for Parquet/CSV writing
ADLS/Blob staging required: Cannot use local filesystem for staging
SQL Server 2017+: Requires external data source support