Skip to content

MSSQL Integration

dpone supports SQL Server as a first-class source, sink, and state backend. Install the optional dependency with:

pip install "dpone[mssql]"

Runtime requirements for real bulk loads:

  • Microsoft ODBC Driver 18 for SQL Server
  • mssql-tools18 or another installation that provides the bcp executable
  • Network access from the runner to SQL Server port 1433
source:
  type: postgres
  connection_type: vault
  connection_id: postgres-demo
  vault_path: postgres/demo-source
  table: {schema: public, name: orders}
  options:
    export_format: mssql-delimited
    batch_commit_mode: whole
    compress_export: false

sink:
  type: mssql
  connection_type: vault
  connection_id: mssql-dwh
  vault_path: mssql/demo-dwh
  table: {schema: landing, name: orders}
  strategy:
    mode: incremental_merge
    unique_key: [id]
    merge_policy: delete_insert
    duplicate_policy: fail
  options:
    bulk:
      mode: bcp
      bcp:
        field_terminator: "\t"
        row_terminator: "\n"
        batch_size: 100000
        packet_size: 16384

state:
  type: mssql
  connection_id: mssql-dwh
  connection_type: vault
  vault_path: mssql/demo-dwh
  table: {schema: etl_state, name: etl_xmin_state}

Credentials

Vault, Airflow, and environment providers share the same logical fields:

Field Default Notes
host required SQL Server host
port 1433 SQL Server TCP port
database required Database name
username / password optional Omit for trusted auth where supported
driver ODBC Driver 18 for SQL Server pyodbc driver name
encrypt yes ODBC encryption flag
trust_server_certificate no Set yes for local Docker only
query_timeout 0 pyodbc query timeout
bcp_path bcp Absolute path if not on PATH

Loading behavior

  • FULL_REFRESH: bcp-load into staging, copy into a shadow table, then commit with a short metadata rename.
  • INCREMENTAL_APPEND: append all rows, or only_new_rows: true with unique_key for insert-not-exists.
  • INCREMENTAL_MERGE: load into staging, fail on duplicate unique_key, then finalize with merge_policy.
  • REPLACE: delete target rows matching sink.strategy.custom_predicate, then insert staged rows.
  • PARTITION_REPLACE: load a complete partition slice into staging, then replace target rows whose partition column appears in staging.

MSSQL merge_policy: auto resolves to delete_insert.

Physical design

Use Physical design for MSSQL target DDL controls:

sink:
  type: mssql
  options:
    physical_design:
      apply: online
      columns:
        amount:
          target_type:
            mssql: decimal(18,4)
      indexes:
        primary_key: [order_id]
      storage:
        mssql:
          compression: page
          clustered_columnstore: false

Planned SQL includes table creation, index DDL, and optional compression DDL. Compression rebuilds and columnstore changes are considered blocking on existing large tables, so dpone requires safe_window or manual_approval before applying them outside new-table creation.

Runbook:

  1. Use dpone schema physical-plan --manifest ... --format md.
  2. If the plan contains DATA_COMPRESSION, schedule a maintenance window.
  3. If exact decimal/string lengths matter, use schema_contract plus physical_design.columns.<column>.target_type.mssql.
  4. Keep bcp staging tables simple; apply business constraints to final tables through governed DDL only.
BEGIN TRANSACTION;

DELETE t
FROM [landing].[orders] AS t
WHERE EXISTS (
    SELECT 1
    FROM [staging].[orders_stg] AS s
    WHERE s.[id] = t.[id]
);

INSERT INTO [landing].[orders] ([id], [amount], ...)
SELECT [id], [amount], ...
FROM [staging].[orders_stg];

COMMIT;

Use merge_policy: shadow_swap when reader stability is more important than storage/IO cost:

CREATE TABLE [landing].[orders__dpone_shadow_ab12cd34] (...);

INSERT INTO [landing].[orders__dpone_shadow_ab12cd34]
SELECT *
FROM [landing].[orders] AS t
WHERE NOT EXISTS (
    SELECT 1 FROM [staging].[orders_stg] AS s WHERE s.[id] = t.[id]
);

INSERT INTO [landing].[orders__dpone_shadow_ab12cd34]
SELECT * FROM [staging].[orders_stg];

EXEC sp_rename N'landing.orders', N'orders__dpone_backup_ab12cd34';
EXEC sp_rename N'landing.orders__dpone_shadow_ab12cd34', N'orders';
DROP TABLE [landing].[orders__dpone_backup_ab12cd34];

For partition_replace, dpone first attempts native metadata switching when the table is certified for it:

sink:
  strategy:
    mode: partition_replace
    partition:
      column: business_date
      native_mode: required
      partition_function: pf_orders_business_date
      switch_out_schema: landing_switch
      switch_out_table_template: orders__switchout__{partition}
      max_partitions_per_run: 32

Native SQL shape:

ALTER TABLE [landing].[orders]
SWITCH PARTITION 42
TO [landing_switch].[orders__switchout__42] PARTITION 42;

ALTER TABLE [staging].[orders_stg]
SWITCH PARTITION 42
TO [landing].[orders] PARTITION 42;

Native SWITCH prerequisites:

  1. Target and staging are aligned to the same partition function.
  2. The target has no unsupported secondary indexes for automatic switching.
  3. Each switch_out_table_template table exists, is empty, and is aligned to the same partition metadata.
  4. max_partitions_per_run caps the number of partitions switched in one run.
  5. native_mode: required fails before fallback when any metadata check does not pass.

With native_mode: auto, dpone falls back to staging-first partition predicate replacement when SWITCH is unavailable:

DELETE t
FROM [landing].[orders] AS t
WHERE EXISTS (
    SELECT 1
    FROM [staging].[orders_stg] AS s
    WHERE s.[business_date] = t.[business_date]
);

INSERT INTO [landing].[orders]
SELECT * FROM [staging].[orders_stg];

MSSQL uses bcp character files by default. For high-throughput Postgres to MSSQL loads, set source.options.export_format: mssql-delimited, source.options.batch_commit_mode: whole, and compress_export: false. This path uses PostgreSQL COPY TO STDOUT to write a bcp-friendly UTF-8 file and avoids Python row-by-row serialization.

Production-safe bcp text encoding

bcp -c is delimiter-based and does not implement PostgreSQL COPY NULL markers or RFC-style CSV quoting. dpone therefore treats MSSQL bulk files as a native transport with an explicit reversible text codec whenever the target is MSSQL:

Source value Bulk file representation Target value after staging commit
NULL Empty bcp field NULL
Empty string "" \x1dE marker Empty string ""
Tab \x1dT Tab
LF newline \x1dN LF newline
CR newline \x1dR CR newline
Codec marker prefix \x1d \x1dP \x1d

This means NULL and "" are never allowed to collapse into the same target value on MSSQL loads. Python-generated streaming/in-memory bulk files apply the codec before writing files. Native Postgres COPY and MSSQL bcp queryout paths encode text columns in SQL before export, then MSSQL sink strategies decode them with set-based INSERT ... SELECT expressions when committing from staging to the final/shadow table.

The MSSQL-target codec is intentionally applied only when the sink is MSSQL. MSSQL source exports that feed ClickHouse direct TSV use a separate ClickHouse-compatible TabSeparated codec: NULL is emitted as \N, backslashes are escaped, and tab/newline/carriage-return characters become \t, \n, and \r before the file reaches clickhouse-client or HTTP streaming.

Do not rely on CSV quotes to protect tabs or newlines in bcp files. If a custom file producer bypasses dpone's codec, it must either produce values that do not contain the configured field/row terminators or use a compatible lossless encoding layer before bcp in.

dpone also makes the bcp in null contract explicit by passing -k (KEEPNULLS) for MSSQL imports. Target tables should still keep defaults out of staging columns; defaults belong in final-table business logic, not in transport tables.

Passwords are redacted in diagnostics. For SQL authentication, dpone avoids putting the password in bcp argv by default and sends it through the bcp prompt stdin. If your platform-specific bcp build cannot prompt on stdin, use trusted authentication or set an explicit wrapper policy in your environment after assessing process-list exposure.

Character bcp is disabled by default for MSSQL-only types that are not safe as delimiter text: binary, varbinary, image, rowversion/timestamp, sql_variant, hierarchyid, geometry, and geography. Use mssql_bcp_file_format: native for MSSQL-native transfer artifacts, or set allow_unsafe_raw_mssql_bulk_files: true only for a manually certified file producer.

For large FULL_REFRESH loads, dpone uses a staging-first shadow swap. Files are bulk-loaded into a staging table, copied into a shadow target table, and committed with a short metadata rename. Direct target bcp and target TRUNCATE are intentionally not used as the default production path.

For INCREMENTAL_MERGE, merge_policy: delete_insert is the default because it is the fastest set-based MSSQL upsert path for bounded deltas. Use merge_policy: shadow_swap when you need the old rebuild/swap behavior and can afford the extra storage and IO.

For REPLACE, dpone keeps the shadow-table finalization path for bounded predicate windows. Existing readers should point at the canonical table name.

For encrypted ODBC Driver 18 connections, keep bulk.bcp.packet_size: 16384 unless you have tested a larger value against your driver/server combination. Larger packet sizes can fail with SSL packet-size errors on macOS/Linux clients.

Schema evolution

MSSQL sink participates in automatic schema evolution. Before bcp/staging load, dpone reads INFORMATION_SCHEMA.COLUMNS, applies safe ALTER TABLE statements, and only then creates the staging table.

sink:
  type: mssql
  options:
    schema_evolution:
      enabled: true
      mode: widening
      on_type_change: new_column

See Schema evolution for schema drift behavior, Type mapping matrix for type conversion policy, and Source -> sink matrix for pair-specific runbooks.

Most-used MSSQL paths:

Physical deletes and reconciliation

dpone supports physical-delete propagation for SQL Server sources through snapshot reconciliation. On each reconciliation run, the runtime reads the configured unique_key columns from the source table, writes the key snapshot into the centralized reconciliation tech tables, compares it with the previous snapshot, and applies a soft delete in the target table for keys that disappeared from the source.

For source.type: mssql, the snapshot read uses MSSQLConnector.build_select_query(...) and streaming pyodbc reads. For sink.type: mssql, the delete application uses parameterized SQL Server UPDATE statements:

UPDATE [landing].[orders]
SET [__dpone__deleted_at] = SYSUTCDATETIME(),
    [__dpone__loaded_at] = SYSUTCDATETIME()
WHERE [__dpone__deleted_at] IS NULL
  AND [id] IN (?, ?, ?)

Composite keys are supported with parameterized OR groups. Large delete batches are chunked below SQL Server's 2100-parameter limit.

Recommended manifest shape:

source:
  type: mssql
  connection_type: vault
  connection_id: mssql-source
  vault_path: mssql/demo-source
  table: {schema: dbo, name: orders}

sink:
  type: mssql
  connection_type: vault
  connection_id: mssql-dwh
  vault_path: mssql/demo-dwh
  table: {schema: landing, name: orders}
  strategy:
    mode: incremental_merge
    unique_key: [id]
  reconciliation:
    enabled: true

state:
  type: mssql
  connection_id: mssql-dwh
  connection_type: vault
  vault_path: mssql/demo-dwh

Operational notes:

  • Reconciliation is snapshot-based, so it catches physical deletes even when the source has no CDC or tombstone column.
  • The target table must contain __dpone__deleted_at and __dpone__loaded_at; normal dpone sink strategies manage these technical columns.
  • Tech reconciliation snapshots are currently centralized in BigQuery. Use MSSQL CDC or Change Tracking readers when you need near-real-time delete events without full key snapshots.

Local live gate and stress testing

Install the local Microsoft tools on macOS:

brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
HOMEBREW_ACCEPT_EULA=Y brew install unixodbc msodbcsql18 mssql-tools18

Run the optional MSSQL integration test against a local SQL Server endpoint:

DPONE_IT_MSSQL_HOST=127.0.0.1 \
DPONE_IT_MSSQL_PORT=15433 \
DPONE_IT_MSSQL_DATABASE=dpone \
DPONE_IT_MSSQL_USER=sa \
DPONE_IT_MSSQL_PASSWORD='Dpone_Strong_12345!' \
DPONE_IT_MSSQL_TRUST_SERVER_CERTIFICATE=yes \
DPONE_IT_MSSQL_BCP_PATH=/opt/homebrew/bin/bcp \
uv run pytest -m integration_mssql -q

Run the reproducible local stress harness:

PYTHONUNBUFFERED=1 uv run python tools/mssql_stress.py \
  --rows 1000000 \
  --batch-size 100000 \
  --bcp-path /opt/homebrew/bin/bcp

The stress harness exercises the real runtime path:

  • Postgres synthetic source generation.
  • Postgres COPY TO STDOUT to mssql-delimited artifact.
  • MSSQL staging-table bcp in followed by shadow-table commit for full refresh.
  • MSSQL bcp queryout.
  • ClickHouse native insert batches.

Observed local benchmark on Apple Silicon Docker Desktop with official SQL Server 2022 linux/amd64 image:

Flow Rows Time Throughput
Postgres synthetic source preparation 15,000,000 26.946s 556,667 rows/s
Postgres to MSSQL full refresh 15,000,000 96.580s 155,311 rows/s
MSSQL to ClickHouse full refresh 15,000,000 124.653s 120,333 rows/s

The same workload should be benchmarked again on native x86_64 Linux before using these numbers as production capacity planning inputs.

Performance roadmap

The next major speedup should come from partitioned parallel extraction and loading, similar to Spark JDBC:

  • partition_column: monotonic numeric/date column used to split source scans.
  • lower_bound / upper_bound: explicit scan bounds for reproducibility.
  • num_partitions: number of parallel extract workers.
  • partition_stride: optional fixed-width partition size.
  • Per-partition artifacts loaded into independent staging tables, then finalized with a single set-based target operation.

Additional high-impact optimizations:

  • Use native bcp into staging tables, then commit destructive strategies with shadow-table swaps.
  • Keep direct target operations limited to short metadata swaps or append commits from staging.
  • Add native ClickHouse file ingestion for MSSQL exported TSV/CSV to reduce Python row coercion overhead.
  • Support bcp queryout partition fan-out for MSSQL sources.
  • Add resumable partition manifests so failed large loads restart from the failed partition, not from zero.

Troubleshooting

  • pyodbc is required: install dpone[mssql] and the Microsoft ODBC Driver.
  • bcp failed: check bcp_path, network access, SQL login permissions, and whether the command works outside dpone.
  • Copy direction must be either 'in', 'out' or 'format': use dpone with the fixed bcp command order; raw bcp commands must follow bcp <table-or-query> in|queryout <file> [options].
  • Text column data incomplete: verify bcp terminators are escaped as \t and \n, not literal control characters in shell arguments.
  • Empty strings become NULL: ensure the file was produced through dpone's MSSQL bulk text codec. Raw third-party TSV files cannot distinguish empty string from NULL in SQL Server bcp character mode.
  • Tabs/newlines split rows or columns: use dpone-generated mssql-delimited artifacts or enable a compatible producer-side text codec before loading.
  • bcp asks for a password and hangs: verify your installed bcp supports stdin prompt input. Prefer trusted auth where available, or use an environment-local wrapper after assessing process-list exposure.
  • MSSQL character bcp is unsafe: switch MSSQL-to-MSSQL native transfer to source.options.bulk.bcp.file_format: native, or explicitly certify and allow your raw producer with allow_unsafe_raw_mssql_bulk_files: true.
  • SSL Provider: Packet size too large: lower bulk.bcp.packet_size to 16384.
  • Decimal truncation: ensure Postgres numeric precision/scale are preserved, for example numeric(18,2).
  • Unicode issues: use ODBC Driver 18 and bcp code page 65001; for older SQL Server/client combinations, prefer NVARCHAR columns.
  • Local Docker: set trust_server_certificate: yes because the container certificate is self-signed.

Partitioned performance mode

For very large PostgreSQL -> MSSQL and MSSQL -> ClickHouse transfers, dpone supports Spark JDBC-style range partitioning:

source:
  options:
    partition_column: id
    lower_bound: 1
    upper_bound: 15000000
    num_partitions: 8
    export_workers: 4
    load_workers: 4

Each range is exported as an independent local artifact and can be loaded through native bcp. See Performance for SLO commands, 15M stress examples, and x86_64 benchmark recommendations.

MSSQL -> ClickHouse fast ingest

For MSSQL -> ClickHouse, use clickhouse_bulk.mode: http when the ClickHouse HTTP endpoint is reachable. This streams bcp queryout TSV files directly into ClickHouse with INSERT ... FORMAT TabSeparated and avoids Python row parsing.

sink:
  type: clickhouse
  options:
    clickhouse_bulk:
      mode: http
      http:
        host: clickhouse.example.com
        port: 8123

See Performance for local benchmark commands and verified 15M results.

Native queryout for MSSQL -> ClickHouse

For MSSQL -> ClickHouse, prefer source.options.extract_mode: bcp_queryout with nested source.options.partitioning. dpone resolves bounds: auto with a MSSQL MIN/MAX/COUNT_BIG query, exports every partition through bcp queryout, and hands file artifacts to the ClickHouse sink for direct INSERT ... FORMAT TabSeparated loading.

Key MSSQL options:

source:
  options:
    extract_mode: bcp_queryout
    bulk:
      mode: bcp
      bcp:
        batch_size: 100000
        packet_size: 65535
        timeout_seconds: 3600
    partitioning:
      strategy: auto
      column: order_id
      bounds: auto
      target_rows_per_partition: 1000000
      max_partitions: 64
      export_workers: 8

Runbook: if the source is overloaded, reduce export_workers; if artifacts are too small, increase target_rows_per_partition; if bounds are skewed, choose a more uniformly distributed partition column.