MSSQL Integration¶
dpone supports SQL Server as a first-class source, sink, and state backend.
Install the optional dependency with:
Runtime requirements for real bulk loads:
- Microsoft ODBC Driver 18 for SQL Server
mssql-tools18or another installation that provides thebcpexecutable- Network access from the runner to SQL Server port
1433
Recommended manifest¶
source:
type: postgres
connection_type: vault
connection_id: postgres-demo
vault_path: postgres/demo-source
table: {schema: public, name: orders}
options:
export_format: mssql-delimited
batch_commit_mode: whole
compress_export: false
sink:
type: mssql
connection_type: vault
connection_id: mssql-dwh
vault_path: mssql/demo-dwh
table: {schema: landing, name: orders}
strategy:
mode: incremental_merge
unique_key: [id]
merge_policy: delete_insert
duplicate_policy: fail
options:
bulk:
mode: bcp
bcp:
field_terminator: "\t"
row_terminator: "\n"
batch_size: 100000
packet_size: 16384
state:
type: mssql
connection_id: mssql-dwh
connection_type: vault
vault_path: mssql/demo-dwh
table: {schema: etl_state, name: etl_xmin_state}
Credentials¶
Vault, Airflow, and environment providers share the same logical fields:
| Field | Default | Notes |
|---|---|---|
host |
required | SQL Server host |
port |
1433 |
SQL Server TCP port |
database |
required | Database name |
username / password |
optional | Omit for trusted auth where supported |
driver |
ODBC Driver 18 for SQL Server |
pyodbc driver name |
encrypt |
yes |
ODBC encryption flag |
trust_server_certificate |
no |
Set yes for local Docker only |
query_timeout |
0 |
pyodbc query timeout |
bcp_path |
bcp |
Absolute path if not on PATH |
Loading behavior¶
FULL_REFRESH: bcp-load into staging, copy into a shadow table, then commit with a short metadata rename.INCREMENTAL_APPEND: append all rows, oronly_new_rows: truewithunique_keyfor insert-not-exists.INCREMENTAL_MERGE: load into staging, fail on duplicateunique_key, then finalize withmerge_policy.REPLACE: delete target rows matchingsink.strategy.custom_predicate, then insert staged rows.PARTITION_REPLACE: load a complete partition slice into staging, then replace target rows whose partition column appears in staging.
MSSQL merge_policy: auto resolves to delete_insert.
Physical design¶
Use Physical design for MSSQL target DDL controls:
sink:
type: mssql
options:
physical_design:
apply: online
columns:
amount:
target_type:
mssql: decimal(18,4)
indexes:
primary_key: [order_id]
storage:
mssql:
compression: page
clustered_columnstore: false
Planned SQL includes table creation, index DDL, and optional compression DDL.
Compression rebuilds and columnstore changes are considered blocking on existing
large tables, so dpone requires safe_window or manual_approval before
applying them outside new-table creation.
Runbook:
- Use
dpone schema physical-plan --manifest ... --format md. - If the plan contains
DATA_COMPRESSION, schedule a maintenance window. - If exact decimal/string lengths matter, use
schema_contractplusphysical_design.columns.<column>.target_type.mssql. - Keep bcp staging tables simple; apply business constraints to final tables through governed DDL only.
BEGIN TRANSACTION;
DELETE t
FROM [landing].[orders] AS t
WHERE EXISTS (
SELECT 1
FROM [staging].[orders_stg] AS s
WHERE s.[id] = t.[id]
);
INSERT INTO [landing].[orders] ([id], [amount], ...)
SELECT [id], [amount], ...
FROM [staging].[orders_stg];
COMMIT;
Use merge_policy: shadow_swap when reader stability is more important than storage/IO cost:
CREATE TABLE [landing].[orders__dpone_shadow_ab12cd34] (...);
INSERT INTO [landing].[orders__dpone_shadow_ab12cd34]
SELECT *
FROM [landing].[orders] AS t
WHERE NOT EXISTS (
SELECT 1 FROM [staging].[orders_stg] AS s WHERE s.[id] = t.[id]
);
INSERT INTO [landing].[orders__dpone_shadow_ab12cd34]
SELECT * FROM [staging].[orders_stg];
EXEC sp_rename N'landing.orders', N'orders__dpone_backup_ab12cd34';
EXEC sp_rename N'landing.orders__dpone_shadow_ab12cd34', N'orders';
DROP TABLE [landing].[orders__dpone_backup_ab12cd34];
For partition_replace, dpone first attempts native metadata switching when the table is certified for it:
sink:
strategy:
mode: partition_replace
partition:
column: business_date
native_mode: required
partition_function: pf_orders_business_date
switch_out_schema: landing_switch
switch_out_table_template: orders__switchout__{partition}
max_partitions_per_run: 32
Native SQL shape:
ALTER TABLE [landing].[orders]
SWITCH PARTITION 42
TO [landing_switch].[orders__switchout__42] PARTITION 42;
ALTER TABLE [staging].[orders_stg]
SWITCH PARTITION 42
TO [landing].[orders] PARTITION 42;
Native SWITCH prerequisites:
- Target and staging are aligned to the same partition function.
- The target has no unsupported secondary indexes for automatic switching.
- Each
switch_out_table_templatetable exists, is empty, and is aligned to the same partition metadata. max_partitions_per_runcaps the number of partitions switched in one run.native_mode: requiredfails before fallback when any metadata check does not pass.
With native_mode: auto, dpone falls back to staging-first partition predicate replacement when SWITCH is unavailable:
DELETE t
FROM [landing].[orders] AS t
WHERE EXISTS (
SELECT 1
FROM [staging].[orders_stg] AS s
WHERE s.[business_date] = t.[business_date]
);
INSERT INTO [landing].[orders]
SELECT * FROM [staging].[orders_stg];
MSSQL uses bcp character files by default. For high-throughput Postgres to MSSQL loads, set source.options.export_format: mssql-delimited, source.options.batch_commit_mode: whole, and compress_export: false. This path uses PostgreSQL COPY TO STDOUT to write a bcp-friendly UTF-8 file and avoids Python row-by-row serialization.
Production-safe bcp text encoding¶
bcp -c is delimiter-based and does not implement PostgreSQL COPY NULL markers
or RFC-style CSV quoting. dpone therefore treats MSSQL bulk files as a native
transport with an explicit reversible text codec whenever the target is MSSQL:
| Source value | Bulk file representation | Target value after staging commit |
|---|---|---|
NULL |
Empty bcp field | NULL |
Empty string "" |
\x1dE marker |
Empty string "" |
| Tab | \x1dT |
Tab |
| LF newline | \x1dN |
LF newline |
| CR newline | \x1dR |
CR newline |
Codec marker prefix \x1d |
\x1dP |
\x1d |
This means NULL and "" are never allowed to collapse into the same target
value on MSSQL loads. Python-generated streaming/in-memory bulk files apply the
codec before writing files. Native Postgres COPY and MSSQL bcp queryout
paths encode text columns in SQL before export, then MSSQL sink strategies decode
them with set-based INSERT ... SELECT expressions when committing from staging
to the final/shadow table.
The MSSQL-target codec is intentionally applied only when the sink is MSSQL.
MSSQL source exports that feed ClickHouse direct TSV use a separate
ClickHouse-compatible TabSeparated codec: NULL is emitted as \N, backslashes
are escaped, and tab/newline/carriage-return characters become \t, \n, and
\r before the file reaches clickhouse-client or HTTP streaming.
Do not rely on CSV quotes to protect tabs or newlines in bcp files. If a custom
file producer bypasses dpone's codec, it must either produce values that do not
contain the configured field/row terminators or use a compatible lossless
encoding layer before bcp in.
dpone also makes the bcp in null contract explicit by passing -k
(KEEPNULLS) for MSSQL imports. Target tables should still keep defaults out of
staging columns; defaults belong in final-table business logic, not in transport
tables.
Passwords are redacted in diagnostics. For SQL authentication, dpone avoids
putting the password in bcp argv by default and sends it through the bcp prompt
stdin. If your platform-specific bcp build cannot prompt on stdin, use trusted
authentication or set an explicit wrapper policy in your environment after
assessing process-list exposure.
Character bcp is disabled by default for MSSQL-only types that are not safe as
delimiter text: binary, varbinary, image, rowversion/timestamp,
sql_variant, hierarchyid, geometry, and geography. Use
mssql_bcp_file_format: native for MSSQL-native transfer artifacts, or set
allow_unsafe_raw_mssql_bulk_files: true only for a manually certified file
producer.
For large FULL_REFRESH loads, dpone uses a staging-first shadow swap. Files are bulk-loaded into a staging table, copied into a shadow target table, and committed with a short metadata rename. Direct target bcp and target TRUNCATE are intentionally not used as the default production path.
For INCREMENTAL_MERGE, merge_policy: delete_insert is the default because it is the fastest set-based MSSQL upsert path for bounded deltas. Use merge_policy: shadow_swap when you need the old rebuild/swap behavior and can afford the extra storage and IO.
For REPLACE, dpone keeps the shadow-table finalization path for bounded predicate windows. Existing readers should point at the canonical table name.
For encrypted ODBC Driver 18 connections, keep bulk.bcp.packet_size: 16384 unless you have tested a larger value against your driver/server combination. Larger packet sizes can fail with SSL packet-size errors on macOS/Linux clients.
Schema evolution¶
MSSQL sink participates in automatic schema evolution. Before bcp/staging load,
dpone reads INFORMATION_SCHEMA.COLUMNS, applies safe ALTER TABLE statements,
and only then creates the staging table.
sink:
type: mssql
options:
schema_evolution:
enabled: true
mode: widening
on_type_change: new_column
See Schema evolution for schema drift behavior, Type mapping matrix for type conversion policy, and Source -> sink matrix for pair-specific runbooks.
Most-used MSSQL paths:
- PostgreSQL -> MSSQL
- MSSQL -> MSSQL
- MSSQL -> ClickHouse
- MSSQL -> PostgreSQL
- MSSQL -> Kafka
- Kafka -> MSSQL
- REST API -> MSSQL
Physical deletes and reconciliation¶
dpone supports physical-delete propagation for SQL Server sources through snapshot reconciliation. On each reconciliation run, the runtime reads the configured unique_key columns from the source table, writes the key snapshot into the centralized reconciliation tech tables, compares it with the previous snapshot, and applies a soft delete in the target table for keys that disappeared from the source.
For source.type: mssql, the snapshot read uses MSSQLConnector.build_select_query(...) and streaming pyodbc reads. For sink.type: mssql, the delete application uses parameterized SQL Server UPDATE statements:
UPDATE [landing].[orders]
SET [__dpone__deleted_at] = SYSUTCDATETIME(),
[__dpone__loaded_at] = SYSUTCDATETIME()
WHERE [__dpone__deleted_at] IS NULL
AND [id] IN (?, ?, ?)
Composite keys are supported with parameterized OR groups. Large delete batches are chunked below SQL Server's 2100-parameter limit.
Recommended manifest shape:
source:
type: mssql
connection_type: vault
connection_id: mssql-source
vault_path: mssql/demo-source
table: {schema: dbo, name: orders}
sink:
type: mssql
connection_type: vault
connection_id: mssql-dwh
vault_path: mssql/demo-dwh
table: {schema: landing, name: orders}
strategy:
mode: incremental_merge
unique_key: [id]
reconciliation:
enabled: true
state:
type: mssql
connection_id: mssql-dwh
connection_type: vault
vault_path: mssql/demo-dwh
Operational notes:
- Reconciliation is snapshot-based, so it catches physical deletes even when the source has no CDC or tombstone column.
- The target table must contain
__dpone__deleted_atand__dpone__loaded_at; normal dpone sink strategies manage these technical columns. - Tech reconciliation snapshots are currently centralized in BigQuery. Use MSSQL CDC or Change Tracking readers when you need near-real-time delete events without full key snapshots.
Local live gate and stress testing¶
Install the local Microsoft tools on macOS:
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
HOMEBREW_ACCEPT_EULA=Y brew install unixodbc msodbcsql18 mssql-tools18
Run the optional MSSQL integration test against a local SQL Server endpoint:
DPONE_IT_MSSQL_HOST=127.0.0.1 \
DPONE_IT_MSSQL_PORT=15433 \
DPONE_IT_MSSQL_DATABASE=dpone \
DPONE_IT_MSSQL_USER=sa \
DPONE_IT_MSSQL_PASSWORD='Dpone_Strong_12345!' \
DPONE_IT_MSSQL_TRUST_SERVER_CERTIFICATE=yes \
DPONE_IT_MSSQL_BCP_PATH=/opt/homebrew/bin/bcp \
uv run pytest -m integration_mssql -q
Run the reproducible local stress harness:
PYTHONUNBUFFERED=1 uv run python tools/mssql_stress.py \
--rows 1000000 \
--batch-size 100000 \
--bcp-path /opt/homebrew/bin/bcp
The stress harness exercises the real runtime path:
- Postgres synthetic source generation.
- Postgres
COPY TO STDOUTtomssql-delimitedartifact. - MSSQL staging-table
bcp infollowed by shadow-table commit for full refresh. - MSSQL
bcp queryout. - ClickHouse native insert batches.
Observed local benchmark on Apple Silicon Docker Desktop with official SQL Server 2022 linux/amd64 image:
| Flow | Rows | Time | Throughput |
|---|---|---|---|
| Postgres synthetic source preparation | 15,000,000 | 26.946s | 556,667 rows/s |
| Postgres to MSSQL full refresh | 15,000,000 | 96.580s | 155,311 rows/s |
| MSSQL to ClickHouse full refresh | 15,000,000 | 124.653s | 120,333 rows/s |
The same workload should be benchmarked again on native x86_64 Linux before using these numbers as production capacity planning inputs.
Performance roadmap¶
The next major speedup should come from partitioned parallel extraction and loading, similar to Spark JDBC:
partition_column: monotonic numeric/date column used to split source scans.lower_bound/upper_bound: explicit scan bounds for reproducibility.num_partitions: number of parallel extract workers.partition_stride: optional fixed-width partition size.- Per-partition artifacts loaded into independent staging tables, then finalized with a single set-based target operation.
Additional high-impact optimizations:
- Use native
bcpinto staging tables, then commit destructive strategies with shadow-table swaps. - Keep direct target operations limited to short metadata swaps or append commits from staging.
- Add native ClickHouse file ingestion for MSSQL exported TSV/CSV to reduce Python row coercion overhead.
- Support
bcp queryoutpartition fan-out for MSSQL sources. - Add resumable partition manifests so failed large loads restart from the failed partition, not from zero.
Troubleshooting¶
pyodbc is required: installdpone[mssql]and the Microsoft ODBC Driver.bcp failed: checkbcp_path, network access, SQL login permissions, and whether the command works outside dpone.Copy direction must be either 'in', 'out' or 'format': use dpone with the fixed bcp command order; raw bcp commands must followbcp <table-or-query> in|queryout <file> [options].Text column data incomplete: verify bcp terminators are escaped as\tand\n, not literal control characters in shell arguments.- Empty strings become
NULL: ensure the file was produced through dpone's MSSQL bulk text codec. Raw third-party TSV files cannot distinguish empty string fromNULLin SQL Server bcp character mode. - Tabs/newlines split rows or columns: use dpone-generated
mssql-delimitedartifacts or enable a compatible producer-side text codec before loading. bcpasks for a password and hangs: verify your installedbcpsupports stdin prompt input. Prefer trusted auth where available, or use an environment-local wrapper after assessing process-list exposure.MSSQL character bcp is unsafe: switch MSSQL-to-MSSQL native transfer tosource.options.bulk.bcp.file_format: native, or explicitly certify and allow your raw producer withallow_unsafe_raw_mssql_bulk_files: true.SSL Provider: Packet size too large: lowerbulk.bcp.packet_sizeto16384.- Decimal truncation: ensure Postgres numeric precision/scale are preserved, for example
numeric(18,2). - Unicode issues: use ODBC Driver 18 and bcp code page
65001; for older SQL Server/client combinations, prefer NVARCHAR columns. - Local Docker: set
trust_server_certificate: yesbecause the container certificate is self-signed.
Partitioned performance mode¶
For very large PostgreSQL -> MSSQL and MSSQL -> ClickHouse transfers, dpone supports Spark JDBC-style range partitioning:
source:
options:
partition_column: id
lower_bound: 1
upper_bound: 15000000
num_partitions: 8
export_workers: 4
load_workers: 4
Each range is exported as an independent local artifact and can be loaded through native bcp. See Performance for SLO commands, 15M stress examples, and x86_64 benchmark recommendations.
MSSQL -> ClickHouse fast ingest¶
For MSSQL -> ClickHouse, use clickhouse_bulk.mode: http when the ClickHouse HTTP endpoint is reachable. This streams bcp queryout TSV files directly into ClickHouse with INSERT ... FORMAT TabSeparated and avoids Python row parsing.
sink:
type: clickhouse
options:
clickhouse_bulk:
mode: http
http:
host: clickhouse.example.com
port: 8123
See Performance for local benchmark commands and verified 15M results.
Native queryout for MSSQL -> ClickHouse¶
For MSSQL -> ClickHouse, prefer source.options.extract_mode: bcp_queryout with nested source.options.partitioning. dpone resolves bounds: auto with a MSSQL MIN/MAX/COUNT_BIG query, exports every partition through bcp queryout, and hands file artifacts to the ClickHouse sink for direct INSERT ... FORMAT TabSeparated loading.
Key MSSQL options:
source:
options:
extract_mode: bcp_queryout
bulk:
mode: bcp
bcp:
batch_size: 100000
packet_size: 65535
timeout_seconds: 3600
partitioning:
strategy: auto
column: order_id
bounds: auto
target_rows_per_partition: 1000000
max_partitions: 64
export_workers: 8
Runbook: if the source is overloaded, reduce export_workers; if artifacts are too small, increase target_rows_per_partition; if bounds are skewed, choose a more uniformly distributed partition column.