ClickHouse Integration¶
dpone supports ClickHouse as an analytical source and sink. The production
focus is fast append/full-refresh ingest, MSSQL/PostgreSQL file streaming,
bounded upserts, and snapshot reconciliation that avoids mutation-heavy
ALTER TABLE ... UPDATE correction paths.
Install¶
For maximum MSSQL -> ClickHouse throughput, prefer HTTP streaming or
clickhouse-client bulk ingest instead of Python row parsing. See
Performance for local 15M benchmark commands.
MSSQL -> ClickHouse TabSeparated safety¶
The fastest MSSQL -> ClickHouse path exports with SQL Server bcp queryout and
streams the file into ClickHouse with INSERT ... FORMAT TabSeparated.
TabSeparated is also delimiter-based, so raw MSSQL text containing tabs or
newlines is unsafe.
When the MSSQL source sees a ClickHouse sink using direct client/HTTP bulk mode,
dpone wraps exported columns in SQL expressions that produce ClickHouse-safe
values:
| MSSQL source value | File value for ClickHouse |
|---|---|
NULL |
\\N |
| Backslash | \\\\ |
| Tab | \\t |
| LF newline | \\n |
| CR newline | \\r |
This avoids Python row parsing while preserving row and column boundaries for
large exports. If clickhouse_bulk.mode is set to python, driver, or
native_driver, dpone does not apply the direct TSV wrapper and the Python file
reader path is used instead.
MSSQL exact type fidelity¶
For MSSQL sources, ClickHouse table creation uses a lossless-first mapper:
decimal(p,s)andnumeric(p,s)becomeDecimal(p,s), notFloat64.moneyandsmallmoneybecome fixed-scaleDecimal.uniqueidentifierbecomesUUID.datetime2(p)becomesDateTime64(p).datetimeoffset(p)usestype_fidelity.temporal.offset_timestamp; defaultutc_instantbecomesDateTime64(p, 'UTC'), whilefixed_timezone,preserve_offsetandpreserve_textare explicit opt-ins.binary,varbinary,rowversionland asStringunless an explicit byte codec policy is configured.time(p)lands asStringby default, or asUInt32seconds since midnight when explicitly configured.
Use dpone plan to inspect the type_fidelity section before first production load. For certification, prefer the MSSQL -> ClickHouse typed_hash reconciliation profile because it compares typed values instead of connector-specific JSON formatting.
source:
type: mssql
options:
type_fidelity:
binary_encoding: hex
time_encoding: seconds_since_midnight
temporal:
offset_timestamp:
mode: utc_instant
timezone: UTC
See MSSQL -> ClickHouse for policy values, tradeoffs and the typed reconciliation runbook.
Incremental merge policies¶
ClickHouse merge_policy: auto resolves to lightweight_delete_insert.
sink:
type: clickhouse
table: {schema: analytics, name: orders}
strategy:
mode: incremental_merge
unique_key: [id]
merge_policy: lightweight_delete_insert
duplicate_policy: fail
mutations_sync: 1
Default SQL shape:
DELETE FROM analytics.orders
WHERE id IN (SELECT id FROM analytics.orders__dpone_staging_ab12cd34)
SETTINGS mutations_sync = 1;
INSERT INTO analytics.orders
SELECT * FROM analytics.orders__dpone_staging_ab12cd34;
Supported policies:
| Policy | Status | Notes |
|---|---|---|
lightweight_delete_insert |
Default | Fast bounded upsert; ClickHouse performs logical deletes and cleans data asynchronously. |
shadow_swap |
Supported | Rebuilds a full shadow table and swaps it into the canonical name; heavier but reader-friendly. |
mutation_delete_insert |
Non-recommended opt-in | Uses ALTER TABLE ... DELETE; requires allow_non_recommended_policy: true. |
Non-recommended mutation opt-in:
sink:
strategy:
mode: incremental_merge
unique_key: [id]
merge_policy: mutation_delete_insert
allow_non_recommended_policy: true
mutations_sync: 2
Partition replace¶
Use partition_replace when the source provides a complete replacement slice
for one or more ClickHouse partitions.
sink:
type: clickhouse
strategy:
mode: partition_replace
partition:
column: business_date
values_from_staging: true
max_partitions_per_run: 32
Runtime shape:
CREATE TABLE analytics.orders__dpone_staging_ab12cd34 AS analytics.orders;
INSERT INTO analytics.orders__dpone_staging_ab12cd34
SELECT ...;
ALTER TABLE analytics.orders
REPLACE PARTITION '2026-06-03'
FROM analytics.orders__dpone_staging_ab12cd34;
The staging table is created from target metadata so partition expressions and
engines remain compatible for REPLACE PARTITION ... FROM.
Physical deletes without update mutations¶
ClickHouse implements ALTER TABLE ... UPDATE and classic ALTER TABLE ... DELETE
as table mutations. Mutations are asynchronous, can accumulate in
system.mutations, and make reconciliation SLOs hard to reason about. For this
reason, dpone does not use ALTER TABLE ... UPDATE for ClickHouse snapshot
reconciliation.
The default ClickHouse reconciliation strategy is shadow_table_swap:
- Stage disappeared source keys in a temporary
Memorytable. - Create an empty shadow table with the same structure as the target table.
- Copy all target rows into the shadow table.
- For rows matching staged deleted keys, set
__dpone__deleted_at = now()and refresh__dpone__loaded_at = now(). - Rename the original table to a backup and the shadow table to the canonical target name.
- Drop temporary artifacts.
This keeps raw target-table semantics correct without ClickHouse mutations. It is heavier than an append-only tombstone strategy because it rewrites the table, but it is deterministic and avoids mutation backlog.
Staging-first load strategies¶
ClickHouse sink loads data into staging tables before touching the canonical target table:
FULL_REFRESH: load into staging, then rename staging into the target name.INCREMENTAL_APPEND: load into staging, then append from staging into target.INCREMENTAL_MERGE: load into staging, validate duplicate keys, then finalize withlightweight_delete_insert,shadow_swap, or guardedmutation_delete_insert.REPLACE: load into staging, copy non-replaced target rows into a shadow table, append staging rows into the shadow table, then rename the shadow table into the target name.PARTITION_REPLACE: create staging from target metadata, load replacement rows, then runALTER TABLE target REPLACE PARTITION ... FROM stagingfor each staged partition value.
The sink does not use target TRUNCATE. mutation_delete_insert is available
only as an explicit, non-recommended opt-in.
Schema evolution¶
ClickHouse sink applies safe schema evolution before staging:
ADD COLUMNfor new nullable/default-safe columns;MODIFY COLUMNonly for planner-approved widening;__dpone__nc__<column>generated columns for explicit incompatible type-change routing.
Schema evolution does not use ClickHouse mutations. Full refresh, replace, and reconciliation still use staging/shadow table swaps. See Schema evolution.
LowCardinality and physical design¶
Use Physical design to control ClickHouse engines,
PARTITION BY, ORDER BY, codecs, and LowCardinality behavior.
sink:
type: clickhouse
options:
physical_design:
storage:
clickhouse:
engine: MergeTree
partition_by: toYYYYMM(business_date)
order_by: [business_date, order_id]
low_cardinality:
mode: auto
max_distinct_values: 10000
max_distinct_ratio: 0.05
Modes:
| Mode | Behavior |
|---|---|
off |
Never generate LowCardinality. |
auto |
Use profiler cardinality thresholds for string-like columns. |
explicit |
Apply only to configured columns. |
force |
Apply to configured columns even when profiler disagrees, with warning. |
preserve |
Keep existing/source LowCardinality, but do not generate new ones. |
Runbook:
- If low-cardinality columns are not selected, inspect
dpone schema inferand verify sample coverage. - If high-cardinality columns were selected, switch to
explicitoroff. - If downstream DDL must be exact, use
physical_design.columns.<column>.target_type.clickhouse. - Run
dpone schema physical-plan --manifest ... --format mdbefore changing existing tables.
Requirements¶
- Reconciliation must be enabled and configured with a stable
unique_key. - The ClickHouse target table must contain
__dpone__loaded_atand__dpone__deleted_at. - The connector must be able to read
system.columnsfor target schema discovery. - The target database should support atomic
RENAME TABLE; ClickHouseAtomicdatabases do.
Example¶
source:
type: mssql
connection_id: mssql_source
table: {schema: dbo, name: orders}
sink:
type: clickhouse
connection_id: clickhouse_dwh
table: {schema: analytics, name: orders}
strategy:
mode: incremental_append
unique_key: [id]
reconciliation:
enabled: true
Operational guidance¶
- Use ClickHouse reconciliation for periodic correctness gates, not per-minute high-churn delete streams.
- For high-delete workloads, prefer CDC/Change Tracking into a versioned table design or run reconciliation by partition.
- Keep readers on the canonical table name; dpone swaps the shadow table into that name only after the full rewrite has completed.
- If a run fails before the swap, dpone drops the shadow table and leaves the canonical target unchanged.
Direct TSV ingest for MSSQL native transfers¶
MSSQL -> ClickHouse native transfer uses ClickHouse direct INSERT ... FORMAT TabSeparated whenever HTTP bulk or clickhouse-client is configured. The Python parsing fallback remains available, but it is not the preferred path for large tables.
Recommended options:
sink:
options:
clickhouse_bulk:
mode: auto
insert_settings:
async_insert: 1
wait_for_async_insert: 1
max_insert_block_size: 1000000
input_format_parallel_parsing: 1
If async_insert: 1 is set without wait_for_async_insert, dpone automatically sets wait_for_async_insert: 1 so the source state is not advanced before ClickHouse accepts the insert.