Skip to content

ClickHouse Integration

dpone supports ClickHouse as an analytical source and sink. The production focus is fast append/full-refresh ingest, MSSQL/PostgreSQL file streaming, bounded upserts, and snapshot reconciliation that avoids mutation-heavy ALTER TABLE ... UPDATE correction paths.

Install

pip install "dpone[clickhouse]"

For maximum MSSQL -> ClickHouse throughput, prefer HTTP streaming or clickhouse-client bulk ingest instead of Python row parsing. See Performance for local 15M benchmark commands.

MSSQL -> ClickHouse TabSeparated safety

The fastest MSSQL -> ClickHouse path exports with SQL Server bcp queryout and streams the file into ClickHouse with INSERT ... FORMAT TabSeparated. TabSeparated is also delimiter-based, so raw MSSQL text containing tabs or newlines is unsafe.

When the MSSQL source sees a ClickHouse sink using direct client/HTTP bulk mode, dpone wraps exported columns in SQL expressions that produce ClickHouse-safe values:

MSSQL source value File value for ClickHouse
NULL \\N
Backslash \\\\
Tab \\t
LF newline \\n
CR newline \\r

This avoids Python row parsing while preserving row and column boundaries for large exports. If clickhouse_bulk.mode is set to python, driver, or native_driver, dpone does not apply the direct TSV wrapper and the Python file reader path is used instead.

MSSQL exact type fidelity

For MSSQL sources, ClickHouse table creation uses a lossless-first mapper:

  • decimal(p,s) and numeric(p,s) become Decimal(p,s), not Float64.
  • money and smallmoney become fixed-scale Decimal.
  • uniqueidentifier becomes UUID.
  • datetime2(p) becomes DateTime64(p).
  • datetimeoffset(p) uses type_fidelity.temporal.offset_timestamp; default utc_instant becomes DateTime64(p, 'UTC'), while fixed_timezone, preserve_offset and preserve_text are explicit opt-ins.
  • binary, varbinary, rowversion land as String unless an explicit byte codec policy is configured.
  • time(p) lands as String by default, or as UInt32 seconds since midnight when explicitly configured.

Use dpone plan to inspect the type_fidelity section before first production load. For certification, prefer the MSSQL -> ClickHouse typed_hash reconciliation profile because it compares typed values instead of connector-specific JSON formatting.

source:
  type: mssql
  options:
    type_fidelity:
      binary_encoding: hex
      time_encoding: seconds_since_midnight
      temporal:
        offset_timestamp:
          mode: utc_instant
          timezone: UTC

See MSSQL -> ClickHouse for policy values, tradeoffs and the typed reconciliation runbook.

Incremental merge policies

ClickHouse merge_policy: auto resolves to lightweight_delete_insert.

sink:
  type: clickhouse
  table: {schema: analytics, name: orders}
  strategy:
    mode: incremental_merge
    unique_key: [id]
    merge_policy: lightweight_delete_insert
    duplicate_policy: fail
    mutations_sync: 1

Default SQL shape:

DELETE FROM analytics.orders
WHERE id IN (SELECT id FROM analytics.orders__dpone_staging_ab12cd34)
SETTINGS mutations_sync = 1;

INSERT INTO analytics.orders
SELECT * FROM analytics.orders__dpone_staging_ab12cd34;

Supported policies:

Policy Status Notes
lightweight_delete_insert Default Fast bounded upsert; ClickHouse performs logical deletes and cleans data asynchronously.
shadow_swap Supported Rebuilds a full shadow table and swaps it into the canonical name; heavier but reader-friendly.
mutation_delete_insert Non-recommended opt-in Uses ALTER TABLE ... DELETE; requires allow_non_recommended_policy: true.

Non-recommended mutation opt-in:

sink:
  strategy:
    mode: incremental_merge
    unique_key: [id]
    merge_policy: mutation_delete_insert
    allow_non_recommended_policy: true
    mutations_sync: 2

Partition replace

Use partition_replace when the source provides a complete replacement slice for one or more ClickHouse partitions.

sink:
  type: clickhouse
  strategy:
    mode: partition_replace
    partition:
      column: business_date
      values_from_staging: true
      max_partitions_per_run: 32

Runtime shape:

CREATE TABLE analytics.orders__dpone_staging_ab12cd34 AS analytics.orders;

INSERT INTO analytics.orders__dpone_staging_ab12cd34
SELECT ...;

ALTER TABLE analytics.orders
REPLACE PARTITION '2026-06-03'
FROM analytics.orders__dpone_staging_ab12cd34;

The staging table is created from target metadata so partition expressions and engines remain compatible for REPLACE PARTITION ... FROM.

Physical deletes without update mutations

ClickHouse implements ALTER TABLE ... UPDATE and classic ALTER TABLE ... DELETE as table mutations. Mutations are asynchronous, can accumulate in system.mutations, and make reconciliation SLOs hard to reason about. For this reason, dpone does not use ALTER TABLE ... UPDATE for ClickHouse snapshot reconciliation.

The default ClickHouse reconciliation strategy is shadow_table_swap:

  1. Stage disappeared source keys in a temporary Memory table.
  2. Create an empty shadow table with the same structure as the target table.
  3. Copy all target rows into the shadow table.
  4. For rows matching staged deleted keys, set __dpone__deleted_at = now() and refresh __dpone__loaded_at = now().
  5. Rename the original table to a backup and the shadow table to the canonical target name.
  6. Drop temporary artifacts.

This keeps raw target-table semantics correct without ClickHouse mutations. It is heavier than an append-only tombstone strategy because it rewrites the table, but it is deterministic and avoids mutation backlog.

Staging-first load strategies

ClickHouse sink loads data into staging tables before touching the canonical target table:

  • FULL_REFRESH: load into staging, then rename staging into the target name.
  • INCREMENTAL_APPEND: load into staging, then append from staging into target.
  • INCREMENTAL_MERGE: load into staging, validate duplicate keys, then finalize with lightweight_delete_insert, shadow_swap, or guarded mutation_delete_insert.
  • REPLACE: load into staging, copy non-replaced target rows into a shadow table, append staging rows into the shadow table, then rename the shadow table into the target name.
  • PARTITION_REPLACE: create staging from target metadata, load replacement rows, then run ALTER TABLE target REPLACE PARTITION ... FROM staging for each staged partition value.

The sink does not use target TRUNCATE. mutation_delete_insert is available only as an explicit, non-recommended opt-in.

Schema evolution

ClickHouse sink applies safe schema evolution before staging:

  • ADD COLUMN for new nullable/default-safe columns;
  • MODIFY COLUMN only for planner-approved widening;
  • __dpone__nc__<column> generated columns for explicit incompatible type-change routing.

Schema evolution does not use ClickHouse mutations. Full refresh, replace, and reconciliation still use staging/shadow table swaps. See Schema evolution.

LowCardinality and physical design

Use Physical design to control ClickHouse engines, PARTITION BY, ORDER BY, codecs, and LowCardinality behavior.

sink:
  type: clickhouse
  options:
    physical_design:
      storage:
        clickhouse:
          engine: MergeTree
          partition_by: toYYYYMM(business_date)
          order_by: [business_date, order_id]
          low_cardinality:
            mode: auto
            max_distinct_values: 10000
            max_distinct_ratio: 0.05

Modes:

Mode Behavior
off Never generate LowCardinality.
auto Use profiler cardinality thresholds for string-like columns.
explicit Apply only to configured columns.
force Apply to configured columns even when profiler disagrees, with warning.
preserve Keep existing/source LowCardinality, but do not generate new ones.

Runbook:

  1. If low-cardinality columns are not selected, inspect dpone schema infer and verify sample coverage.
  2. If high-cardinality columns were selected, switch to explicit or off.
  3. If downstream DDL must be exact, use physical_design.columns.<column>.target_type.clickhouse.
  4. Run dpone schema physical-plan --manifest ... --format md before changing existing tables.

Requirements

  • Reconciliation must be enabled and configured with a stable unique_key.
  • The ClickHouse target table must contain __dpone__loaded_at and __dpone__deleted_at.
  • The connector must be able to read system.columns for target schema discovery.
  • The target database should support atomic RENAME TABLE; ClickHouse Atomic databases do.

Example

source:
  type: mssql
  connection_id: mssql_source
  table: {schema: dbo, name: orders}

sink:
  type: clickhouse
  connection_id: clickhouse_dwh
  table: {schema: analytics, name: orders}
  strategy:
    mode: incremental_append
    unique_key: [id]
  reconciliation:
    enabled: true

Operational guidance

  • Use ClickHouse reconciliation for periodic correctness gates, not per-minute high-churn delete streams.
  • For high-delete workloads, prefer CDC/Change Tracking into a versioned table design or run reconciliation by partition.
  • Keep readers on the canonical table name; dpone swaps the shadow table into that name only after the full rewrite has completed.
  • If a run fails before the swap, dpone drops the shadow table and leaves the canonical target unchanged.

Direct TSV ingest for MSSQL native transfers

MSSQL -> ClickHouse native transfer uses ClickHouse direct INSERT ... FORMAT TabSeparated whenever HTTP bulk or clickhouse-client is configured. The Python parsing fallback remains available, but it is not the preferred path for large tables.

Recommended options:

sink:
  options:
    clickhouse_bulk:
      mode: auto
      insert_settings:
      async_insert: 1
      wait_for_async_insert: 1
      max_insert_block_size: 1000000
      input_format_parallel_parsing: 1

If async_insert: 1 is set without wait_for_async_insert, dpone automatically sets wait_for_async_insert: 1 so the source state is not advanced before ClickHouse accepts the insert.