ClickHouse -> ClickHouse¶

This guide is a copy/paste-ready starting point for loading data from ClickHouse into ClickHouse with dpone.

When to use this path¶

Use this path when ClickHouse is the system of record or ingestion boundary and ClickHouse is the landing, warehouse, event-log, or downstream replication target.

Copy/paste manifest¶

# yaml-language-server: $schema=../../src/dpone/schema/etl-batch-manifest.schema.json
kind: dpone.batch.v1

defaults:
  name: clickhouse_to_clickhouse_example
  source:
    type: clickhouse
    connection_id: clickhouse_source
    options:
      batch_size: 50000
      export_format: csv
  sink:
    type: clickhouse
    connection_id: clickhouse_target
    table:
      schema: analytics
      name: orders
    strategy:
      mode: incremental_merge
      unique_key: order_id
      merge_policy: lightweight_delete_insert
      duplicate_policy: fail

quality:
  gates:
    - id: source_target_rows
      type: row_count_reconciliation
      severity: error
      tolerance:
        mode: pct
        value: 0.1

schemas:
  analytics:
    tables:
      - orders

Run it locally:

dpone plan examples/source-sink/clickhouse-to-clickhouse.yaml --format md
dpone run examples/source-sink/clickhouse-to-clickhouse.yaml

The checked source file is examples/source-sink/clickhouse-to-clickhouse.yaml; CI compares its parsed YAML with this block.

If you change the strategy to full_refresh and empty output is invalid, row-count reconciliation is not enough: it can pass a 0 source / 0 target comparison. Add an explicit non-empty target gate:

quality:
  gates:
    - id: target_min_rows
      type: min_rows
      side: target
      threshold: 1
      severity: error

Supported load strategies¶

These rows describe public runtime contracts, not certification of this exact source, sink, transport, schema-evolution mode, and runtime combination.

Strategy	Status	Notes
`full_refresh`	Supported	Uses staging first, then applies the target-specific finalization plan.
`incremental_append`	Supported	Uses staging first, then applies the target-specific finalization plan.
`incremental_merge`	Supported	Default `merge_policy: lightweight_delete_insert`; `shadow_swap` supported; `mutation_delete_insert` is explicit opt-in and non-recommended.
`replace`	Supported	Uses staging first, then applies the target-specific finalization plan.
`partition_replace`	Supported	Replaces target partitions represented by staging `partition.column`; see Load strategies for native/fallback paths.
`snapshot_diff`	Supported	Requires a complete bounded snapshot and `unique_key`; applies the configured diff/delete policy.

See Load strategies for the detailed algorithm for each strategy.

Runtime algorithm¶

ClickHouse implements StagedLoadPort, so this route records governance_finalization=pre_finalize. Blocking gates evaluate projected staging rows before target finalization; a failure aborts and cleans staging without advancing source state. See Load governance.

flowchart TD
    A["Resolve manifest and registry entries"] --> B["Create ClickHouse source"]
    B --> C["Plan bounded extract"]
    C --> D["Read through native SELECT streaming or file export"]
    D --> E["Emit ExtractResult with schema and artifact"]
    E --> F["Plan schema evolution"]
    F --> G["Create ClickHouse staging or event batch"]
    G --> H["Load into run-scoped ClickHouse staging"]
    H --> I["Run blocking quality gates against staging"]
    I --> J["Apply ClickHouse finalization strategy"]
    J --> K["Advance state only after success"]

Strategy behavior¶

full_refresh: extract the selected source boundary, load into staging, and replace the target according to the target's safe finalization path.
incremental_append: extract only the incremental boundary and append rows through staging or event production.
incremental_merge: load into staging, validate duplicates, then use lightweight_delete_insert by default; shadow_swap and guarded mutation_delete_insert are explicit policies.
replace: reload a bounded predicate window through staging and then atomically replace the matching target slice.
snapshot_diff: compare a complete current source snapshot with the target by unique_key, then apply the configured insert, update, and delete policy.
partition_replace: extract a complete partition slice, load it into staging, and replace only partitions represented by partition.column.

Snapshot reconciliation is separate from the load strategy. Runtime planning reports that capability as reconciliation.mode=snapshot; in the official dpone.batch.v1 authoring schema, enable it with reconciliation: true.

Schema evolution and type mapping¶

Schema evolution is enabled by default and runs before the staging/final load path:

Read source schema from ExtractResult.schema.
Introspect the ClickHouse target schema.
Apply safe additions and widening operations.
Fail breaking changes by default.
If configured, route incompatible type changes to __dpone__nc__<column>.

Use Schema evolution and Type mapping matrix when adding columns or changing source types.

Runbook¶

Start with dpone doctor --profile local and fix missing extras or native clients.
Run dpone plan <manifest> --format md and review source boundary, staging path, schema evolution, state, and quality gates.
Run a small bounded window first.
Inspect the run artifact under .dpone/runs/clickhouse_to_clickhouse.
For incremental jobs, verify state before enabling a schedule.
For delete-aware jobs, run reconciliation in report-only mode before enabling physical deletes.
Promote the manifest through GitOps after the plan and artifact are reviewed.

Cross-links¶

Type contracts and physical design¶

This flow supports the shared dpone type-governance stack:

Type inference for source metadata, sampled profiling, confidence, and empty string vs NULL behavior.
Schema contracts for explicit logical column types, enforcement modes, and __dpone__nc__* variant columns.
Physical design for target-specific DDL such as concrete SQL types, indexes, partitioning, compression, ClickHouse LowCardinality, and BigQuery clustering.

Use dpone schema infer --manifest ... and dpone schema physical-plan --manifest ... before enabling new table DDL in production.