Skip to content

Load strategies

This guide explains how dpone load strategies work, which source/sink combinations support them, and which guide to copy from when building a manifest.

Table of contents

Strategy Intelligence

Use Strategy Intelligence when you want dpone to recommend and explain a safe target-native strategy instead of choosing one manually.

dpone strategy advise orders.yaml --estimated-rows 50000000 --changed-percent 0.2 --delete-percent 0.05

The advisor explains the chosen strategy, merge policy, native fast path, adaptive batch plan, safety gates, and repair commands. It is plan-only in v1: it does not touch sources, sinks, credentials, or state.

Strategy support matrix

Strategy Best for Core requirement Source/sink examples
full_refresh Small or bounded complete reloads Safe staging or shadow-table finalization Postgres -> MSSQL, REST API -> BigQuery
incremental_append Append-only facts and events Monotonic cursor or bounded source batch Kafka -> ClickHouse, MSSQL -> Kafka
incremental_merge Upserts by business key unique_key and set-based finalization MSSQL -> MSSQL, Postgres -> Postgres
replace Rebuilding one partition or predicate window Replace predicate or bounded time window REST API -> Postgres, ClickHouse -> BigQuery
partition_replace Recomputing changed table partitions Target partition metadata and partition.column MSSQL -> ClickHouse, Postgres -> BigQuery
snapshot_diff Snapshot-to-target diff with physical delete support unique_key and __dpone__row_hash Postgres -> MSSQL, MSSQL -> Postgres
scd2 Dimension history with current/expired versions unique_key, row hash, validity columns Postgres -> BigQuery, MSSQL -> MSSQL
cdc_apply Applying normalized insert/update/delete events CDC offsets and operation semantics Postgres -> Kafka, MSSQL -> ClickHouse
backfill Resumable historical reloads split into chunks chunk config and inner mode Postgres -> ClickHouse, MSSQL -> BigQuery
xmin Postgres transaction-ID incremental loads Postgres source and XMin state Postgres -> MSSQL, Postgres -> ClickHouse
cdc Change-event replication Source-specific CDC reader and offset state Postgres -> Kafka, MSSQL -> ClickHouse
snapshot_reconciliation Physical delete detection Snapshot key set and configured delete policy Postgres -> MSSQL, MSSQL -> ClickHouse

full_refresh

Use full_refresh when the selected source boundary can be reloaded completely.

sink:
  type: mssql
  table: {schema: landing, name: orders}
  strategy:
    mode: full_refresh
  options:
    bulk:
      mode: bcp
flowchart TD
    A["Resolve source boundary"] --> B["Extract full dataset"]
    B --> C["Load into staging table"]
    C --> D["Validate schema and quality"]
    D --> E["Swap, truncate+insert, or replace target safely"]
    E --> F["Commit state and write run artifact"]

Algorithm:

  1. Resolve all manifest and registry defaults.
  2. Extract the full configured source boundary.
  3. Load into staging or shadow target.
  4. Run quality checks before finalization when possible.
  5. Replace the final target using the sink-native safe path.
  6. Advance state only after finalization succeeds.

Copy from Postgres -> MSSQL or REST API -> BigQuery.

incremental_append

Use incremental_append for append-only streams, event logs, and facts where existing target rows are never updated.

source:
  options:
    incremental_column: updated_at
sink:
  strategy:
    mode: incremental_append
flowchart TD
    A["Read previous cursor from state"] --> B["Extract rows greater than cursor"]
    B --> C["Load rows into staging or producer batch"]
    C --> D["Append into final target"]
    D --> E["Write new high watermark"]

Algorithm:

  1. Read the previous cursor or offset from state.
  2. Build a bounded incremental extract.
  3. Load the extracted rows through the target-native fast path.
  4. Append only; do not update or delete existing target rows.
  5. Commit the new cursor after sink success.

Copy from Kafka -> ClickHouse or MSSQL -> Kafka.

incremental_merge

Use incremental_merge when source rows may update previously loaded target rows.

sink:
  strategy:
    mode: incremental_merge
    unique_key: order_id
    merge_policy: auto
    duplicate_policy: fail
flowchart TD
    A["Extract changed rows"] --> B["Load changed rows into staging"]
    B --> C["Deduplicate staging by unique key"]
    C --> D["Apply set-based merge or keyed upsert events"]
    D --> E["Run reconciliation and quality checks"]
    E --> F["Advance state"]

Algorithm:

  1. Extract a bounded changed-row set.
  2. Load into staging first.
  3. Validate staging duplicates by unique_key; v1 default is duplicate_policy: fail.
  4. Resolve merge_policy: auto to the sink default.
  5. Finalize by target-native policy.
  6. Run quality checks and persist state only after success.

Policy matrix:

Sink auto resolves to Other supported policies Notes
MSSQL delete_insert shadow_swap delete_insert runs DELETE target WHERE EXISTS staging keys, then INSERT FROM staging in one transaction.
Postgres delete_insert shadow_swap shadow_swap rebuilds a full shadow table, then renames target/shadow.
BigQuery delete_insert shadow_swap shadow_swap uses CTAS + table rename and fails if BigQuery rename restrictions apply.
ClickHouse lightweight_delete_insert shadow_swap, mutation_delete_insert mutation_delete_insert requires allow_non_recommended_policy: true.
Kafka event_upsert none Kafka emits keyed upsert/delete events and never mutates a topic.

SQL shape examples:

-- MSSQL/Postgres/BigQuery delete_insert shape
DELETE FROM target
WHERE EXISTS (SELECT 1 FROM staging WHERE staging.id = target.id);

INSERT INTO target (...)
SELECT ... FROM staging;
-- ClickHouse default lightweight_delete_insert shape
DELETE FROM target
WHERE id IN (SELECT id FROM staging)
SETTINGS mutations_sync = 1;

INSERT INTO target
SELECT * FROM staging;
-- Shadow swap shape
CREATE TABLE shadow AS target;
INSERT INTO shadow SELECT * FROM target WHERE NOT EXISTS (...staging keys...);
INSERT INTO shadow SELECT * FROM staging;
RENAME target TO backup, shadow TO target;
DROP TABLE backup;

Copy from MSSQL -> MSSQL or Postgres -> Postgres.

replace

Use replace to rebuild a bounded target slice such as a date partition or API window.

sink:
  strategy:
    mode: replace
    custom_predicate: "business_date between '2026-01-01' and '2026-01-02'"
flowchart TD
    A["Resolve replace predicate"] --> B["Extract replacement window"]
    B --> C["Load replacement rows into staging"]
    C --> D["Validate replacement set"]
    D --> E["Delete or swap only the matching target slice"]
    E --> F["Insert replacement rows and commit state"]

Algorithm:

  1. Resolve the bounded replacement predicate.
  2. Extract the replacement window.
  3. Load into staging.
  4. Validate row counts and key uniqueness.
  5. Replace only the matching slice through sink-native finalization.
  6. Commit state and run artifact.

Copy from REST API -> Postgres or ClickHouse -> BigQuery.

partition_replace

Use partition_replace when the source can produce a complete bounded partition slice and the target should replace exactly those partitions.

sink:
  strategy:
    mode: partition_replace
    partition:
      column: business_date
      values_from_staging: true
      max_partitions_per_run: 64
      native_mode: auto
flowchart TD
    A["Extract complete partition slice"] --> B["Load slice into staging"]
    B --> C["Collect distinct partition values from staging"]
    C --> D["Validate max_partitions_per_run and metadata compatibility"]
    D --> E["Replace only matching target partitions"]
    E --> F["Run quality checks and write run artifact"]

Algorithm:

  1. Extract a complete partition slice from the source.
  2. Load rows into staging first.
  3. Read distinct partition.column values from staging.
  4. Validate max_partitions_per_run and target partition compatibility.
  5. Apply sink-native replacement where capability checks pass.
  6. ClickHouse uses ALTER TABLE ... REPLACE PARTITION ... FROM staging when staging metadata is compatible.
  7. BigQuery uses query job destination partition decorators with WRITE_TRUNCATE for time partitions.
  8. Postgres uses declarative partition DETACH PARTITION / ATTACH PARTITION after resolving existing partition bounds.
  9. MSSQL uses ALTER TABLE ... SWITCH PARTITION when target, staging, indexes, partition function, and pre-created switch-out tables are aligned.
  10. Commit state and run artifacts only after success.

Support matrix:

Sink Status Finalization
ClickHouse Supported Native ALTER TABLE target REPLACE PARTITION ... FROM staging when staging is created from target metadata.
BigQuery Supported Native time-partition overwrite through destination partition decorator + WRITE_TRUNCATE; fallback is partition-scoped delete/insert.
Postgres Supported Native declarative partition detach/attach when existing partition bounds are resolvable; fallback is partition-scoped delete/insert.
MSSQL Supported Native ALTER TABLE ... SWITCH PARTITION when metadata is aligned and switch-out tables are pre-created; fallback is partition-scoped delete/insert.
Kafka Not supported Kafka topics are append-only logs; use keyed events instead.

Operational warnings:

  • partition_replace is not an incremental cursor. The source must emit a complete replacement slice.
  • The strategy should be used only for partitioned targets or targets with a strong partition column convention.
  • Keep max_partitions_per_run conservative to avoid accidental large rewrites.
  • Use native_mode: required for certified production tables where fallback would be too blocking.
  • Use native_mode: fallback when you intentionally want predicate delete/insert and no metadata switch.
  • For Kafka sinks, this strategy fails fast with a clear diagnostic.

snapshot_diff

Use snapshot_diff when the source can emit a bounded current snapshot and the target must converge to that snapshot, including physical deletes.

sink:
  strategy:
    mode: snapshot_diff
    unique_key: [order_id]
    diff:
      compare: row_hash
      delete_policy: hard_delete
flowchart TD
    A["Extract current source snapshot"] --> B["Enrich rows with __dpone__row_hash"]
    B --> C["Load snapshot into staging"]
    C --> D["Compare staging and current target by unique_key"]
    D --> E["Insert missing target rows"]
    D --> F["Replace rows with changed row_hash"]
    D --> G["Apply delete_policy for target-only keys"]
    E --> H["Quality and reconciliation checks"]
    F --> H
    G --> H
    H --> I["Commit load and state"]

Algorithm:

  1. Extract a complete bounded source snapshot.
  2. Compute or pass through __dpone__row_hash for business columns.
  3. Load the snapshot into staging.
  4. Compare target and staging by unique_key.
  5. Insert new keys and update changed keys through the sink-native staged finalizer.
  6. Apply delete_policy for keys that exist in target but not in staging.
  7. Write load audit, run artifact, and state after finalization succeeds.

DB-native finalizers:

Sink snapshot_diff finalizer Delete policies
Postgres staging duplicate check, target-only key cleanup, delete+insert changed keys in one transaction hard_delete, soft_delete, ignore
MSSQL staging duplicate check, target-only key cleanup, DELETE t ... WHERE EXISTS plus staged insert hard_delete, soft_delete, ignore
BigQuery staging duplicate check, target-only key cleanup, partition-safe DML delete+insert hard_delete, soft_delete, ignore
ClickHouse use staged incremental_merge or partition_replace finalizers; prefer snapshot reconciliation docs for physical deletes target-specific
Kafka emits keyed upsert/delete events; it does not mutate a target table envelope/delete config

Use Load lineage to understand __dpone__row_hash and row identity. Use snapshot_diff only when the source snapshot boundary is complete; otherwise prefer incremental_merge plus reconciliation or CDC.

scd2

Use scd2 when the target must preserve a full history of dimensional changes.

sink:
  strategy:
    mode: scd2
    unique_key: [customer_id]
    scd2:
      valid_from_column: "__dpone__valid_from_at"
      valid_to_column: "__dpone__valid_to_at"
      current_flag_column: "__dpone__is_current"
      row_hash_column: "__dpone__row_hash"
      delete_policy: expire
flowchart TD
    A["Extract current dimension snapshot or CDC batch"] --> B["Compute __dpone__row_hash"]
    B --> C["Load changed/current rows into staging"]
    C --> D["Join current target rows by unique_key"]
    D --> E{"row_hash changed?"}
    E -->|yes| F["Expire current target version"]
    F --> G["Insert new current version"]
    E -->|no| H["Keep current version"]
    D --> I{"delete detected?"}
    I -->|expire| J["Set valid_to and is_current=false"]
    G --> K["Commit audit/state"]
    H --> K
    J --> K

Algorithm:

  1. Load the source snapshot or CDC batch into staging.
  2. Deduplicate by unique_key according to the strategy duplicate policy.
  3. Compare staging rows with target current rows by __dpone__row_hash.
  4. Expire changed current rows by setting __dpone__valid_to_at.
  5. Insert a new current row with __dpone__valid_from_at and __dpone__is_current=true.
  6. Apply delete policy. The default expire keeps history and closes the current record.
  7. Commit state only after SCD2 finalization succeeds.

DB-native finalizers:

Sink scd2 finalizer Delete policy
Postgres expires changed current rows with UPDATE ... FROM staging, inserts new current versions with NOT EXISTS expire, ignore
MSSQL expires changed current rows with set-based UPDATE ... FROM, inserts new current versions from staging expire, ignore
BigQuery expires changed current rows with DML, inserts current versions from staging expire, ignore
ClickHouse use the ClickHouse history-table path only when mutation/lightweight-delete tradeoffs are accepted target-specific
Kafka not a table-history sink; use keyed change events instead not supported

SCD2 is for database sinks: MSSQL, Postgres, ClickHouse, and BigQuery. Kafka sinks should use keyed change events instead.

cdc_apply

Use cdc_apply when the source emits normalized insert/update/delete events with an offset that must be committed only after sink success.

sink:
  strategy:
    mode: cdc_apply
    unique_key: [order_id]
    cdc:
      delete_policy: apply
      operation_column: "__dpone__op"
sequenceDiagram
    participant Source
    participant State
    participant Staging
    participant Sink
    Source->>State: read last CDC offset
    Source->>Staging: extract bounded CDC batch
    Staging->>Sink: stage normalized events
    Sink->>Sink: apply insert/update/delete semantics
    Sink->>State: commit next offset after success

Algorithm:

  1. Read source-specific CDC offset from state.
  2. Extract a bounded CDC batch.
  3. Normalize operations to insert/update/delete semantics.
  4. Load events into staging first.
  5. Apply events through the sink-native staged finalizer.
  6. Persist the next offset only after the sink finalization succeeds.

See Reconciliation and CDC for Postgres logical replication, MSSQL CDC/Change Tracking, Kafka CDC envelopes, offset state, and runbooks.

backfill

Use backfill for large historical reloads that must be resumable and split into deterministic chunks.

sink:
  strategy:
    mode: backfill
    backfill:
      inner_mode: partition_replace
      chunk:
        column: business_date
        from: "2025-01-01"
        to: "2025-12-31"
        step: 1d
      parallel_workers: 4
flowchart TD
    A["Create one run_id"] --> B["Plan deterministic chunks"]
    B --> C["Create one load_id per chunk"]
    C --> D["Execute inner strategy"]
    D --> E{"chunk succeeded?"}
    E -->|yes| F["Mark chunk committed and advance chunk state"]
    E -->|no| G["Mark chunk failed and keep state unchanged"]
    F --> H{"more chunks?"}
    H -->|yes| C
    H -->|no| I["Finalize run artifact"]

Algorithm:

  1. Create one run_id for the backfill execution.
  2. Build deterministic chunks from the configured column and range.
  3. Create one load_id per chunk.
  4. Execute inner_mode, usually partition_replace, replace, or incremental_merge.
  5. Mark each chunk committed or failed independently.
  6. Resume from the first non-committed chunk on retry.

Backfill is a wrapper strategy. It does not change source/sink semantics; it adds chunking, auditability, and resumability on top of an inner strategy.

Supported inner modes today:

Inner mode Use case Notes
partition_replace preferred for partitioned historical reloads fails fast for Kafka and non-partitioned targets
replace chunk predicate replacement useful when partition metadata is unavailable
incremental_merge keyed chunk upserts uses the sink default merge_policy unless overridden

xmin

Use xmin for Postgres sources when you want transaction-ID based incremental extraction without a business timestamp column.

source:
  type: postgres
  options:
    incremental_strategy: xmin
sink:
  strategy:
    mode: incremental_merge
    unique_key: order_id
state:
  type: mssql
  table: {schema: etl_state, name: postgres_xmin_state}
flowchart TD
    A["Read previous XMin watermark"] --> B["Capture current safe XMin upper bound"]
    B --> C["Extract rows with xmin in the bounded range"]
    C --> D["Load into target staging"]
    D --> E["Merge by unique key"]
    E --> F["Persist new XMin watermark after success"]

Algorithm:

  1. Validate that the source type is Postgres.
  2. Read the previous XMin watermark from state.
  3. Capture a safe upper bound before extraction.
  4. Extract rows whose xmin falls within the bounded range.
  5. Load and finalize with the configured sink strategy.
  6. Persist the upper bound only after sink success.

Copy from Postgres -> MSSQL, Postgres -> ClickHouse, or see the deep dive in Postgres XMin.

cdc

Use cdc when the source emits a change stream and you need insert/update/delete events rather than snapshot polling.

source:
  type: postgres
  options:
    cdc:
      enabled: true
      slot: dpone_orders
      publication: dpone_publication
sink:
  strategy:
    mode: incremental_merge
    unique_key: order_id
flowchart TD
    A["Read CDC offset from state"] --> B["Consume bounded change batch"]
    B --> C["Normalize insert, update, and delete events"]
    C --> D["Load events into staging or Kafka"]
    D --> E["Apply sink-specific event semantics"]
    E --> F["Persist CDC offset after success"]

Algorithm:

  1. Validate CDC capability for the source connector.
  2. Read typed CDC offset state.
  3. Consume a bounded batch of changes.
  4. Normalize operations into insert, update, and delete semantics.
  5. Apply sink-specific finalization or produce keyed events.
  6. Persist the CDC offset after sink success.

Copy from Postgres -> Kafka or MSSQL -> ClickHouse.

snapshot_reconciliation

Use snapshot_reconciliation when the source does not emit delete events but the target must reflect physical deletes.

reconciliation:
  enabled: true
  mode: snapshot
  key: order_id
  apply_deletes: true
  delete_mode: soft_delete
flowchart TD
    A["Extract current source key snapshot"] --> B["Load keys into reconciliation staging"]
    B --> C["Compare target keys to source keys"]
    C --> D["Create delete candidate set"]
    D --> E["Apply soft delete, replacing-table marker, or target-native delete plan"]
    E --> F["Write reconciliation metrics"]

Algorithm:

  1. Extract the current source key snapshot.
  2. Load keys into reconciliation staging.
  3. Compare staged source keys with target keys.
  4. Produce a delete candidate set.
  5. Apply the configured delete behavior through staging-first plans.
  6. Write reconciliation metrics to the run artifact.

Copy from Postgres -> MSSQL, MSSQL -> ClickHouse, or Postgres -> ClickHouse.

Copy/paste source/sink guide index

Use this index when you already know your source and target and want a ready manifest plus strategy runbook.

Source Sink Guide
Postgres MSSQL Postgres -> MSSQL
Postgres Postgres Postgres -> Postgres
Postgres ClickHouse Postgres -> ClickHouse
Postgres BigQuery Postgres -> BigQuery
Postgres Kafka Postgres -> Kafka
MSSQL MSSQL MSSQL -> MSSQL
MSSQL Postgres MSSQL -> Postgres
MSSQL ClickHouse MSSQL -> ClickHouse
MSSQL BigQuery MSSQL -> BigQuery
MSSQL Kafka MSSQL -> Kafka
ClickHouse MSSQL ClickHouse -> MSSQL
ClickHouse Postgres ClickHouse -> Postgres
ClickHouse ClickHouse ClickHouse -> ClickHouse
ClickHouse BigQuery ClickHouse -> BigQuery
ClickHouse Kafka ClickHouse -> Kafka
REST API MSSQL REST API -> MSSQL
REST API Postgres REST API -> Postgres
REST API ClickHouse REST API -> ClickHouse
REST API BigQuery REST API -> BigQuery
REST API Kafka REST API -> Kafka
Kafka MSSQL Kafka -> MSSQL
Kafka Postgres Kafka -> Postgres
Kafka ClickHouse Kafka -> ClickHouse
Kafka BigQuery Kafka -> BigQuery
Kafka Kafka Kafka -> Kafka