Load strategies¶
This guide explains how dpone load strategies work, which source/sink combinations support them, and which guide to copy from when building a manifest.
Table of contents¶
full_refreshincremental_appendincremental_mergereplacepartition_replacesnapshot_diffscd2cdc_applybackfillxmincdcsnapshot_reconciliation- Strategy Intelligence
- Strategy support matrix
- Cross-links
Strategy Intelligence¶
Use Strategy Intelligence when you want dpone to recommend and explain a safe target-native strategy instead of choosing one manually.
dpone strategy advise orders.yaml --estimated-rows 50000000 --changed-percent 0.2 --delete-percent 0.05
The advisor explains the chosen strategy, merge policy, native fast path, adaptive batch plan, safety gates, and repair commands. It is plan-only in v1: it does not touch sources, sinks, credentials, or state.
Strategy support matrix¶
| Strategy | Best for | Core requirement | Source/sink examples |
|---|---|---|---|
full_refresh |
Small or bounded complete reloads | Safe staging or shadow-table finalization | Postgres -> MSSQL, REST API -> BigQuery |
incremental_append |
Append-only facts and events | Monotonic cursor or bounded source batch | Kafka -> ClickHouse, MSSQL -> Kafka |
incremental_merge |
Upserts by business key | unique_key and set-based finalization |
MSSQL -> MSSQL, Postgres -> Postgres |
replace |
Rebuilding one partition or predicate window | Replace predicate or bounded time window | REST API -> Postgres, ClickHouse -> BigQuery |
partition_replace |
Recomputing changed table partitions | Target partition metadata and partition.column |
MSSQL -> ClickHouse, Postgres -> BigQuery |
snapshot_diff |
Snapshot-to-target diff with physical delete support | unique_key and __dpone__row_hash |
Postgres -> MSSQL, MSSQL -> Postgres |
scd2 |
Dimension history with current/expired versions | unique_key, row hash, validity columns |
Postgres -> BigQuery, MSSQL -> MSSQL |
cdc_apply |
Applying normalized insert/update/delete events | CDC offsets and operation semantics | Postgres -> Kafka, MSSQL -> ClickHouse |
backfill |
Resumable historical reloads split into chunks | chunk config and inner mode | Postgres -> ClickHouse, MSSQL -> BigQuery |
xmin |
Postgres transaction-ID incremental loads | Postgres source and XMin state | Postgres -> MSSQL, Postgres -> ClickHouse |
cdc |
Change-event replication | Source-specific CDC reader and offset state | Postgres -> Kafka, MSSQL -> ClickHouse |
snapshot_reconciliation |
Physical delete detection | Snapshot key set and configured delete policy | Postgres -> MSSQL, MSSQL -> ClickHouse |
full_refresh¶
Use full_refresh when the selected source boundary can be reloaded completely.
sink:
type: mssql
table: {schema: landing, name: orders}
strategy:
mode: full_refresh
options:
bulk:
mode: bcp
flowchart TD
A["Resolve source boundary"] --> B["Extract full dataset"]
B --> C["Load into staging table"]
C --> D["Validate schema and quality"]
D --> E["Swap, truncate+insert, or replace target safely"]
E --> F["Commit state and write run artifact"]
Algorithm:
- Resolve all manifest and registry defaults.
- Extract the full configured source boundary.
- Load into staging or shadow target.
- Run quality checks before finalization when possible.
- Replace the final target using the sink-native safe path.
- Advance state only after finalization succeeds.
Copy from Postgres -> MSSQL or REST API -> BigQuery.
incremental_append¶
Use incremental_append for append-only streams, event logs, and facts where existing target rows are never updated.
flowchart TD
A["Read previous cursor from state"] --> B["Extract rows greater than cursor"]
B --> C["Load rows into staging or producer batch"]
C --> D["Append into final target"]
D --> E["Write new high watermark"]
Algorithm:
- Read the previous cursor or offset from state.
- Build a bounded incremental extract.
- Load the extracted rows through the target-native fast path.
- Append only; do not update or delete existing target rows.
- Commit the new cursor after sink success.
Copy from Kafka -> ClickHouse or MSSQL -> Kafka.
incremental_merge¶
Use incremental_merge when source rows may update previously loaded target rows.
sink:
strategy:
mode: incremental_merge
unique_key: order_id
merge_policy: auto
duplicate_policy: fail
flowchart TD
A["Extract changed rows"] --> B["Load changed rows into staging"]
B --> C["Deduplicate staging by unique key"]
C --> D["Apply set-based merge or keyed upsert events"]
D --> E["Run reconciliation and quality checks"]
E --> F["Advance state"]
Algorithm:
- Extract a bounded changed-row set.
- Load into staging first.
- Validate staging duplicates by
unique_key; v1 default isduplicate_policy: fail. - Resolve
merge_policy: autoto the sink default. - Finalize by target-native policy.
- Run quality checks and persist state only after success.
Policy matrix:
| Sink | auto resolves to |
Other supported policies | Notes |
|---|---|---|---|
| MSSQL | delete_insert |
shadow_swap |
delete_insert runs DELETE target WHERE EXISTS staging keys, then INSERT FROM staging in one transaction. |
| Postgres | delete_insert |
shadow_swap |
shadow_swap rebuilds a full shadow table, then renames target/shadow. |
| BigQuery | delete_insert |
shadow_swap |
shadow_swap uses CTAS + table rename and fails if BigQuery rename restrictions apply. |
| ClickHouse | lightweight_delete_insert |
shadow_swap, mutation_delete_insert |
mutation_delete_insert requires allow_non_recommended_policy: true. |
| Kafka | event_upsert |
none | Kafka emits keyed upsert/delete events and never mutates a topic. |
SQL shape examples:
-- MSSQL/Postgres/BigQuery delete_insert shape
DELETE FROM target
WHERE EXISTS (SELECT 1 FROM staging WHERE staging.id = target.id);
INSERT INTO target (...)
SELECT ... FROM staging;
-- ClickHouse default lightweight_delete_insert shape
DELETE FROM target
WHERE id IN (SELECT id FROM staging)
SETTINGS mutations_sync = 1;
INSERT INTO target
SELECT * FROM staging;
-- Shadow swap shape
CREATE TABLE shadow AS target;
INSERT INTO shadow SELECT * FROM target WHERE NOT EXISTS (...staging keys...);
INSERT INTO shadow SELECT * FROM staging;
RENAME target TO backup, shadow TO target;
DROP TABLE backup;
Copy from MSSQL -> MSSQL or Postgres -> Postgres.
replace¶
Use replace to rebuild a bounded target slice such as a date partition or API window.
sink:
strategy:
mode: replace
custom_predicate: "business_date between '2026-01-01' and '2026-01-02'"
flowchart TD
A["Resolve replace predicate"] --> B["Extract replacement window"]
B --> C["Load replacement rows into staging"]
C --> D["Validate replacement set"]
D --> E["Delete or swap only the matching target slice"]
E --> F["Insert replacement rows and commit state"]
Algorithm:
- Resolve the bounded replacement predicate.
- Extract the replacement window.
- Load into staging.
- Validate row counts and key uniqueness.
- Replace only the matching slice through sink-native finalization.
- Commit state and run artifact.
Copy from REST API -> Postgres or ClickHouse -> BigQuery.
partition_replace¶
Use partition_replace when the source can produce a complete bounded partition slice and the target should replace exactly those partitions.
sink:
strategy:
mode: partition_replace
partition:
column: business_date
values_from_staging: true
max_partitions_per_run: 64
native_mode: auto
flowchart TD
A["Extract complete partition slice"] --> B["Load slice into staging"]
B --> C["Collect distinct partition values from staging"]
C --> D["Validate max_partitions_per_run and metadata compatibility"]
D --> E["Replace only matching target partitions"]
E --> F["Run quality checks and write run artifact"]
Algorithm:
- Extract a complete partition slice from the source.
- Load rows into staging first.
- Read distinct
partition.columnvalues from staging. - Validate
max_partitions_per_runand target partition compatibility. - Apply sink-native replacement where capability checks pass.
- ClickHouse uses
ALTER TABLE ... REPLACE PARTITION ... FROM stagingwhen staging metadata is compatible. - BigQuery uses query job destination partition decorators with
WRITE_TRUNCATEfor time partitions. - Postgres uses declarative partition
DETACH PARTITION/ATTACH PARTITIONafter resolving existing partition bounds. - MSSQL uses
ALTER TABLE ... SWITCH PARTITIONwhen target, staging, indexes, partition function, and pre-created switch-out tables are aligned. - Commit state and run artifacts only after success.
Support matrix:
| Sink | Status | Finalization |
|---|---|---|
| ClickHouse | Supported | Native ALTER TABLE target REPLACE PARTITION ... FROM staging when staging is created from target metadata. |
| BigQuery | Supported | Native time-partition overwrite through destination partition decorator + WRITE_TRUNCATE; fallback is partition-scoped delete/insert. |
| Postgres | Supported | Native declarative partition detach/attach when existing partition bounds are resolvable; fallback is partition-scoped delete/insert. |
| MSSQL | Supported | Native ALTER TABLE ... SWITCH PARTITION when metadata is aligned and switch-out tables are pre-created; fallback is partition-scoped delete/insert. |
| Kafka | Not supported | Kafka topics are append-only logs; use keyed events instead. |
Operational warnings:
partition_replaceis not an incremental cursor. The source must emit a complete replacement slice.- The strategy should be used only for partitioned targets or targets with a strong partition column convention.
- Keep
max_partitions_per_runconservative to avoid accidental large rewrites. - Use
native_mode: requiredfor certified production tables where fallback would be too blocking. - Use
native_mode: fallbackwhen you intentionally want predicate delete/insert and no metadata switch. - For Kafka sinks, this strategy fails fast with a clear diagnostic.
snapshot_diff¶
Use snapshot_diff when the source can emit a bounded current snapshot and the target must converge to that snapshot, including physical deletes.
sink:
strategy:
mode: snapshot_diff
unique_key: [order_id]
diff:
compare: row_hash
delete_policy: hard_delete
flowchart TD
A["Extract current source snapshot"] --> B["Enrich rows with __dpone__row_hash"]
B --> C["Load snapshot into staging"]
C --> D["Compare staging and current target by unique_key"]
D --> E["Insert missing target rows"]
D --> F["Replace rows with changed row_hash"]
D --> G["Apply delete_policy for target-only keys"]
E --> H["Quality and reconciliation checks"]
F --> H
G --> H
H --> I["Commit load and state"]
Algorithm:
- Extract a complete bounded source snapshot.
- Compute or pass through
__dpone__row_hashfor business columns. - Load the snapshot into staging.
- Compare target and staging by
unique_key. - Insert new keys and update changed keys through the sink-native staged finalizer.
- Apply
delete_policyfor keys that exist in target but not in staging. - Write load audit, run artifact, and state after finalization succeeds.
DB-native finalizers:
| Sink | snapshot_diff finalizer |
Delete policies |
|---|---|---|
| Postgres | staging duplicate check, target-only key cleanup, delete+insert changed keys in one transaction | hard_delete, soft_delete, ignore |
| MSSQL | staging duplicate check, target-only key cleanup, DELETE t ... WHERE EXISTS plus staged insert |
hard_delete, soft_delete, ignore |
| BigQuery | staging duplicate check, target-only key cleanup, partition-safe DML delete+insert | hard_delete, soft_delete, ignore |
| ClickHouse | use staged incremental_merge or partition_replace finalizers; prefer snapshot reconciliation docs for physical deletes |
target-specific |
| Kafka | emits keyed upsert/delete events; it does not mutate a target table | envelope/delete config |
Use Load lineage to understand __dpone__row_hash and row identity. Use snapshot_diff only when the source snapshot boundary is complete; otherwise prefer incremental_merge plus reconciliation or CDC.
scd2¶
Use scd2 when the target must preserve a full history of dimensional changes.
sink:
strategy:
mode: scd2
unique_key: [customer_id]
scd2:
valid_from_column: "__dpone__valid_from_at"
valid_to_column: "__dpone__valid_to_at"
current_flag_column: "__dpone__is_current"
row_hash_column: "__dpone__row_hash"
delete_policy: expire
flowchart TD
A["Extract current dimension snapshot or CDC batch"] --> B["Compute __dpone__row_hash"]
B --> C["Load changed/current rows into staging"]
C --> D["Join current target rows by unique_key"]
D --> E{"row_hash changed?"}
E -->|yes| F["Expire current target version"]
F --> G["Insert new current version"]
E -->|no| H["Keep current version"]
D --> I{"delete detected?"}
I -->|expire| J["Set valid_to and is_current=false"]
G --> K["Commit audit/state"]
H --> K
J --> K
Algorithm:
- Load the source snapshot or CDC batch into staging.
- Deduplicate by
unique_keyaccording to the strategy duplicate policy. - Compare staging rows with target current rows by
__dpone__row_hash. - Expire changed current rows by setting
__dpone__valid_to_at. - Insert a new current row with
__dpone__valid_from_atand__dpone__is_current=true. - Apply delete policy. The default
expirekeeps history and closes the current record. - Commit state only after SCD2 finalization succeeds.
DB-native finalizers:
| Sink | scd2 finalizer |
Delete policy |
|---|---|---|
| Postgres | expires changed current rows with UPDATE ... FROM staging, inserts new current versions with NOT EXISTS |
expire, ignore |
| MSSQL | expires changed current rows with set-based UPDATE ... FROM, inserts new current versions from staging |
expire, ignore |
| BigQuery | expires changed current rows with DML, inserts current versions from staging | expire, ignore |
| ClickHouse | use the ClickHouse history-table path only when mutation/lightweight-delete tradeoffs are accepted | target-specific |
| Kafka | not a table-history sink; use keyed change events instead | not supported |
SCD2 is for database sinks: MSSQL, Postgres, ClickHouse, and BigQuery. Kafka sinks should use keyed change events instead.
cdc_apply¶
Use cdc_apply when the source emits normalized insert/update/delete events with an offset that must be committed only after sink success.
sink:
strategy:
mode: cdc_apply
unique_key: [order_id]
cdc:
delete_policy: apply
operation_column: "__dpone__op"
sequenceDiagram
participant Source
participant State
participant Staging
participant Sink
Source->>State: read last CDC offset
Source->>Staging: extract bounded CDC batch
Staging->>Sink: stage normalized events
Sink->>Sink: apply insert/update/delete semantics
Sink->>State: commit next offset after success
Algorithm:
- Read source-specific CDC offset from state.
- Extract a bounded CDC batch.
- Normalize operations to insert/update/delete semantics.
- Load events into staging first.
- Apply events through the sink-native staged finalizer.
- Persist the next offset only after the sink finalization succeeds.
See Reconciliation and CDC for Postgres logical replication, MSSQL CDC/Change Tracking, Kafka CDC envelopes, offset state, and runbooks.
backfill¶
Use backfill for large historical reloads that must be resumable and split into deterministic chunks.
sink:
strategy:
mode: backfill
backfill:
inner_mode: partition_replace
chunk:
column: business_date
from: "2025-01-01"
to: "2025-12-31"
step: 1d
parallel_workers: 4
flowchart TD
A["Create one run_id"] --> B["Plan deterministic chunks"]
B --> C["Create one load_id per chunk"]
C --> D["Execute inner strategy"]
D --> E{"chunk succeeded?"}
E -->|yes| F["Mark chunk committed and advance chunk state"]
E -->|no| G["Mark chunk failed and keep state unchanged"]
F --> H{"more chunks?"}
H -->|yes| C
H -->|no| I["Finalize run artifact"]
Algorithm:
- Create one
run_idfor the backfill execution. - Build deterministic chunks from the configured column and range.
- Create one
load_idper chunk. - Execute
inner_mode, usuallypartition_replace,replace, orincremental_merge. - Mark each chunk committed or failed independently.
- Resume from the first non-committed chunk on retry.
Backfill is a wrapper strategy. It does not change source/sink semantics; it adds chunking, auditability, and resumability on top of an inner strategy.
Supported inner modes today:
| Inner mode | Use case | Notes |
|---|---|---|
partition_replace |
preferred for partitioned historical reloads | fails fast for Kafka and non-partitioned targets |
replace |
chunk predicate replacement | useful when partition metadata is unavailable |
incremental_merge |
keyed chunk upserts | uses the sink default merge_policy unless overridden |
xmin¶
Use xmin for Postgres sources when you want transaction-ID based incremental extraction without a business timestamp column.
source:
type: postgres
options:
incremental_strategy: xmin
sink:
strategy:
mode: incremental_merge
unique_key: order_id
state:
type: mssql
table: {schema: etl_state, name: postgres_xmin_state}
flowchart TD
A["Read previous XMin watermark"] --> B["Capture current safe XMin upper bound"]
B --> C["Extract rows with xmin in the bounded range"]
C --> D["Load into target staging"]
D --> E["Merge by unique key"]
E --> F["Persist new XMin watermark after success"]
Algorithm:
- Validate that the source type is Postgres.
- Read the previous XMin watermark from state.
- Capture a safe upper bound before extraction.
- Extract rows whose
xminfalls within the bounded range. - Load and finalize with the configured sink strategy.
- Persist the upper bound only after sink success.
Copy from Postgres -> MSSQL, Postgres -> ClickHouse, or see the deep dive in Postgres XMin.
cdc¶
Use cdc when the source emits a change stream and you need insert/update/delete events rather than snapshot polling.
source:
type: postgres
options:
cdc:
enabled: true
slot: dpone_orders
publication: dpone_publication
sink:
strategy:
mode: incremental_merge
unique_key: order_id
flowchart TD
A["Read CDC offset from state"] --> B["Consume bounded change batch"]
B --> C["Normalize insert, update, and delete events"]
C --> D["Load events into staging or Kafka"]
D --> E["Apply sink-specific event semantics"]
E --> F["Persist CDC offset after success"]
Algorithm:
- Validate CDC capability for the source connector.
- Read typed CDC offset state.
- Consume a bounded batch of changes.
- Normalize operations into
insert,update, anddeletesemantics. - Apply sink-specific finalization or produce keyed events.
- Persist the CDC offset after sink success.
Copy from Postgres -> Kafka or MSSQL -> ClickHouse.
snapshot_reconciliation¶
Use snapshot_reconciliation when the source does not emit delete events but the target must reflect physical deletes.
reconciliation:
enabled: true
mode: snapshot
key: order_id
apply_deletes: true
delete_mode: soft_delete
flowchart TD
A["Extract current source key snapshot"] --> B["Load keys into reconciliation staging"]
B --> C["Compare target keys to source keys"]
C --> D["Create delete candidate set"]
D --> E["Apply soft delete, replacing-table marker, or target-native delete plan"]
E --> F["Write reconciliation metrics"]
Algorithm:
- Extract the current source key snapshot.
- Load keys into reconciliation staging.
- Compare staged source keys with target keys.
- Produce a delete candidate set.
- Apply the configured delete behavior through staging-first plans.
- Write reconciliation metrics to the run artifact.
Copy from Postgres -> MSSQL, MSSQL -> ClickHouse, or Postgres -> ClickHouse.
Copy/paste source/sink guide index¶
Use this index when you already know your source and target and want a ready manifest plus strategy runbook.
| Source | Sink | Guide |
|---|---|---|
| Postgres | MSSQL | Postgres -> MSSQL |
| Postgres | Postgres | Postgres -> Postgres |
| Postgres | ClickHouse | Postgres -> ClickHouse |
| Postgres | BigQuery | Postgres -> BigQuery |
| Postgres | Kafka | Postgres -> Kafka |
| MSSQL | MSSQL | MSSQL -> MSSQL |
| MSSQL | Postgres | MSSQL -> Postgres |
| MSSQL | ClickHouse | MSSQL -> ClickHouse |
| MSSQL | BigQuery | MSSQL -> BigQuery |
| MSSQL | Kafka | MSSQL -> Kafka |
| ClickHouse | MSSQL | ClickHouse -> MSSQL |
| ClickHouse | Postgres | ClickHouse -> Postgres |
| ClickHouse | ClickHouse | ClickHouse -> ClickHouse |
| ClickHouse | BigQuery | ClickHouse -> BigQuery |
| ClickHouse | Kafka | ClickHouse -> Kafka |
| REST API | MSSQL | REST API -> MSSQL |
| REST API | Postgres | REST API -> Postgres |
| REST API | ClickHouse | REST API -> ClickHouse |
| REST API | BigQuery | REST API -> BigQuery |
| REST API | Kafka | REST API -> Kafka |
| Kafka | MSSQL | Kafka -> MSSQL |
| Kafka | Postgres | Kafka -> Postgres |
| Kafka | ClickHouse | Kafka -> ClickHouse |
| Kafka | BigQuery | Kafka -> BigQuery |
| Kafka | Kafka | Kafka -> Kafka |