Skip to content

Load lineage and technical columns

dpone uses canonical framework metadata columns to make loads auditable, retry-safe, and debuggable across database, API, and Kafka pipelines.

The rule is simple: framework-owned columns always use the __dpone__* namespace. User data should not create columns with this prefix.

Quick start

Default lineage is enabled and uses the standard preset plus quarantine support:

sink:
  options:
    lineage:
      enabled: true
      preset: standard
      features:
        quarantine: true

Disable target lineage columns explicitly when you need a strict legacy target shape:

sink:
  options:
    lineage: false

Set defaults once in batch manifests:

defaults:
  sink:
    options:
      lineage:
        enabled: true
        preset: standard
        features:
          operations: true
          quarantine: true

Presets and features

Presets are not mutually exclusive modes. They are starting points. Features can be enabled or disabled on top of a preset.

Preset Target columns
off none
minimal __dpone__load_id, __dpone__loaded_at
standard minimal plus __dpone__row_id, __dpone__extracted_at
debug standard plus operation and diagnostics columns
hierarchical standard plus parent/root/list index columns
Feature Adds
identity __dpone__load_id, __dpone__loaded_at
row_identity __dpone__row_id, __dpone__extracted_at
hierarchy __dpone__parent_row_id, __dpone__root_row_id, __dpone__list_index
operations __dpone__op
diagnostics __dpone__meta
quarantine rejected rows go to state quarantine storage, not to the target table

For dlt-like splitting of nested JSON into root and child tables, see Nested normalization. The hierarchical preset is the lineage contract used by that runtime path.

Example: standard lineage with operation and diagnostics columns:

sink:
  options:
    lineage:
      enabled: true
      preset: standard
      features:
        operations: true
        diagnostics: true
        quarantine: true

Canonical technical columns

Column Purpose Type contract
__dpone__loaded_at UTC time when the load package wrote the row timestamp
__dpone__deleted_at UTC logical delete time nullable timestamp
__dpone__xmin PostgreSQL XMin checkpoint value when materialized bigint/string by sink
__dpone__deleted_marker Physical delete reconciliation marker boolean/integer by sink
__dpone__run_id One process execution identifier ULID string, 26 chars
__dpone__load_id One target load package identifier ULID string, 26 chars
__dpone__row_id Deterministic row lineage identity SHA-256 hex, 64 chars
__dpone__parent_row_id Parent row identity for normalized nested data SHA-256 hex, 64 chars
__dpone__root_row_id Root row identity for normalized nested data SHA-256 hex, 64 chars
__dpone__list_index Stable list position for normalized nested arrays integer
__dpone__op Normalized operation marker such as insert/update/delete string
__dpone__meta Optional row diagnostics JSON/object
__dpone__row_hash Business row hash for diff/SCD2 strategies SHA-256 hex, 64 chars
__dpone__valid_from_at SCD2 validity start timestamp timestamp
__dpone__valid_to_at SCD2 validity end timestamp nullable timestamp
__dpone__is_current SCD2 current row flag boolean
__dpone__extracted_at UTC time when the source row was extracted timestamp

Legacy names such as meta__load_dtm, meta__update_dtm, meta__delete_dtm, meta__xmin, and __dpone_deleted_marker are accepted only through compatibility resolution. New tables should use canonical names only.

Load audit table

State backends can store load package lifecycle records in etl_state.__dpone__loads.

stateDiagram-v2
    [*] --> started
    started --> staged: staging loaded
    staged --> committed: target finalized
    started --> failed: extract/load error
    staged --> failed: finalization error
    committed --> [*]
    failed --> [*]

State advancement happens only after the load audit status reaches committed.

Status Meaning
started Run/load IDs were created and extraction is about to execute
staged Source payload was materialized into staging or producer batch
committed Sink finalization succeeded and state can advance
failed The load package failed; state must not advance
cancelled Reserved for cooperative cancellation flows

Row identity algorithm

__dpone__row_id is deterministic across retries. It does not include retry-specific values such as __dpone__load_id or __dpone__loaded_at.

flowchart TD
    A["Input source row"] --> B{"unique_key configured?"}
    B -->|yes| C["Normalize source family, source table, unique_key values"]
    B -->|no| D["Normalize source family, source table, source row content, stable row path"]
    C --> E["SHA-256 hex"]
    D --> E
    E --> F["__dpone__row_id"]

Preferred input is:

source_type + source_schema + source_table + normalized unique_key values

Fallback input is:

source_type + source_schema + source_table + normalized source row content + stable row path

Use a unique_key whenever possible. Fallback identity is deterministic for stable extract ordering, but a business key is more robust for retries, backfills, and source-side ordering changes.

Nested hierarchy relationship

Nested normalization uses the same row identity service and adds parent/root relationships:

flowchart LR
    Root["orders.__dpone__row_id"] --> Customer["orders__customer.__dpone__parent_row_id"]
    Root --> Items["orders__items.__dpone__parent_row_id"]
    Items --> Discounts["orders__items__discounts.__dpone__parent_row_id"]
    Root -. root lineage .-> DiscountsRoot["orders__items__discounts.__dpone__root_row_id"]

Join child rows back to parent rows with:

child.__dpone__parent_row_id = parent.__dpone__row_id

See Nested normalization for table naming, scalar arrays, configuration, processor behavior, and runbooks.

Strategy interaction

Strategy Lineage usage
full_refresh Writes one load_id for the load package and row IDs when enabled
incremental_append Uses row IDs for audit/debug and source cursor state for checkpointing
incremental_merge Uses unique_key for target finalization; row IDs remain lineage IDs, not business keys
snapshot_diff Uses __dpone__row_hash to detect changed rows
scd2 Uses __dpone__valid_from_at, __dpone__valid_to_at, __dpone__is_current, and __dpone__row_hash
cdc_apply Uses __dpone__op when operation tracking is enabled
backfill Uses one run_id and multiple load_id values for chunks

See Load strategies for the execution algorithms.

Implementation map

Concept Code
Technical column catalog src/dpone/contracts/technical_columns.py
Lineage option resolution src/dpone/runtime/lineage/options.py
Row ID and ULID generation src/dpone/runtime/lineage/identity.py
Row enrichment src/dpone/runtime/lineage/enrichment.py
Nested normalization src/dpone/runtime/normalization/normalizer.py
Load lifecycle service src/dpone/runtime/lineage/audit.py
Processor lifecycle integration src/dpone/runtime/etl/processor.py

Runbooks

Target rejects __dpone__* columns

  1. Confirm the target table allows schema evolution or pre-create the lineage columns manually.
  2. If this is a strict legacy target, set sink.options.lineage: false.
  3. Keep run artifacts and __dpone__loads enabled through state storage when possible, even if target row lineage is disabled.

Row identity changes unexpectedly

  1. Prefer an explicit sink.strategy.unique_key for merge/diff/SCD2 flows.
  2. Check whether the source extract ordering changed when no unique_key is configured.
  3. Compare source row content before enrichment. Retry-specific lineage fields are intentionally excluded from __dpone__row_id.

Legacy target already has meta__* columns

  1. Use technical_columns_naming: legacy_compat for one deprecation window.
  2. Plan a migration to canonical __dpone__* names.
  3. Do not create new meta__*, *_dtm, or *_dttm framework columns.