MSSQL -> Kafka¶

This guide is a copy/paste-ready starting point for loading data from MSSQL into Kafka with dpone.

Status: Batch/event-log supported

Vendor-live evidence uses a wide typed fixture (dpone_src CSV → Kafka JSON envelope) and covers Docker strategies full_refresh, incremental_append, incremental_merge, replace, snapshot_diff, and backfill (inner incremental_merge) via tests/integration/mssql/test_mssql_to_kafka_vendor_live_integration.py. partition_replace / scd2 and non-merge backfill inner modes are N/A for KafkaSink. Extract uses CSV (not mssql-delimited). See Route live wide certification.

Vendor-live IT (manual)¶

docker compose -f docker/docker-compose.integration.yml up -d mssql kafka
export DPONE_RUN_INTEGRATION=1 DPONE_RUN_INTEGRATION_LIVE=1
uv sync --extra kafka --extra mssql
uv run pytest tests/integration/mssql/test_mssql_to_kafka_vendor_live_integration.py -q

When to use this path¶

Use this path when MSSQL is the system of record or ingestion boundary and Kafka is the landing, warehouse, event-log, or downstream replication target.

Copy/paste manifest¶

# yaml-language-server: $schema=../../src/dpone/schema/etl-batch-manifest.schema.json
kind: dpone.batch.v1

defaults:
  name: mssql_to_kafka_example
  source:
    type: mssql
    connection_id: mssql_source
    options:
      batch_size: 50000
      export_format: csv
  sink:
    type: kafka
    connection_id: kafka_target
    topic: dwh.orders
    strategy:
      mode: incremental_append
    options:
      message_format: json
      envelope: dpone
      key:
        mode: hash_row
      delivery:
        mode: at_least_once

quality:
  gates:
    - id: source_target_rows
      type: row_count_reconciliation
      severity: error
      tolerance:
        mode: pct
        value: 0.1

schemas:
  dbo:
    tables:
      - orders

Run it locally:

dpone plan examples/source-sink/mssql-to-kafka.yaml --format md
dpone run examples/source-sink/mssql-to-kafka.yaml

The checked source file is examples/source-sink/mssql-to-kafka.yaml; CI compares its parsed YAML with this block.

If you change the strategy to semantic full_refresh publication and empty output is invalid, row-count reconciliation is not enough: it can pass a 0 source / 0 target comparison. Add an explicit non-empty target gate:

quality:
  gates:
    - id: target_min_rows
      type: min_rows
      side: target
      threshold: 1
      severity: error

Supported load strategies¶

Kafka is append-only message publication. Strategy names describe event semantics; they do not mutate existing topic records or swap topic storage. Wide vendor-live covers the Event publication rows below; N/A rows are documented and not claimed as certified.

Strategy	Status	Notes
`full_refresh`	Event publication	Publishes the bounded snapshot as new messages; existing messages remain.
`incremental_append`	Event publication	Publishes each bounded source row as a new message.
`incremental_merge`	Event publication	Publishes keyed upsert/delete events with `merge_policy: event_upsert`.
`replace`	Event publication	Publishes replacement-intent events; consumers decide how to materialize them.
`partition_replace`	N/A	Kafka topics are append-only logs; use `incremental_merge` with `merge_policy: event_upsert`.
`snapshot_diff`	Event publication	Publishes keyed upsert/delete events derived from a complete source snapshot.
`scd2`	N/A	KafkaSink has no SCD2 validity columns; use keyed upsert events.
`backfill` (inner merge)	Event publication	Replays a bounded window as keyed upsert events.

See Load strategies for the detailed algorithm for each strategy. MSSQL CDC is a source capability, not a load strategy. It uses typed CDC offsets and advances source state only after Kafka delivery succeeds; certify the exact route and environment before enabling it.

Runtime algorithm¶

This sink does not currently implement StagedLoadPort, so this route records governance_finalization=legacy_post_finalize. Blocking gates run only after messages have been published. A failure prevents source-state advancement but cannot retract delivered messages; inspect delivery evidence and consumer idempotency before retrying. See Load governance.

flowchart TD
    A["Resolve manifest and registry entries"] --> B["Create MSSQL source"]
    B --> C["Plan bounded extract"]
    C --> D["Read through CSV file export (Kafka-safe wire)"]
    D --> E["Emit ExtractResult with schema and artifact"]
    E --> F["Plan schema evolution"]
    F --> G["Create bounded Kafka producer batch"]
    G --> H["Publish messages and wait for delivery aggregation"]
    H --> J["Run quality and reconciliation checks (legacy_post_finalize)"]
    J --> K["Advance state only after success"]

Strategy behavior¶

full_refresh: publish every row in the bounded snapshot as a new message; existing topic records remain.
incremental_append: publish only the bounded incremental rows as new messages.
incremental_merge: publish keyed upsert/delete events with merge_policy: event_upsert; requires unique_key.
replace: publish replacement-intent events for a bounded window; no existing message is rewritten.
snapshot_diff: publish keyed upsert/delete events derived from a complete source snapshot.
partition_replace: not supported for Kafka sinks because a topic is an append-only event log.

Snapshot reconciliation is separate from the load strategy. Runtime planning reports that capability as reconciliation.mode=snapshot; in the official dpone.batch.v1 authoring schema, enable it with reconciliation: true.

Schema evolution and type mapping¶

Schema evolution is enabled by default and runs before bounded message publication:

Read source schema from ExtractResult.schema.
Introspect the Kafka target schema.
Apply safe additions and widening operations.
Fail breaking changes by default.
If configured, route incompatible type changes to __dpone__nc__<column>.

Use Schema evolution and Type mapping matrix when adding columns or changing source types.

Self-service golden path¶

Copy-paste CJM for the checked-in example (wide vendor-live certified route):

dpone doctor --profile local
pip install "dpone[mssql,kafka]"
dpone plan examples/source-sink/mssql-to-kafka.yaml --format md
dpone schema type-matrix --source mssql --sink kafka --format md
dpone run examples/source-sink/mssql-to-kafka.yaml

Landing convention (vault/GitOps-oriented): examples/batch/landing_mssql_to_kafka.batch.yaml.

See Route live wide certification for the maintainer vendor-live IT evidence path (SKIP ≠ PASS).

Runbook¶

Start with dpone doctor --profile local and fix missing extras or native clients.
Run dpone plan examples/source-sink/mssql-to-kafka.yaml --format md and review source boundary, publication and delivery path, schema evolution, state, and quality gates.
Run a small bounded window first.
Inspect the run artifact under .dpone/runs/mssql_to_kafka.
For incremental jobs, verify state before enabling a schedule.
For delete-aware event semantics, review keyed tombstone/upsert behavior with consumers before publication.
Promote the manifest through GitOps after the plan and artifact are reviewed.

Cross-links¶

Type contracts and physical design¶

This flow supports the shared dpone type-governance stack:

Type inference for source metadata, sampled profiling, confidence, and empty string vs NULL behavior.
Schema contracts for explicit logical column types, enforcement modes, and __dpone__nc__* variant columns.
Physical design for target-specific DDL such as concrete SQL types, indexes, partitioning, compression, ClickHouse LowCardinality, and BigQuery clustering.

Use dpone schema infer --manifest ... and dpone schema physical-plan --manifest ... before enabling new table DDL in production.