Skip to content

Architecture

This document describes the target OOP / clean architecture structure for dpone and the current migration status.

Current architecture snapshot

dpone is now organized as a production batch ELT runtime with explicit contracts for connectors, sources, sinks, artifacts, state, schema evolution, reconciliation, quality checks and operational UX.

The main runtime path is:

  1. Parse and normalize manifests in dpone.manifest.* and dpone.dag.*.
  2. Hydrate runtime objects through dpone.runtime.bootstrap.DefaultRuntimeHydrator.
  3. Extract source data into explicit artifacts.
  4. Enforce runtime data contracts and quarantine bad rows before staging.
  5. Plan schema evolution, target type compatibility and physical DDL before target writes.
  6. Load through sink strategies that use staging or shadow tables for database targets.
  7. Write data-contract evidence and commit load audit before source/Kafka/XMin state.
  8. Emit run reports, quality results and diagnostic artifacts.

First-class production connector families now include Postgres, MSSQL, ClickHouse, BigQuery, REST APIs and Kafka. See ADR 0005 for the connector architecture decision.

The operational maturity layer is split into focused packages rather than one large "production" module:

  • dpone.orchestration.*: run locks, retry handoff, scheduler snippets and orchestration artifacts.
  • dpone.ops.run_registry: auditable run registry entries.
  • dpone.ops.openlineage_export: dpone run lineage event export.
  • dpone.ops.dbt_artifacts, dpone.ops.dbt_openlineage, dpone.ops.dbt_lineage: dbt artifact parsing and transformation lineage.
  • dpone.ops.benchmark_baseline: performance regression gate.
  • dpone.ops.certification_artifacts, dpone.ops.certification_suite: full certification evidence gate.
  • dpone.observability.*: runtime metric extraction plus Prometheus and OpenTelemetry-compatible artifact export.
  • dpone.storage.* and dpone.staging.object_storage: provider-neutral S3/GCS/Azure staging manifests and adapters.
  • dpone.supply_chain.*: SBOM, provenance, signing envelope and release attestation evidence.
  • dpone.connector_sdk.*: community connector package scaffolding and generated certification templates.
  • dpone.strategy_intelligence.*: plan-only load strategy selection, repair planning, certification matrix, and native fast-path recommendations.
  • dpone.runtime.etl.lifecycle: runtime data contract enforcement, target compatibility, optional physical DDL apply and evidence handoff around ETLProcessor.

Visual map

Layer flow

flowchart LR
    CLI["dpone.cli"]
    Commands["dpone.commands"]
    Render["dpone.cli_render"]
    Services["dpone.services"]
    ManifestDag["dpone.manifest / dpone.dag"]
    Ports["dpone.ports"]
    Adapters["dpone.adapters"]
    App["dpone.app.context"]
    Bootstrap["dpone.runtime.bootstrap"]
    Runtime["dpone.runtime.*"]
    Shims["Compatibility shims\n(dpone.source / dpone.sink /\ndpone.lib.* / dpone.core.*)"]

    App --> Commands
    App --> Services
    App --> Adapters
    CLI --> Commands
    Commands --> Render
    Commands --> Services
    Services --> ManifestDag
    Services --> Ports
    Ports --> Adapters
    ManifestDag -. execution config .-> Bootstrap
    Bootstrap --> Runtime
    Runtime -. consumes process inputs .-> ManifestDag
    Shims -. re-export only .-> Runtime

Runtime extension points

classDiagram
    class APISourceDefaults {
        +api_type
        +credentials_mode
        +default_connection_type
    }
    class APIProviderRuntimeSpec {
        +defaults
        +connector_target
        +source_target
        +connector_kwargs_factory()
    }
    class DefaultRuntimeHydrator {
        +build(config, load_config)
        -_build_api_source(...)
        -_build_source(...)
        -_build_sink(...)
    }
    class SourceFactory {
        +create(...)
    }
    class SinkFactory {
        +create(...)
    }
    class AppsflyerConnector {
        +from_vault(...)
    }
    class AppsflyerSource
    class PostgresConnector
    class PostgresSource
    class BigQuerySink

    APIProviderRuntimeSpec --> APISourceDefaults
    DefaultRuntimeHydrator --> APIProviderRuntimeSpec : type=api
    DefaultRuntimeHydrator --> SourceFactory : db sources
    DefaultRuntimeHydrator --> SinkFactory : sinks
    APIProviderRuntimeSpec --> AppsflyerConnector
    APIProviderRuntimeSpec --> AppsflyerSource
    SourceFactory --> PostgresConnector
    SourceFactory --> PostgresSource
    SinkFactory --> BigQuerySink

Execution sequence

sequenceDiagram
    participant Config as manifest / dag config
    participant Builder as ETLProcessConfig / LoadConfigBuilder
    participant Hydrator as DefaultRuntimeHydrator
    participant APIReg as build_api_runtime_source()
    participant SourceFactory as SourceFactory.create()
    participant SinkFactory as SinkFactory.create()
    participant Runner as DefaultProcessRunner
    participant Processor as ETLProcessor

    Config->>Builder: parse / normalize config
    Builder->>Hydrator: build(config, load_config)
    alt source.type == api
        Hydrator->>APIReg: build API source from registry
        APIReg-->>Hydrator: connector + source
    else source.type == postgres / mssql / clickhouse / kafka
        Hydrator->>SourceFactory: create(...)
        SourceFactory-->>Hydrator: runtime source
    end
    Hydrator->>SinkFactory: create(...)
    SinkFactory-->>Hydrator: runtime sink
    Hydrator-->>Runner: RuntimeBindings
    Runner->>Processor: execute ETLProcess
    Processor-->>Runner: ProcessResult

Runtime execution responsibilities

flowchart TD
    Manifest["Manifest / batch config"]
    Hydrator["DefaultRuntimeHydrator"]
    Source["Source runtime"]
    Artifact["Rows / stream / file / partitioned artifact"]
    Contract["RuntimeLifecycleService\ncontract enforcement"]
    Evolution["SchemaEvolutionService"]
    Compatibility["Type compatibility / physical DDL"]
    Sink["Sink strategy"]
    Staging["Staging or shadow table"]
    Reconcile["Reconciliation / deletes"]
    Evidence["Data contract evidence"]
    State["State backend"]
    Report["Run artifact / quality report"]

    Manifest --> Hydrator
    Hydrator --> Source
    Source --> Artifact
    Artifact --> Contract
    Contract --> Evolution
    Evolution --> Compatibility
    Compatibility --> Sink
    Sink --> Staging
    Staging --> Reconcile
    Reconcile --> Evidence
    Evidence --> State
    State --> Report

For Kafka sinks, the Staging node is replaced by bounded producer buffering and delivery acknowledgements. Kafka is treated as an append/event-log target, not as a mutable table.

Operational evidence lane

flowchart LR
    Orchestrate["dpone orchestrate run"]
    Run["dpone run"]
    JobState["LocalJobStateStore"]
    Registry["dpone ops run-registry"]
    Observability["dpone observability metrics-export"]
    Lineage["dpone ops lineage-export"]
    Dbt["dpone ops dbt-lineage"]
    Benchmark["dpone ops benchmark-baseline"]
    Matrix["dpone ops certification-run"]
    Suite["dpone ops certification-suite"]
    Release["release / go-live evidence"]

    Orchestrate --> JobState
    JobState --> Run
    Run --> Registry
    Run --> Observability
    Registry --> Lineage
    Observability --> Suite
    Matrix --> Suite
    Benchmark --> Suite
    Lineage --> Suite
    Dbt --> Suite
    Registry --> Suite
    Suite --> Release

The evidence lane has the same architectural rule as runtime code: commands are thin adapters, services own business rules, and reusable parsing/checksum logic is isolated in focused helper classes.

Phase 2 taxonomy cleanup status

The current cleanup direction is protocol-first, adapter-specific implementation second. Public modules may remain as compatibility facades, but new business logic should live in focused implementation modules.

Phase 2 cleanup status:

  • dpone.manifest.validation, dpone.manifest.migrate and dpone.manifest.batch_compiler are thin compatibility facades over focused validation, migration and batch-compilation modules.
  • dpone.runtime.artifacts is a vendor-neutral facade over artifact models, row/file/cloud artifacts and staging helpers.
  • dpone.commands.registry is split into focused registry sections for core, ops, schema/CDC, manifest, DAG and docs commands.
  • API providers, BigQuery, Postgres CDC, ClickHouse sink, MSSQL strategies and Postgres base strategies keep public class names but now delegate through implementation modules that can be split further without breaking imports.
  • Module-size hard debt is closed: docs/module_size_baseline.json has zero allowlisted entries above the hard 600 LOC gate. Warning-level modules above 450 LOC remain visible in quality metrics and should be split before gaining substantial new behavior.

New code rules:

  • Manifest facades should re-export only; do not add validation or migration logic back into the facade modules.
  • Artifact core must not import vendor connectors, sources or sinks.
  • Generic reconciliation and API runtime core must depend on protocols/facades, not concrete BigQuery/Postgres/MSSQL/ClickHouse/Kafka/provider adapters.
  • Sink strategy packages must not import unrelated concrete sink packages.
  • Command registries should stay grouped by UX area; new command families get a focused registry module instead of expanding dpone.commands.registry.

Stable APIs and extension seams

  • Canonical imports for new code live under dpone.manifest.*, dpone.dag.*, dpone.runtime.*, dpone.contracts.*, dpone.ports.*, dpone.adapters.*.
  • Compatibility shims such as dpone.source, dpone.sink, dpone.lib.*, dpone.core.*, dpone.yaml_config_handler stay as backward-compatible re-export layers only.
  • New Pull API providers plug in through dpone.contracts.api_sources, dpone.runtime.api_registry, a runtime connector, a runtime source and provider-specific strategies/resources.
  • New database-backed integrations plug in through dpone.runtime.credentials.*, connector creation in factory/bootstrap, source/sink runtime classes, state backends and schema-evolution adapters.
  • Kafka integrations plug in as bounded batch sources/sinks through dpone.runtime.kafka.*, KafkaSource, KafkaSink, KafkaOffsetState and optional Schema Registry codecs.
  • CDC replay and idempotency plug in through dpone.runtime.cdc.identity and dpone.readiness.cdc_replay; readers stay focused on bounded extraction, while replay planning and offset commit safety stay credential-free and testable.
  • Object storage staging plugs in through dpone.storage.* and dpone.staging.object_storage; source exports, sink loaders, and certification jobs share one manifest and checksum contract instead of provider-specific dictionaries.
  • Community connector packages are generated through dpone.connector_sdk.*; the SDK is control-plane only and must not be imported by runtime source/sink execution paths.
  • Strategy intelligence plugs in through dpone.strategy_intelligence.*; it is credential-free and side-effect free, so CLI, docs, Studio, and future runtime optimizers can share one explainable decision contract.
  • Runtime artifacts are the boundary between extraction and loading. Sources should produce InMemoryRowsArtifact, StreamingRowsArtifact, FileExportArtifact, PartitionedFileExportArtifact or InternalQueryArtifact rather than calling sinks directly.

For an implementation-oriented checklist, see Developer integrations runbook.

Layers

  • dpone/cli/ – thin CLI entrypoint + argparse wiring
  • dpone/commands/ – CLI commands as OOP objects (no business logic)
  • dpone/cli_render/ – presentation layer (text/Markdown rendering for CLI outputs)
  • dpone/services/ – use-cases / application services (orchestrate domain + adapters)
  • dpone/services/dag/views/ – typed DAG CLI view-models / JSON payload DTOs between commands and renderers
  • dpone/app/ – composition root (DI) + settings/logging
  • dpone/ports/ – Protocols / abstract interfaces (filesystem, yaml codec, etc.)
  • dpone/adapters/ – concrete implementations (local fs, pyyaml, etc.)

Feature slices:

  • dpone/manifest/ – Variant C manifests (compile/validate/explain)
  • dpone/dag/ – DAG dependency model (graph builder, edge semantics, explain, report)
  • dpone/runtime/ – execution runtime (ETL/sources/sinks/connectors/state)

Current migration status

  • ✅ CLI refactoring started: commands + thin entrypoint.
  • ✅ DAG explain/report commands now use typed view-models (dpone.services.dag.views) instead of building JSON payloads ad-hoc in commands.
  • ✅ DAG subsystem renamed to dpone.dag.
  • dpone.yaml_config_handler is a deprecated shim kept for backward compatibility.
  • ✅ Dependency graph loading/building split into smaller DAG modules:
  • dpone.dag.node_registry – node indexing and uniqueness
  • dpone.dag.graph_relationships – dependency edge construction
  • dpone.dag.graph_algorithms – topo sort / cycle detection
  • dpone.dag.task_group_index – lazy task-group to manifest index
  • dpone.dag.manifest_chain_loader – recursive manifest loading
  • ✅ Unified dependency semantics extracted into dpone.dag.edge_resolver:
  • one source of truth for depends_on / group / selector / file-all / stem fallback
  • reused by Airflow DAG building, dependency graph building and explain/report tools
  • ✅ Manifest explain/provenance split into focused modules:
  • dpone.manifest.explain_models – result/why dataclasses
  • dpone.manifest.explain_trace – batch trace + provenance construction
  • dpone.manifest.explain_merge – diff/patch/origin-aware merge helpers
  • dpone.manifest.explain_why--why reasoning + patch suggestions
  • dpone.manifest.explain remains a thin public facade
  • ✅ Manifest CLI commands no longer go through dpone.cli.legacy:
  • shared manifest loading helpers live in dpone.services.manifest.load_context
  • typed manifest view-models live in dpone.services.manifest.views
  • text rendering moved to dpone.cli_render.manifest.*
  • dpone.cli.legacy is now only a deprecated compatibility shim:
  • canonical entrypoint: dpone.cli.main / dpone
  • canonical parser wiring: dpone.cli.parser + dpone.commands.*
  • CLI help/parsing happens before AppContext construction, so --help stays lightweight
  • ✅ Runtime packages moved under dpone.runtime.*.
  • Legacy import paths are kept as deprecated shims:
    • dpone.sourcedpone.runtime.sources
    • dpone.sinkdpone.runtime.sinks
    • dpone.lib.connectorsdpone.runtime.connectors
    • dpone.statedpone.runtime.state
    • dpone.credentialsdpone.runtime.credentials
    • etc.
  • ✅ Runtime ETL orchestration split into focused collaborators:
  • dpone.runtime.etl.processor now keeps orchestration only
  • dpone.runtime.etl.load_config_runtime isolates runtime LoadConfig mutation/enrichment
  • dpone.runtime.etl.run_state_tracker owns run-state persistence lifecycle
  • dpone.runtime.etl.reconciliation_service owns reconciliation + tech-connector lookup
  • dpone.runtime.state.models exposes RunState/RunStateStatus without BigQuery imports
  • ✅ BigQuery sink runtime split into focused collaborators:
  • dpone.runtime.sinks.strategies.bigquery.bigquery_base is now an orchestration-only facade
  • target_table_manager owns target table creation, labels/description and technical columns
  • dml_helper owns JSON-aware SELECT generation, DML execution and lookback cleanup SQL
  • exchange_logger owns progress/evidence logging for exchange/full-refresh flows
  • partition_validation_service owns ClickHouse partition validation after BigQuery loads
  • ✅ PostgreSQL sink runtime split into focused collaborators:
  • dpone.runtime.sinks.strategies.postgres.postgres_base is now an orchestration-only facade
  • target_table_manager owns target table creation + technical columns
  • file_export_loader owns COPY/exchange/truncate flows for FileExportArtifact
  • internal_query_loader owns CTAS/exchange workflow for InternalQueryArtifact
  • staging_sql_helper owns staging->target SQL helpers, typed select and column-type cache
  • ✅ Staging managers split into focused collaborators:
  • dpone.runtime.sinks.staging is now a thin facade
  • dpone.runtime.sinks.staging_managers.postgres owns PostgreSQL staging table lifecycle and COPY helpers
  • dpone.runtime.sinks.staging_managers.bigquery owns BigQuery staging lifecycle and row/query insertion
  • legacy postgres_staging_manager / bigquery_staging_manager modules are compatibility shims only
  • bigquery_staging_file_loader owns local-file → BigQuery staging routing (direct load vs GCS)
  • bigquery_staging_gcs_loader owns native GCS → BigQuery staging flows and partition row tracking
  • ✅ ClickHouse connector split into focused collaborators:
  • dpone.runtime.connectors.clickhouse is now a thin facade
  • clickhouse_query_ops owns low-level query execution, streaming and CSV import
  • clickhouse_gcs_export owns native GCS export SQL and HMAC-based export flow
  • clickhouse_partitioning owns date parsing, partition generation/discovery and row counting
  • clickhouse_incremental_export owns cleanup, partition query shaping and incremental GCS export orchestration
  • ✅ MSSQL is a first-class source, sink and state backend:
  • dpone.runtime.connectors.mssql owns SQL Server connectivity and metadata helpers
  • dpone.runtime.connectors.mssql_bulk owns bcp command generation and bulk file flows
  • dpone.runtime.sources.mssql and dpone.runtime.sinks.mssql expose runtime source/sink contracts
  • dpone.runtime.state.mssql supports run, XMin, Kafka and CDC state persistence
  • ✅ Kafka is a first-class bounded batch source/sink:
  • dpone.runtime.connectors.kafka owns producers, consumers, admin clients and Schema Registry clients
  • dpone.runtime.kafka.* owns codecs, envelopes, keys, offset planning and delivery aggregation
  • dpone.runtime.sources.kafka and dpone.runtime.sinks.kafka expose source/sink runtime contracts
  • ✅ Generic REST APIs are first-class Pull API sources:
  • dpone.runtime.connectors.api.rest owns HTTP/auth/pagination mechanics
  • dpone.runtime.sources.api.rest emits streaming row artifacts for database and Kafka sinks
  • ✅ Schema evolution is centralized:
  • dpone.runtime.schema_evolution compares source/target schemas, renders dialect DDL and maps generated __dpone__nc__* columns
  • database sinks apply safe evolution before staging/final load
  • ✅ Reconciliation and physical deletes are staging-first:
  • MSSQL/Postgres/ClickHouse soft-delete handlers use staging/shadow flows and avoid row-by-row target mutations
  • ClickHouse delete semantics avoid direct mutation-heavy update paths where table-engine alternatives are safer
  • ✅ CDC and offset state are explicit:
  • Postgres logical decoding, MSSQL CDC/Change Tracking contracts and Kafka offsets share typed state models
  • state backends are selected explicitly through runtime configuration
  • ✅ Managed-like UX commands are layered over services:
  • doctor, init, plan, run-report, state, connectors, perf and studio commands call reusable services rather than duplicating runtime logic
  • ✅ Documentation is now published as a strict GitHub Pages site:
  • mkdocs.yml defines curated navigation
  • .github/workflows/pages.yml builds with mkdocs build --strict
  • source -> sink guides, type mappings, load strategies and Postgres XMin runbooks are first-class docs

Principles

  • KISS: prefer small modules with explicit dependencies.
  • DI: construct dependencies only in dpone.app.context.AppContext.
  • SOLID:
  • SRP: each module does one thing
  • OCP: add new sources/sinks/commands without modifying core
  • ISP: minimal ports (Protocol) with small surface area
  • DIP: services depend on ports, not adapters

Import rules (to keep coupling under control)

These are the rules we follow during refactoring, and they are now checked automatically via dpone docs check-import-rules and pytest. Vendor-specific implementations are allowed, but they must live behind adapter packages and protocols/facades; generic core modules must not import BigQuery, ClickHouse, MSSQL, Postgres, Kafka, or API-provider adapters directly. We also track coarse layer/slice trends via dpone docs check-layer-metrics against docs/layer_metrics_baseline.json. Architecture-fitness drift is checked via dpone docs check-architecture-fitness, which flags high fan-out modules, broad facade dependencies and high-responsibility classes before they become god modules. Module LOC/SLOC debt is tracked via dpone docs check-module-size against docs/module_size_baseline.json. Docs/readme links are checked via dpone docs check-docs, compatibility/deprecation policy is validated via dpone docs check-compatibility against docs/compatibility_registry.yaml, and auto-generated docs are kept in sync via dpone docs update-cli-reference --check and dpone docs update-deprecation-roadmap --check.

  • CLI/manifest/DAG analysis utilities must remain usable without optional runtime deps.
  • Heavy symbols must be imported lazily.
  • dpone.commands.* must not contain business logic.
  • dpone.runtime.* may depend on dpone.manifest.* and dpone.dag.* as inputs.
  • Deprecated shim imports (dpone.source, dpone.sink, dpone.etl, dpone.yaml_config_handler, etc.) are forbidden outside shim packages themselves.

See also: Import rules.

Notes about dpone.dag.config

dpone.dag.config is now a thin public facade. The former monolithic module was split into focused parts:

  • dpone.dag.config_models – public ETLProcessConfig dataclass
  • dpone.dag.process_config_parser – compiled dict -> process model
  • dpone.dag.load_config_buildersource/sink -> LoadConfig
  • dpone.dag.dependency_parserdepends_on normalization
  • dpone.dag.config_refs<file>#<selector> helpers + manifest-loader bridge

dpone.dag.config.ETLProcessConfig remains the canonical import path for backward compatibility and re-exports these smaller building blocks.

ETLProcessConfig is now a pure parsed config model for the DAG/manifest layers:

  • It parses YAML / compiled manifests into normalized configs.
  • In metadata-only mode it stays fully pure (no runtime objects).
  • In execution mode it requests runtime bindings through the dpone.ports.runtime_hydrator port.

Runtime creation moved to dpone.runtime.bootstrap:

  • DefaultRuntimeHydrator creates sources/sinks/logger/state storages.
  • DefaultProcessRunner executes ETLProcess through ETLProcessor.

This means dpone.dag/* and dpone.manifest/* contain no direct imports from dpone.runtime.*, while runtime execution remains available through lazy port registration.

Notes about dpone.manifest.explain

dpone.manifest.explain is now a thin facade over smaller modules. This keeps the public API stable (explain_manifest, explain_why, compute_deep_patch) while making it easier to evolve provenance, patches and why-debugging independently.

A small but important compiler safety fix was made alongside this split: each compiled table config is now isolated via deepcopy, preventing cross-table mutation when source/sink shells and naming templates are injected during batch compilation.

Architecture metrics

Developer metrics now include a coarse-grained layer / slice architecture view in addition to module-level LOC and import coupling. This helps track whether refactoring is actually reducing cross-layer traffic (commands -> services -> manifest/dag, manifest/dag -> ports, etc.) rather than merely moving lines around.

See also: Quality metrics.

Runtime coupling cleanup

The canonical runtime code now depends on dpone.contracts.*, dpone.runtime.artifacts and dpone.runtime.support.* instead of importing legacy dpone.core.* / dpone.lib.* helpers directly. Legacy core/* and lib/* modules remain as deprecated shims for backward compatibility.

API runtime registry

dpone.runtime.api_registry is the canonical registration point for runtime API source api_type values. It stores metadata, defaults, and lazy wiring for providers such as omnidesk, appsflyer, mindbox, and cbr. The DAG and manifest layer uses dpone.contracts.api_sources for compatible default derivation without importing runtime modules.

Rollout bundles

Concrete provider-specific rollout bundles live under deploy/argo/<provider>/ and are documented in matching docs/*_ROLLOUT.md pages.

Type contracts and physical design layer

dpone keeps type detection and physical DDL planning split into small, testable layers:

  • dpone.type_system profiles source metadata and sampled rows, then produces portable logical type decisions with confidence and provenance.
  • dpone.readiness.schema_contracts owns user-declared logical column contracts and enforcement modes.
  • dpone.readiness.target_type_resolvers maps logical columns to concrete MSSQL, PostgreSQL, ClickHouse, BigQuery, or Kafka types.
  • dpone.readiness.physical_design renders target-specific DDL for new tables and governed physical changes.
  • dpone.commands.schema_plan_cmd exposes the layer through dpone schema infer and dpone schema physical-plan.

The runtime rule is the same as online schema evolution: new target table design may be fully planned from configuration, while existing-table DDL must go through governance, risk classification, and approval for blocking changes.