Architecture¶
This document describes the target OOP / clean architecture structure for dpone and the current migration status.
Current architecture snapshot¶
dpone is now organized as a production batch ELT runtime with explicit contracts for connectors, sources, sinks, artifacts, state, schema evolution, reconciliation, quality checks and operational UX.
The main runtime path is:
- Parse and normalize manifests in
dpone.manifest.*anddpone.dag.*. - Hydrate runtime objects through
dpone.runtime.bootstrap.DefaultRuntimeHydrator. - Extract source data into explicit artifacts.
- Enforce runtime data contracts and quarantine bad rows before staging.
- Plan schema evolution, target type compatibility and physical DDL before target writes.
- Load through sink strategies that use staging or shadow tables for database targets.
- Write data-contract evidence and commit load audit before source/Kafka/XMin state.
- Emit run reports, quality results and diagnostic artifacts.
First-class production connector families now include Postgres, MSSQL, ClickHouse, BigQuery, REST APIs and Kafka. See ADR 0005 for the connector architecture decision.
The operational maturity layer is split into focused packages rather than one large "production" module:
dpone.orchestration.*: run locks, retry handoff, scheduler snippets and orchestration artifacts.dpone.ops.run_registry: auditable run registry entries.dpone.ops.openlineage_export: dpone run lineage event export.dpone.ops.dbt_artifacts,dpone.ops.dbt_openlineage,dpone.ops.dbt_lineage: dbt artifact parsing and transformation lineage.dpone.ops.benchmark_baseline: performance regression gate.dpone.ops.certification_artifacts,dpone.ops.certification_suite: full certification evidence gate.dpone.observability.*: runtime metric extraction plus Prometheus and OpenTelemetry-compatible artifact export.dpone.storage.*anddpone.staging.object_storage: provider-neutral S3/GCS/Azure staging manifests and adapters.dpone.supply_chain.*: SBOM, provenance, signing envelope and release attestation evidence.dpone.connector_sdk.*: community connector package scaffolding and generated certification templates.dpone.strategy_intelligence.*: plan-only load strategy selection, repair planning, certification matrix, and native fast-path recommendations.dpone.runtime.etl.lifecycle: runtime data contract enforcement, target compatibility, optional physical DDL apply and evidence handoff aroundETLProcessor.
Visual map¶
Layer flow¶
flowchart LR
CLI["dpone.cli"]
Commands["dpone.commands"]
Render["dpone.cli_render"]
Services["dpone.services"]
ManifestDag["dpone.manifest / dpone.dag"]
Ports["dpone.ports"]
Adapters["dpone.adapters"]
App["dpone.app.context"]
Bootstrap["dpone.runtime.bootstrap"]
Runtime["dpone.runtime.*"]
Shims["Compatibility shims\n(dpone.source / dpone.sink /\ndpone.lib.* / dpone.core.*)"]
App --> Commands
App --> Services
App --> Adapters
CLI --> Commands
Commands --> Render
Commands --> Services
Services --> ManifestDag
Services --> Ports
Ports --> Adapters
ManifestDag -. execution config .-> Bootstrap
Bootstrap --> Runtime
Runtime -. consumes process inputs .-> ManifestDag
Shims -. re-export only .-> Runtime
Runtime extension points¶
classDiagram
class APISourceDefaults {
+api_type
+credentials_mode
+default_connection_type
}
class APIProviderRuntimeSpec {
+defaults
+connector_target
+source_target
+connector_kwargs_factory()
}
class DefaultRuntimeHydrator {
+build(config, load_config)
-_build_api_source(...)
-_build_source(...)
-_build_sink(...)
}
class SourceFactory {
+create(...)
}
class SinkFactory {
+create(...)
}
class AppsflyerConnector {
+from_vault(...)
}
class AppsflyerSource
class PostgresConnector
class PostgresSource
class BigQuerySink
APIProviderRuntimeSpec --> APISourceDefaults
DefaultRuntimeHydrator --> APIProviderRuntimeSpec : type=api
DefaultRuntimeHydrator --> SourceFactory : db sources
DefaultRuntimeHydrator --> SinkFactory : sinks
APIProviderRuntimeSpec --> AppsflyerConnector
APIProviderRuntimeSpec --> AppsflyerSource
SourceFactory --> PostgresConnector
SourceFactory --> PostgresSource
SinkFactory --> BigQuerySink
Execution sequence¶
sequenceDiagram
participant Config as manifest / dag config
participant Builder as ETLProcessConfig / LoadConfigBuilder
participant Hydrator as DefaultRuntimeHydrator
participant APIReg as build_api_runtime_source()
participant SourceFactory as SourceFactory.create()
participant SinkFactory as SinkFactory.create()
participant Runner as DefaultProcessRunner
participant Processor as ETLProcessor
Config->>Builder: parse / normalize config
Builder->>Hydrator: build(config, load_config)
alt source.type == api
Hydrator->>APIReg: build API source from registry
APIReg-->>Hydrator: connector + source
else source.type == postgres / mssql / clickhouse / kafka
Hydrator->>SourceFactory: create(...)
SourceFactory-->>Hydrator: runtime source
end
Hydrator->>SinkFactory: create(...)
SinkFactory-->>Hydrator: runtime sink
Hydrator-->>Runner: RuntimeBindings
Runner->>Processor: execute ETLProcess
Processor-->>Runner: ProcessResult
Runtime execution responsibilities¶
flowchart TD
Manifest["Manifest / batch config"]
Hydrator["DefaultRuntimeHydrator"]
Source["Source runtime"]
Artifact["Rows / stream / file / partitioned artifact"]
Contract["RuntimeLifecycleService\ncontract enforcement"]
Evolution["SchemaEvolutionService"]
Compatibility["Type compatibility / physical DDL"]
Sink["Sink strategy"]
Staging["Staging or shadow table"]
Reconcile["Reconciliation / deletes"]
Evidence["Data contract evidence"]
State["State backend"]
Report["Run artifact / quality report"]
Manifest --> Hydrator
Hydrator --> Source
Source --> Artifact
Artifact --> Contract
Contract --> Evolution
Evolution --> Compatibility
Compatibility --> Sink
Sink --> Staging
Staging --> Reconcile
Reconcile --> Evidence
Evidence --> State
State --> Report
For Kafka sinks, the Staging node is replaced by bounded producer buffering and delivery acknowledgements. Kafka is treated as an append/event-log target, not as a mutable table.
Operational evidence lane¶
flowchart LR
Orchestrate["dpone orchestrate run"]
Run["dpone run"]
JobState["LocalJobStateStore"]
Registry["dpone ops run-registry"]
Observability["dpone observability metrics-export"]
Lineage["dpone ops lineage-export"]
Dbt["dpone ops dbt-lineage"]
Benchmark["dpone ops benchmark-baseline"]
Matrix["dpone ops certification-run"]
Suite["dpone ops certification-suite"]
Release["release / go-live evidence"]
Orchestrate --> JobState
JobState --> Run
Run --> Registry
Run --> Observability
Registry --> Lineage
Observability --> Suite
Matrix --> Suite
Benchmark --> Suite
Lineage --> Suite
Dbt --> Suite
Registry --> Suite
Suite --> Release
The evidence lane has the same architectural rule as runtime code: commands are thin adapters, services own business rules, and reusable parsing/checksum logic is isolated in focused helper classes.
Phase 2 taxonomy cleanup status¶
The current cleanup direction is protocol-first, adapter-specific implementation second. Public modules may remain as compatibility facades, but new business logic should live in focused implementation modules.
Phase 2 cleanup status:
dpone.manifest.validation,dpone.manifest.migrateanddpone.manifest.batch_compilerare thin compatibility facades over focused validation, migration and batch-compilation modules.dpone.runtime.artifactsis a vendor-neutral facade over artifact models, row/file/cloud artifacts and staging helpers.dpone.commands.registryis split into focused registry sections for core, ops, schema/CDC, manifest, DAG and docs commands.- API providers, BigQuery, Postgres CDC, ClickHouse sink, MSSQL strategies and Postgres base strategies keep public class names but now delegate through implementation modules that can be split further without breaking imports.
- Module-size hard debt is closed:
docs/module_size_baseline.jsonhas zero allowlisted entries above the hard 600 LOC gate. Warning-level modules above 450 LOC remain visible in quality metrics and should be split before gaining substantial new behavior.
New code rules:
- Manifest facades should re-export only; do not add validation or migration logic back into the facade modules.
- Artifact core must not import vendor connectors, sources or sinks.
- Generic reconciliation and API runtime core must depend on protocols/facades, not concrete BigQuery/Postgres/MSSQL/ClickHouse/Kafka/provider adapters.
- Sink strategy packages must not import unrelated concrete sink packages.
- Command registries should stay grouped by UX area; new command families get a
focused registry module instead of expanding
dpone.commands.registry.
Stable APIs and extension seams¶
- Canonical imports for new code live under
dpone.manifest.*,dpone.dag.*,dpone.runtime.*,dpone.contracts.*,dpone.ports.*,dpone.adapters.*. - Compatibility shims such as
dpone.source,dpone.sink,dpone.lib.*,dpone.core.*,dpone.yaml_config_handlerstay as backward-compatible re-export layers only. - New Pull API providers plug in through
dpone.contracts.api_sources,dpone.runtime.api_registry, a runtime connector, a runtime source and provider-specific strategies/resources. - New database-backed integrations plug in through
dpone.runtime.credentials.*, connector creation in factory/bootstrap, source/sink runtime classes, state backends and schema-evolution adapters. - Kafka integrations plug in as bounded batch sources/sinks through
dpone.runtime.kafka.*,KafkaSource,KafkaSink,KafkaOffsetStateand optional Schema Registry codecs. - CDC replay and idempotency plug in through
dpone.runtime.cdc.identityanddpone.readiness.cdc_replay; readers stay focused on bounded extraction, while replay planning and offset commit safety stay credential-free and testable. - Object storage staging plugs in through
dpone.storage.*anddpone.staging.object_storage; source exports, sink loaders, and certification jobs share one manifest and checksum contract instead of provider-specific dictionaries. - Community connector packages are generated through
dpone.connector_sdk.*; the SDK is control-plane only and must not be imported by runtime source/sink execution paths. - Strategy intelligence plugs in through
dpone.strategy_intelligence.*; it is credential-free and side-effect free, so CLI, docs, Studio, and future runtime optimizers can share one explainable decision contract. - Runtime artifacts are the boundary between extraction and loading. Sources should produce
InMemoryRowsArtifact,StreamingRowsArtifact,FileExportArtifact,PartitionedFileExportArtifactorInternalQueryArtifactrather than calling sinks directly.
For an implementation-oriented checklist, see Developer integrations runbook.
Layers¶
- dpone/cli/ – thin CLI entrypoint + argparse wiring
- dpone/commands/ – CLI commands as OOP objects (no business logic)
- dpone/cli_render/ – presentation layer (text/Markdown rendering for CLI outputs)
- dpone/services/ – use-cases / application services (orchestrate domain + adapters)
- dpone/services/dag/views/ – typed DAG CLI view-models / JSON payload DTOs between commands and renderers
- dpone/app/ – composition root (DI) + settings/logging
- dpone/ports/ – Protocols / abstract interfaces (filesystem, yaml codec, etc.)
- dpone/adapters/ – concrete implementations (local fs, pyyaml, etc.)
Feature slices:
- dpone/manifest/ – Variant C manifests (compile/validate/explain)
- dpone/dag/ – DAG dependency model (graph builder, edge semantics, explain, report)
- dpone/runtime/ – execution runtime (ETL/sources/sinks/connectors/state)
Current migration status¶
- ✅ CLI refactoring started: commands + thin entrypoint.
- ✅ DAG explain/report commands now use typed view-models (
dpone.services.dag.views) instead of building JSON payloads ad-hoc in commands. - ✅ DAG subsystem renamed to
dpone.dag. dpone.yaml_config_handleris a deprecated shim kept for backward compatibility.- ✅ Dependency graph loading/building split into smaller DAG modules:
dpone.dag.node_registry– node indexing and uniquenessdpone.dag.graph_relationships– dependency edge constructiondpone.dag.graph_algorithms– topo sort / cycle detectiondpone.dag.task_group_index– lazy task-group to manifest indexdpone.dag.manifest_chain_loader– recursive manifest loading- ✅ Unified dependency semantics extracted into
dpone.dag.edge_resolver: - one source of truth for
depends_on/group/selector/file-all/ stem fallback - reused by Airflow DAG building, dependency graph building and explain/report tools
- ✅ Manifest explain/provenance split into focused modules:
dpone.manifest.explain_models– result/why dataclassesdpone.manifest.explain_trace– batch trace + provenance constructiondpone.manifest.explain_merge– diff/patch/origin-aware merge helpersdpone.manifest.explain_why–--whyreasoning + patch suggestionsdpone.manifest.explainremains a thin public facade- ✅ Manifest CLI commands no longer go through
dpone.cli.legacy: - shared manifest loading helpers live in
dpone.services.manifest.load_context - typed manifest view-models live in
dpone.services.manifest.views - text rendering moved to
dpone.cli_render.manifest.* - ✅
dpone.cli.legacyis now only a deprecated compatibility shim: - canonical entrypoint:
dpone.cli.main/dpone - canonical parser wiring:
dpone.cli.parser+dpone.commands.* - CLI help/parsing happens before
AppContextconstruction, so--helpstays lightweight - ✅ Runtime packages moved under
dpone.runtime.*. - Legacy import paths are kept as deprecated shims:
dpone.source→dpone.runtime.sourcesdpone.sink→dpone.runtime.sinksdpone.lib.connectors→dpone.runtime.connectorsdpone.state→dpone.runtime.statedpone.credentials→dpone.runtime.credentials- etc.
- ✅ Runtime ETL orchestration split into focused collaborators:
dpone.runtime.etl.processornow keeps orchestration onlydpone.runtime.etl.load_config_runtimeisolates runtimeLoadConfigmutation/enrichmentdpone.runtime.etl.run_state_trackerowns run-state persistence lifecycledpone.runtime.etl.reconciliation_serviceowns reconciliation + tech-connector lookupdpone.runtime.state.modelsexposesRunState/RunStateStatuswithout BigQuery imports- ✅ BigQuery sink runtime split into focused collaborators:
dpone.runtime.sinks.strategies.bigquery.bigquery_baseis now an orchestration-only facadetarget_table_managerowns target table creation, labels/description and technical columnsdml_helperowns JSON-aware SELECT generation, DML execution and lookback cleanup SQLexchange_loggerowns progress/evidence logging for exchange/full-refresh flowspartition_validation_serviceowns ClickHouse partition validation after BigQuery loads- ✅ PostgreSQL sink runtime split into focused collaborators:
dpone.runtime.sinks.strategies.postgres.postgres_baseis now an orchestration-only facadetarget_table_managerowns target table creation + technical columnsfile_export_loaderowns COPY/exchange/truncate flows forFileExportArtifactinternal_query_loaderowns CTAS/exchange workflow forInternalQueryArtifactstaging_sql_helperowns staging->target SQL helpers, typed select and column-type cache- ✅ Staging managers split into focused collaborators:
dpone.runtime.sinks.stagingis now a thin facadedpone.runtime.sinks.staging_managers.postgresowns PostgreSQL staging table lifecycle and COPY helpersdpone.runtime.sinks.staging_managers.bigqueryowns BigQuery staging lifecycle and row/query insertion- legacy
postgres_staging_manager/bigquery_staging_managermodules are compatibility shims only bigquery_staging_file_loaderowns local-file → BigQuery staging routing (direct load vs GCS)bigquery_staging_gcs_loaderowns native GCS → BigQuery staging flows and partition row tracking- ✅ ClickHouse connector split into focused collaborators:
dpone.runtime.connectors.clickhouseis now a thin facadeclickhouse_query_opsowns low-level query execution, streaming and CSV importclickhouse_gcs_exportowns native GCS export SQL and HMAC-based export flowclickhouse_partitioningowns date parsing, partition generation/discovery and row countingclickhouse_incremental_exportowns cleanup, partition query shaping and incremental GCS export orchestration- ✅ MSSQL is a first-class source, sink and state backend:
dpone.runtime.connectors.mssqlowns SQL Server connectivity and metadata helpersdpone.runtime.connectors.mssql_bulkownsbcpcommand generation and bulk file flowsdpone.runtime.sources.mssqlanddpone.runtime.sinks.mssqlexpose runtime source/sink contractsdpone.runtime.state.mssqlsupports run, XMin, Kafka and CDC state persistence- ✅ Kafka is a first-class bounded batch source/sink:
dpone.runtime.connectors.kafkaowns producers, consumers, admin clients and Schema Registry clientsdpone.runtime.kafka.*owns codecs, envelopes, keys, offset planning and delivery aggregationdpone.runtime.sources.kafkaanddpone.runtime.sinks.kafkaexpose source/sink runtime contracts- ✅ Generic REST APIs are first-class Pull API sources:
dpone.runtime.connectors.api.restowns HTTP/auth/pagination mechanicsdpone.runtime.sources.api.restemits streaming row artifacts for database and Kafka sinks- ✅ Schema evolution is centralized:
dpone.runtime.schema_evolutioncompares source/target schemas, renders dialect DDL and maps generated__dpone__nc__*columns- database sinks apply safe evolution before staging/final load
- ✅ Reconciliation and physical deletes are staging-first:
- MSSQL/Postgres/ClickHouse soft-delete handlers use staging/shadow flows and avoid row-by-row target mutations
- ClickHouse delete semantics avoid direct mutation-heavy update paths where table-engine alternatives are safer
- ✅ CDC and offset state are explicit:
- Postgres logical decoding, MSSQL CDC/Change Tracking contracts and Kafka offsets share typed state models
- state backends are selected explicitly through runtime configuration
- ✅ Managed-like UX commands are layered over services:
doctor,init,plan,run-report,state,connectors,perfandstudiocommands call reusable services rather than duplicating runtime logic- ✅ Documentation is now published as a strict GitHub Pages site:
mkdocs.ymldefines curated navigation.github/workflows/pages.ymlbuilds withmkdocs build --strict- source -> sink guides, type mappings, load strategies and Postgres XMin runbooks are first-class docs
Principles¶
- KISS: prefer small modules with explicit dependencies.
- DI: construct dependencies only in
dpone.app.context.AppContext. - SOLID:
- SRP: each module does one thing
- OCP: add new sources/sinks/commands without modifying core
- ISP: minimal ports (Protocol) with small surface area
- DIP: services depend on ports, not adapters
Import rules (to keep coupling under control)¶
These are the rules we follow during refactoring, and they are now checked automatically via dpone docs check-import-rules and pytest.
Vendor-specific implementations are allowed, but they must live behind adapter packages and protocols/facades; generic core modules must not import BigQuery, ClickHouse, MSSQL, Postgres, Kafka, or API-provider adapters directly.
We also track coarse layer/slice trends via dpone docs check-layer-metrics against docs/layer_metrics_baseline.json. Architecture-fitness drift is checked via dpone docs check-architecture-fitness, which flags high fan-out modules, broad facade dependencies and high-responsibility classes before they become god modules.
Module LOC/SLOC debt is tracked via dpone docs check-module-size against docs/module_size_baseline.json.
Docs/readme links are checked via dpone docs check-docs, compatibility/deprecation policy is validated via dpone docs check-compatibility against docs/compatibility_registry.yaml, and auto-generated docs are kept in sync via dpone docs update-cli-reference --check and dpone docs update-deprecation-roadmap --check.
- CLI/manifest/DAG analysis utilities must remain usable without optional runtime deps.
- Heavy symbols must be imported lazily.
dpone.commands.*must not contain business logic.dpone.runtime.*may depend ondpone.manifest.*anddpone.dag.*as inputs.- Deprecated shim imports (
dpone.source,dpone.sink,dpone.etl,dpone.yaml_config_handler, etc.) are forbidden outside shim packages themselves.
See also: Import rules.
Notes about dpone.dag.config¶
dpone.dag.config is now a thin public facade. The former monolithic module was split into focused parts:
dpone.dag.config_models– publicETLProcessConfigdataclassdpone.dag.process_config_parser– compiled dict -> process modeldpone.dag.load_config_builder–source/sink->LoadConfigdpone.dag.dependency_parser–depends_onnormalizationdpone.dag.config_refs–<file>#<selector>helpers + manifest-loader bridge
dpone.dag.config.ETLProcessConfig remains the canonical import path for backward compatibility and re-exports these smaller building blocks.
ETLProcessConfig is now a pure parsed config model for the DAG/manifest layers:
- It parses YAML / compiled manifests into normalized configs.
- In metadata-only mode it stays fully pure (no runtime objects).
- In execution mode it requests runtime bindings through the
dpone.ports.runtime_hydratorport.
Runtime creation moved to dpone.runtime.bootstrap:
DefaultRuntimeHydratorcreates sources/sinks/logger/state storages.DefaultProcessRunnerexecutesETLProcessthroughETLProcessor.
This means dpone.dag/* and dpone.manifest/* contain no direct imports from dpone.runtime.*, while runtime execution remains available through lazy port registration.
Notes about dpone.manifest.explain¶
dpone.manifest.explain is now a thin facade over smaller modules. This keeps the public API stable (explain_manifest, explain_why, compute_deep_patch) while making it easier to evolve provenance, patches and why-debugging independently.
A small but important compiler safety fix was made alongside this split: each compiled table config is now isolated via deepcopy, preventing cross-table mutation when source/sink shells and naming templates are injected during batch compilation.
Architecture metrics¶
Developer metrics now include a coarse-grained layer / slice architecture view in addition to module-level LOC and import coupling. This helps track whether refactoring is actually reducing cross-layer traffic (commands -> services -> manifest/dag, manifest/dag -> ports, etc.) rather than merely moving lines around.
See also: Quality metrics.
Runtime coupling cleanup¶
The canonical runtime code now depends on dpone.contracts.*, dpone.runtime.artifacts and dpone.runtime.support.* instead of importing legacy dpone.core.* / dpone.lib.* helpers directly. Legacy core/* and lib/* modules remain as deprecated shims for backward compatibility.
API runtime registry¶
dpone.runtime.api_registry is the canonical registration point for runtime API source api_type values.
It stores metadata, defaults, and lazy wiring for providers such as omnidesk, appsflyer, mindbox, and cbr.
The DAG and manifest layer uses dpone.contracts.api_sources for compatible default derivation without importing runtime modules.
Rollout bundles¶
Concrete provider-specific rollout bundles live under deploy/argo/<provider>/ and are documented in matching docs/*_ROLLOUT.md pages.
Type contracts and physical design layer¶
dpone keeps type detection and physical DDL planning split into small, testable layers:
dpone.type_systemprofiles source metadata and sampled rows, then produces portable logical type decisions with confidence and provenance.dpone.readiness.schema_contractsowns user-declared logical column contracts and enforcement modes.dpone.readiness.target_type_resolversmaps logical columns to concrete MSSQL, PostgreSQL, ClickHouse, BigQuery, or Kafka types.dpone.readiness.physical_designrenders target-specific DDL for new tables and governed physical changes.dpone.commands.schema_plan_cmdexposes the layer throughdpone schema inferanddpone schema physical-plan.
The runtime rule is the same as online schema evolution: new target table design may be fully planned from configuration, while existing-table DDL must go through governance, risk classification, and approval for blocking changes.