CI/CD failure runbooks¶

Use this page when a GitHub Actions run is red or stuck. Start by identifying the workflow, job, and exact failing step. Do not treat all red checks the same: default PR CI, docs deploy, release publishing, manual matrix, and scheduled certification have different recovery paths.

Triage flow¶

flowchart TD
    A["A check is red"] --> B["Identify workflow and job"]
    B --> C{"Default PR gate?"}
    C -- yes --> D["Reproduce local command"]
    C -- no --> E{"Docs, release, security, or manual gate?"}
    E -- docs --> F["Run mkdocs build --strict"]
    E -- release --> G["Run uv build and twine check"]
    E -- security --> H["Inspect finding; rotate secrets if needed"]
    E -- manual --> I["Re-run focused marker/case with artifacts"]
    D --> J["Fix code/test/docs"]
    F --> J
    G --> J
    H --> J
    I --> J
    J --> K["Update runbook if failure pattern is new"]

CI quality failures¶

Applies to:

.github/workflows/ci.yml
Jobs named Quality checks (3.11), Quality checks (3.12)

Ruff lint fails¶

Reproduce:

uv run ruff check .

Fix:

Prefer small explicit code fixes over broad suppressions.
If a rule is noisy for a whole category, document the rule change in the PR.
Do not hide connector optional-import failures behind # noqa unless the lazy import contract is still covered by tests.

Ruff format fails¶

Reproduce:

uv run ruff format --check .

Fix:

uv run ruff format <changed-python-files>

Then re-run the format check.

Mypy fails¶

Reproduce:

uv run mypy --config-file mypy.ini

Fix:

Keep public models typed at boundaries.
Prefer Protocols and small adapters over Any spreading through runtime code.
If a third-party library lacks types, isolate the import in the connector adapter.

Pytest fails¶

Reproduce the full default test step:

uv run pytest -m "not integration_live" --cov=src/dpone --cov-report=xml

Focused reproduction:

uv run pytest path/to/test_file.py::test_name -q

Fix:

If the failure is a docs contract, update docs and code together.
If the failure is optional import safety, keep dependency imports lazy.
If the failure is integration marker skip behavior, check marker/env docs before changing runtime behavior.

Coverage fails¶

The default gate requires repository coverage above the configured minimum.

Fix:

Add focused tests for new branches or public contracts.
Avoid deleting coverage expectations just to pass CI.
For broad generated docs changes, coverage should not change; investigate accidental runtime edits.

Package build fails¶

Reproduce:

uv build

Fix:

Check pyproject.toml metadata and package include rules.
Check optional extras for invalid dependency names or private indexes.
Run uv run twine check dist/* after build metadata changes.

PostgreSQL XMin integration failures¶

Applies to the postgres-xmin job in .github/workflows/ci.yml.

Reproduce against a local Postgres service:

DPONE_RUN_INTEGRATION=1 \
DPONE_IT_PG_HOST=127.0.0.1 \
DPONE_IT_PG_PORT=5432 \
DPONE_IT_PG_DATABASE=dpone_it \
DPONE_IT_PG_USER=dpone \
DPONE_IT_PG_PASSWORD=dpone \
uv run pytest -m integration_postgres_xmin tests/integration/postgres -q

Common causes:

Postgres service is not healthy yet.
XMin strategy selector changed without updating tests/docs.
State persistence changed and the test can no longer resume from the expected XMin state.
Physical delete expectations were added without reconciliation or CDC behavior.

Fix:

Keep XMin Postgres-only; non-Postgres sources must fail fast when XMin is explicitly selected.
Preserve state transition order: extract, load, quality/reconciliation, then state commit.
Update Postgres XMin when algorithm behavior changes.

Docs and GitHub Pages failures¶

Applies to .github/workflows/pages.yml.

Reproduce:

python -m pip install -r docs/requirements.txt
mkdocs build --strict

Common causes:

Broken relative link.
File added but not linked from nav or documentation index.
Mermaid fence config broken.
Markdown heading/link mismatch.
MkDocs dependency drift.

Fix:

Add new public pages to Documentation index, mkdocs.yml, or an existing section index.
Keep Mermaid as fenced blocks using ```mermaid.
Do not use unsafe YAML tags in mkdocs.yml; pre-commit check-yaml must pass.
If GitHub Pages deploy succeeds but site content is old, check that the docs workflow completed on master and Pages source is GitHub Actions.

Release and PyPI failures¶

Applies to .github/workflows/release.yml.

PyPI upload succeeds but installers cannot see the version¶

Symptoms:

https://pypi.org/pypi/dpone/X.Y.Z/json returns the uploaded wheel/sdist.
https://pypi.org/simple/dpone/ or https://pypi.org/pypi/dpone/json still lists an older version.
uv pip install dpone==X.Y.Z or pip install dpone==X.Y.Z says no matching version exists.

Actions:

Run the strict visibility check:

uv run python tools/pypi_release_smoke.py \
  --package dpone \
  --version X.Y.Z \
  --install-smoke \
  --timeout-seconds 900

If only version_json passes, treat it as PyPI Simple Repository API/index lag or a Warehouse health issue. Do not publish a GitHub Release announcement that claims the package is installable.
If a hotfix container is required while Simple API is stale, build the runtime image from the direct wheel URL shown in the version-specific PyPI JSON or from the GitHub Release asset.
Re-run the release workflow with skip-existing after the Simple API exposes the version. It will skip already uploaded files and create/refresh release evidence once resolver install passes.

Build and metadata failures¶

Reproduce build and metadata checks:

uv sync --all-extras
uv build
uv run twine check dist/*

Common causes:

Tag does not match vX.Y.Z.
Version in pyproject.toml was not updated before tagging.
Trusted Publishing pending publisher is misconfigured.
Manual token fallback requested but PYPI_API_TOKEN is missing or expired.
PyPI project already has the same version.

Fix:

Prefer Trusted Publishing with environment pypi.
Check PyPI project owner/repository/workflow filename/environment settings.
Never reuse a compromised token.
If a publish partially succeeded, do not delete/reuse the version; cut a new patch version.

Secret scan failures¶

Applies to .github/workflows/secret-scan.yml.

If TruffleHog reports a verified secret:

Treat it as compromised.
Revoke and rotate the credential before further public release work.
Remove the secret from the file, test artifact, or docs page.
If the secret is in public git history, coordinate history rewrite separately and document the incident.
Add a regression check if the pattern can recur.

Do not silence verified secret findings to make CI green.

CodeQL failures¶

Applies to .github/workflows/codeql.yml.

Fix:

Open the CodeQL alert and identify the data/control flow.
Prefer input validation, safe APIs, and explicit escaping over suppressions.
Add a regression test for the risky behavior when practical.
If the alert is a false positive, document why and keep the suppression narrow.

OSSF Scorecard failures¶

Applies to .github/workflows/scorecard.yml.

Common causes:

Branch protection was changed.
Token permissions are too broad.
Security policy or dependency update posture regressed.

Fix:

Keep workflow permissions least-privilege.
Keep SECURITY.md and dependency automation current.
Treat Scorecard as supply-chain posture evidence; do not block emergency hotfixes solely on advisory score drift, but create a follow-up issue.

Source sink integration matrix failures¶

Applies to .github/workflows/integration-matrix.yml and tests/integration/matrix/.

Reproduce all credential-free contracts:

DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_contract \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_contract_latest \
uv run pytest -m integration_matrix tests/integration/matrix -q

Reproduce local/mock layer:

DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_local \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_local_latest \
uv run pytest -m integration_matrix_mock tests/integration/matrix -q

Focused case:

DPONE_MATRIX_CASE_ID=postgres_to_mssql__incremental_merge

Common causes:

A source -> sink guide is missing.
A new strategy is not registered in dpone.integration_matrix.
The behavior artifact count/checksum changed without docs updates.
mock_local expectations changed for BigQuery documented-contract skips.

Fix:

Update the canonical matrix and docs in the same PR.
Keep default mock volume documented: 10,000 rows, 20% changed, 5% deletes, 120 wide columns.
Use artifacts under test_artifacts/integration_matrix/ to compare expected/actual behavior.

Connector certification failures¶

Applies to .github/workflows/connector-certification.yml.

Offline certification reproduction:

uv run pytest \
  tests/test_mssql_manifest_examples.py \
  tests/test_runtime_mssql_contracts.py \
  tests/test_runtime_kafka_contracts.py \
  tests/test_runtime_rest_and_clickhouse_contracts.py \
  tests/test_runtime_schema_evolution_contracts.py \
  tests/test_runtime_state_and_reconciliation_contracts.py \
  tests/test_runtime_cdc_readers.py \
  tests/test_runtime_parallel_partitioning.py \
  tests/test_managed_ux_contracts.py \
  -q

Fix:

If capability metadata changed, update Connector certification.
If a local service fails, inspect docker compose -f docker/docker-compose.integration.yml ps and service logs.
If local MSSQL tests fail before connecting, verify ODBC Driver 18, bcp -v, and /opt/mssql-tools18/bin on the runner PATH.
If local MSSQL login succeeds but dpone_it cannot open, verify the Prepare local MSSQL database step and the sqlcmd database materialization log.
If Kafka tests fail before connecting, verify that both kafka and schema-registry services were started by the workflow.
If vendor-live fails, first verify that the job only ran provider/API directories, then verify credentials and provider availability before changing runtime code.
Upload or update certification artifacts for release-impacting connector changes.

Live certification failures¶

Use this runbook when .github/workflows/live-certification.yml is red.

Open the first failing step. Later evidence-pack steps often fail only because an upstream artifact is missing.
If service startup fails, run docker compose -f docker/docker-compose.integration.yml ps and inspect dpone-it-postgres, dpone-it-mssql, dpone-it-clickhouse, dpone-it-kafka, and dpone-it-schema-registry logs.
If native tooling fails, verify bcp -v, sqlcmd -?, ODBC Driver 18, and /opt/mssql-tools18/bin on the runner PATH.
If mssql_stress.py fails during Postgres -> MSSQL export, inspect postgres_to_mssql.source_export and partition bounds in the benchmark JSON.
If mssql_stress.py fails during MSSQL load/finalize, inspect postgres_to_mssql.target_load_finalize, SQL Server error files, and bulk.bcp.* settings.
If the optional native benchmark suite is red, open postgres_mssql_native_benchmark_summary.md first, then the specific scenario JSON under native_benchmark_suite/.
For 1M/10M local failures, distinguish infrastructure pressure from runtime bugs: check Docker memory, temp disk, SQL Server transaction log growth, and ClickHouse part pressure before changing code.
Keep release promotion blocked until release-evidence, evidence-chain, and pre-release artifacts are present and passed.

Stuck or queued workflows¶

Common causes:

GitHub Actions runner capacity.
Environment protection waiting for approval.
Pages deployment concurrency.
Long local service startup.

Fix:

Check workflow concurrency groups before canceling.
Cancel superseded runs only when a newer commit contains the same changes.
Do not cancel release publishing after upload has started unless you have verified PyPI state.

Orchestration maturity failures¶

Use this runbook when .github/workflows/orchestration-maturity.yml is red.

Open the failing step first: orchestration tests, docs link checks, or strict MkDocs build.
For test failures, run uv run pytest tests/test_orchestration.py -q locally and inspect the specific blocker code.
For lock failures, inspect .dpone/locks/<key>.lock.json and confirm no active scheduler job owns it.
For resume policy failures, inspect .dpone/orchestration-state/<run_id>.job_state.json before changing --resume-policy.
For scheduler snippet failures, confirm snippets call dpone orchestrate run, not bare dpone run.
Upload orchestration-maturity-report with the PR or release evidence after the gate is green.

Observability maturity failures¶

Use this runbook when .github/workflows/observability-maturity.yml is red.

Open observability-maturity-report and identify whether tests, metrics export, SLO smoke, or artifact indexing failed.
Reproduce focused tests with uv run pytest tests/test_observability.py -q.
If metrics.empty, run_report.missing, or run_report.invalid_json appears, inspect test_artifacts/observability/maturity/run_report.json.
If Prometheus output is malformed, inspect label names and values in the metrics export command; labels are sanitized but empty keys are invalid input.
If OpenTelemetry resource attributes are missing, confirm --resource-attr key=value flags are passed after metrics-export.
If SLO smoke is red, inspect slo_report.json and tune the synthetic objective only when the runtime metric contract is still correct.
If metrics_index.json or artifact_index.json checksum evidence is missing, re-run the export and index commands in order.
Upload the whole test_artifacts/observability/maturity/ directory after remediation.

Full certification automation failures¶

Use this runbook when .github/workflows/full-certification.yml is red.

Open the failing step in order; downstream steps may be red only because an upstream artifact is missing.
If source_sink_matrix fails, re-run the focused case from test_artifacts/full_certification/matrix/certification_report.json.
If benchmark_baseline.not_passed, re-run the same profile before updating a baseline.
If lineage_report.missing, verify run-registry produced a *__run_registry.json entry first.
If evidence_bundle.not_passed, inspect data contract rows and required evidence in ops_evidence_bundle.json.
If certification_suite is red, inspect blockers before changing workflow steps.
If evidence-chain-verify fails, treat it as release-blocking audit evidence and rebuild from the artifact index only after reviewing checksum drift.
Attach full-certification-report to release review or connector badge promotion evidence.

Production maturity failures¶

Workflow: .github/workflows/production-maturity.yml

Command to reproduce locally:

uv run dpone ops production-maturity \
  --release local-readiness \
  --output-dir test_artifacts/production_maturity/report \
  --artifact certification=PATH_TO_CERTIFICATION_JSON \
  --artifact cdc=PATH_TO_CDC_JSON \
  --artifact performance=PATH_TO_PERFORMANCE_JSON \
  --artifact security=PATH_TO_SECURITY_JSON \
  --artifact supply_chain=PATH_TO_SUPPLY_CHAIN_JSON \
  --artifact governance=PATH_TO_GOVERNANCE_JSON \
  --artifact docs=PATH_TO_DOCS_JSON

Recovery:

Open test_artifacts/production_maturity/report/production_maturity.md.
For *.missing, rerun or download the missing specialized workflow artifact.
For *.not_passed, fix the specialized workflow that produced the artifact; do not patch the aggregator to ignore the failure.
Rerun dpone ops production-maturity with the corrected artifact paths.
Keep release publishing blocked until all required blockers are gone.

Expected output is level: ga_ready for a releasable build. release_candidate is reviewable but not publishable without explicit acceptance of remaining blockers.

Industrial readiness failures¶

Workflow: .github/workflows/industrial-readiness.yml

Command to reproduce locally:

uv run dpone ops industrial-readiness \
  --release local-industrial-readiness \
  --output-dir test_artifacts/industrial_readiness/report \
  --artifact local_matrix=PATH_TO_LOCAL_MATRIX_JSON \
  --artifact correctness=PATH_TO_CORRECTNESS_JSON \
  --artifact reliability=PATH_TO_RELIABILITY_JSON \
  --artifact performance_lab=PATH_TO_PERFORMANCE_LAB_JSON \
  --artifact ux=PATH_TO_UX_JSON \
  --artifact governance=PATH_TO_GOVERNANCE_JSON

Recovery:

Open test_artifacts/industrial_readiness/report/industrial_readiness.md.
Fix missing or failed specialized evidence first.
For local_matrix.case_missing:*, run the exact source -> sink -> strategy case named by the blocker.
For correctness blockers, inspect reconciliation, type fidelity, NULL/empty-string handling, and quarantine artifacts.
For reliability blockers, inspect locks, retries, resumability, idempotency, and state commit order.
Keep public release promotion blocked until the industrial readiness report is industrial_ready.

Route certification release failures¶

Workflow: .github/workflows/route-certification-release.yml

Command to reproduce locally:

uv run dpone ops route-certify-release \
  --release local-route-release \
  --profile oss_ci \
  --output-dir test_artifacts/route_certification_release \
  --route-bundle postgres_to_mssql__incremental_merge=PATH_TO_POSTGRES_MSSQL_BUNDLE \
  --route-bundle mssql_to_clickhouse__incremental_merge=PATH_TO_MSSQL_CLICKHOUSE_BUNDLE \
  --format json

Recovery:

Open test_artifacts/route_certification_release/route_certification_release.md.
For <route>.missing, regenerate the missing route_certification_bundle.json with dpone ops route-certify.
For <route>.not_certified, open that route bundle and fix upstream blockers before rerunning the release gate.
For <route>.profile_mismatch, regenerate the route bundle with the same profile requested by the release gate.
For <route>.route_mismatch, attach the bundle whose embedded route.case_id matches the CLI route key.
Keep release publishing blocked until route_certification_release.json has level: release_ready.

Route release finalize failures¶

Workflow: .github/workflows/route-release-finalize.yml

Command to reproduce locally:

uv run dpone ops route-release-finalize \
  --release local-route-release \
  --profile oss_ci \
  --bundle-root test_artifacts/route_certify \
  --history-dir test_artifacts/route_release_finalize/history \
  --output-dir test_artifacts/route_release_finalize \
  --format json

Recovery:

Open test_artifacts/route_release_finalize/route_release_finalizer.md.
For <route>.missing, regenerate the missing route_certification_bundle.json.
For <route>.stale, regenerate route evidence or explicitly approve a larger --max-age-hours.
For <route>.release_mismatch, regenerate the bundle with the finalizer release id.
For <route>.score_regression, compare the baseline route score and fix degraded upstream evidence.
Keep release publishing blocked until route_release_finalizer.json has level: final_ready.