Skip to content

CI/CD failure runbooks

Use this page when a GitHub Actions run is red or stuck. Start by identifying the workflow, job, and exact failing step. Do not treat all red checks the same: default PR CI, docs deploy, release publishing, manual matrix, and scheduled certification have different recovery paths.

Triage flow

flowchart TD
    A["A check is red"] --> B["Identify workflow and job"]
    B --> C{"Default PR gate?"}
    C -- yes --> D["Reproduce local command"]
    C -- no --> E{"Docs, release, security, or manual gate?"}
    E -- docs --> F["Run mkdocs build --strict"]
    E -- release --> G["Run uv build and twine check"]
    E -- security --> H["Inspect finding; rotate secrets if needed"]
    E -- manual --> I["Re-run focused marker/case with artifacts"]
    D --> J["Fix code/test/docs"]
    F --> J
    G --> J
    H --> J
    I --> J
    J --> K["Update runbook if failure pattern is new"]

CI quality failures

Applies to:

  • .github/workflows/ci.yml
  • Jobs named Quality checks (3.11), Quality checks (3.12)

Ruff lint fails

Reproduce:

uv run ruff check .

Fix:

  • Prefer small explicit code fixes over broad suppressions.
  • If a rule is noisy for a whole category, document the rule change in the PR.
  • Do not hide connector optional-import failures behind # noqa unless the lazy import contract is still covered by tests.

Ruff format fails

Reproduce:

uv run ruff format --check .

Fix:

uv run ruff format <changed-python-files>

Then re-run the format check.

Mypy fails

Reproduce:

uv run mypy --config-file mypy.ini

Fix:

  • Keep public models typed at boundaries.
  • Prefer Protocols and small adapters over Any spreading through runtime code.
  • If a third-party library lacks types, isolate the import in the connector adapter.

Pytest fails

Reproduce the full default test step:

uv run pytest -m "not integration_live" --cov=src/dpone --cov-report=xml

Focused reproduction:

uv run pytest path/to/test_file.py::test_name -q

Fix:

  • If the failure is a docs contract, update docs and code together.
  • If the failure is optional import safety, keep dependency imports lazy.
  • If the failure is integration marker skip behavior, check marker/env docs before changing runtime behavior.

Coverage fails

The default gate requires repository coverage above the configured minimum.

Fix:

  • Add focused tests for new branches or public contracts.
  • Avoid deleting coverage expectations just to pass CI.
  • For broad generated docs changes, coverage should not change; investigate accidental runtime edits.

Package build fails

Reproduce:

uv build

Fix:

  • Check pyproject.toml metadata and package include rules.
  • Check optional extras for invalid dependency names or private indexes.
  • Run uv tool run twine check dist/* after build metadata changes.

PostgreSQL XMin integration failures

Applies to the postgres-xmin job in .github/workflows/ci.yml.

Reproduce against a local Postgres service:

DPONE_RUN_INTEGRATION=1 \
DPONE_IT_PG_HOST=127.0.0.1 \
DPONE_IT_PG_PORT=5432 \
DPONE_IT_PG_DATABASE=dpone_it \
DPONE_IT_PG_USER=dpone \
DPONE_IT_PG_PASSWORD=dpone \
uv run pytest -m integration_postgres_xmin tests/integration/postgres -q

Common causes:

  • Postgres service is not healthy yet.
  • XMin strategy selector changed without updating tests/docs.
  • State persistence changed and the test can no longer resume from the expected XMin state.
  • Physical delete expectations were added without reconciliation or CDC behavior.

Fix:

  • Keep XMin Postgres-only; non-Postgres sources must fail fast when XMin is explicitly selected.
  • Preserve state transition order: extract, load, quality/reconciliation, then state commit.
  • Update Postgres XMin when algorithm behavior changes.

Docs and GitHub Pages failures

Applies to .github/workflows/pages.yml.

Reproduce:

python -m pip install -r docs/requirements.txt
mkdocs build --strict

Common causes:

  • Broken relative link.
  • File added but not linked from nav or documentation index.
  • Mermaid fence config broken.
  • Markdown heading/link mismatch.
  • MkDocs dependency drift.

Fix:

  • Add new public pages to Documentation index, mkdocs.yml, or an existing section index.
  • Keep Mermaid as fenced blocks using ```mermaid.
  • Do not use unsafe YAML tags in mkdocs.yml; pre-commit check-yaml must pass.
  • If GitHub Pages deploy succeeds but site content is old, check that the docs workflow completed on master and Pages source is GitHub Actions.

Release and PyPI failures

Applies to .github/workflows/release.yml.

Reproduce build and metadata checks:

uv sync --all-extras
uv build
uv tool run twine check dist/*

Common causes:

  • Tag does not match vX.Y.Z.
  • Version in pyproject.toml was not updated before tagging.
  • Trusted Publishing pending publisher is misconfigured.
  • Manual token fallback requested but PYPI_API_TOKEN is missing or expired.
  • PyPI project already has the same version.

Fix:

  • Prefer Trusted Publishing with environment pypi.
  • Check PyPI project owner/repository/workflow filename/environment settings.
  • Never reuse a compromised token.
  • If a publish partially succeeded, do not delete/reuse the version; cut a new patch version.

Secret scan failures

Applies to .github/workflows/secret-scan.yml.

If TruffleHog reports a verified secret:

  1. Treat it as compromised.
  2. Revoke and rotate the credential before further public release work.
  3. Remove the secret from the file, test artifact, or docs page.
  4. If the secret is in public git history, coordinate history rewrite separately and document the incident.
  5. Add a regression check if the pattern can recur.

Do not silence verified secret findings to make CI green.

CodeQL failures

Applies to .github/workflows/codeql.yml.

Fix:

  • Open the CodeQL alert and identify the data/control flow.
  • Prefer input validation, safe APIs, and explicit escaping over suppressions.
  • Add a regression test for the risky behavior when practical.
  • If the alert is a false positive, document why and keep the suppression narrow.

OSSF Scorecard failures

Applies to .github/workflows/scorecard.yml.

Common causes:

  • Branch protection was changed.
  • Token permissions are too broad.
  • Security policy or dependency update posture regressed.

Fix:

  • Keep workflow permissions least-privilege.
  • Keep SECURITY.md and dependency automation current.
  • Treat Scorecard as supply-chain posture evidence; do not block emergency hotfixes solely on advisory score drift, but create a follow-up issue.

Source sink integration matrix failures

Applies to .github/workflows/integration-matrix.yml and tests/integration/matrix/.

Reproduce all credential-free contracts:

DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_contract \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_contract_latest \
uv run pytest -m integration_matrix tests/integration/matrix -q

Reproduce local/mock layer:

DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_local \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_local_latest \
uv run pytest -m integration_matrix_mock tests/integration/matrix -q

Focused case:

DPONE_MATRIX_CASE_ID=postgres_to_mssql__incremental_merge

Common causes:

  • A source -> sink guide is missing.
  • A new strategy is not registered in dpone.integration_matrix.
  • The behavior artifact count/checksum changed without docs updates.
  • mock_local expectations changed for BigQuery documented-contract skips.

Fix:

  • Update the canonical matrix and docs in the same PR.
  • Keep default mock volume documented: 10,000 rows, 20% changed, 5% deletes, 120 wide columns.
  • Use artifacts under test_artifacts/integration_matrix/ to compare expected/actual behavior.

Connector certification failures

Applies to .github/workflows/connector-certification.yml.

Offline certification reproduction:

uv run pytest \
  tests/test_mssql_manifest_examples.py \
  tests/test_runtime_mssql_contracts.py \
  tests/test_runtime_kafka_contracts.py \
  tests/test_runtime_rest_and_clickhouse_contracts.py \
  tests/test_runtime_schema_evolution_contracts.py \
  tests/test_runtime_state_and_reconciliation_contracts.py \
  tests/test_runtime_cdc_readers.py \
  tests/test_runtime_parallel_partitioning.py \
  tests/test_managed_ux_contracts.py \
  -q

Fix:

  • If capability metadata changed, update Connector certification.
  • If a local service fails, inspect docker compose -f docker/docker-compose.integration.yml ps and service logs.
  • If vendor-live fails, verify credentials and provider availability before changing runtime code.
  • Upload or update certification artifacts for release-impacting connector changes.

Live certification failures

Use this runbook when .github/workflows/live-certification.yml is red.

  1. Open the first failing step. Later evidence-pack steps often fail only because an upstream artifact is missing.
  2. If service startup fails, run docker compose -f docker/docker-compose.integration.yml ps and inspect dpone-it-postgres, dpone-it-mssql, dpone-it-clickhouse, dpone-it-kafka, and dpone-it-schema-registry logs.
  3. If native tooling fails, verify bcp -v, sqlcmd -?, ODBC Driver 18, and /opt/mssql-tools18/bin on the runner PATH.
  4. If mssql_stress.py fails during Postgres -> MSSQL export, inspect postgres_to_mssql.source_export and partition bounds in the benchmark JSON.
  5. If mssql_stress.py fails during MSSQL load/finalize, inspect postgres_to_mssql.target_load_finalize, SQL Server error files, and bulk.bcp.* settings.
  6. If the optional native benchmark suite is red, open postgres_mssql_native_benchmark_summary.md first, then the specific scenario JSON under native_benchmark_suite/.
  7. For 1M/10M local failures, distinguish infrastructure pressure from runtime bugs: check Docker memory, temp disk, SQL Server transaction log growth, and ClickHouse part pressure before changing code.
  8. Keep release promotion blocked until release-evidence, evidence-chain, and pre-release artifacts are present and passed.

Stuck or queued workflows

Common causes:

  • GitHub Actions runner capacity.
  • Environment protection waiting for approval.
  • Pages deployment concurrency.
  • Long local service startup.

Fix:

  • Check workflow concurrency groups before canceling.
  • Cancel superseded runs only when a newer commit contains the same changes.
  • Do not cancel release publishing after upload has started unless you have verified PyPI state.

Orchestration maturity failures

Use this runbook when .github/workflows/orchestration-maturity.yml is red.

  1. Open the failing step first: orchestration tests, docs link checks, or strict MkDocs build.
  2. For test failures, run uv run pytest tests/test_orchestration.py -q locally and inspect the specific blocker code.
  3. For lock failures, inspect .dpone/locks/<key>.lock.json and confirm no active scheduler job owns it.
  4. For resume policy failures, inspect .dpone/orchestration-state/<run_id>.job_state.json before changing --resume-policy.
  5. For scheduler snippet failures, confirm snippets call dpone orchestrate run, not bare dpone run.
  6. Upload orchestration-maturity-report with the PR or release evidence after the gate is green.

Observability maturity failures

Use this runbook when .github/workflows/observability-maturity.yml is red.

  1. Open observability-maturity-report and identify whether tests, metrics export, SLO smoke, or artifact indexing failed.
  2. Reproduce focused tests with uv run pytest tests/test_observability.py -q.
  3. If metrics.empty, run_report.missing, or run_report.invalid_json appears, inspect test_artifacts/observability/maturity/run_report.json.
  4. If Prometheus output is malformed, inspect label names and values in the metrics export command; labels are sanitized but empty keys are invalid input.
  5. If OpenTelemetry resource attributes are missing, confirm --resource-attr key=value flags are passed after metrics-export.
  6. If SLO smoke is red, inspect slo_report.json and tune the synthetic objective only when the runtime metric contract is still correct.
  7. If metrics_index.json or artifact_index.json checksum evidence is missing, re-run the export and index commands in order.
  8. Upload the whole test_artifacts/observability/maturity/ directory after remediation.

Full certification automation failures

Use this runbook when .github/workflows/full-certification.yml is red.

  1. Open the failing step in order; downstream steps may be red only because an upstream artifact is missing.
  2. If source_sink_matrix fails, re-run the focused case from test_artifacts/full_certification/matrix/certification_report.json.
  3. If benchmark_baseline.not_passed, re-run the same profile before updating a baseline.
  4. If lineage_report.missing, verify run-registry produced a *__run_registry.json entry first.
  5. If evidence_bundle.not_passed, inspect data contract rows and required evidence in ops_evidence_bundle.json.
  6. If certification_suite is red, inspect blockers before changing workflow steps.
  7. If evidence-chain-verify fails, treat it as release-blocking audit evidence and rebuild from the artifact index only after reviewing checksum drift.
  8. Attach full-certification-report to release review or connector badge promotion evidence.

Production maturity failures

Workflow: .github/workflows/production-maturity.yml

Command to reproduce locally:

uv run dpone ops production-maturity \
  --release local-readiness \
  --output-dir test_artifacts/production_maturity/report \
  --artifact certification=PATH_TO_CERTIFICATION_JSON \
  --artifact cdc=PATH_TO_CDC_JSON \
  --artifact performance=PATH_TO_PERFORMANCE_JSON \
  --artifact security=PATH_TO_SECURITY_JSON \
  --artifact supply_chain=PATH_TO_SUPPLY_CHAIN_JSON \
  --artifact governance=PATH_TO_GOVERNANCE_JSON \
  --artifact docs=PATH_TO_DOCS_JSON

Recovery:

  1. Open test_artifacts/production_maturity/report/production_maturity.md.
  2. For *.missing, rerun or download the missing specialized workflow artifact.
  3. For *.not_passed, fix the specialized workflow that produced the artifact; do not patch the aggregator to ignore the failure.
  4. Rerun dpone ops production-maturity with the corrected artifact paths.
  5. Keep release publishing blocked until all required blockers are gone.

Expected output is level: ga_ready for a releasable build. release_candidate is reviewable but not publishable without explicit acceptance of remaining blockers.

Industrial readiness failures

Workflow: .github/workflows/industrial-readiness.yml

Command to reproduce locally:

uv run dpone ops industrial-readiness \
  --release local-industrial-readiness \
  --output-dir test_artifacts/industrial_readiness/report \
  --artifact local_matrix=PATH_TO_LOCAL_MATRIX_JSON \
  --artifact correctness=PATH_TO_CORRECTNESS_JSON \
  --artifact reliability=PATH_TO_RELIABILITY_JSON \
  --artifact performance_lab=PATH_TO_PERFORMANCE_LAB_JSON \
  --artifact ux=PATH_TO_UX_JSON \
  --artifact governance=PATH_TO_GOVERNANCE_JSON

Recovery:

  1. Open test_artifacts/industrial_readiness/report/industrial_readiness.md.
  2. Fix missing or failed specialized evidence first.
  3. For local_matrix.case_missing:*, run the exact source -> sink -> strategy case named by the blocker.
  4. For correctness blockers, inspect reconciliation, type fidelity, NULL/empty-string handling, and quarantine artifacts.
  5. For reliability blockers, inspect locks, retries, resumability, idempotency, and state commit order.
  6. Keep public release promotion blocked until the industrial readiness report is industrial_ready.