CI/CD failure runbooks¶
Use this page when a GitHub Actions run is red or stuck. Start by identifying the workflow, job, and exact failing step. Do not treat all red checks the same: default PR CI, docs deploy, release publishing, manual matrix, and scheduled certification have different recovery paths.
Triage flow¶
flowchart TD
A["A check is red"] --> B["Identify workflow and job"]
B --> C{"Default PR gate?"}
C -- yes --> D["Reproduce local command"]
C -- no --> E{"Docs, release, security, or manual gate?"}
E -- docs --> F["Run mkdocs build --strict"]
E -- release --> G["Run uv build and twine check"]
E -- security --> H["Inspect finding; rotate secrets if needed"]
E -- manual --> I["Re-run focused marker/case with artifacts"]
D --> J["Fix code/test/docs"]
F --> J
G --> J
H --> J
I --> J
J --> K["Update runbook if failure pattern is new"]
CI quality failures¶
Applies to:
.github/workflows/ci.yml- Jobs named
Quality checks (3.11),Quality checks (3.12)
Ruff lint fails¶
Reproduce:
Fix:
- Prefer small explicit code fixes over broad suppressions.
- If a rule is noisy for a whole category, document the rule change in the PR.
- Do not hide connector optional-import failures behind
# noqaunless the lazy import contract is still covered by tests.
Ruff format fails¶
Reproduce:
Fix:
Then re-run the format check.
Mypy fails¶
Reproduce:
Fix:
- Keep public models typed at boundaries.
- Prefer Protocols and small adapters over
Anyspreading through runtime code. - If a third-party library lacks types, isolate the import in the connector adapter.
Pytest fails¶
Reproduce the full default test step:
Focused reproduction:
Fix:
- If the failure is a docs contract, update docs and code together.
- If the failure is optional import safety, keep dependency imports lazy.
- If the failure is integration marker skip behavior, check marker/env docs before changing runtime behavior.
Coverage fails¶
The default gate requires repository coverage above the configured minimum.
Fix:
- Add focused tests for new branches or public contracts.
- Avoid deleting coverage expectations just to pass CI.
- For broad generated docs changes, coverage should not change; investigate accidental runtime edits.
Package build fails¶
Reproduce:
Fix:
- Check
pyproject.tomlmetadata and package include rules. - Check optional extras for invalid dependency names or private indexes.
- Run
uv tool run twine check dist/*after build metadata changes.
PostgreSQL XMin integration failures¶
Applies to the postgres-xmin job in .github/workflows/ci.yml.
Reproduce against a local Postgres service:
DPONE_RUN_INTEGRATION=1 \
DPONE_IT_PG_HOST=127.0.0.1 \
DPONE_IT_PG_PORT=5432 \
DPONE_IT_PG_DATABASE=dpone_it \
DPONE_IT_PG_USER=dpone \
DPONE_IT_PG_PASSWORD=dpone \
uv run pytest -m integration_postgres_xmin tests/integration/postgres -q
Common causes:
- Postgres service is not healthy yet.
- XMin strategy selector changed without updating tests/docs.
- State persistence changed and the test can no longer resume from the expected XMin state.
- Physical delete expectations were added without reconciliation or CDC behavior.
Fix:
- Keep XMin Postgres-only; non-Postgres sources must fail fast when XMin is explicitly selected.
- Preserve state transition order: extract, load, quality/reconciliation, then state commit.
- Update Postgres XMin when algorithm behavior changes.
Docs and GitHub Pages failures¶
Applies to .github/workflows/pages.yml.
Reproduce:
Common causes:
- Broken relative link.
- File added but not linked from nav or documentation index.
- Mermaid fence config broken.
- Markdown heading/link mismatch.
- MkDocs dependency drift.
Fix:
- Add new public pages to Documentation index,
mkdocs.yml, or an existing section index. - Keep Mermaid as fenced blocks using
```mermaid. - Do not use unsafe YAML tags in
mkdocs.yml; pre-commitcheck-yamlmust pass. - If GitHub Pages deploy succeeds but site content is old, check that the
docsworkflow completed onmasterand Pages source is GitHub Actions.
Release and PyPI failures¶
Applies to .github/workflows/release.yml.
Reproduce build and metadata checks:
Common causes:
- Tag does not match
vX.Y.Z. - Version in
pyproject.tomlwas not updated before tagging. - Trusted Publishing pending publisher is misconfigured.
- Manual token fallback requested but
PYPI_API_TOKENis missing or expired. - PyPI project already has the same version.
Fix:
- Prefer Trusted Publishing with environment
pypi. - Check PyPI project owner/repository/workflow filename/environment settings.
- Never reuse a compromised token.
- If a publish partially succeeded, do not delete/reuse the version; cut a new patch version.
Secret scan failures¶
Applies to .github/workflows/secret-scan.yml.
If TruffleHog reports a verified secret:
- Treat it as compromised.
- Revoke and rotate the credential before further public release work.
- Remove the secret from the file, test artifact, or docs page.
- If the secret is in public git history, coordinate history rewrite separately and document the incident.
- Add a regression check if the pattern can recur.
Do not silence verified secret findings to make CI green.
CodeQL failures¶
Applies to .github/workflows/codeql.yml.
Fix:
- Open the CodeQL alert and identify the data/control flow.
- Prefer input validation, safe APIs, and explicit escaping over suppressions.
- Add a regression test for the risky behavior when practical.
- If the alert is a false positive, document why and keep the suppression narrow.
OSSF Scorecard failures¶
Applies to .github/workflows/scorecard.yml.
Common causes:
- Branch protection was changed.
- Token permissions are too broad.
- Security policy or dependency update posture regressed.
Fix:
- Keep workflow permissions least-privilege.
- Keep
SECURITY.mdand dependency automation current. - Treat Scorecard as supply-chain posture evidence; do not block emergency hotfixes solely on advisory score drift, but create a follow-up issue.
Source sink integration matrix failures¶
Applies to .github/workflows/integration-matrix.yml and tests/integration/matrix/.
Reproduce all credential-free contracts:
DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_contract \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_contract_latest \
uv run pytest -m integration_matrix tests/integration/matrix -q
Reproduce local/mock layer:
DPONE_RUN_INTEGRATION=1 \
DPONE_RUN_INTEGRATION_MATRIX=1 \
DPONE_MATRIX_RUN_MODE=mock_local \
DPONE_MATRIX_ARTIFACT_DIR=test_artifacts/integration_matrix/mock_local_latest \
uv run pytest -m integration_matrix_mock tests/integration/matrix -q
Focused case:
Common causes:
- A source -> sink guide is missing.
- A new strategy is not registered in
dpone.integration_matrix. - The behavior artifact count/checksum changed without docs updates.
mock_localexpectations changed for BigQuery documented-contract skips.
Fix:
- Update the canonical matrix and docs in the same PR.
- Keep default mock volume documented: 10,000 rows, 20% changed, 5% deletes, 120 wide columns.
- Use artifacts under
test_artifacts/integration_matrix/to compare expected/actual behavior.
Connector certification failures¶
Applies to .github/workflows/connector-certification.yml.
Offline certification reproduction:
uv run pytest \
tests/test_mssql_manifest_examples.py \
tests/test_runtime_mssql_contracts.py \
tests/test_runtime_kafka_contracts.py \
tests/test_runtime_rest_and_clickhouse_contracts.py \
tests/test_runtime_schema_evolution_contracts.py \
tests/test_runtime_state_and_reconciliation_contracts.py \
tests/test_runtime_cdc_readers.py \
tests/test_runtime_parallel_partitioning.py \
tests/test_managed_ux_contracts.py \
-q
Fix:
- If capability metadata changed, update Connector certification.
- If a local service fails, inspect
docker compose -f docker/docker-compose.integration.yml psand service logs. - If vendor-live fails, verify credentials and provider availability before changing runtime code.
- Upload or update certification artifacts for release-impacting connector changes.
Live certification failures¶
Use this runbook when .github/workflows/live-certification.yml is red.
- Open the first failing step. Later evidence-pack steps often fail only because an upstream artifact is missing.
- If service startup fails, run
docker compose -f docker/docker-compose.integration.yml psand inspectdpone-it-postgres,dpone-it-mssql,dpone-it-clickhouse,dpone-it-kafka, anddpone-it-schema-registrylogs. - If native tooling fails, verify
bcp -v,sqlcmd -?, ODBC Driver 18, and/opt/mssql-tools18/binon the runnerPATH. - If
mssql_stress.pyfails during Postgres -> MSSQL export, inspectpostgres_to_mssql.source_exportand partition bounds in the benchmark JSON. - If
mssql_stress.pyfails during MSSQL load/finalize, inspectpostgres_to_mssql.target_load_finalize, SQL Server error files, andbulk.bcp.*settings. - If the optional native benchmark suite is red, open
postgres_mssql_native_benchmark_summary.mdfirst, then the specific scenario JSON undernative_benchmark_suite/. - For 1M/10M local failures, distinguish infrastructure pressure from runtime bugs: check Docker memory, temp disk, SQL Server transaction log growth, and ClickHouse part pressure before changing code.
- Keep release promotion blocked until
release-evidence,evidence-chain, andpre-releaseartifacts are present and passed.
Stuck or queued workflows¶
Common causes:
- GitHub Actions runner capacity.
- Environment protection waiting for approval.
- Pages deployment concurrency.
- Long local service startup.
Fix:
- Check workflow concurrency groups before canceling.
- Cancel superseded runs only when a newer commit contains the same changes.
- Do not cancel release publishing after upload has started unless you have verified PyPI state.
Orchestration maturity failures¶
Use this runbook when .github/workflows/orchestration-maturity.yml is red.
- Open the failing step first: orchestration tests, docs link checks, or strict MkDocs build.
- For test failures, run
uv run pytest tests/test_orchestration.py -qlocally and inspect the specific blocker code. - For lock failures, inspect
.dpone/locks/<key>.lock.jsonand confirm no active scheduler job owns it. - For resume policy failures, inspect
.dpone/orchestration-state/<run_id>.job_state.jsonbefore changing--resume-policy. - For scheduler snippet failures, confirm snippets call
dpone orchestrate run, not baredpone run. - Upload
orchestration-maturity-reportwith the PR or release evidence after the gate is green.
Observability maturity failures¶
Use this runbook when .github/workflows/observability-maturity.yml is red.
- Open
observability-maturity-reportand identify whether tests, metrics export, SLO smoke, or artifact indexing failed. - Reproduce focused tests with
uv run pytest tests/test_observability.py -q. - If
metrics.empty,run_report.missing, orrun_report.invalid_jsonappears, inspecttest_artifacts/observability/maturity/run_report.json. - If Prometheus output is malformed, inspect label names and values in the metrics export command; labels are sanitized but empty keys are invalid input.
- If OpenTelemetry resource attributes are missing, confirm
--resource-attr key=valueflags are passed aftermetrics-export. - If SLO smoke is red, inspect
slo_report.jsonand tune the synthetic objective only when the runtime metric contract is still correct. - If
metrics_index.jsonorartifact_index.jsonchecksum evidence is missing, re-run the export and index commands in order. - Upload the whole
test_artifacts/observability/maturity/directory after remediation.
Full certification automation failures¶
Use this runbook when .github/workflows/full-certification.yml is red.
- Open the failing step in order; downstream steps may be red only because an upstream artifact is missing.
- If
source_sink_matrixfails, re-run the focused case fromtest_artifacts/full_certification/matrix/certification_report.json. - If
benchmark_baseline.not_passed, re-run the same profile before updating a baseline. - If
lineage_report.missing, verifyrun-registryproduced a*__run_registry.jsonentry first. - If
evidence_bundle.not_passed, inspect data contract rows and required evidence inops_evidence_bundle.json. - If
certification_suiteis red, inspectblockersbefore changing workflow steps. - If
evidence-chain-verifyfails, treat it as release-blocking audit evidence and rebuild from the artifact index only after reviewing checksum drift. - Attach
full-certification-reportto release review or connector badge promotion evidence.
Production maturity failures¶
Workflow: .github/workflows/production-maturity.yml
Command to reproduce locally:
uv run dpone ops production-maturity \
--release local-readiness \
--output-dir test_artifacts/production_maturity/report \
--artifact certification=PATH_TO_CERTIFICATION_JSON \
--artifact cdc=PATH_TO_CDC_JSON \
--artifact performance=PATH_TO_PERFORMANCE_JSON \
--artifact security=PATH_TO_SECURITY_JSON \
--artifact supply_chain=PATH_TO_SUPPLY_CHAIN_JSON \
--artifact governance=PATH_TO_GOVERNANCE_JSON \
--artifact docs=PATH_TO_DOCS_JSON
Recovery:
- Open
test_artifacts/production_maturity/report/production_maturity.md. - For
*.missing, rerun or download the missing specialized workflow artifact. - For
*.not_passed, fix the specialized workflow that produced the artifact; do not patch the aggregator to ignore the failure. - Rerun
dpone ops production-maturitywith the corrected artifact paths. - Keep release publishing blocked until all required blockers are gone.
Expected output is level: ga_ready for a releasable build. release_candidate is reviewable but not publishable without explicit acceptance of remaining blockers.
Industrial readiness failures¶
Workflow: .github/workflows/industrial-readiness.yml
Command to reproduce locally:
uv run dpone ops industrial-readiness \
--release local-industrial-readiness \
--output-dir test_artifacts/industrial_readiness/report \
--artifact local_matrix=PATH_TO_LOCAL_MATRIX_JSON \
--artifact correctness=PATH_TO_CORRECTNESS_JSON \
--artifact reliability=PATH_TO_RELIABILITY_JSON \
--artifact performance_lab=PATH_TO_PERFORMANCE_LAB_JSON \
--artifact ux=PATH_TO_UX_JSON \
--artifact governance=PATH_TO_GOVERNANCE_JSON
Recovery:
- Open
test_artifacts/industrial_readiness/report/industrial_readiness.md. - Fix missing or failed specialized evidence first.
- For
local_matrix.case_missing:*, run the exact source -> sink -> strategy case named by the blocker. - For correctness blockers, inspect reconciliation, type fidelity, NULL/empty-string handling, and quarantine artifacts.
- For reliability blockers, inspect locks, retries, resumability, idempotency, and state commit order.
- Keep public release promotion blocked until the industrial readiness report is
industrial_ready.