openova/.github
e3mrah 96674b71c9
fix(ci+catalyst-api): hold deploy-bot bumps when any prov is in-flight (was rolling catalyst-api Pod mid-tofu-apply, abandoning provs) (#1688)
Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica
with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that
dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux
rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was
rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner
resources tagged with the abandoned deployment-id stayed orphaned and
required manual `hcloud` cleanup. Three consecutive provs died this way in
one afternoon.

Option C (smallest blast radius): gate the deploy-bot at the workflow level.

  1. New public endpoint GET /api/v1/deployments/in-flight-count on
     catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight
     status (pending / provisioning / tofu-applying / flux-bootstrapping).
     Phase-1 (phase1-watching) is observational and resumes across Pod
     restarts via resumePhase1Watch, so it does NOT block. Adopted
     deployments are excluded. No FQDNs / owner emails in the response —
     same information-disclosure posture as /api/v1/subdomains/check.
     Unauthenticated; the deploy-bot has no session cookie.

  2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint
     before bumping values.yaml. count==0 → green light. count>0 → sleep
     20s and retry. Hard cap 30 min (a stuck prov must not block all
     future deploys — that would be the worst possible failure mode for a
     CI gate). Fail-open on any non-200 / network error so the gate
     cannot itself become an outage.

Notes:
  - Mothership URL configurable via vars.CATALYST_API_URL (defaults to
    https://console.openova.io). Sovereign chroot self-deploys can point
    to their local catalyst-api.
  - First-rollout safe: the endpoint does not exist on the LIVE
    mothership until THIS PR's image lands, so the first run after merge
    falls through the 404 branch and proceeds. Subsequent runs benefit
    from the gate.
  - NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image
    refs in chart templates (existing behaviour), so the new endpoint
    reaches Sovereigns through the normal chart-rebake path.

Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs
Phase-1 vs terminal vs adopted classification + empty-store green light.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:54:54 +04:00
..
workflows fix(ci+catalyst-api): hold deploy-bot bumps when any prov is in-flight (was rolling catalyst-api Pod mid-tofu-apply, abandoning provs) (#1688) 2026-05-18 15:54:54 +04:00
dependabot.yml chore(ci): add Dependabot for npm and GitHub Actions dependency updates 2026-03-19 13:42:02 +01:00