Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica
with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that
dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux
rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was
rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner
resources tagged with the abandoned deployment-id stayed orphaned and
required manual `hcloud` cleanup. Three consecutive provs died this way in
one afternoon.
Option C (smallest blast radius): gate the deploy-bot at the workflow level.
1. New public endpoint GET /api/v1/deployments/in-flight-count on
catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight
status (pending / provisioning / tofu-applying / flux-bootstrapping).
Phase-1 (phase1-watching) is observational and resumes across Pod
restarts via resumePhase1Watch, so it does NOT block. Adopted
deployments are excluded. No FQDNs / owner emails in the response —
same information-disclosure posture as /api/v1/subdomains/check.
Unauthenticated; the deploy-bot has no session cookie.
2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint
before bumping values.yaml. count==0 → green light. count>0 → sleep
20s and retry. Hard cap 30 min (a stuck prov must not block all
future deploys — that would be the worst possible failure mode for a
CI gate). Fail-open on any non-200 / network error so the gate
cannot itself become an outage.
Notes:
- Mothership URL configurable via vars.CATALYST_API_URL (defaults to
https://console.openova.io). Sovereign chroot self-deploys can point
to their local catalyst-api.
- First-rollout safe: the endpoint does not exist on the LIVE
mothership until THIS PR's image lands, so the first run after merge
falls through the 404 branch and proceeds. Subsequent runs benefit
from the gate.
- NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image
refs in chart templates (existing behaviour), so the new endpoint
reaches Sovereigns through the normal chart-rebake path.
Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs
Phase-1 vs terminal vs adopted classification + empty-store green light.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>