fix(bp-powerdns): root-cause Job DeadlineExceeded recurrence (post Fix #144) (#1425)

Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after
prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook.
That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12)
both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT
`DeadlineExceeded`. The deadline never got a chance to fire.

Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl
get hr bp-powerdns -o yaml`):

  status:
    Helm install failed for release powerdns/powerdns with chart
    bp-powerdns@1.2.2: failed post-install: 1 error occurred:
      * job powerdns-zone-bootstrap failed: BackoffLimitExceeded

Pod events for powerdns-zone-bootstrap-tq7qq:
  59m Started container zone-bootstrap
  56m Back-off restarting failed container zone-bootstrap
  55m Job has reached the specified backoff limit

Root cause walked end-to-end (per CLAUDE.md TRACE rule):

  TEST: bp-powerdns HR Ready=True
    ↑
  HR: Helm install succeeds (post-install Job exits 0)
    ↑
  Zone-bootstrap Job: curl POST succeeds
    ↑
  powerdns:8081 Service: reachable (has Ready endpoints)
    ↑
  powerdns Deployment: Pods Ready (3 replicas)  ← Pending, blocked here
    ↑
  CNPG cluster: pdns-pg-app Secret exists
    ↑
  pdns-pg-1-initdb Pod: scheduled, Running, Completed  ← Pending too
    ↑
  Worker node has capacity                              ← 99% CPU requested

The zone-bootstrap container curl'd `http://powerdns:8081`, hit
"connection refused" (empty Service endpoints), exited 7, container
restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level
backoffs (≈10min wall-time with exponential delay), the Job declared
`BackoffLimitExceeded` — well before activeDeadlineSeconds=840s
(14min) could even consider firing.

Fix #144 was directionally right (the upstream IS slow on cold k3s) but
operated on the wrong knob. The container's outer-loop retry budget is
bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds.
Bumping only the deadline left the BackoffLimit ceiling unchanged.

Architectural fix (this commit):

1. Move the wait-for-API loop INSIDE the container (one Pod, one inner
   poll loop, restartPolicy=Never). The inner loop polls
   GET /api/v1/servers every 10s until HTTP 200, bounded by new
   `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container
   run owns the full wait budget instead of N short-lived containers
   racing the backoff timer.

2. restartPolicy: OnFailure → Never. The container script handles its
   own retry; Kubernetes-level backoff is reserved for genuinely
   transient pod failures (image-pull, OS eviction) where the Job-level
   backoffLimit=6 still triggers a fresh Pod.

3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower
   clusters can raise the inner deadline without forking the chart
   (per docs/INVIOLABLE-PRINCIPLES.md #4).

4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s).
   Sits below activeDeadlineSeconds (840s) so the zone-creation phase
   keeps ≥240s of headroom AFTER the API comes Ready.

Curl status handling in the wait loop:
  200          → API up, proceed to bootstrap
  401|403      → auth failure, FATAL (no retry — operator misconfig)
  000|5xx|...  → transient, sleep & retry until inner deadline

Files changed:
- platform/powerdns/chart/Chart.yaml         1.2.2 → 1.2.3 + history
- platform/powerdns/chart/values.yaml        + apiReadyTimeoutSeconds knob
- platform/powerdns/chart/templates/
    zone-bootstrap-job.yaml                  inner wait-for-API loop;
                                              restartPolicy: Never
- clusters/_template/bootstrap-kit/
    11-powerdns.yaml                         pin to 1.2.3 + HR comment

Why this is sufficient where Fix #144 was not:

Fix #144 worked the chart-level deadline. This commit works the
inner-loop ownership — the wait budget is now owned by the script
inside the container, not by the Job spec arithmetic
(backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds
still caps the worst-case runtime (no runaway poll), but the script
now actually GETS to use it.

Verification:
- helm template renders cleanly (deps build OK, empty-zones short-
  circuit preserved, non-empty zones render Job + RBAC + Audit CM)
- kubectl create --dry-run=client --validate=false: 5/5 resources
  created (sa, role, rb, cm, job)
- chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml

Companion infrastructure note (NOT addressed by this commit, flagged
for Coordinator):

The DEEPER bottom of the trace stack is worker capacity. Prov #38's
single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The
cluster-autoscaler attempted 2→3 scale-up but is in backoff because
two unscheduled pods (gitea/gitea-* PV affinity conflict from a
previous wedged install; trivy-system/node-collector NodeAffinity)
poison the autoscaler's "can the template node fit" check. Even with
this chart fix in place, the powerdns Deployment cannot become Ready
until either:
  (a) the worker autoscales successfully (gitea PV migrated / trivy
      taints relaxed), or
  (b) worker_count is bumped from 2 to 3 in the provisioning body, or
  (c) qa_worker_size is bumped to cpx42.

This chart fix ensures bp-powerdns survives a slow CNPG cold-start.
It does NOT fix a fundamentally undersized cluster. Coordinator next
step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart
landed. Either should converge.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-12 02:13:34 +04:00 committed by GitHub
parent 569d780b86
commit ce76a7b7ab
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 147 additions and 6 deletions

View File

@ -110,7 +110,21 @@ spec:
# Helm post-install hook failed, HR FAILED 4× → terminal.
# New deadline (14m) sits below the HR install.timeout cap of
# 15m so Flux's remediation can still reclaim a true failure.
version: 1.2.2
# 1.2.3 (Fix #144-followup, prov #37+#38 recurrence 2026-05-12):
# bumping activeDeadlineSeconds alone was insufficient — the Job
# hit BackoffLimitExceeded (NOT DeadlineExceeded) at ~10min
# because each container invocation curl'd a Service with empty
# Ready endpoints (powerdns Pods Pending behind a worker-capacity
# wedge that kept bp-cnpg's pdns-pg-1-initdb itself Pending).
# Container restartPolicy=OnFailure + backoffLimit=6 killed the
# Job long before activeDeadlineSeconds had any effect. Fix moves
# the wait-for-API loop INSIDE the container (restartPolicy=Never,
# bounded by new apiReadyTimeoutSeconds=600s) so one Pod owns
# the full 14m budget. Trace: in chroot prov #38, HR status
# message read "Helm install failed for release powerdns/powerdns
# with chart bp-powerdns@1.2.2: failed post-install: 1 error
# occurred: * job powerdns-zone-bootstrap failed: BackoffLimitExceeded".
version: 1.2.3
sourceRef:
kind: HelmRepository
name: bp-powerdns

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-powerdns
version: 1.2.2
version: 1.2.3
description: |
Catalyst-curated Blueprint wrapper for PowerDNS Authoritative.
Carries Catalyst-specific values.yaml + templates (CNPG cluster, dnsdist
@ -15,6 +15,18 @@ description: |
`helm dependency build` resolves it; values.yaml carries both the
catalystBlueprint metadata block and the upstream subchart values.
1.2.3 — Fix #144 recurrence (prov #37+#38 InstallFailed
BackoffLimitExceeded, 2026-05-12). Bumping activeDeadlineSeconds alone
(Fix #144) was insufficient: the post-install hook Job's container
exited within backoffLimit=6 (~10min wall-time) because curl against
http://powerdns:8081 hit a Service with empty Ready endpoints (powerdns
Deployment Pods Pending behind a CNPG initdb that itself waited for
worker capacity). The 14m deadline never got a chance — the Job died
at BackoffLimit. Fix moves the wait-for-API loop INSIDE the container
(restartPolicy: Never) so one Pod owns the full activeDeadlineSeconds
budget. New value apiReadyTimeoutSeconds (default 600s) bounds the
inner poll; surfaced via values.yaml per INVIOLABLE-PRINCIPLES #4.
Bumped to 1.2.0 — multi-zone bootstrap (issue #827, parent epic #825).
A franchised Sovereign now supports N parent zones, NOT one. New
values key `zones: []` declares the parent domains the operator

View File

@ -145,13 +145,40 @@ metadata:
spec:
backoffLimit: {{ .Values.zoneBootstrap.backoffLimit | default 6 }}
activeDeadlineSeconds: {{ .Values.zoneBootstrap.activeDeadlineSeconds | default 300 }}
# Fix #144-followup (recurrence on prov #37 + #38, 2026-05-12):
# The 14m activeDeadlineSeconds bumped by Fix #144 is harmless if the
# Job's container is allowed to RUN to that deadline. But this Job
# exited within ~6 backoffs (~10 minutes wall-time) and tripped
# `BackoffLimitExceeded` — NOT `DeadlineExceeded`. Each container
# invocation curl'd `http://powerdns:8081`, the in-cluster PowerDNS
# Service whose endpoints were empty because the powerdns Deployment
# Pods were Pending (worker CPU 99% requested; bp-cnpg `pdns-pg-1-initdb`
# also Pending behind same scheduling wedge; cluster-autoscaler in
# backoff after gitea/gitea-* PV affinity conflict). curl exited 7
# (connection refused) → container restarted → 6 backoffs → Job dead
# at ~10 minutes, well before activeDeadlineSeconds=840s.
#
# The Fix: move the readiness wait INTO the container — one pod, one
# long inner loop. See `command:` below. Combined with
# `restartPolicy: Never` (one chance, no Kubernetes-level backoff
# against backoffLimit) so the inner loop owns the wait budget,
# bounded by activeDeadlineSeconds. backoffLimit kept at 6 to recover
# from genuinely transient pod failures (image-pull retry, node
# eviction) without papering over a stuck Deployment.
template:
metadata:
labels:
{{- include "bp-powerdns.labels" . | nindent 8 }}
catalyst.openova.io/component: zone-bootstrap
spec:
restartPolicy: OnFailure
# Fix #144-followup: switched OnFailure → Never. The inner
# poll-for-API-ready loop now owns the wait budget (bounded by
# activeDeadlineSeconds + POWERDNS_API_READY_TIMEOUT_S env var).
# If the container itself crashes (image-pull retry, OS-level
# error), the Job's backoffLimit still triggers a new Pod —
# so transient failures recover. Steady-state retries for an
# un-Ready upstream stay inside the single Pod.
restartPolicy: Never
serviceAccountName: powerdns-zone-bootstrap
volumes:
- name: tmp
@ -199,12 +226,84 @@ spec:
secretKeyRef:
name: powerdns-api-credentials
key: api-key
# Fix #144-followup: inner-loop deadline for the wait-for-API
# poll. Default 600s = 10 min; surfaced via values.yaml so
# operators on slower clusters can raise it without forking
# the chart. Sits below activeDeadlineSeconds (840s) so the
# zone-creation phase has ≥240s of headroom after the API
# comes Ready.
- name: POWERDNS_API_READY_TIMEOUT_S
value: {{ .Values.zoneBootstrap.apiReadyTimeoutSeconds | default 600 | quote }}
command:
- /bin/sh
- -c
- |
set -eu
api="${POWERDNS_API}"
key="${POWERDNS_API_KEY}"
# ─── Wait for PowerDNS API to come Ready ──────────────────────
#
# Fix #144-followup (recurrence on prov #37+#38, 2026-05-12):
# The Helm post-install hook fires the moment the `powerdns`
# Deployment + Service manifests apply — but the Service has
# no Ready endpoints until the powerdns Pods themselves come
# up, and THOSE wait on bp-cnpg's `pdns-pg-app` Secret which
# only materialises after `pdns-pg-1-initdb` Pod schedules,
# runs, and completes. On a fresh Sovereign with the worker
# under capacity pressure (gitea+harbor+keycloak racing for
# CPU) that whole chain can take >10 minutes — longer than
# the Job's 6-backoff retry budget.
#
# Approach: one container run, one inner poll loop. We retry
# every 10s for up to .Values.zoneBootstrap.apiReadyTimeoutSeconds
# (default 600s = 10min, well within activeDeadlineSeconds=840s
# budget so there's still headroom for the zone-creation loop
# below). Authenticated GET /api/v1/servers returns:
# 200 — server up; proceed to bootstrap
# 401 — server up but key mismatch (FATAL, no retry)
# anything else — keep waiting
# curl's exit codes:
# 7 — connection refused (Service endpoints empty)
# 28 — operation timeout (network blip / pod warming)
# Both are retryable. Anything terminal (DNS resolution fail,
# which would mean Service object missing) we surface after
# the deadline.
ready_deadline_s="${POWERDNS_API_READY_TIMEOUT_S:-600}"
poll_interval_s=10
waited=0
echo "Waiting up to ${ready_deadline_s}s for PowerDNS API at ${api}/api/v1/servers"
while :; do
status=$(curl --silent --output /tmp/api-probe \
--write-out '%{http_code}' \
--max-time 5 \
-H "X-API-Key: ${key}" \
"${api}/api/v1/servers" 2>/dev/null || echo "000")
case "${status}" in
200)
echo " PowerDNS API ready (HTTP 200) after ${waited}s"
break
;;
401|403)
echo " PowerDNS API reachable but auth rejected (HTTP ${status}) — FATAL"
cat /tmp/api-probe 2>/dev/null || true
exit 1
;;
*)
# 000 = curl exit (refused/timeout); 5xx = upstream not ready
if [ "${waited}" -ge "${ready_deadline_s}" ]; then
echo " PowerDNS API did not become ready within ${ready_deadline_s}s (last status=${status}) — FATAL"
exit 1
fi
;;
esac
sleep "${poll_interval_s}"
waited=$((waited + poll_interval_s))
done
# Idempotent zone-creation loop.
#
# For each entry in .Values.zones (rendered into the
@ -232,9 +331,6 @@ spec:
# one round trip; GET-then-POST is two. Same outcome,
# half the cost.
api="${POWERDNS_API}"
key="${POWERDNS_API_KEY}"
create_zone() {
local name="$1"
local kind="$2"

View File

@ -540,3 +540,22 @@ zoneBootstrap:
# leave Flux waiting forever; a Job that fits inside the HR cap lets
# Flux's own remediation cycle reclaim the failure path).
activeDeadlineSeconds: 840
# apiReadyTimeoutSeconds — inner poll budget the container spends
# waiting for http://powerdns:8081/api/v1/servers to return HTTP 200
# before giving up.
#
# Recurrence of #144 on prov #37+#38 (2026-05-12) traced to
# BackoffLimitExceeded, NOT DeadlineExceeded: each container invocation
# curl'd the in-cluster PowerDNS Service whose endpoints stayed empty
# (powerdns Pods Pending behind a worker-capacity wedge + slow CNPG
# initdb). curl exited 7 (connection refused), the container restarted,
# 6 backoffs (~10min wall-time) tripped the Job's backoffLimit well
# before activeDeadlineSeconds=840s.
#
# New posture: one container run, one long inner loop. This timeout
# bounds the inner poll; restartPolicy is `Never` (the container owns
# the wait budget). 600s default sits below activeDeadlineSeconds=840s
# so the zone-creation phase retains >=240s of headroom AFTER the API
# comes Ready. Operators on slower clusters can raise this without
# forking the chart per docs/INVIOLABLE-PRINCIPLES.md #4.
apiReadyTimeoutSeconds: 600