Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook. That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12) both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT `DeadlineExceeded`. The deadline never got a chance to fire. Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl get hr bp-powerdns -o yaml`): status: Helm install failed for release powerdns/powerdns with chart bp-powerdns@1.2.2: failed post-install: 1 error occurred: * job powerdns-zone-bootstrap failed: BackoffLimitExceeded Pod events for powerdns-zone-bootstrap-tq7qq: 59m Started container zone-bootstrap 56m Back-off restarting failed container zone-bootstrap 55m Job has reached the specified backoff limit Root cause walked end-to-end (per CLAUDE.md TRACE rule): TEST: bp-powerdns HR Ready=True ↑ HR: Helm install succeeds (post-install Job exits 0) ↑ Zone-bootstrap Job: curl POST succeeds ↑ powerdns:8081 Service: reachable (has Ready endpoints) ↑ powerdns Deployment: Pods Ready (3 replicas) ← Pending, blocked here ↑ CNPG cluster: pdns-pg-app Secret exists ↑ pdns-pg-1-initdb Pod: scheduled, Running, Completed ← Pending too ↑ Worker node has capacity ← 99% CPU requested The zone-bootstrap container curl'd `http://powerdns:8081`, hit "connection refused" (empty Service endpoints), exited 7, container restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level backoffs (≈10min wall-time with exponential delay), the Job declared `BackoffLimitExceeded` — well before activeDeadlineSeconds=840s (14min) could even consider firing. Fix #144 was directionally right (the upstream IS slow on cold k3s) but operated on the wrong knob. The container's outer-loop retry budget is bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds. Bumping only the deadline left the BackoffLimit ceiling unchanged. Architectural fix (this commit): 1. Move the wait-for-API loop INSIDE the container (one Pod, one inner poll loop, restartPolicy=Never). The inner loop polls GET /api/v1/servers every 10s until HTTP 200, bounded by new `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container run owns the full wait budget instead of N short-lived containers racing the backoff timer. 2. restartPolicy: OnFailure → Never. The container script handles its own retry; Kubernetes-level backoff is reserved for genuinely transient pod failures (image-pull, OS eviction) where the Job-level backoffLimit=6 still triggers a fresh Pod. 3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower clusters can raise the inner deadline without forking the chart (per docs/INVIOLABLE-PRINCIPLES.md #4). 4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s). Sits below activeDeadlineSeconds (840s) so the zone-creation phase keeps ≥240s of headroom AFTER the API comes Ready. Curl status handling in the wait loop: 200 → API up, proceed to bootstrap 401|403 → auth failure, FATAL (no retry — operator misconfig) 000|5xx|... → transient, sleep & retry until inner deadline Files changed: - platform/powerdns/chart/Chart.yaml 1.2.2 → 1.2.3 + history - platform/powerdns/chart/values.yaml + apiReadyTimeoutSeconds knob - platform/powerdns/chart/templates/ zone-bootstrap-job.yaml inner wait-for-API loop; restartPolicy: Never - clusters/_template/bootstrap-kit/ 11-powerdns.yaml pin to 1.2.3 + HR comment Why this is sufficient where Fix #144 was not: Fix #144 worked the chart-level deadline. This commit works the inner-loop ownership — the wait budget is now owned by the script inside the container, not by the Job spec arithmetic (backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds still caps the worst-case runtime (no runaway poll), but the script now actually GETS to use it. Verification: - helm template renders cleanly (deps build OK, empty-zones short- circuit preserved, non-empty zones render Job + RBAC + Audit CM) - kubectl create --dry-run=client --validate=false: 5/5 resources created (sa, role, rb, cm, job) - chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml Companion infrastructure note (NOT addressed by this commit, flagged for Coordinator): The DEEPER bottom of the trace stack is worker capacity. Prov #38's single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The cluster-autoscaler attempted 2→3 scale-up but is in backoff because two unscheduled pods (gitea/gitea-* PV affinity conflict from a previous wedged install; trivy-system/node-collector NodeAffinity) poison the autoscaler's "can the template node fit" check. Even with this chart fix in place, the powerdns Deployment cannot become Ready until either: (a) the worker autoscales successfully (gitea PV migrated / trivy taints relaxed), or (b) worker_count is bumped from 2 to 3 in the provisioning body, or (c) qa_worker_size is bumped to cpx42. This chart fix ensures bp-powerdns survives a slow CNPG cold-start. It does NOT fix a fundamentally undersized cluster. Coordinator next step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart landed. Either should converge. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
54 lines
2.7 KiB
YAML
54 lines
2.7 KiB
YAML
apiVersion: v2
|
|
name: bp-powerdns
|
|
version: 1.2.3
|
|
description: |
|
|
Catalyst-curated Blueprint wrapper for PowerDNS Authoritative.
|
|
Carries Catalyst-specific values.yaml + templates (CNPG cluster, dnsdist
|
|
companion, Traefik api-ingress, placeholder anycast endpoint, multi-zone
|
|
bootstrap Job) on top of the upstream Helm chart. The bootstrap-kit
|
|
installer (products/catalyst/bootstrap/api/internal/bootstrap/) reads
|
|
the upstream chart reference from values.yaml's catalystBlueprint
|
|
metadata block and applies the values overlay at helm install time.
|
|
|
|
Mirrors the bp-cilium / bp-keycloak / bp-cert-manager wrapper shape:
|
|
Chart.yaml lists the upstream chart as a Helm dependency so
|
|
`helm dependency build` resolves it; values.yaml carries both the
|
|
catalystBlueprint metadata block and the upstream subchart values.
|
|
|
|
1.2.3 — Fix #144 recurrence (prov #37+#38 InstallFailed
|
|
BackoffLimitExceeded, 2026-05-12). Bumping activeDeadlineSeconds alone
|
|
(Fix #144) was insufficient: the post-install hook Job's container
|
|
exited within backoffLimit=6 (~10min wall-time) because curl against
|
|
http://powerdns:8081 hit a Service with empty Ready endpoints (powerdns
|
|
Deployment Pods Pending behind a CNPG initdb that itself waited for
|
|
worker capacity). The 14m deadline never got a chance — the Job died
|
|
at BackoffLimit. Fix moves the wait-for-API loop INSIDE the container
|
|
(restartPolicy: Never) so one Pod owns the full activeDeadlineSeconds
|
|
budget. New value apiReadyTimeoutSeconds (default 600s) bounds the
|
|
inner poll; surfaced via values.yaml per INVIOLABLE-PRINCIPLES #4.
|
|
|
|
Bumped to 1.2.0 — multi-zone bootstrap (issue #827, parent epic #825).
|
|
A franchised Sovereign now supports N parent zones, NOT one. New
|
|
values key `zones: []` declares the parent domains the operator
|
|
brings to the Sovereign at signup; a Helm post-install/post-upgrade
|
|
hook Job (templates/zone-bootstrap-job.yaml) creates each one in
|
|
PowerDNS via the REST API. Idempotent on HTTP 409. Pairs with the
|
|
per-zone wildcard Certificate templates in bp-catalyst-platform 1.4.0+
|
|
and the post-provision /api/v1/sovereign/parent-domains/<name>/zone
|
|
endpoint in catalyst-api (issue #829).
|
|
type: application
|
|
keywords: [catalyst, blueprint, powerdns, dns, dnssec, lua-records, dnsdist]
|
|
maintainers:
|
|
- name: OpenOva Catalyst
|
|
email: catalyst@openova.io
|
|
|
|
# Upstream chart pulled in as a Helm subchart so `helm dependency build`
|
|
# bundles it into the OCI artifact. Pinned to pschichtel/powerdns 0.10.0
|
|
# (verified publisher on Artifact Hub, tracks docker.io/powerdns/pdns-auth-50
|
|
# at appVersion 5.0.3 — see values.yaml `catalystBlueprint.upstream` for
|
|
# the rationale).
|
|
dependencies:
|
|
- name: powerdns
|
|
version: "0.10.0"
|
|
repository: "https://schich.tel/helm-charts"
|