fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450)

* fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:23:04 +04:00 · 2026-05-12 22:23:04 +04:00 · 855e106d87
commit 855e106d87
parent fb563e9fd6
1 changed files with 17 additions and 3 deletions
--- a/clusters/_template/bootstrap-kit/16-cnpg.yaml
+++ b/clusters/_template/bootstrap-kit/16-cnpg.yaml
@ -62,14 +62,28 @@ spec:
        kind: HelmRepository
        name: bp-cnpg
        namespace: flux-system
-  # Event-driven install per docs/INVIOLABLE-PRINCIPLES.md #3.
+  # CNPG: KEEP Helm wait (disableWait: false / default). Consumers
+  # bp-harbor + bp-powerdns + bp-keycloak + bp-gitea apply
+  # postgresql.cnpg.io/v1.Cluster CRs gated by the cnpg mutating webhook
+  # `mcluster.cnpg.io`. If bp-cnpg's HelmRelease goes Ready before the
+  # cnpg-webhook-service has endpoints, Flux dependsOn lets downstream
+  # HRs proceed → their Cluster CR apply gets:
+  #   "failed calling webhook \"mcluster.cnpg.io\": no endpoints
+  #    available for service \"cnpg-webhook-service\""
+  # → Helm install fails → RetriesExceeded → entire DB-backed chain
+  # (Harbor/PowerDNS/Keycloak/Gitea) wedges. Caught on prov #55/#56
+  # (2026-05-12). disableWait: false (the default) tells Helm to block
+  # the HR's Ready until the webhook deployment is rolled and the
+  # service has endpoints, which is exactly what downstream consumers
+  # need. This is the carve-out from the INVIOLABLE-PRINCIPLES #3
+  # event-driven blanket — the rule's WHY (avoiding agent-waits-for-
+  # its-own-CRDs cilium-style deadlock) does NOT apply here because
+  # bp-cnpg's CRDs are loaded by helm-controller before pods schedule.
  install:
    timeout: 15m
-    disableWait: true
    remediation:
      retries: 3
  upgrade:
    timeout: 15m
-    disableWait: true
    remediation:
      retries: 3