fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450)
* fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
fb563e9fd6
commit
855e106d87
@ -62,14 +62,28 @@ spec:
|
||||
kind: HelmRepository
|
||||
name: bp-cnpg
|
||||
namespace: flux-system
|
||||
# Event-driven install per docs/INVIOLABLE-PRINCIPLES.md #3.
|
||||
# CNPG: KEEP Helm wait (disableWait: false / default). Consumers
|
||||
# bp-harbor + bp-powerdns + bp-keycloak + bp-gitea apply
|
||||
# postgresql.cnpg.io/v1.Cluster CRs gated by the cnpg mutating webhook
|
||||
# `mcluster.cnpg.io`. If bp-cnpg's HelmRelease goes Ready before the
|
||||
# cnpg-webhook-service has endpoints, Flux dependsOn lets downstream
|
||||
# HRs proceed → their Cluster CR apply gets:
|
||||
# "failed calling webhook \"mcluster.cnpg.io\": no endpoints
|
||||
# available for service \"cnpg-webhook-service\""
|
||||
# → Helm install fails → RetriesExceeded → entire DB-backed chain
|
||||
# (Harbor/PowerDNS/Keycloak/Gitea) wedges. Caught on prov #55/#56
|
||||
# (2026-05-12). disableWait: false (the default) tells Helm to block
|
||||
# the HR's Ready until the webhook deployment is rolled and the
|
||||
# service has endpoints, which is exactly what downstream consumers
|
||||
# need. This is the carve-out from the INVIOLABLE-PRINCIPLES #3
|
||||
# event-driven blanket — the rule's WHY (avoiding agent-waits-for-
|
||||
# its-own-CRDs cilium-style deadlock) does NOT apply here because
|
||||
# bp-cnpg's CRDs are loaded by helm-controller before pods schedule.
|
||||
install:
|
||||
timeout: 15m
|
||||
disableWait: true
|
||||
remediation:
|
||||
retries: 3
|
||||
upgrade:
|
||||
timeout: 15m
|
||||
disableWait: true
|
||||
remediation:
|
||||
retries: 3
|
||||
|
||||
Loading…
Reference in New Issue
Block a user