fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450)

* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)

Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:

  fatal: failed to start: daemon creation failed: unable to initialize
  BPF masquerade support: BPF masquerade requires NodePort
  (--enable-node-port="true")

Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.

Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.

Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race

prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with:

  Internal error occurred: failed calling webhook "mcluster.cnpg.io":
  no endpoints available for service "cnpg-webhook-service"

Chain:
1. bp-cnpg install with disableWait: true → HR goes Ready immediately
   when manifests apply (operator pod still spinning up).
2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the
   dependsOn check on bp-cnpg.
3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs.
4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints
   yet → admission webhook call fails → Helm install fails →
   RetriesExceeded → entire DB-backed chain wedges.

Carve out the disableWait: true blanket for bp-cnpg specifically.
INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the
agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply
to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule,
so Helm-wait blocks only on pod readiness, not on a self-referencing CRD.

With this change bp-cnpg's HR stays Reconciling until cnpg-controller-
manager + cnpg-webhook-service are both rolled + Available, so Flux
dependsOn correctly gates downstream consumers behind a webhook that's
actually serving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-12 22:23:04 +04:00 committed by GitHub
parent fb563e9fd6
commit 855e106d87
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -62,14 +62,28 @@ spec:
kind: HelmRepository
name: bp-cnpg
namespace: flux-system
# Event-driven install per docs/INVIOLABLE-PRINCIPLES.md #3.
# CNPG: KEEP Helm wait (disableWait: false / default). Consumers
# bp-harbor + bp-powerdns + bp-keycloak + bp-gitea apply
# postgresql.cnpg.io/v1.Cluster CRs gated by the cnpg mutating webhook
# `mcluster.cnpg.io`. If bp-cnpg's HelmRelease goes Ready before the
# cnpg-webhook-service has endpoints, Flux dependsOn lets downstream
# HRs proceed → their Cluster CR apply gets:
# "failed calling webhook \"mcluster.cnpg.io\": no endpoints
# available for service \"cnpg-webhook-service\""
# → Helm install fails → RetriesExceeded → entire DB-backed chain
# (Harbor/PowerDNS/Keycloak/Gitea) wedges. Caught on prov #55/#56
# (2026-05-12). disableWait: false (the default) tells Helm to block
# the HR's Ready until the webhook deployment is rolled and the
# service has endpoints, which is exactly what downstream consumers
# need. This is the carve-out from the INVIOLABLE-PRINCIPLES #3
# event-driven blanket — the rule's WHY (avoiding agent-waits-for-
# its-own-CRDs cilium-style deadlock) does NOT apply here because
# bp-cnpg's CRDs are loaded by helm-controller before pods schedule.
install:
timeout: 15m
disableWait: true
remediation:
retries: 3
upgrade:
timeout: 15m
disableWait: true
remediation:
retries: 3