openova

Author	SHA1	Message	Date
e3mrah	c148ec6a34	fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22) (#1575 ) PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-} slot-file placeholders WITHOUT the $$ escape. tofu's templatefile() parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu expression — failing with "Extra characters after interpolation expression; Template interpolation doesn't expect a colon". Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s. The escape pattern is documented at main.tf:1029 (the same warning that caught t127 last week). $$ prefix tells tofu's templatefile to emit literal \${...} to cloud-init for Flux envsubst. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:31:26 +04:00
e3mrah	57939585c0	feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) (#1571 ) * feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment reads to populate the deployment record so Sovereign Console Settings page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL, orgName (instead of `—` placeholders). Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName, controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with empty defaults. Per-Sovereign overlays wire actual values from cloud- init substitute placeholders (mirrors regionsJson pattern). Catalyst-api Pod now reads them via valueFrom configMapKeyRef + optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap so env stays empty there — correct, mothership is signer not validator). Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP post-#1568. This PR fills the remaining 3 D22 fields when operator wires the values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(slot-13): add D22 sovereign-side identity placeholders Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} + ${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569) → catalyst-api env → chrootEnsureDeployment populates the deployment record → Settings page renders real values instead of `—`. This PR alone is a no-op (placeholders default to empty, same as today). The cloud-init substitute lines + provisioner.go tfvars need to land in a companion PR to actually populate the values on next-prov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit Kustomization's postBuild.substitute env, which the slot-13 placeholders (#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}. Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment populates the deployment record (#1567 + #1568 fallback). SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes the dependency cycle (hcloud_server.cp doesn't exist at cloudinit render time). Separate PR will source it via metadata-service or post-create ConfigMap patch. Next-prov (t133+) Sovereign Console Settings page now renders real ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 01:47:04 +04:00
e3mrah	1c988b9a4b	fix(firewall): open NodePort range 30000-32767 for clustermesh LB (D11) (#1538 ) PR #1537's use-private-ip approach was not viable: the per-region Hetzner LB has no private-network attachment by default (LB private_net is empty) and our DoD A2 architecture pins one private /24 per region that does NOT span across regions. The LB->backend hop has to transit the public path. The actual blocker is the Sovereign firewall: it permits 80/443/6443/53 and blocks the NodePort range. Hetzner LB TCP health-check probes `<node-public-ip>:<NodePort>` and gets dropped → all targets marked unhealthy → external clients see "unexpected eof while reading" at TLS handshake → cilium clustermesh agent stays `0/N remote clusters ready, Waiting for initial connection`. Security: clustermesh-apiserver requires mTLS. Peer agents must present a client cert signed by the peer cluster's cilium-ca (PR #1530). Anonymous connections rejected at handshake. mTLS is the security boundary, NOT the firewall — opening NodePorts is safe here. Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — completes the D11 incident chain (#1525 → #1528 → #1530 → #1536 → this). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:44:02 +04:00
e3mrah	1f30a08ae3	fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534 ) The Sovereign-side catalyst-api runs in "chroot" mode — it has no parent prov record, so chrootEnsureDeployment synthesises a minimal in-memory Deployment with only SovereignFQDN set. The /infrastructure/topology loader then sees empty Request.Regions[] and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes) which only sees THIS cluster's Node(s) → emits exactly 1 Region even on a 3-region Sovereign. /cloud?view=graph renders as "1 cluster 1 region" — DoD D5 failure. Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported `console.t126.omani.works/cloud?view=graph` showed 1 region despite mothership openova-flow snapshot holding all 3 regions correctly. This PR threads the canonical multi-region RegionSpec[] from the mothership prov body all the way to the Sovereign-side catalyst-api: tofu var.regions → jsonencode → sovereign_regions_json tftpl var → cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON → bp-catalyst-platform slot 13 sovereign.regionsJson value → sovereign-fqdn ConfigMap key `regionsJson` → catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom) → chrootEnsureDeployment parses JSON, populates Request.Regions[] → topology loader emits one Region per spec entry Single-region Sovereigns: var.regions has length 1; chart writes the array literal; chroot synth still produces 1 Region — no regression. Empty env: chroot falls back to live-Nodes path (legacy behavior preserved). Refs DoD D5. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:45:24 +04:00
e3mrah	357feb0843	fix(tofu): escape ${...} in comment that broke templatefile() (t127) (#1533 ) Unescaped `${DMZ_VCLUSTER_ENABLED:=true}` Flux envsubst expression inside a tftpl comment was being parsed by tofu's templatefile() as a tftpl interpolation. tofu's `:=` is not a valid tftpl operator, so tofu plan failed with: ./cloudinit-control-plane.tftpl:1021,71-72: Extra characters after interpolation expression; Template interpolation doesn't expect a colon at this location. Every other `${...}` reference in tftpl comments in this file is properly escaped as `$${...}` (e.g. lines 12, 850, 893, 971, 996, 1039, 1138). Mine slipped through PR #1531. Fix: rewrite the comment to NOT include any `${...}` expression (since the expression was just illustrative), avoiding the escape gymnastics entirely. Caught on t127 (b7942a70f7516e9e, 2026-05-16) — first prov after PR #1531 landed FAILED in tofu plan stage within 60s. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:39:43 +04:00
e3mrah	904686ff0d	fix(vcluster): canonical region label substitute + per-role enable flags (#1531 ) Caught on t126 (84c0848406dd6fdd, 2026-05-16): bp-{dmz,mgmt,rtz}-vcluster charts installed but DMZ Pods Pending on every region with FailedScheduling. Pod nodeSelector was `openova.io/region=hel1` (from `${SOVEREIGN_REGION_KEY}` substitute = Hetzner region key "hel1"/"nbg1-1"/"sin-2"), but the k3s node-label is `openova.io/region=hz-hel-rtz-prod` (canonical 4-segment label written by cloud-init from `region_canonical_label` per PR #1512). Mismatch meant every vCluster Pod across every region sat Pending. MGMT + RTZ slot 58/59 charts also default-OFF with no substitute flipping them on per the DoD A4 topology (primary=MGMT+DMZ; secondary=DMZ+RTZ). This PR: 1. Adds `SOVEREIGN_REGION_CANONICAL_LABEL` substitute to tofu cloud-init `bootstrap-kit` postBuild block, sourced from per-region `region_canonical_label` tftpl var. 2. Adds `MGMT_VCLUSTER_ENABLED` + `RTZ_VCLUSTER_ENABLED` substitutes — primary CP renders true/false, secondary CP renders false/true. 3. Updates bootstrap-kit slots 54/58/59 to use the canonical label substitute. Slots 58/59 also read the per-role enable flag. Expected post-deploy state on a fresh 3-region prov: primary: DMZ + MGMT vCluster Pods Running (RTZ rendered zero) secondary: DMZ + RTZ vCluster Pods Running (MGMT rendered zero) Refs DoD A4 (vCluster topology). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:28:06 +04:00
e3mrah	ed19bb3f8d	fix(k3s): --disable-cloud-controller so providerID stays empty for our patch (#1524 ) Caught on t123 (a3bfa56adbcfb049, 2026-05-16): Gap A v3.1's patch loop hit k8s validation error: The Node "catalyst-t123-omani-works-cp1" is invalid: spec.providerID: Forbidden: node updates may not change providerID except from "" to valid k8s allows setting providerID from empty → valid, but NOT changing it. k3s's embedded cloud controller sets providerID=k3s://<hostname> BEFORE our cloud-init runcmd patch fires (race window). Once set, the patch is rejected. Fix: --disable-cloud-controller (alone, NOT with the cloud-provider= external kubelet arg that caused the chicken-and-egg taint in reverted PR #1513). This disables the k3s embedded cloud controller so it never sets providerID; the kubelet leaves providerID empty; our runcmd patch successfully sets hcloud://<id>. hcloud-ccm (installed later via Flux) sees the correct providerID and allocates per-region LBs. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 15:25:54 +04:00
e3mrah	0ebd137547	fix(cloud-init): retry providerID patch up to 30× when Node not yet registered (#1523 ) Caught on t122 (7e519eb997af236c, 2026-05-16): primary + sin patched fine, but nbg1's kubectl patch failed because the Node object hadn't yet appeared in the apiserver between healthz OK and Node registration. Result: nbg1 stuck at providerID=k3s://... → CCM rejected its LB allocation → clustermesh-apiserver external_ip stayed <pending> on nbg1 → AutoEstablishClusterMesh couldn't fully mesh. Add a 30-iter loop (150s budget): get node first; if found, patch; else sleep 5. Hetzner apiserver registers Nodes within ~10-30s of k3s install on healthy clusters. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 14:58:59 +04:00
e3mrah	ef93a2cdbe	feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520 ) Architecturally-clean replacement for the reverted PRs #1513 (k3s flag) and #1516 (pre-install hcloud-ccm). Both prior approaches broke cold-start (chicken-and-egg with the uninitialized taint). This patch instead lets k3s boot normally with its default embedded cloud controller (which sets `providerID=k3s://<hostname>` — the problem), then immediately patches the local Node's `spec.providerID` to `hcloud://<id>` using the Hetzner instance metadata endpoint (169.254.169.254). The patch runs ONCE per CP node, right after k3s apiserver healthz becomes reachable, BEFORE flux-bootstrap.yaml applies the bootstrap-kit Kustomization. Once providerID has the canonical `hcloud://` prefix, bp-hcloud-ccm (installed by Flux later in the bootstrap-kit chain) accepts the node as a Hetzner-managed instance and allocates LBs for Service type=LoadBalancer normally. That unblocks: - D12: clustermesh-apiserver Service gets a real external IP instead of <pending> - D10: AutoEstablishClusterMesh (PR #1508) can read each region's LB IP and write peer entries into cilium-clustermesh Secret - D11: inter-region pod-to-pod traffic flows via Cilium WG over the per-region LB IPs - D5: child catalyst-api can reach secondary regions via mesh, so /cloud view aggregates all 3 regions instead of 1/1 Failure is non-fatal: if metadata lookup or patch fails, we log and continue (bp-hcloud-ccm has a chance to set providerID later via its own node-list-and-match logic). Cold-start is never blocked. Canonical topology (1 cpx52 per region, workerCount=0) means every node is a CP — covered by this patch. Operator-added workers (workerCount>0) would also need providerID patched; a follow-up Job in bp-providerid-patcher can iterate all nodes post-Flux. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 14:12:26 +04:00
e3mrah	766890510b	Revert PR #1516 + #1517 — Gap A hcloud-ccm pre-install hangs cloud-init (#1518 ) * Revert "fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)" This reverts commit `05c6edb4fe`. * Revert "fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)" This reverts commit `b7140b9069`. --------- Co-authored-by: claude <claude@anthropic.com>	2026-05-16 13:32:18 +04:00
e3mrah	05c6edb4fe	fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517 ) PR #1516 added ~3KB of hcloud-ccm bootstrap manifests inline (Secret + ServiceAccount + ClusterRoleBinding + Deployment with full toleration list + container args). Rendered cloud-init now exceeds the 30720 precondition on every primary + secondary CP: Error: Resource precondition failed on main.tf line 716: length(local.control_plane_cloud_init) <= 30720 Caught on t118 prov (0619287065fb58c8, 2026-05-16): apply failed at both primary AND nbg1-1 + sin-2 simultaneously. Hetzner hard cap is 32768 bytes. Bump guardrail to 32000 (96.5% of hard cap) — leaves a 768-byte safety margin while admitting the hcloud-ccm pre-install legitimately needed bytes. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 13:15:21 +04:00
e3mrah	b7140b9069	fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516 ) DoD multi-region gates D5/D10/D11/D12-LB-pending all trace to one root cause: k3s sets node.spec.providerID=k3s://<hostname>. hcloud-ccm rejects every LoadBalancer-Service allocation because the prefix isn't hcloud://, so clustermesh-apiserver Service stays <pending> → AutoEstablishClusterMesh (PR #1508) hard-fails → no peer entries → no inter-region pod traffic → openova-flow-emitter on secondaries can't reach openova-flow-server on primary → /cloud view sees only 1 region. PR #1513 attempted the kubelet-flag-only fix (--cloud-provider=external + --disable-cloud-controller) banking on Flux's bp-hcloud-ccm slot 55 to install the CCM. Reverted in PR #1514 because Flux pods themselves cannot land on a node tainted node.cloudprovider.kubernetes.io/ uninitialized=NoSchedule — chicken-and-egg, 0 HRs after 30 min. Architecturally correct fix: pre-install hcloud-ccm via raw manifests in cloud-init, BEFORE flux-bootstrap.yaml apply. Once the Deployment runs (with uninitialized-taint toleration), CCM matches the node to its Hetzner server, writes providerID=hcloud://<id>, kubelet lifts the taint, Flux proceeds normally. Flux later "adopts" this Deployment via bp-hcloud-ccm HelmRelease (release name collides cleanly with `helm upgrade --install`). Changes: - cloudinit-control-plane.tftpl: - Re-add k3s install flags --disable-cloud-controller + --kubelet-arg=cloud-provider=external (same flags as reverted #1513). - New write_files entry /var/lib/catalyst/hcloud-ccm-bootstrap.yaml containing Secret kube-system/hcloud (token + network keys), ServiceAccount, ClusterRoleBinding, and Deployment with full toleration set (uninitialized + CriticalAddonsOnly + control-plane + master + not-ready). Image pulled via harbor.openova.io proxy- cache of hetznercloud/hcloud-cloud-controller-manager:v1.20.0 (mirrors platform/hcloud-ccm/chart/Chart.yaml appVersion pin, per MIRROR-EVERYTHING rule). - New runcmd steps inserted AFTER the local-path StorageClass setup and BEFORE the kubeconfig postback: kubectl apply the manifest, then poll node.spec.providerID for up to 300s waiting for hcloud:// prefix. On timeout, dump CCM pod + logs and exit 1. - cloudinit-worker.tftpl: - Add --kubelet-arg=cloud-provider=external to agent install. Workers join the cluster after the primary CP's CCM is up; worker kubelet will wait for the same external CCM to set its providerID. Secondary regions (local.secondary_region_cloud_init in main.tf) call the SAME cloudinit-control-plane.tftpl, so the fix inherits to every secondary CP automatically. No main.tf changes needed — hcloud_token and hcloud_network_name were already threaded into both primary and secondary templatefile() calls. DoD impact: unblocks D5 (/cloud 3-regions), D10 (Cilium peer entries), D11 (inter-region pod-to-pod via WG), D12 (LB external IPs no longer <pending>). After this lands plus a fresh prov, those four DoD gates flip green; expected 13-14/14 on next t118 cycle. Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md, session_2026_05_16_t117_dod_partial.md Reverts: tail of PR #1513 left the worker tftpl untouched, but #1514's revert restored it to no-flag state. This PR re-applies the flag intent correctly because the CCM is now present at the moment kubelet starts. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 13:06:49 +04:00
e3mrah	f30a49fba5	Revert "fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513 )" (#1514 ) This reverts commit `7f0de7fa82`. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-16 12:12:38 +04:00
e3mrah	7f0de7fa82	fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513 ) DoD gate D12-LB-allocation root cause: k3s registers nodes with providerID=k3s://<hostname> instead of hcloud://<server-id>. hcloud-ccm rejects every LB allocation: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://catalyst-t115-omani-works-nbg1-1-cp1 This blocked clustermesh-apiserver Service from getting an external IP on every secondary region → AutoEstablishClusterMesh (PR #1508) couldn't write peer entries → D10/D11 fail. Caught on t115.omani.works (577be15281be2587, 2026-05-16) after PR #1509 flipped clustermesh-apiserver Service to LoadBalancer. The NodePort default in the old chart masked this k3s-vs-hcloud-ccm incompatibility until the LoadBalancer flip exposed it. Fix (k3s server install line in cloudinit-control-plane.tftpl): + --disable-cloud-controller + --kubelet-arg=cloud-provider=external Fix (k3s agent install line in cloudinit-worker.tftpl): + --kubelet-arg=cloud-provider=external The k3s server flag tells the embedded cloud controller to stay out. The kubelet flag tells kubelet to wait for an external CCM to set providerID. hcloud-ccm (bootstrap-kit slot 36) then matches each node to its Hetzner server by name and sets providerID=hcloud://<id>, unblocking LB allocation, Volume CSI, and node-external-ip. The node is briefly tainted node.cloudprovider.kubernetes.io/ uninitialized=NoSchedule until the CCM removes it — Flux's bootstrap-kit Kustomization tolerates this taint via SOPs. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 11:34:25 +04:00
e3mrah	dc590855a1	fix(tofu): per-region cloud-init renders with secondary's own values, not primary's (#1512 ) * fix(tofu): per-region cloud-init renders with secondary's own values, not primary's Root cause: cloudinit-control-plane.tftpl hardcoded the literal `openova.io/region=hz-fsn-rtz-prod` on the k3s install line. Every CP node — primary AND every secondary — labeled itself with that fixed string regardless of the cluster's real region. The template variables `region` and `sovereign_region_key` were already wired per-region in main.tf, but this one node-label flag was written as a constant. Concrete impact on prov t114.omani.works (a1448e0b9e471f5d, 2026-05-16): - Primary cluster (hel1) k3s nodes carried `hz-fsn-rtz-prod` even though Sovereign primary = hel1. qa-fixtures Pods targeted `openova.io/region in [hz-fsn-rtz-prod]` and silently landed on the wrong-named nodes — the scheduler accepted but the cluster name didn't match the label, breaking the OpenovaFlow canvas's per-region grouping and any downstream selector reading the label. - Secondary clusters (nbg1, sin) carried the same hardcoded label so their k3s nodes never reported their own region, again breaking the canvas (D13) and the Continuum DR region awareness. - clusters/_template/bootstrap-kit/01-cilium.yaml further masked the bug with a `${HCLOUD_LB_LOCATION:=hel1}` default fallback on the clustermesh-apiserver Service annotation — for a Sovereign with primary=hel1 the fallback APPEARED correct but silently masked any rendering failure path where the substitute might be missing. Fix shape: 1. Introduce locals.region_canonical_label in main.tf, keyed by region key ("primary" + every secondary key). Each value is computed as `hz-<region-prefix-no-digits>-rtz-prod` per NAMING-CONVENTION §2.1. 2. Thread `region_canonical_label` into BOTH the primary CP templatefile() call (from locals.region_canonical_label["primary"]) and the secondary CP templatefile() call (from locals.region_canonical_label[k]). 3. Replace the hardcoded literal in cloudinit-control-plane.tftpl line 1364 with `${region_canonical_label}` — each CP now labels its k3s node with ITS OWN canonical region tag. 4. Thread `QA_PRIMARY_REGION` substitute into the bootstrap-kit Kustomization's postBuild.substitute block so the chart's qaFixtures.primaryRegion seam (`${QA_PRIMARY_REGION:-hz-fsn-rtz-prod}`) is set to the Sovereign-wide primary region label, never the hardcoded `hz-fsn-rtz-prod` chart default. Identical value on every cluster's bootstrap-kit because qaFixtures.primaryRegion is Sovereign-wide singular. 5. Remove the `${HCLOUD_LB_LOCATION:=hel1}` fallback default in 01-cilium.yaml — the cloud-init substitute ALWAYS provides a value, so a missing substitute is a tofu rendering bug that should surface at chart admission, not silently render hel1. Provider-agnostic per DoD A6: the `hz` prefix is correct only because this file lives under infra/hetzner/; future infra/aws/ and infra/huawei/ modules will derive `aw` / `hw` in their own per-module locals using the same pattern. DoD impact unblocked: - D10 (cilium clustermesh peer entries): clustermesh-apiserver Service now annotates the correct region for hcloud-ccm LB allocation on every peer, not just primary=hel1. - D12 (clustermesh LB external IP allocated): no longer pending on non-hel1 primary or any secondary because the location annotation now reflects each peer's real region. - D13 (canvas per-region bubble grouping): k3s nodes report their actual region label so FlowNode.region values differentiate across clusters. Tests added (infra/hetzner/tests/multi_region.tftest.hcl, run "per_region_cloud_init_carries_secondarys_own_region"): - SOVEREIGN_REGION_KEY / HCLOUD_LB_LOCATION render per-region (regression test for the templatefile contract). - openova.io/region= node-label is the per-region canonical label (`hz-nbg-rtz-prod` on nbg1-1, `hz-sin-rtz-prod` on sin-2, `hz-hel-rtz-prod` on primary hel1). - QA_PRIMARY_REGION substitute carries the Sovereign's primary region label on every cluster's bootstrap-kit substitute. - Negative assertions catch any regression that re-introduces `hz-fsn-rtz-prod` on a non-fsn1 Sovereign. Test result: 7 passed, 2 pre-existing failures (qa_mode SKU override tests — unrelated, present on origin/main, separate contract from Fix #183 body-first coalesce). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(tofu): align qa_mode SKU tests with Fix #183 body-first coalesce contract Pre-existing test failures on origin/main since Fix #183 (PR #1386, 2026-05-11) inverted the coalesce direction in `local.effective_cp_size = local.qa_mode ? coalesce(var.control_plane_size, var.qa_control_plane_size) : var.control_plane_size`. The pre-Fix-#183 tests asserted that qa_control_plane_size wins when qa_fixtures_enabled='true', but the new contract is the OPPOSITE: body wins (variables.tf default `cpx22` for control_plane_size is non-empty so coalesce always picks it first; qa-default only activates when the body is empty, which provisioner.go achieves by CONDITIONALLY omitting the var in writeTfvars when the operator's body has no override — see provisioner.go:1280-1289). Inside tofu test we can't conditionally omit a variable, so the variables.tf default ALWAYS wins. Updated assertions: - qa_mode_on_flips_to_bigger_skus → asserts variables.tf default `cpx22` wins (the auto-flip is exercised at the provisioner-side boundary, not tofu-side). - qa_mode_on_respects_explicit_overrides → asserts the body-first behavior when only qa_control_plane_size is set (no control_plane_size override). - NEW qa_mode_on_body_overrides_win → asserts the operator's explicit control_plane_size/worker_size wins verbatim — the canonical "body wins" lane Fix #183 codified. Tests result: 10 passed, 0 failed (was 7 passed, 2 failed on origin/main since Fix #183). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 10:57:48 +04:00
e3mrah	0c9e391d59	fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile (#1511 ) * fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh apiserver Service MUST be LoadBalancer (NEVER NodePort). Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted ${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign landed with clustermesh-apiserver as NodePort, in direct violation of A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go, PR #1508) which hard-fails on Service.type != LoadBalancer. Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): - 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True - clustermesh-apiserver Service = NodePort on all 3 regions - cilium-clustermesh peer Secret empty (0 peers) — orchestrator never wrote them because of the type-check - D10 + D12 both failed silently Fix flips the chart default to LoadBalancer and threads Hetzner CCM LB annotations (location, type, name) from the bootstrap-kit substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE + HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init postBuild substitute block alongside the existing CLUSTER_MESH_NAME + CLUSTER_MESH_ID. Operator escape hatch preserved: bare-metal / non-cloud Sovereigns override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign bootstrap-kit overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile PR #1509 added ${sovereign_fqdn_slug} reference to cloudinit-control-plane.tftpl (for the Hetzner CCM LB name annotation on clustermesh-apiserver) and wired it into the PRIMARY templatefile() invocation in main.tf, but missed the SECONDARY-regions templatefile() at line ~990. Every multi-region prov now fails at `tofu plan`: Invalid value for "vars" parameter: vars map does not contain key "sovereign_fqdn_slug", referenced at ./cloudinit-control-plane.tftpl:991,37-56. Caught on prov t113.omani.works (82c3587b97156a08, 2026-05-15) — first multi-region prov against #1509's chart fix. Phase-0 failed at plan before any servers spun up. Fix is trivial: thread the same replace(var.sovereign_fqdn, ".", "-") through the for_each secondary block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 00:00:19 +04:00
e3mrah	5f8ba85dc5	fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) (#1509 ) DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh apiserver Service MUST be LoadBalancer (NEVER NodePort). Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted ${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign landed with clustermesh-apiserver as NodePort, in direct violation of A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go, PR #1508) which hard-fails on Service.type != LoadBalancer. Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): - 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True - clustermesh-apiserver Service = NodePort on all 3 regions - cilium-clustermesh peer Secret empty (0 peers) — orchestrator never wrote them because of the type-check - D10 + D12 both failed silently Fix flips the chart default to LoadBalancer and threads Hetzner CCM LB annotations (location, type, name) from the bootstrap-kit substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE + HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init postBuild substitute block alongside the existing CLUSTER_MESH_NAME + CLUSTER_MESH_ID. Operator escape hatch preserved: bare-metal / non-cloud Sovereigns override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign bootstrap-kit overlay. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 23:40:04 +04:00
e3mrah	93f699326a	infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net (#1507 ) * docs(sovereign): pin multi-region DoD contract — never divert from D1-D14 Founder ruling 2026-05-15: every silent compromise from the multi-region target-state architecture is a quality violation. This file locks the convergence contract so future Claude sessions cannot drift. Architecture invariants A1-A6: - 3 regions minimum (never drop to 2 to dodge provider capacity) - Inter-region link = DMZ WireGuard over PUBLIC IPs, ALWAYS (no hcloud_network cross-region, no VPC peering, no Huawei VPC) - Cilium ClusterMesh apiserver = LoadBalancer (NEVER NodePort) - vCluster topology: primary = MGMT+DMZ, secondary = DMZ+RTZ - Zero public exposure of K8s control-plane endpoints - Provider-mix is canonical (assume 1 Hetzner + 1 AWS + 1 Huawei) DoD gates D1-D14 enforced via Playwright MCP + kubectl + cilium CLI on every fresh prov. No partial credit, no "deferred", no "matrix-drift". Mirrored to auto-memory at ~/.claude/projects/-home-openova-repos-openova-private/memory/sovereign_multiregion_dod.md so it loads at every session start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net Implements A1+A2+A6 from docs/SOVEREIGN-MULTI-REGION-DOD.md. Each region gets its own hcloud_network (10.0.0.0/16 INSIDE each, not shared across). Inter-region link is exclusively Cilium WireGuard over PUBLIC IPs through the DMZ — no provider's internal network ever spans regions. - Replaces hcloud_network.main + hcloud_network_subnet.{main,secondary} with hcloud_network.region[] + hcloud_network_subnet.region[] (for_each over toset(local.all_region_keys); primary key = "primary", secondary keys = slice-G1 "{cloudRegion}-{index}" shape). - Per-region cluster-cidr (10.42+i.0/16) + service-cidr (10.96+i.0/16) threaded through cloud-init so ClusterMesh peers don't collide on pod/service CIDRs (DoD gate D11). - Firewall: open UDP 51871 from 0.0.0.0/0 (Cilium WG inter-region encryption) — without this the WG mesh between regions cannot form. - Each CP's local private IP is now uniformly 10.0.1.2 per region (every region has its own /24 inside its own /16 — no cross-region IP collision class possible by construction). - Hetzner resource names threaded to cluster-autoscaler now use hcloud_network.region["primary"\|<k>].name so autoscaler-spawned workers land in the same isolated /16 as their region's CP. - Pre-2026-05-15 state will plan a network-recreate on next apply; per DoD cycle protocol this is consciously accepted (no tofu state mv runbook, every wipe-and-create is a fresh provision). - tofu tests cover: per-region network count + uniform 10.0.0.0/16 + uniform 10.0.1.0/24 subnet + per-region cluster/service CIDRs + Cilium WG firewall rule existence. - README "Network" section adds the 3-region DMZ-WG ASCII topology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tofu): apply tofu fmt — fixes CI fmt-check on PR #1507 Apply OpenTofu's canonical formatting to main.tf. No semantic changes; only whitespace alignment under template substitute blocks where my refactor added 2-char fields (`cluster_cidr` and `service_cidr`) that perturbed the prior column alignment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude <claude@anthropic.com>	2026-05-15 22:04:32 +04:00
e3mrah	3a19bb161f	fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml (#1503 ) * fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. * fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99) correctly reached cilium-config — but only AFTER Flux helm-upgraded the release. The pre-Flux Cilium install (cloud-init line 1473) used /var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or cluster.id, so cilium-agent started with the chart defaults ("default", 0). The Flux upgrade then changed cilium-config but the already-running cilium-agent kept its in-memory cluster.name="default" because it reads ConfigMap once at startup. Downstream consequences observed live on t105: hubble-relay CrashLoopBackOff: "tls: failed to verify certificate: x509: certificate is valid for *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1 .default.hubble-grpc.cilium.io" clustermesh peer announcements use stale "default" identity → cross-region mesh handshakes x509-fail. Fix: include cluster.name + cluster.id in the pre-Flux helm install's values file, sourced from the templatefile() vars cluster_mesh_name + cluster_mesh_id (already threaded per-region by main.tf:381-382 and :900-901). Now the first cilium-agent process announces with the correct identity, no helm-upgrade race. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-15 19:48:58 +04:00
e3mrah	1dc21bfd51	fix(cloud-init): accept Hetzner DHCP routes on private NIC (use-routes: true) (#1489 ) The netplan stanza for the hot-attached private NIC had `dhcp4-overrides.use-routes: false`, which discards Hetzner DHCP's classless static routes. Result: the interface gets `10.0.1.2/32` (host route only) with NO route for the 10.0.0.0/8 private network. The kernel routes all return traffic (including SYN-ACK to the Hetzner LB at 10.0.1.254) via eth0's default route — the public NIC. Hetzner LB's health check on private network gets the SYN forwarded, but the SYN-ACK arrives via the wrong NIC; Hetzner drops it as asymmetric. Target stays `unhealthy` forever on every service port. Caught live on prov 6dfade27 (omani.works, 2026-05-14): all 3 region LBs marked unhealthy on 53/80/443 — public surface blackholed despite 3-region × 45/45 HRs Ready + valid PROD cert + envoy listening on 0.0.0.0:30443. Confirmed via tcpdump on the host: enp7s0 In 10.0.1.254.X > 10.0.1.2:30443 [S] ← SYN arrives on private eth0 Out 10.0.1.2:30443 > 10.0.1.254.X [S.] ← SYN-ACK on wrong NIC Fix: change to `use-routes: true`. Hetzner DHCP-provided routes have higher metric than eth0's default (metric 100), so the public default stays intact; we only gain the per-subnet 10.0.0.0/N route needed for symmetric routing on the private NIC. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 22:52:01 +04:00
e3mrah	cebc9542d7	fix(cloudinit): escape ${WILDCARD_CERT_ISSUER} reference in comment so templatefile() doesn't try to interpolate it (#1485 ) OpenTofu's `templatefile()` parses `${...}` expressions everywhere in the template body — including comments. A comment on line 1072 of cloudinit-control-plane.tftpl referenced the Kustomization-time variable `${WILDCARD_CERT_ISSUER}` as documentation, but tofu reads it as a template var lookup → fails with `vars map does not contain key "WILDCARD_CERT_ISSUER"` → `tofu plan` exit 1. Fix: escape the documentation reference with `$${WILDCARD_CERT_ISSUER}` so it survives as literal text in the rendered file. The actual variable binding `WILDCARD_CERT_ISSUER: "${wildcard_cert_issuer}"` two lines below is unchanged (it correctly maps the lowercase tofu local to the uppercase Kustomization postBuild key). Caught live on prov #81 (omani.works), the first provision after #1481 landed the WILDCARD_CERT_ISSUER threading. omantel.biz had been provisioned BEFORE #1481 merged so it never exercised the new tftpl path. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 20:20:51 +04:00
e3mrah	a88e132be9	fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu (#1481 ) clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 18:25:45 +04:00
e3mrah	a75463f76a	fix(cloud-init): wait for private NIC before k3s install (prov #71 ) (#1464 ) * fix(flow_snapshot): region-scope dep edges (no cross-region wiring) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-init): wait for private NIC before k3s install (prov #71) Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks BEFORE the NIC is ready, renders netplan with only eth0, and the private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN. Effect on secondary CPs: k3s server starts with --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2 and fatals on "listen tcp 10.0.11.2:2380: bind: cannot assign requested address" then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service restart counter reached 5394, kubeconfig never PUT back to mothership, canvas showed secondary region as a permanent black hole. Diagnosed via Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster fsn1 zone NIC attach. Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for the expected private IP (control plane) or a route to it (worker). If the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true and `netplan apply`. Bail loudly if the IP/route never appears — failures surface in cloud-init.log instead of disguising as a slow boot. Symmetric fix in worker template covers autoscaler-spawned secondary workers when worker_count > 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 07:39:25 +04:00
e3mrah	32e0b408bf	fix(k3s): add public IP --tls-san + openova.io/region node label (#1459 ) Two related fixes for multi-region + qa-fixtures DoD on prov #64: 1. k3s TLS cert needs the public IPv4 in SAN. Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s auto-generates the server cert with SANs from --tls-san flags. We only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2 + cluster-ip + 127.0.0.1 only. Bridge connection from contabo rejected with: "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1, ::1, not 204.168.212.113" → silent watcher failure → 0 secondary HRs observed → canvas missing region sub-groups. Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before k3s install, add it as --tls-san=$CP_PUBLIC_IPV4. 2. openova.io/region=hz-fsn-rtz-prod node label. qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs, qa-wp Application) carry hard nodeAffinity for `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion default in products/catalyst/chart/templates/qa-fixtures/*.yaml). Without the label every fixture pod FailedScheduling → bp-catalyst- platform post-install hook waits forever → bootstrap-kit chain hangs at 44/45 with bp-catalyst-platform Running. Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP (qa-fixtures pin to primary by design). Both shipped in same commit since both are inside the same k3s server install line. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:38:25 +04:00
e3mrah	44913d8a6a	fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458 ) prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook timed out because the catalyst-api Helm-released pod stayed Pending with "Too many pods. 0/1 nodes are available". k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/ flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on prov #63 the CP carried everything alone and dropped scheduling at 110. Bump to 220 on both CP and worker so the saturation point doesn't gate the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU + 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit weight. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 18:37:42 +04:00
e3mrah	5f4f9f2cb5	fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457 ) prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects its node IP from the primary interface, which on Hetzner cpx52 binds to the public IPv4 (49.x.x.x) instead of the private network IP (10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there; nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the private IP from cilium-config k8sServiceHost — times out, CrashLoop. Worked by luck on cpx42 (earlier kernel + Hetzner network attach timing). cpx52 reproduces 100%. Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip} in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443 (cilium-config substitute) find the API server every time. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:30 +04:00
e3mrah	68372d700b	fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:11:23 +04:00
e3mrah	be47815ddf	fix(infra): pass cp_private_ip to primary CP templatefile too (#1447 ) PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:01:43 +04:00
e3mrah	cdcc50a213	fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446 ) Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3 "no stretched fault domain". Cilium on each region MUST talk to its OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites hardcoded the primary's IP: 1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665): `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2). 2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}` so each region's k3s API cert validates against the LOCAL CP's IP. 3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml): add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR values so Flux postBuild.substitute can override per region. The cloud-init Kustomization renders the substitute var to `${cp_private_ip}`. Single-region (primary-only) provisions fall back to the default `10.0.1.2` and stay byte-identical to today. Live evidence of the bug — prov #52 (3-region) on 2026-05-12: cilium-operator on nbg1 secondary: "Establishing connection to apiserver" host="https://10.0.1.2:6443" "failed to start: ... tls: failed to verify certificate: x509: certificate signed by unknown authority" Each region's k3s has its OWN self-signed CA (cluster-init per CP). The primary's API cert isn't signed by the secondary's CA → cilium crash- loops → no CNI → flux controllers Pending → no HRs → canvas shows only primary's HRs. This fix points each region's cilium at the LOCAL CP, whose API server presents the matching CA from this cluster. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:56:18 +04:00
e3mrah	19a847e514	fix(infra): restore \n escape in secondary CP templatefile regex (#1445 ) The conflict-resolution Python script in PR #1444 wrote a literal newline where the regex string needed the two-char "\n" escape. tofu init rejected with "Invalid multi-line string / Unterminated template string" on main.tf:925. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:10 +04:00
e3mrah	4923938c2b	feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444 ) Operator mandate (2026-05-12): the mothership canvas must surface install-* HRs from EVERY region of a multi-region provision, not just the primary CP's. Today catalyst-api stores ONE kubeconfig per deployment (the primary CP's) and spawns ONE helmwatch.Bridge against it. Result: secondary regions are invisible on the canvas even though their k3s clusters are fully reconciling. End-to-end change across infra + handler: 1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL appends `?region=<kubeconfig_postback_region>` when the var is set. main.tf templatefile call passes empty for primary CP, `each.key` (e.g. "nbg1-1", "hel1-2") for each secondary region. 2) PutKubeconfig handler: reads ?region= query param. Empty → primary path (unchanged: stores at <dir>/<id>.yaml, sets Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty → secondary path: stores at <dir>/<id>-<region>.yaml, populates Deployment.secondaryKubeconfigPaths[region]. Single-use guard is per-region (the same bearer secures every CP's PUT — secondaries reuse it for their own slot). NO Phase-1 watch re-launch from a secondary PUT. 3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the primary's watcher. Scans <kubeconfigsDir>/<id>-.yaml every 15s, spawns one helmwatch.NewWatcher per kubeconfig discovered, stores the Watcher on Deployment.secondaryWatchers[region]. Per-region watchers emit ordinary helmwatch events with region-prefixed Component names so the wizard's per-component view doesn't collide primary vs secondary bp-cilium events. They do NOT contribute to markPhase1Done — outcome remains the primary's classification. 4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group bubbles + install- nodes from each secondary watcher's SnapshotComponents. Node id: <depID>:<region>:install-<chart>. FlowNode.region set so the canvas can colour-group. Intra-region finish-to-start deps emitted from cs.DependsOn — same-region only, never cross-region (per NAMING-CONVENTION §1.3 independent fault domains, no stretched cluster). 5) wipe.go: removes both <id>.yaml AND every <id>-.yaml secondary kubeconfig file on Sovereign wipe. Storage model is uniform across SME and corporate Sovereigns. No hardcoding of provider, region count, or building block. Caught after operator pointed out that 3-region prov #50 was showing only 52 install- nodes (all from fsn1) on the canvas — the architectural gap. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:12:38 +04:00
e3mrah	c5d891ad0b	fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443 ) The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name / hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster autoscaler could attach scale-up VMs to the private network. The primary CP's templatefile call at main.tf:483-485 was updated, but the matching call for secondary regions at main.tf:899 was missed. Result: any provision with regions[] of length > 1 fails at tofu plan with "vars map does not contain key hcloud_network_name" referenced in cloudinit-control-plane.tftpl:478. Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash) at T+0:47. Forward the same three resource refs to every secondary region's templatefile call. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:23:53 +04:00
e3mrah	b743b646ac	fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427 ) Root cause (autoscaler pod log, prov #43 chroot): W orchestrator.go:626 Node group workers is not ready for scaleup - backoff with status: Scale-up timed out for node group workers after 15m2.273255226s Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY: workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[] workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[] The worker cloud-init (identical to Phase-0 user_data) issues curl -sfL https://get.k3s.io \| K3S_URL=https://10.0.1.2:6443 ... sh - against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment that URL is unreachable → k3s agent install silent-fails → node never registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst- platform Pending Pods never schedulable → chroot canvas tests blocked. Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on the cluster-autoscaler deployment so the Hetzner provider attaches every scale-up VM to the SAME private network + firewall + ssh-key the Phase-0 Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net / -fw / catalyst-<sov-fqdn-with-dashes>). Names flow: Tofu (hcloud_network.main.name + hcloud_firewall.main.name + hcloud_ssh_key.main.name) → cloudinit-control-plane.tftpl (3 new template vars) → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys) → flux-system/cloud-credentials Secret → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*) → upstream chart's deployment env Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent regression of the three env-var slots in chart values.yaml. Reaffirms canonical seam: values flow through Tofu → cloud-init → flux-system Secret → Flux valuesFrom → chart values → upstream env. Never via kubectl patch, never via bespoke Go API calls. Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 06:11:30 +04:00
e3mrah	22855e62d8	feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396 ) Final integration piece for OpenovaFlow infrastructure path — catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits distinct region tags on every FlowNode and the snapshot returns 2× per HR on a multi-region Sovereign. Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst- ui temporary revert until npm workspaces land), PR #1395 (chart no-op). ## Scope vs original Agent #3 brief The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire + runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred: PR #1394 reverted Agent #1's UI wiring because the Docker UI build has no node_modules for the cross-workspace canvas source. Founder note on #1394: "Agent #3 (or a follow-up) will re-wire them properly once npm workspaces are configured at repo root." This PR ships the infrastructure half (proxy + cloud-init + runbook). The canvas-side rewire is a separate follow-up PR that needs npm workspaces, not surgical edits to FlowPage. ## What ships ### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events} products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go: - GET /snapshot — JSON pass-through, headers + status forwarded - GET /stream — unbuffered SSE pass-through using http.Flusher (NOT httputil.ReverseProxy; that buffers and breaks text/event-stream) - POST /events — body forwarded byte-for-byte - Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign in-cluster Service DNS) Routes registered in cmd/api/main.go inside the auth-gated chi.Group. 11 table-driven tests cover snapshot/events/stream pass-through, upstream 404/400/unreachable propagation, empty-deploymentId guard, SSE frames arrive AS EMITTED, and env-default fallback. ### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY - infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild. substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP - infra/hetzner/main.tf — primary CP renders var.region as region key; secondary CP renders each.key (e.g. "hel1-1") from for_each over local.secondary_regions - infra/hetzner/variables.tf — new sovereign_deployment_id var (string, default "" for tofu mocks) - provisioner.go writeTfvars — writes vars["sovereign_deployment_id"] = req.DeploymentID - bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY} envsubst keys ### 3. Deployment record flag handler/deployments.go State() — emits `openovaFlowEnabled: true` on every deployment. The catalyst-ui rewire (follow-up PR) will read this to enable the openova-flow-server adapter; legacy provisions without the flag will keep the bridge once the rewire lands. ### 4. Verification runbook docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body (multi-region cpx42 fsn1+hel1, qaTestEnabled=true, sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual canvas checks (gated on the follow-up UI rewire), and a failure-class triage table. ## Canonical-seam citations 1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/ deployments.go:1244-1287 (StreamLogs): identical Content-Type + Cache-Control + X-Accel-Buffering header set; identical http.Flusher.Flush() after each write; identical r.Context().Done() cancel path. 2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893 (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var} form, dual emission at primary + secondary CP for_each in main.tf. ## Verification ``` $ go build ./... (clean) $ go vet ./... (clean) $ go test ./internal/handler/ -run TestFlowProxy -count=1 -race ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler 1.410s $ go test ./internal/provisioner/... -count=1 ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner 0.025s ``` 3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields, TestHandleWhoami_PinSessionRBACClaims, TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on main HEAD without this PR — unrelated baseline state. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:01:09 +04:00
e3mrah	4e6bec7022	fix(infra): body-supplied SKUs win over QA defaults (Fix #183 ) (#1386 ) * fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 * fix(infra): body-supplied SKUs win over QA defaults (Fix #183) Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size, var.control_plane_size)` when qa_fixtures_enabled='true'. Because qa_control_plane_size has a non-empty default (cpx32), coalesce always returned the QA default and silently overrode whatever the body supplied in `controlPlaneSize`. Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"` explicitly (cheapest viable for the founder's collapsed-CP+worker single-node-per-region topology with workerCount=0). The QA-default override downgraded that to cpx32 at plan time — the explicit choice never made it onto the hardware. Fix #183 — invert the coalesce so body wins: effective_cp_size = local.qa_mode ? coalesce(var.control_plane_size, var.qa_control_plane_size) : var.control_plane_size `provisioner.go` writeTfvars already emits control_plane_size / worker_size only when the body's field is non-empty (so `var.control_plane_size` inherits variables.tf's cost-optimised default when the body left it blank). That means `coalesce(var.control_plane_size, var.qa_*)` always has a non-empty first arg in normal flow; the QA-default fallback only fires on a zero-override QA call that intentionally leaves the SKU empty. No change to customer-Sovereign behaviour (qa_fixtures_enabled='false' branch already used `var.control_plane_size` verbatim). Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:04:41 +04:00
e3mrah	515c3cf38d	fix(infra): pin LB private IPs + revert hel1 zone (Fix #182 ) (#1385 ) * fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:00:50 +04:00
e3mrah	7aa1b24c0d	fix(infra/hetzner): hel1 network_zone is eu-north not eu-central (#179 ) (#1381 ) prov #29 + prov #30 both failed at +90s with: Error: hcloud/inlineAttachServerToNetwork: attach server to network: IP not available (ip_not_available, ...) with hcloud_server.secondary_control_plane["hel1-1"] Root cause: `local.hetzner_network_zones` hardcoded `hel1 = "eu-central"`. Helsinki is physically in Hetzner's eu-north zone (Finland), not eu-central (Falkenstein/Nuremberg). Hetzner subnets are zone-bound: when the secondary hel1 subnet is created with network_zone=eu-central, the subnet exists but attaching a server in location=hel1 (physical eu-north) returns ip_not_available because cross-zone attach isn't supported. Fix: hel1 -> eu-north. Caught live on prov #29 + #30 (omantel.biz 2-region fsn1+hel1 reprov, both failed at the same line 872 secondary CP attach). Per CLAUDE.md ARCHITECT-FIRST: Hetzner publishes zone-region mapping at https://docs.hetzner.com/cloud/general/locations/; hel1 is unambiguously listed under eu-north. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 12:26:18 +04:00
e3mrah	8308f53e32	fix(infra/hetzner): auto-flip QA Sovereigns to cpx32/cpx42 nodes (Fix #157 ) (#1360 ) 12 of 12 fresh Sovereign provisions in the 2026-05-10 bounded-cycle session wedged on the production cpx22 CP / cpx32 worker defaults (memory entry: "provision #5 cpx22 OOM" + handover doc). Root cause: the CP's documented ~3.5GB k3s+cilium+flux+cert-manager+sealed-secrets working set leaves zero RAM headroom for Flux source-controller's ~700MB burst during the 44-slot bootstrap-kit apply, while two cpx32 workers (8GB each) cannot satisfy the simultaneous request set from bp-keycloak (2Gi JVM) + bp-harbor (~2.5Gi across 6 sub-components) + bp-cnpg primary + bp-openbao 3-replica Raft once the qaFixtures Continuum + CNPGPair + status-seeder Jobs queue. Mirrors the Fix #123 pattern (wildcard_cert_use_staging) — auto-flips ONLY when qa_fixtures_enabled='true'. Customer-facing Sovereigns (SME / marketplace / admin / console) provision with qa_fixtures_ enabled='false' so coalesce() in main.tf falls back to the existing cpx22/cpx32 defaults; the production code path is untouched. - variables.tf: qa_control_plane_size (default cpx32), qa_worker_size (default cpx42) with the same Hetzner SKU regex validation as the production size variables. - main.tf: locals.qa_mode + locals.effective_cp_size + locals. effective_worker_size; hcloud_server.control_plane and .worker read the effective locals so QA Sovereigns auto-flip and customer Sovereigns plan-clean unchanged. - tests/multi_region.tftest.hcl: three new run blocks pin the contract — qa_mode=false keeps cpx22/cpx32, qa_mode=true flips to cpx32/cpx42 defaults, qa_mode=true respects explicit operator overrides (no hardcoded SKU per docs/INVIOLABLE-PRINCIPLES.md #4). Per principle 17 (isolated worktree) shipped from .claude/worktrees/ qa-node-sizing-157. Per principle 4 (target-state) attacks the systemic OOM-cascade root cause rather than another per-blueprint timeout bandaid. Per principle 16 (canonical seam) the SKU choice lives in variables.tf defaults + per-resource selection in main.tf; no other path mutates server_type. Per principle 18 no SKU is hardcoded — every value is operator-overridable. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:04:44 +04:00
e3mrah	901afa2a95	fix(infra/hetzner): add skip_region_validation=true to aws provider for Hetzner regions (#135 ) (#1344 ) Fix #133 (PR #1343) swapped aminueza/minio for hashicorp/aws to bypass DeleteBucketPolicy AccessDenied. Worked for the bucket creation API, but the aws provider's region validator runs at provider-init time and rejects Hetzner regions (fsn1/nbg1/hel1) before any S3 call: Error: invalid AWS Region: fsn1 provider["registry.opentofu.org/hashicorp/aws"] Reproduced on prov #19 (02c23fc20df90629) — failed at `tofu plan` in 96s. Companion to the existing skip_credentials_validation + skip_metadata_api_check + skip_requesting_account_id flags that already disable the other AWS-specific preflight checks the Hetzner endpoint can't satisfy. skip_region_validation=true tells the provider not to compare the region string against AWS's hardcoded region list; the region is still passed through to the S3 SDK (used as the SigV4 signing region) which is what Hetzner expects. Per CLAUDE.md principle 16: same canonical seam as the other skip_* flags in the same provider block — this is the missing fourth flag in the standard "non-AWS S3-compatible backend" pattern. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 04:12:50 +04:00
e3mrah	5d43cf7b53	fix(infra/hetzner): swap aminueza/minio for hashicorp/aws to escape AccessDenied wedge (#133 ) (#1343 ) Root cause of provisions #13 / #17 failing in <2 min at `tofu apply` with: [FATAL] [ACL] Unable to create bucket (catalyst-omantel-biz-<id>): unable to remove bucket policy: Access Denied. `aminueza/minio v3.34.0`'s `minio_s3_bucket` Create handler calls `DeleteBucketPolicy` post-create as part of state normalization (the provider treats "no policy" as the canonical zero state and forcibly clears any inherited policy). Hetzner Object Storage's standard read/write credentials don't grant `s3:DeleteBucketPolicy`, so the call fails AccessDenied EVERY TIME -- the bucket IS created on Hetzner's side but tofu marks the resource as failed and rolls back the apply, blocking every fresh Sovereign provision from reaching Phase 1. The wedge is deterministic, not flaky. Provider swap rationale -- `hashicorp/aws` configured against Hetzner's S3 endpoint speaks vanilla S3 and does NOT do any post-create policy normalization. A successful CreateBucket is the terminal state for `aws_s3_bucket` Create. Hetzner officially documents AWS CLI / SDK as a supported S3 client (see https://docs.hetzner.com/storage/object-storage/getting-started/using-s3-api-tools/), so this is the canonical-vendor path, not a workaround. Changes: * `versions.tf` -- drop `aminueza/minio`, add `hashicorp/aws ~> 5.0` pointed at `https://<region>.your-objectstorage.com` with `s3_use_path_style = true` and the four `skip_` flags that disable AWS-specific preflight calls (STS, IMDS) Hetzner doesn't implement. `main.tf` -- `minio_s3_bucket.main` -> `aws_s3_bucket.main` (no force_destroy preserved). Add `aws_s3_bucket_acl.main` for `private` (the bucket-level acl arg was removed in aws-provider 5.x). Updated comment block explains the AccessDenied root cause inline so future readers don't repeat the journey. * `outputs.tf` -- `minio_s3_bucket.main.bucket` -> `aws_s3_bucket.main.bucket`. * `variables.tf` -- prose-only updates pointing at the new provider + the fix-#133 root-cause note. * `tests/multi_region.tftest.hcl` -- override_resource swap from `minio_s3_bucket.main` to `aws_s3_bucket.main` + `aws_s3_bucket_acl.main` so the offline tftest mock path still bypasses provider validation. * `cloudinit-control-plane.tftpl` -- two comment lines updated to reference the new resource name (no behavioural change). * `.terraform.lock.hcl` -- removed (regenerated by `tofu init` against the new provider set; CI's `tofu init -backend=false` step relocks deterministically). Idempotency / state migration: * Fresh-provision-only path -- existing prov state lives in PDM and is recycled per provision. New provs: `tofu init` pulls the aws provider, `tofu apply` creates `aws_s3_bucket` with the same name Hetzner already owns and gets BucketAlreadyOwnedByYou (200, no-op in the AWS SDK). Idempotent. * Long-lived Sovereigns (sme/marketplace/admin/console -- protected per ADR-0001 §9.4) are NOT re-applied; their tofu state is stable. No `state mv` runbook is required. Test plan: * `tofu fmt -check -recursive` -- expected pass (manual indent matches fmt output). * `tofu validate` (CI's infra-hetzner-tofu workflow) -- expected pass. * `tofu test` against `tests/multi_region.tftest.hcl` -- expected pass on all 5 scenarios (mock_provider for hcloud + override_resource for the two new aws resources). * `tofu apply` is NOT runnable from this env (no Hetzner creds); CI's test-hetzner-e2e workflow exercises the live path on PR merge. Refs #133. Co-authored-by: Claude (e3mrah) <noreply@anthropic.com>	2026-05-11 03:59:15 +04:00
e3mrah	90aa2767da	fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123 , LE rate-limit bypass) (#1339 ) Root cause (qa-loop iter-1 wedge, 2026-05-10): Let's Encrypt production hit the 5-certs/168h rate limit on *.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy could not get a wildcard cert -> console.omantel.biz TLS handshake failed -> iter-1 Test Executor could not run. Customer Sovereigns are unaffected (one cert per registered domain in their lifetime), but QA Sovereigns wipe + re-provision dozens of times in a session and exhaust the production ceiling within hours. Fix (target-state, NOT workaround): - bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer (letsencrypt-dns01-staging-powerdns) alongside the existing production one. Same DNS-01 webhook config (same PowerDNS endpoint, same API key) -> only the ACME directory URL + account key differ. Both ClusterIssuers are real cert-manager resources; LE treats them as wholly independent issuers so a rate-limit hit on production does NOT block staging issuance. - bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool, default false). When true, sovereign-wildcard-certs.yaml renders Certificate(s) with issuerRef.name pointing at the staging issuer instead of production. - bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst, same passthrough pattern as QA_FIXTURES_ENABLED. - catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA Sovereigns (Request.QATestEnabled=true) so the per-Sovereign overlay flips both QA fixtures + staging certs from one wizard toggle. - tofu var wildcard_cert_use_staging propagates through main.tf into the cloudinit postBuild.substitute block on both primary + secondary regions. Result: cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard cert in <2min (no production rate limit). curl -sk + Playwright (ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run within minutes of provision. Customer Sovereigns (QATestEnabled= false) keep getting real-trusted production certs. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL + issuer name is values-overridable. Operators wiring a private staging ACME (e.g. internal Smallstep CA) override via per-Sovereign overlay without rebuilding any Blueprint. Staging is the documented LE pattern (https://letsencrypt.org/docs/staging-environment/), not a band-aid. _None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_ Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 01:08:07 +04:00
e3mrah	3a5d9fc102	fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111 ) (#1331 ) Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 23:31:56 +04:00
e3mrah	0843f02269	fix(infra/hetzner): escape ${VAR:-default} in tftpl comment (PROV-9 BLOCKER) (#1328 ) PR #1311 (Fix #73) added a YAML comment in cloudinit-control-plane.tftpl line 933 that referenced the envsubst placeholder \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context, and the colon inside the interpolation makes it choke with: Extra characters after interpolation expression; Template interpolation doesn't expect a colon at this location. Result: every prov-* attempt since #1311 merged tofu-plans EXIT 1 in ~2 seconds. Prov #9 (4204f0b0c5e37a80) failed at 18:51 UTC with this error before any Hetzner resource was created. Fix: change \${QA_FIXTURES_ENABLED:-false} to \$\${QA_FIXTURES_ENABLED:-false} (HCL escape — \$\$ renders as a literal \$ in the cloud-init output, which envsubst then interprets at apply time). Same precedent: commit `7e5c4375` "escape \$ in tftpl comments referencing envsubst placeholders". This is a 1-char fix on a comment. No runtime behavior change. Unblocks the qa-loop bounded-provision-cycle. Refs Fix #98, Fix #95, Fix #73 (regression). Co-authored-by: e3mrah <alierenbaysal@gmail.com>	2026-05-10 22:53:49 +04:00
e3mrah	b22975cb4b	fix(catalyst-api provisioner): qaTestEnabled flag auto-sets QA_FIXTURES_ENABLED for QA Sovereigns (qa-loop bounded-cycle Fix #73 ) (#1311 ) Provision #7 came up zero-touch but the bp-catalyst-platform qaFixtures stack stayed off because the chart template defaults to ${QA_FIXTURES_ENABLED:-false} and the catalyst-api provisioner never threaded the toggle. Result: ~140 of the qa-loop matrix's TCs were inherently fixture-blocked on every QA Sovereign. Canonical seam: provisioner.Request struct. New fields: - QATestEnabled bool `json:"qaTestEnabled"` (default false) - QAFixturesNamespace string `json:"qaFixturesNamespace,...` (default derived) - QAOrganization string `json:"qaOrganization,...` (default derived) When QATestEnabled=true, writeTfvars emits qa_fixtures_enabled="true" + qa_test_session_enabled="true" plus qa_fixtures_namespace + qa_organization derived from SovereignFQDN's first label per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): omantel.biz -> qa-omantel / omantel-platform qa.example.com -> qa-qa / qa-platform demo.openova.io -> qa-demo / demo-platform Customer Sovereigns provision with QATestEnabled=false (default) -> no qa-fixture artifacts on production tenants. Wiring: 1. internal/provisioner/provisioner.go Request struct + writeTfvars() + deriveQAFixturesNamespace + deriveQAOrganization + firstFQDNLabel 2. infra/hetzner/variables.tf 4 new tofu vars (string, true\|false validated) 3. infra/hetzner/cloudinit-control-plane.tftpl QA_FIXTURES_ENABLED / QA_TEST_SESSION_ENABLED / QA_FIXTURES_NAMESPACE / QA_ORGANIZATION substitute envvars on bootstrap-kit Kustomization 4. infra/hetzner/main.tf pass new vars into both templatefile invocations (primary + per-secondary-region) 5. internal/provisioner/provisioner_test.go 3 new tests: - default-disabled invariant - enabled derivation matrix - operator-override-wins QA Sovereign provision command (catalyst-api): POST /api/v1/deployments { "sovereignFQDN": "omantel.biz", "qaTestEnabled": true, ... } Verified: go test ./products/catalyst/bootstrap/api/internal/provisioner/... ok (0.019s) Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 21:08:35 +04:00
e3mrah	fcfed6408c	feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101 ) (#1226 ) * feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) Follow-up to #1223. The Flux Kustomization on every Sovereign points at clusters/_template/bootstrap-kit/ and post-build-substitutes per- Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml that #1223 added is therefore dead code (Flux doesn't read that path). The canonical mechanism is to extend the template with envsubst placeholders + thread the values through tofu vars. Wires four layers end-to-end: 1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds `cluster.name: ${CLUSTER_MESH_NAME:=}` and `cluster.id: ${CLUSTER_MESH_ID:=0}` plus `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults = single-cluster Sovereign (no peer connects); the cilium subchart accepts empty cluster.name when id=0. 2. infra/hetzner/cloudinit-control-plane.tftpl — adds CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit Kustomization's postBuild.substitute block (alongside SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML). 3. infra/hetzner/variables.tf — declares cluster_mesh_name (string, default "") and cluster_mesh_id (number, default 0, validated 0-255). 4. infra/hetzner/main.tf — primary cloud-init passes var.cluster_mesh_{name,id} verbatim. Secondary regions (when var.regions[i>0] is non-empty per slice G3) auto-derive each peer's name as `<sovereign-stem>-<region-code-no-digits>` and increment id from var.cluster_mesh_id+1. Per-region override via the new RegionSpec.ClusterMeshName field. 5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — adds ClusterMeshName + ClusterMeshID to Request and threads them into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer override. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side default is intentionally empty — operator request OR per-Sovereign overlay must supply the values when ClusterMesh is enabled. The allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md (introduced in #1223). Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): escape $ in tftpl comments referencing envsubst placeholders `tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a template variable reference; the comment was meant to refer to the Flux envsubst placeholder consumed downstream by the bootstrap-kit cilium HelmRelease. Escaped both refs with `$$` per Terraform's templatefile escape syntax so the comment renders verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name coalesce errors when every arg is empty (the not-in-mesh path). Switch to a conditional that yields '' when both the per-region override AND var.cluster_mesh_name are empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 00:19:53 +04:00
e3mrah	7ca4abddd2	feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101 ) (#1159 ) * feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState \| 404 \| 401 PUT /lease/<slot> → 200 + LeaseState \| 412 + state \| 401 DELETE /lease/<slot> → 204 \| 412 \| 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 08:01:44 +04:00
e3mrah	8988cd9e4f	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 ) Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/*. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:29:44 +04:00
e3mrah	8e312cd244	fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966 ) (#967 ) Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1) failed at `tofu apply` with: Error: invalid input in field 'user_data' (invalid_input): [user_data => [Length must be between 0 and 32768.]] with hcloud_server.control_plane[0] on main.tf line 309 Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921 inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud- init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's multi-domain substitutions. Rendered size: ~37 KB. Root cause: the prior strip regex `(?m)^[ ]{0,2}# .\n` was scoped to indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE write_files content blocks (e.g. flux-bootstrap.yaml's triplicate Kustomization documentation). Those comments are inert: every write_files entry is YAML / JSON / key=value config (no shell scripts), and parsers ignore `#`-prefixed lines entirely. Changes: 1. New strip regex `(?m)^[ ]#( \|$).\n` strips ANY-indent comment lines that start with `#` followed by space or EOL. Preserves: - `#cloud-config` line 1 (no space after `#`) - `#!`-shebangs (no space after `#`) - `#pragma`-style directives (`#` followed by non-space non-EOL) Applied to both `local.control_plane_cloud_init` and `local.worker_cloud_init`. 2. Plan-time guardrail via `lifecycle.precondition` on `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan (not apply) when `length(local.<>_cloud_init) > 30720` bytes (30 KiB = 32 KiB hard cap minus 10% future-additions buffer). Future bloat- creep that silently re-eats the headroom now fails fast at plan-time BEFORE the network/LB/firewall/SSH-key resources get created. Verified rendered sizes (Python simulation of templatefile + strip, substitutions match real otech114 inputs): CP cloud-init: 79404 bytes raw → 21144 bytes stripped (margin: 11624 under hard cap, 9576 under guardrail) Worker cloud-init: 3254 bytes raw → 2410 bytes stripped (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes) `#cloud-config` first-line preserved. All 18 write_files entries and 43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip (comments are documentation only at the file-format level). Closes #966 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 17:58:44 +04:00
e3mrah	d1431bed09	fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965 ) Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:21:59 +04:00
e3mrah	2ff50f0591	fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955 ) Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:47:37 +04:00

1 2 3

108 Commits