Commit Graph

108 Commits

Author SHA1 Message Date
e3mrah
c148ec6a34
fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22) (#1575)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".

Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.

The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:31:26 +04:00
e3mrah
57939585c0
feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) (#1571)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22)

Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): add D22 sovereign-side identity placeholders

Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} +
${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires
them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569)
→ catalyst-api env → chrootEnsureDeployment populates the deployment
record → Settings page renders real values instead of `—`.

This PR alone is a no-op (placeholders default to empty, same as today).
The cloud-init substitute lines + provisioner.go tfvars need to land in
a companion PR to actually populate the values on next-prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22)

Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block
now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit
Kustomization's postBuild.substitute env, which the slot-13 placeholders
(#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}.

Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile
substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap
keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment
populates the deployment record (#1567 + #1568 fallback).

SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes
the dependency cycle (hcloud_server.cp doesn't exist at cloudinit
render time). Separate PR will source it via metadata-service or
post-create ConfigMap patch.

Next-prov (t133+) Sovereign Console Settings page now renders real
ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:47:04 +04:00
e3mrah
1c988b9a4b
fix(firewall): open NodePort range 30000-32767 for clustermesh LB (D11) (#1538)
PR #1537's use-private-ip approach was not viable: the per-region
Hetzner LB has no private-network attachment by default (LB private_net
is empty) and our DoD A2 architecture pins one private /24 per region
that does NOT span across regions. The LB->backend hop has to transit
the public path.

The actual blocker is the Sovereign firewall: it permits 80/443/6443/53
and blocks the NodePort range. Hetzner LB TCP health-check probes
`<node-public-ip>:<NodePort>` and gets dropped → all targets marked
unhealthy → external clients see "unexpected eof while reading" at
TLS handshake → cilium clustermesh agent stays `0/N remote clusters
ready, Waiting for initial connection`.

Security: clustermesh-apiserver requires mTLS. Peer agents must present
a client cert signed by the peer cluster's cilium-ca (PR #1530).
Anonymous connections rejected at handshake. mTLS is the security
boundary, NOT the firewall — opening NodePorts is safe here.

Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — completes the D11
incident chain (#1525#1528#1530#1536 → this).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:44:02 +04:00
e3mrah
1f30a08ae3
fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534)
The Sovereign-side catalyst-api runs in "chroot" mode — it has no
parent prov record, so chrootEnsureDeployment synthesises a minimal
in-memory Deployment with only SovereignFQDN set. The
/infrastructure/topology loader then sees empty Request.Regions[]
and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes)
which only sees THIS cluster's Node(s) → emits exactly 1 Region
even on a 3-region Sovereign. /cloud?view=graph renders as
"1 cluster 1 region" — DoD D5 failure.

Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported
`console.t126.omani.works/cloud?view=graph` showed 1 region despite
mothership openova-flow snapshot holding all 3 regions correctly.

This PR threads the canonical multi-region RegionSpec[] from the
mothership prov body all the way to the Sovereign-side catalyst-api:

  tofu var.regions
    → jsonencode → sovereign_regions_json tftpl var
    → cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON
    → bp-catalyst-platform slot 13 sovereign.regionsJson value
    → sovereign-fqdn ConfigMap key `regionsJson`
    → catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom)
    → chrootEnsureDeployment parses JSON, populates Request.Regions[]
    → topology loader emits one Region per spec entry

Single-region Sovereigns: var.regions has length 1; chart writes
the array literal; chroot synth still produces 1 Region — no
regression. Empty env: chroot falls back to live-Nodes path
(legacy behavior preserved).

Refs DoD D5.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:45:24 +04:00
e3mrah
357feb0843
fix(tofu): escape ${...} in comment that broke templatefile() (t127) (#1533)
Unescaped `${DMZ_VCLUSTER_ENABLED:=true}` Flux envsubst expression
inside a tftpl comment was being parsed by tofu's templatefile() as
a tftpl interpolation. tofu's `:=` is not a valid tftpl operator,
so tofu plan failed with:

  ./cloudinit-control-plane.tftpl:1021,71-72: Extra characters after
  interpolation expression; Template interpolation doesn't expect a
  colon at this location.

Every other `${...}` reference in tftpl comments in this file is
properly escaped as `$${...}` (e.g. lines 12, 850, 893, 971, 996,
1039, 1138). Mine slipped through PR #1531.

Fix: rewrite the comment to NOT include any `${...}` expression
(since the expression was just illustrative), avoiding the escape
gymnastics entirely.

Caught on t127 (b7942a70f7516e9e, 2026-05-16) — first prov after
PR #1531 landed FAILED in tofu plan stage within 60s.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:39:43 +04:00
e3mrah
904686ff0d
fix(vcluster): canonical region label substitute + per-role enable flags (#1531)
Caught on t126 (84c0848406dd6fdd, 2026-05-16): bp-{dmz,mgmt,rtz}-vcluster
charts installed but DMZ Pods Pending on every region with
FailedScheduling. Pod nodeSelector was `openova.io/region=hel1`
(from `${SOVEREIGN_REGION_KEY}` substitute = Hetzner region key
"hel1"/"nbg1-1"/"sin-2"), but the k3s node-label is
`openova.io/region=hz-hel-rtz-prod` (canonical 4-segment label written
by cloud-init from `region_canonical_label` per PR #1512). Mismatch
meant every vCluster Pod across every region sat Pending.

MGMT + RTZ slot 58/59 charts also default-OFF with no substitute
flipping them on per the DoD A4 topology (primary=MGMT+DMZ;
secondary=DMZ+RTZ).

This PR:
1. Adds `SOVEREIGN_REGION_CANONICAL_LABEL` substitute to tofu cloud-init
   `bootstrap-kit` postBuild block, sourced from per-region
   `region_canonical_label` tftpl var.
2. Adds `MGMT_VCLUSTER_ENABLED` + `RTZ_VCLUSTER_ENABLED` substitutes —
   primary CP renders true/false, secondary CP renders false/true.
3. Updates bootstrap-kit slots 54/58/59 to use the canonical label
   substitute. Slots 58/59 also read the per-role enable flag.

Expected post-deploy state on a fresh 3-region prov:
  primary:    DMZ + MGMT vCluster Pods Running (RTZ rendered zero)
  secondary:  DMZ + RTZ vCluster Pods Running (MGMT rendered zero)

Refs DoD A4 (vCluster topology).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:28:06 +04:00
e3mrah
ed19bb3f8d
fix(k3s): --disable-cloud-controller so providerID stays empty for our patch (#1524)
Caught on t123 (a3bfa56adbcfb049, 2026-05-16): Gap A v3.1's patch loop
hit k8s validation error:

  The Node "catalyst-t123-omani-works-cp1" is invalid:
  spec.providerID: Forbidden: node updates may not change providerID
  except from "" to valid

k8s allows setting providerID from empty → valid, but NOT changing it.
k3s's embedded cloud controller sets providerID=k3s://<hostname>
BEFORE our cloud-init runcmd patch fires (race window). Once set,
the patch is rejected.

Fix: --disable-cloud-controller (alone, NOT with the cloud-provider=
external kubelet arg that caused the chicken-and-egg taint in
reverted PR #1513). This disables the k3s embedded cloud controller
so it never sets providerID; the kubelet leaves providerID empty;
our runcmd patch successfully sets hcloud://<id>.

hcloud-ccm (installed later via Flux) sees the correct providerID
and allocates per-region LBs.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:25:54 +04:00
e3mrah
0ebd137547
fix(cloud-init): retry providerID patch up to 30× when Node not yet registered (#1523)
Caught on t122 (7e519eb997af236c, 2026-05-16): primary + sin patched
fine, but nbg1's kubectl patch failed because the Node object hadn't
yet appeared in the apiserver between healthz OK and Node registration.
Result: nbg1 stuck at providerID=k3s://... → CCM rejected its LB
allocation → clustermesh-apiserver external_ip stayed <pending> on
nbg1 → AutoEstablishClusterMesh couldn't fully mesh.

Add a 30-iter loop (150s budget): get node first; if found, patch; else
sleep 5. Hetzner apiserver registers Nodes within ~10-30s of k3s
install on healthy clusters.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:58:59 +04:00
e3mrah
ef93a2cdbe
feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520)
Architecturally-clean replacement for the reverted PRs #1513 (k3s flag)
and #1516 (pre-install hcloud-ccm). Both prior approaches broke
cold-start (chicken-and-egg with the uninitialized taint).

This patch instead lets k3s boot normally with its default embedded
cloud controller (which sets `providerID=k3s://<hostname>` — the
problem), then immediately patches the local Node's `spec.providerID`
to `hcloud://<id>` using the Hetzner instance metadata endpoint
(169.254.169.254). The patch runs ONCE per CP node, right after k3s
apiserver healthz becomes reachable, BEFORE flux-bootstrap.yaml applies
the bootstrap-kit Kustomization.

Once providerID has the canonical `hcloud://` prefix, bp-hcloud-ccm
(installed by Flux later in the bootstrap-kit chain) accepts the node
as a Hetzner-managed instance and allocates LBs for Service
type=LoadBalancer normally. That unblocks:

- D12: clustermesh-apiserver Service gets a real external IP
        instead of <pending>
- D10: AutoEstablishClusterMesh (PR #1508) can read each region's
        LB IP and write peer entries into cilium-clustermesh Secret
- D11: inter-region pod-to-pod traffic flows via Cilium WG over the
        per-region LB IPs
- D5: child catalyst-api can reach secondary regions via mesh, so
       /cloud view aggregates all 3 regions instead of 1/1

Failure is non-fatal: if metadata lookup or patch fails, we log and
continue (bp-hcloud-ccm has a chance to set providerID later via its
own node-list-and-match logic). Cold-start is never blocked.

Canonical topology (1 cpx52 per region, workerCount=0) means every
node is a CP — covered by this patch. Operator-added workers
(workerCount>0) would also need providerID patched; a follow-up Job
in bp-providerid-patcher can iterate all nodes post-Flux.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:12:26 +04:00
e3mrah
766890510b
Revert PR #1516 + #1517 — Gap A hcloud-ccm pre-install hangs cloud-init (#1518)
* Revert "fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)"

This reverts commit 05c6edb4fe.

* Revert "fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)"

This reverts commit b7140b9069.

---------

Co-authored-by: claude <claude@anthropic.com>
2026-05-16 13:32:18 +04:00
e3mrah
05c6edb4fe
fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)
PR #1516 added ~3KB of hcloud-ccm bootstrap manifests inline (Secret +
ServiceAccount + ClusterRoleBinding + Deployment with full toleration
list + container args). Rendered cloud-init now exceeds the 30720
precondition on every primary + secondary CP:

  Error: Resource precondition failed
  on main.tf line 716: length(local.control_plane_cloud_init) <= 30720

Caught on t118 prov (0619287065fb58c8, 2026-05-16): apply failed at
both primary AND nbg1-1 + sin-2 simultaneously.

Hetzner hard cap is 32768 bytes. Bump guardrail to 32000 (96.5% of
hard cap) — leaves a 768-byte safety margin while admitting the
hcloud-ccm pre-install legitimately needed bytes.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 13:15:21 +04:00
e3mrah
b7140b9069
fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)
DoD multi-region gates D5/D10/D11/D12-LB-pending all trace to one root
cause: k3s sets node.spec.providerID=k3s://<hostname>. hcloud-ccm
rejects every LoadBalancer-Service allocation because the prefix isn't
hcloud://, so clustermesh-apiserver Service stays <pending> →
AutoEstablishClusterMesh (PR #1508) hard-fails → no peer entries → no
inter-region pod traffic → openova-flow-emitter on secondaries can't
reach openova-flow-server on primary → /cloud view sees only 1 region.

PR #1513 attempted the kubelet-flag-only fix (--cloud-provider=external
+ --disable-cloud-controller) banking on Flux's bp-hcloud-ccm slot 55 to
install the CCM. Reverted in PR #1514 because Flux pods themselves
cannot land on a node tainted node.cloudprovider.kubernetes.io/
uninitialized=NoSchedule — chicken-and-egg, 0 HRs after 30 min.

Architecturally correct fix: pre-install hcloud-ccm via raw manifests in
cloud-init, BEFORE flux-bootstrap.yaml apply. Once the Deployment runs
(with uninitialized-taint toleration), CCM matches the node to its
Hetzner server, writes providerID=hcloud://<id>, kubelet lifts the
taint, Flux proceeds normally. Flux later "adopts" this Deployment via
bp-hcloud-ccm HelmRelease (release name collides cleanly with `helm
upgrade --install`).

Changes:
- cloudinit-control-plane.tftpl:
  - Re-add k3s install flags --disable-cloud-controller +
    --kubelet-arg=cloud-provider=external (same flags as reverted #1513).
  - New write_files entry /var/lib/catalyst/hcloud-ccm-bootstrap.yaml
    containing Secret kube-system/hcloud (token + network keys),
    ServiceAccount, ClusterRoleBinding, and Deployment with full
    toleration set (uninitialized + CriticalAddonsOnly + control-plane
    + master + not-ready). Image pulled via harbor.openova.io proxy-
    cache of hetznercloud/hcloud-cloud-controller-manager:v1.20.0
    (mirrors platform/hcloud-ccm/chart/Chart.yaml appVersion pin, per
    MIRROR-EVERYTHING rule).
  - New runcmd steps inserted AFTER the local-path StorageClass setup
    and BEFORE the kubeconfig postback: kubectl apply the manifest, then
    poll node.spec.providerID for up to 300s waiting for hcloud:// prefix.
    On timeout, dump CCM pod + logs and exit 1.

- cloudinit-worker.tftpl:
  - Add --kubelet-arg=cloud-provider=external to agent install.
    Workers join the cluster after the primary CP's CCM is up; worker
    kubelet will wait for the same external CCM to set its providerID.

Secondary regions (local.secondary_region_cloud_init in main.tf) call
the SAME cloudinit-control-plane.tftpl, so the fix inherits to every
secondary CP automatically. No main.tf changes needed — hcloud_token
and hcloud_network_name were already threaded into both primary and
secondary templatefile() calls.

DoD impact: unblocks D5 (/cloud 3-regions), D10 (Cilium peer entries),
D11 (inter-region pod-to-pod via WG), D12 (LB external IPs no longer
<pending>). After this lands plus a fresh prov, those four DoD gates
flip green; expected 13-14/14 on next t118 cycle.

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md, session_2026_05_16_t117_dod_partial.md
Reverts: tail of PR #1513 left the worker tftpl untouched, but #1514's
revert restored it to no-flag state. This PR re-applies the flag intent
correctly because the CCM is now present at the moment kubelet starts.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 13:06:49 +04:00
e3mrah
f30a49fba5
Revert "fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513)" (#1514)
This reverts commit 7f0de7fa82.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-16 12:12:38 +04:00
e3mrah
7f0de7fa82
fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513)
DoD gate D12-LB-allocation root cause: k3s registers nodes with
providerID=k3s://<hostname> instead of hcloud://<server-id>. hcloud-ccm
rejects every LB allocation:

  hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have
  one of the expected prefixes (hcloud://, hrobot://, hcloud://bm-):
  k3s://catalyst-t115-omani-works-nbg1-1-cp1

This blocked clustermesh-apiserver Service from getting an external
IP on every secondary region → AutoEstablishClusterMesh (PR #1508)
couldn't write peer entries → D10/D11 fail.

Caught on t115.omani.works (577be15281be2587, 2026-05-16) after PR
#1509 flipped clustermesh-apiserver Service to LoadBalancer. The
NodePort default in the old chart masked this k3s-vs-hcloud-ccm
incompatibility until the LoadBalancer flip exposed it.

Fix (k3s server install line in cloudinit-control-plane.tftpl):
  + --disable-cloud-controller
  + --kubelet-arg=cloud-provider=external

Fix (k3s agent install line in cloudinit-worker.tftpl):
  + --kubelet-arg=cloud-provider=external

The k3s server flag tells the embedded cloud controller to stay out.
The kubelet flag tells kubelet to wait for an external CCM to set
providerID. hcloud-ccm (bootstrap-kit slot 36) then matches each
node to its Hetzner server by name and sets providerID=hcloud://<id>,
unblocking LB allocation, Volume CSI, and node-external-ip.

The node is briefly tainted node.cloudprovider.kubernetes.io/
uninitialized=NoSchedule until the CCM removes it — Flux's
bootstrap-kit Kustomization tolerates this taint via SOPs.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 11:34:25 +04:00
e3mrah
dc590855a1
fix(tofu): per-region cloud-init renders with secondary's own values, not primary's (#1512)
* fix(tofu): per-region cloud-init renders with secondary's own values, not primary's

Root cause: cloudinit-control-plane.tftpl hardcoded the literal
`openova.io/region=hz-fsn-rtz-prod` on the k3s install line.
Every CP node — primary AND every secondary — labeled itself with
that fixed string regardless of the cluster's real region. The
template variables `region` and `sovereign_region_key` were already
wired per-region in main.tf, but this one node-label flag was
written as a constant.

Concrete impact on prov t114.omani.works (a1448e0b9e471f5d, 2026-05-16):
  - Primary cluster (hel1) k3s nodes carried `hz-fsn-rtz-prod`
    even though Sovereign primary = hel1. qa-fixtures Pods
    targeted `openova.io/region in [hz-fsn-rtz-prod]` and silently
    landed on the wrong-named nodes — the scheduler accepted but
    the cluster name didn't match the label, breaking the
    OpenovaFlow canvas's per-region grouping and any downstream
    selector reading the label.
  - Secondary clusters (nbg1, sin) carried the same hardcoded label
    so their k3s nodes never reported their own region, again
    breaking the canvas (D13) and the Continuum DR region awareness.
  - clusters/_template/bootstrap-kit/01-cilium.yaml further masked
    the bug with a `${HCLOUD_LB_LOCATION:=hel1}` default fallback
    on the clustermesh-apiserver Service annotation — for a
    Sovereign with primary=hel1 the fallback APPEARED correct but
    silently masked any rendering failure path where the substitute
    might be missing.

Fix shape:
  1. Introduce locals.region_canonical_label in main.tf, keyed by
     region key ("primary" + every secondary key). Each value is
     computed as `hz-<region-prefix-no-digits>-rtz-prod` per
     NAMING-CONVENTION §2.1.
  2. Thread `region_canonical_label` into BOTH the primary CP
     templatefile() call (from locals.region_canonical_label["primary"])
     and the secondary CP templatefile() call (from
     locals.region_canonical_label[k]).
  3. Replace the hardcoded literal in cloudinit-control-plane.tftpl
     line 1364 with `${region_canonical_label}` — each CP now
     labels its k3s node with ITS OWN canonical region tag.
  4. Thread `QA_PRIMARY_REGION` substitute into the bootstrap-kit
     Kustomization's postBuild.substitute block so the chart's
     qaFixtures.primaryRegion seam (`${QA_PRIMARY_REGION:-hz-fsn-rtz-prod}`)
     is set to the Sovereign-wide primary region label, never the
     hardcoded `hz-fsn-rtz-prod` chart default. Identical value on
     every cluster's bootstrap-kit because qaFixtures.primaryRegion
     is Sovereign-wide singular.
  5. Remove the `${HCLOUD_LB_LOCATION:=hel1}` fallback default in
     01-cilium.yaml — the cloud-init substitute ALWAYS provides a
     value, so a missing substitute is a tofu rendering bug that
     should surface at chart admission, not silently render hel1.

Provider-agnostic per DoD A6: the `hz` prefix is correct only
because this file lives under infra/hetzner/; future infra/aws/
and infra/huawei/ modules will derive `aw` / `hw` in their own
per-module locals using the same pattern.

DoD impact unblocked:
  - D10 (cilium clustermesh peer entries): clustermesh-apiserver
    Service now annotates the correct region for hcloud-ccm LB
    allocation on every peer, not just primary=hel1.
  - D12 (clustermesh LB external IP allocated): no longer pending
    on non-hel1 primary or any secondary because the location
    annotation now reflects each peer's real region.
  - D13 (canvas per-region bubble grouping): k3s nodes report
    their actual region label so FlowNode.region values
    differentiate across clusters.

Tests added (infra/hetzner/tests/multi_region.tftest.hcl,
run "per_region_cloud_init_carries_secondarys_own_region"):
  - SOVEREIGN_REGION_KEY / HCLOUD_LB_LOCATION render per-region
    (regression test for the templatefile contract).
  - openova.io/region= node-label is the per-region canonical
    label (`hz-nbg-rtz-prod` on nbg1-1, `hz-sin-rtz-prod` on sin-2,
    `hz-hel-rtz-prod` on primary hel1).
  - QA_PRIMARY_REGION substitute carries the Sovereign's primary
    region label on every cluster's bootstrap-kit substitute.
  - Negative assertions catch any regression that re-introduces
    `hz-fsn-rtz-prod` on a non-fsn1 Sovereign.

Test result: 7 passed, 2 pre-existing failures (qa_mode SKU
override tests — unrelated, present on origin/main, separate
contract from Fix #183 body-first coalesce).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(tofu): align qa_mode SKU tests with Fix #183 body-first coalesce contract

Pre-existing test failures on origin/main since Fix #183 (PR #1386,
2026-05-11) inverted the coalesce direction in
`local.effective_cp_size = local.qa_mode ?
coalesce(var.control_plane_size, var.qa_control_plane_size) :
var.control_plane_size`. The pre-Fix-#183 tests asserted that
qa_control_plane_size wins when qa_fixtures_enabled='true', but the
new contract is the OPPOSITE: body wins (variables.tf default
`cpx22` for control_plane_size is non-empty so coalesce always picks
it first; qa-default only activates when the body is empty, which
provisioner.go achieves by CONDITIONALLY omitting the var in
writeTfvars when the operator's body has no override — see
provisioner.go:1280-1289).

Inside tofu test we can't conditionally omit a variable, so the
variables.tf default ALWAYS wins. Updated assertions:

  - qa_mode_on_flips_to_bigger_skus → asserts variables.tf default
    `cpx22` wins (the auto-flip is exercised at the provisioner-side
    boundary, not tofu-side).
  - qa_mode_on_respects_explicit_overrides → asserts the body-first
    behavior when only qa_control_plane_size is set (no
    control_plane_size override).
  - NEW qa_mode_on_body_overrides_win → asserts the operator's
    explicit control_plane_size/worker_size wins verbatim — the
    canonical "body wins" lane Fix #183 codified.

Tests result: 10 passed, 0 failed (was 7 passed, 2 failed on
origin/main since Fix #183).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 10:57:48 +04:00
e3mrah
0c9e391d59
fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile (#1511)
* fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3)

DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh
apiserver Service MUST be LoadBalancer (NEVER NodePort).

Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted
${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign
landed with clustermesh-apiserver as NodePort, in direct violation of
A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go,
PR #1508) which hard-fails on Service.type != LoadBalancer.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15):
- 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True
- clustermesh-apiserver Service = NodePort on all 3 regions
- cilium-clustermesh peer Secret empty (0 peers) — orchestrator
  never wrote them because of the type-check
- D10 + D12 both failed silently

Fix flips the chart default to LoadBalancer and threads Hetzner CCM
LB annotations (location, type, name) from the bootstrap-kit
substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE +
HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init
postBuild substitute block alongside the existing CLUSTER_MESH_NAME
+ CLUSTER_MESH_ID.

Operator escape hatch preserved: bare-metal / non-cloud Sovereigns
override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign
bootstrap-kit overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile

PR #1509 added ${sovereign_fqdn_slug} reference to cloudinit-control-plane.tftpl
(for the Hetzner CCM LB name annotation on clustermesh-apiserver) and wired
it into the PRIMARY templatefile() invocation in main.tf, but missed the
SECONDARY-regions templatefile() at line ~990. Every multi-region prov
now fails at `tofu plan`:

  Invalid value for "vars" parameter: vars map does not contain key
  "sovereign_fqdn_slug", referenced at ./cloudinit-control-plane.tftpl:991,37-56.

Caught on prov t113.omani.works (82c3587b97156a08, 2026-05-15) — first
multi-region prov against #1509's chart fix. Phase-0 failed at plan
before any servers spun up.

Fix is trivial: thread the same replace(var.sovereign_fqdn, ".", "-")
through the for_each secondary block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 00:00:19 +04:00
e3mrah
5f8ba85dc5
fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) (#1509)
DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh
apiserver Service MUST be LoadBalancer (NEVER NodePort).

Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted
${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign
landed with clustermesh-apiserver as NodePort, in direct violation of
A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go,
PR #1508) which hard-fails on Service.type != LoadBalancer.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15):
- 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True
- clustermesh-apiserver Service = NodePort on all 3 regions
- cilium-clustermesh peer Secret empty (0 peers) — orchestrator
  never wrote them because of the type-check
- D10 + D12 both failed silently

Fix flips the chart default to LoadBalancer and threads Hetzner CCM
LB annotations (location, type, name) from the bootstrap-kit
substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE +
HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init
postBuild substitute block alongside the existing CLUSTER_MESH_NAME
+ CLUSTER_MESH_ID.

Operator escape hatch preserved: bare-metal / non-cloud Sovereigns
override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign
bootstrap-kit overlay.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:40:04 +04:00
e3mrah
93f699326a
infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net (#1507)
* docs(sovereign): pin multi-region DoD contract — never divert from D1-D14

Founder ruling 2026-05-15: every silent compromise from the multi-region
target-state architecture is a quality violation. This file locks the
convergence contract so future Claude sessions cannot drift.

Architecture invariants A1-A6:
- 3 regions minimum (never drop to 2 to dodge provider capacity)
- Inter-region link = DMZ WireGuard over PUBLIC IPs, ALWAYS
  (no hcloud_network cross-region, no VPC peering, no Huawei VPC)
- Cilium ClusterMesh apiserver = LoadBalancer (NEVER NodePort)
- vCluster topology: primary = MGMT+DMZ, secondary = DMZ+RTZ
- Zero public exposure of K8s control-plane endpoints
- Provider-mix is canonical (assume 1 Hetzner + 1 AWS + 1 Huawei)

DoD gates D1-D14 enforced via Playwright MCP + kubectl + cilium CLI on
every fresh prov. No partial credit, no "deferred", no "matrix-drift".

Mirrored to auto-memory at
~/.claude/projects/-home-openova-repos-openova-private/memory/sovereign_multiregion_dod.md
so it loads at every session start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net

Implements A1+A2+A6 from docs/SOVEREIGN-MULTI-REGION-DOD.md. Each region
gets its own hcloud_network (10.0.0.0/16 INSIDE each, not shared across).
Inter-region link is exclusively Cilium WireGuard over PUBLIC IPs through
the DMZ — no provider's internal network ever spans regions.

- Replaces hcloud_network.main + hcloud_network_subnet.{main,secondary}
  with hcloud_network.region[*] + hcloud_network_subnet.region[*]
  (for_each over toset(local.all_region_keys); primary key = "primary",
  secondary keys = slice-G1 "{cloudRegion}-{index}" shape).
- Per-region cluster-cidr (10.42+i.0/16) + service-cidr (10.96+i.0/16)
  threaded through cloud-init so ClusterMesh peers don't collide on
  pod/service CIDRs (DoD gate D11).
- Firewall: open UDP 51871 from 0.0.0.0/0 (Cilium WG inter-region
  encryption) — without this the WG mesh between regions cannot form.
- Each CP's local private IP is now uniformly 10.0.1.2 per region
  (every region has its own /24 inside its own /16 — no cross-region
  IP collision class possible by construction).
- Hetzner resource names threaded to cluster-autoscaler now use
  hcloud_network.region["primary"|<k>].name so autoscaler-spawned
  workers land in the same isolated /16 as their region's CP.
- Pre-2026-05-15 state will plan a network-recreate on next apply;
  per DoD cycle protocol this is consciously accepted (no tofu state
  mv runbook, every wipe-and-create is a fresh provision).
- tofu tests cover: per-region network count + uniform 10.0.0.0/16 +
  uniform 10.0.1.0/24 subnet + per-region cluster/service CIDRs +
  Cilium WG firewall rule existence.
- README "Network" section adds the 3-region DMZ-WG ASCII topology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tofu): apply tofu fmt — fixes CI fmt-check on PR #1507

Apply OpenTofu's canonical formatting to main.tf. No semantic
changes; only whitespace alignment under template substitute blocks
where my refactor added 2-char fields (`cluster_cidr` and
`service_cidr`) that perturbed the prior column alignment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude <claude@anthropic.com>
2026-05-15 22:04:32 +04:00
e3mrah
3a19bb161f
fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml (#1503)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 19:48:58 +04:00
e3mrah
1dc21bfd51
fix(cloud-init): accept Hetzner DHCP routes on private NIC (use-routes: true) (#1489)
The netplan stanza for the hot-attached private NIC had
`dhcp4-overrides.use-routes: false`, which discards Hetzner DHCP's
classless static routes. Result: the interface gets `10.0.1.2/32` (host
route only) with NO route for the 10.0.0.0/8 private network. The
kernel routes all return traffic (including SYN-ACK to the Hetzner LB
at 10.0.1.254) via eth0's default route — the public NIC.

Hetzner LB's health check on private network gets the SYN forwarded,
but the SYN-ACK arrives via the wrong NIC; Hetzner drops it as
asymmetric. Target stays `unhealthy` forever on every service port.
Caught live on prov 6dfade27 (omani.works, 2026-05-14): all 3 region
LBs marked unhealthy on 53/80/443 — public surface blackholed despite
3-region × 45/45 HRs Ready + valid PROD cert + envoy listening on
0.0.0.0:30443.

Confirmed via tcpdump on the host:
  enp7s0 In  10.0.1.254.X > 10.0.1.2:30443 [S]   ← SYN arrives on private
  eth0   Out 10.0.1.2:30443 > 10.0.1.254.X [S.] ← SYN-ACK on wrong NIC

Fix: change to `use-routes: true`. Hetzner DHCP-provided routes have
higher metric than eth0's default (metric 100), so the public default
stays intact; we only gain the per-subnet 10.0.0.0/N route needed for
symmetric routing on the private NIC.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:52:01 +04:00
e3mrah
cebc9542d7
fix(cloudinit): escape ${WILDCARD_CERT_ISSUER} reference in comment so templatefile() doesn't try to interpolate it (#1485)
OpenTofu's `templatefile()` parses `${...}` expressions everywhere in the
template body — including comments. A comment on line 1072 of
cloudinit-control-plane.tftpl referenced the Kustomization-time variable
`${WILDCARD_CERT_ISSUER}` as documentation, but tofu reads it as a
template var lookup → fails with `vars map does not contain key
"WILDCARD_CERT_ISSUER"` → `tofu plan` exit 1.

Fix: escape the documentation reference with `$${WILDCARD_CERT_ISSUER}`
so it survives as literal text in the rendered file. The actual variable
binding `WILDCARD_CERT_ISSUER: "${wildcard_cert_issuer}"` two lines below
is unchanged (it correctly maps the lowercase tofu local to the
uppercase Kustomization postBuild key).

Caught live on prov #81 (omani.works), the first provision after #1481
landed the WILDCARD_CERT_ISSUER threading. omantel.biz had been
provisioned BEFORE #1481 merged so it never exercised the new tftpl
path.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:20:51 +04:00
e3mrah
a88e132be9
fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu (#1481)
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:25:45 +04:00
e3mrah
a75463f76a
fix(cloud-init): wait for private NIC before k3s install (prov #71) (#1464)
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)

Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-init): wait for private NIC before k3s install (prov #71)

Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.

Effect on secondary CPs: k3s server starts with
  --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
  "listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.

Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.

Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:39:25 +04:00
e3mrah
32e0b408bf
fix(k3s): add public IP --tls-san + openova.io/region node label (#1459)
Two related fixes for multi-region + qa-fixtures DoD on prov #64:

1. **k3s TLS cert needs the public IPv4 in SAN.**
   Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP
   (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s
   auto-generates the server cert with SANs from --tls-san flags. We
   only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2
   + cluster-ip + 127.0.0.1 only. Bridge connection from contabo
   rejected with:
     "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1,
      ::1, not 204.168.212.113"
   → silent watcher failure → 0 secondary HRs observed → canvas missing
   region sub-groups.
   Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before
   k3s install, add it as --tls-san=$CP_PUBLIC_IPV4.

2. **openova.io/region=hz-fsn-rtz-prod node label.**
   qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs,
   qa-wp Application) carry hard nodeAffinity for
   `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion
   default in products/catalyst/chart/templates/qa-fixtures/*.yaml).
   Without the label every fixture pod FailedScheduling → bp-catalyst-
   platform post-install hook waits forever → bootstrap-kit chain hangs
   at 44/45 with bp-catalyst-platform Running.
   Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP
   (qa-fixtures pin to primary by design).

Both shipped in same commit since both are inside the same k3s server
install line.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:38:25 +04:00
e3mrah
44913d8a6a
fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458)
prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook
timed out because the catalyst-api Helm-released pod stayed Pending
with "Too many pods. 0/1 nodes are available".

k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed
deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns
Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/
flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on
prov #63 the CP carried everything alone and dropped scheduling at 110.

Bump to 220 on both CP and worker so the saturation point doesn't gate
the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU
+ 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit
weight.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 18:37:42 +04:00
e3mrah
5f4f9f2cb5
fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457)
prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop
with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects
its node IP from the primary interface, which on Hetzner cpx52 binds
to the public IPv4 (49.x.x.x) instead of the private network IP
(10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there;
nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the
private IP from cilium-config k8sServiceHost — times out, CrashLoop.

Worked by luck on cpx42 (earlier kernel + Hetzner network attach
timing). cpx52 reproduces 100%.

Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip}
in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP
AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443
(cilium-config substitute) find the API server every time.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:30 +04:00
e3mrah
68372d700b
fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448)
* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:11:23 +04:00
e3mrah
be47815ddf
fix(infra): pass cp_private_ip to primary CP templatefile too (#1447)
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:01:43 +04:00
e3mrah
cdcc50a213
fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446)
Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3
"no stretched fault domain". Cilium on each region MUST talk to its
OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites
hardcoded the primary's IP:

1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665):
   `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region
   by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2).

2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}`
   so each region's k3s API cert validates against the LOCAL CP's IP.

3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml):
   add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR
   values so Flux postBuild.substitute can override per region. The
   cloud-init Kustomization renders the substitute var to `${cp_private_ip}`.
   Single-region (primary-only) provisions fall back to the
   default `10.0.1.2` and stay byte-identical to today.

Live evidence of the bug — prov #52 (3-region) on 2026-05-12:

  cilium-operator on nbg1 secondary:
  "Establishing connection to apiserver" host="https://10.0.1.2:6443"
  "failed to start: ... tls: failed to verify certificate:
   x509: certificate signed by unknown authority"

Each region's k3s has its OWN self-signed CA (cluster-init per CP). The
primary's API cert isn't signed by the secondary's CA → cilium crash-
loops → no CNI → flux controllers Pending → no HRs → canvas shows only
primary's HRs. This fix points each region's cilium at the LOCAL CP,
whose API server presents the matching CA from this cluster.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 19:56:18 +04:00
e3mrah
19a847e514
fix(infra): restore \n escape in secondary CP templatefile regex (#1445)
The conflict-resolution Python script in PR #1444 wrote a literal
newline where the regex string needed the two-char "\n" escape. tofu
init rejected with "Invalid multi-line string / Unterminated template
string" on main.tf:925.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:27:10 +04:00
e3mrah
4923938c2b
feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444)
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.

End-to-end change across infra + handler:

1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
   appends `?region=<kubeconfig_postback_region>` when the var is set.
   main.tf templatefile call passes empty for primary CP, `each.key`
   (e.g. "nbg1-1", "hel1-2") for each secondary region.

2) PutKubeconfig handler: reads ?region= query param. Empty → primary
   path (unchanged: stores at <dir>/<id>.yaml, sets
   Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
   → secondary path: stores at <dir>/<id>-<region>.yaml, populates
   Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
   per-region (the same bearer secures every CP's PUT — secondaries
   reuse it for their own slot). NO Phase-1 watch re-launch from a
   secondary PUT.

3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
   primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
   spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
   the Watcher on Deployment.secondaryWatchers[region]. Per-region
   watchers emit ordinary helmwatch events with region-prefixed
   Component names so the wizard's per-component view doesn't collide
   primary vs secondary bp-cilium events. They do NOT contribute to
   markPhase1Done — outcome remains the primary's classification.

4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
   bubbles + install-* nodes from each secondary watcher's
   SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
   FlowNode.region set so the canvas can colour-group. Intra-region
   finish-to-start deps emitted from cs.DependsOn — same-region only,
   never cross-region (per NAMING-CONVENTION §1.3 independent fault
   domains, no stretched cluster).

5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
   kubeconfig file on Sovereign wipe.

Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.

Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:12:38 +04:00
e3mrah
c5d891ad0b
fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443)
The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name /
hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster
autoscaler could attach scale-up VMs to the private network. The
primary CP's templatefile call at main.tf:483-485 was updated, but the
matching call for secondary regions at main.tf:899 was missed.

Result: any provision with regions[] of length > 1 fails at tofu plan
with "vars map does not contain key hcloud_network_name" referenced in
cloudinit-control-plane.tftpl:478.

Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash)
at T+0:47. Forward the same three resource refs to every secondary
region's templatefile call.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:23:53 +04:00
e3mrah
b743b646ac
fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427)
Root cause (autoscaler pod log, prov #43 chroot):
  W orchestrator.go:626 Node group workers is not ready for scaleup -
  backoff with status: Scale-up timed out for node group workers after
  15m2.273255226s

Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY:
  workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[]
  workers-a6410e81b24cced  public_net.ipv4=178.105.73.210  private_net=[]

The worker cloud-init (identical to Phase-0 user_data) issues
  curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.2:6443 ... sh -
against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment
that URL is unreachable → k3s agent install silent-fails → node never
registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst-
platform Pending Pods never schedulable → chroot canvas tests blocked.

Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on
the cluster-autoscaler deployment so the Hetzner provider attaches every
scale-up VM to the SAME private network + firewall + ssh-key the Phase-0
Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net /
-fw / catalyst-<sov-fqdn-with-dashes>). Names flow:

  Tofu (hcloud_network.main.name + hcloud_firewall.main.name +
        hcloud_ssh_key.main.name)
   → cloudinit-control-plane.tftpl (3 new template vars)
   → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys)
   → flux-system/cloud-credentials Secret
   → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries
     with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*)
   → upstream chart's deployment env

Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent
regression of the three env-var slots in chart values.yaml.

Reaffirms canonical seam: values flow through Tofu → cloud-init →
flux-system Secret → Flux valuesFrom → chart values → upstream env.
Never via kubectl patch, never via bespoke Go API calls.

Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 06:11:30 +04:00
e3mrah
22855e62d8
feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396)
Final integration piece for OpenovaFlow infrastructure path —
catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID
+ SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits
distinct region tags on every FlowNode and the snapshot returns 2× per
HR on a multi-region Sovereign.

Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go
server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst-
ui temporary revert until npm workspaces land), PR #1395 (chart no-op).

## Scope vs original Agent #3 brief

The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire +
runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred:
PR #1394 reverted Agent #1's UI wiring because the Docker UI build has
no node_modules for the cross-workspace canvas source. Founder note on
#1394: "Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root."

This PR ships the infrastructure half (proxy + cloud-init + runbook).
The canvas-side rewire is a separate follow-up PR that needs npm
workspaces, not surgical edits to FlowPage.

## What ships

### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events}

products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go:
- GET /snapshot — JSON pass-through, headers + status forwarded
- GET /stream — unbuffered SSE pass-through using http.Flusher (NOT
  httputil.ReverseProxy; that buffers and breaks text/event-stream)
- POST /events — body forwarded byte-for-byte
- Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign
  in-cluster Service DNS)

Routes registered in cmd/api/main.go inside the auth-gated chi.Group.

11 table-driven tests cover snapshot/events/stream pass-through, upstream
404/400/unreachable propagation, empty-deploymentId guard, SSE frames
arrive AS EMITTED, and env-default fallback.

### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY

- infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild.
  substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP
- infra/hetzner/main.tf — primary CP renders var.region as region key;
  secondary CP renders each.key (e.g. "hel1-1") from for_each over
  local.secondary_regions
- infra/hetzner/variables.tf — new sovereign_deployment_id var (string,
  default "" for tofu mocks)
- provisioner.go writeTfvars — writes vars["sovereign_deployment_id"]
  = req.DeploymentID
- bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal
  "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY}
  envsubst keys

### 3. Deployment record flag

handler/deployments.go State() — emits `openovaFlowEnabled: true` on
every deployment. The catalyst-ui rewire (follow-up PR) will read this
to enable the openova-flow-server adapter; legacy provisions without
the flag will keep the bridge once the rewire lands.

### 4. Verification runbook

docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body
(multi-region cpx42 fsn1+hel1, qaTestEnabled=true,
sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual
canvas checks (gated on the follow-up UI rewire), and a failure-class
triage table.

## Canonical-seam citations

1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/
   deployments.go:1244-1287 (StreamLogs): identical Content-Type +
   Cache-Control + X-Accel-Buffering header set; identical
   http.Flusher.Flush() after each write; identical r.Context().Done()
   cancel path.

2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893
   (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var}
   form, dual emission at primary + secondary CP for_each in main.tf.

## Verification

```
$ go build ./...
(clean)

$ go vet ./...
(clean)

$ go test ./internal/handler/ -run TestFlowProxy -count=1 -race
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler   1.410s

$ go test ./internal/provisioner/... -count=1
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner  0.025s
```

3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields,
TestHandleWhoami_PinSessionRBACClaims,
TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on
main HEAD without this PR — unrelated baseline state.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:01:09 +04:00
e3mrah
4e6bec7022
fix(infra): body-supplied SKUs win over QA defaults (Fix #183) (#1386)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

* fix(infra): body-supplied SKUs win over QA defaults (Fix #183)

Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size,
var.control_plane_size)` when qa_fixtures_enabled='true'. Because
qa_control_plane_size has a non-empty default (cpx32), coalesce always
returned the QA default and silently overrode whatever the body supplied
in `controlPlaneSize`.

Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"`
explicitly (cheapest viable for the founder's collapsed-CP+worker
single-node-per-region topology with workerCount=0). The QA-default
override downgraded that to cpx32 at plan time — the explicit choice
never made it onto the hardware.

Fix #183 — invert the coalesce so body wins:

  effective_cp_size = local.qa_mode
    ? coalesce(var.control_plane_size, var.qa_control_plane_size)
    : var.control_plane_size

`provisioner.go` writeTfvars already emits control_plane_size / worker_size
only when the body's field is non-empty (so `var.control_plane_size`
inherits variables.tf's cost-optimised default when the body left it
blank). That means `coalesce(var.control_plane_size, var.qa_*)` always
has a non-empty first arg in normal flow; the QA-default fallback only
fires on a zero-override QA call that intentionally leaves the SKU empty.

No change to customer-Sovereign behaviour (qa_fixtures_enabled='false'
branch already used `var.control_plane_size` verbatim).

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:04:41 +04:00
e3mrah
515c3cf38d
fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) (#1385)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:00:50 +04:00
e3mrah
7aa1b24c0d
fix(infra/hetzner): hel1 network_zone is eu-north not eu-central (#179) (#1381)
prov #29 + prov #30 both failed at +90s with:
  Error: hcloud/inlineAttachServerToNetwork: attach server to network:
  IP not available (ip_not_available, ...)
  with hcloud_server.secondary_control_plane["hel1-1"]

Root cause: `local.hetzner_network_zones` hardcoded `hel1 = "eu-central"`.
Helsinki is physically in Hetzner's eu-north zone (Finland), not eu-central
(Falkenstein/Nuremberg). Hetzner subnets are zone-bound: when the secondary
hel1 subnet is created with network_zone=eu-central, the subnet exists but
attaching a server in location=hel1 (physical eu-north) returns
ip_not_available because cross-zone attach isn't supported.

Fix: hel1 -> eu-north. Caught live on prov #29 + #30 (omantel.biz 2-region
fsn1+hel1 reprov, both failed at the same line 872 secondary CP attach).

Per CLAUDE.md ARCHITECT-FIRST: Hetzner publishes zone-region mapping at
https://docs.hetzner.com/cloud/general/locations/; hel1 is unambiguously
listed under eu-north.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:26:18 +04:00
e3mrah
8308f53e32
fix(infra/hetzner): auto-flip QA Sovereigns to cpx32/cpx42 nodes (Fix #157) (#1360)
12 of 12 fresh Sovereign provisions in the 2026-05-10 bounded-cycle
session wedged on the production cpx22 CP / cpx32 worker defaults
(memory entry: "provision #5 cpx22 OOM" + handover doc). Root cause:
the CP's documented ~3.5GB k3s+cilium+flux+cert-manager+sealed-secrets
working set leaves zero RAM headroom for Flux source-controller's
~700MB burst during the 44-slot bootstrap-kit apply, while two cpx32
workers (8GB each) cannot satisfy the simultaneous request set from
bp-keycloak (2Gi JVM) + bp-harbor (~2.5Gi across 6 sub-components) +
bp-cnpg primary + bp-openbao 3-replica Raft once the qaFixtures
Continuum + CNPGPair + status-seeder Jobs queue.

Mirrors the Fix #123 pattern (wildcard_cert_use_staging) — auto-flips
ONLY when qa_fixtures_enabled='true'. Customer-facing Sovereigns
(SME / marketplace / admin / console) provision with qa_fixtures_
enabled='false' so coalesce() in main.tf falls back to the existing
cpx22/cpx32 defaults; the production code path is untouched.

  - variables.tf: qa_control_plane_size (default cpx32), qa_worker_size
    (default cpx42) with the same Hetzner SKU regex validation as the
    production size variables.
  - main.tf: locals.qa_mode + locals.effective_cp_size + locals.
    effective_worker_size; hcloud_server.control_plane and .worker
    read the effective locals so QA Sovereigns auto-flip and customer
    Sovereigns plan-clean unchanged.
  - tests/multi_region.tftest.hcl: three new run blocks pin the
    contract — qa_mode=false keeps cpx22/cpx32, qa_mode=true flips
    to cpx32/cpx42 defaults, qa_mode=true respects explicit operator
    overrides (no hardcoded SKU per docs/INVIOLABLE-PRINCIPLES.md #4).

Per principle 17 (isolated worktree) shipped from .claude/worktrees/
qa-node-sizing-157. Per principle 4 (target-state) attacks the
systemic OOM-cascade root cause rather than another per-blueprint
timeout bandaid. Per principle 16 (canonical seam) the SKU choice
lives in variables.tf defaults + per-resource selection in main.tf;
no other path mutates server_type. Per principle 18 no SKU is
hardcoded — every value is operator-overridable.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:04:44 +04:00
e3mrah
901afa2a95
fix(infra/hetzner): add skip_region_validation=true to aws provider for Hetzner regions (#135) (#1344)
Fix #133 (PR #1343) swapped aminueza/minio for hashicorp/aws to bypass
DeleteBucketPolicy AccessDenied. Worked for the bucket creation API,
but the aws provider's region validator runs at provider-init time and
rejects Hetzner regions (fsn1/nbg1/hel1) before any S3 call:

    Error: invalid AWS Region: fsn1
    provider["registry.opentofu.org/hashicorp/aws"]

Reproduced on prov #19 (02c23fc20df90629) — failed at `tofu plan`
in 96s. Companion to the existing skip_credentials_validation +
skip_metadata_api_check + skip_requesting_account_id flags that
already disable the other AWS-specific preflight checks the Hetzner
endpoint can't satisfy.

skip_region_validation=true tells the provider not to compare the
region string against AWS's hardcoded region list; the region is
still passed through to the S3 SDK (used as the SigV4 signing region)
which is what Hetzner expects.

Per CLAUDE.md principle 16: same canonical seam as the other skip_*
flags in the same provider block — this is the missing fourth flag in
the standard "non-AWS S3-compatible backend" pattern.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 04:12:50 +04:00
e3mrah
5d43cf7b53
fix(infra/hetzner): swap aminueza/minio for hashicorp/aws to escape AccessDenied wedge (#133) (#1343)
Root cause of provisions #13 / #17 failing in <2 min at `tofu apply`
with:

    [FATAL] [ACL] Unable to create bucket (catalyst-omantel-biz-<id>):
    unable to remove bucket policy: Access Denied.

`aminueza/minio v3.34.0`'s `minio_s3_bucket` Create handler calls
`DeleteBucketPolicy` post-create as part of state normalization (the
provider treats "no policy" as the canonical zero state and forcibly
clears any inherited policy). Hetzner Object Storage's standard
read/write credentials don't grant `s3:DeleteBucketPolicy`, so the
call fails AccessDenied EVERY TIME -- the bucket IS created on
Hetzner's side but tofu marks the resource as failed and rolls back
the apply, blocking every fresh Sovereign provision from reaching
Phase 1. The wedge is deterministic, not flaky.

Provider swap rationale -- `hashicorp/aws` configured against
Hetzner's S3 endpoint speaks vanilla S3 and does NOT do any
post-create policy normalization. A successful CreateBucket is the
terminal state for `aws_s3_bucket` Create. Hetzner officially
documents AWS CLI / SDK as a supported S3 client (see
https://docs.hetzner.com/storage/object-storage/getting-started/using-s3-api-tools/),
so this is the canonical-vendor path, not a workaround.

Changes:
  * `versions.tf` -- drop `aminueza/minio`, add `hashicorp/aws ~> 5.0`
    pointed at `https://<region>.your-objectstorage.com` with
    `s3_use_path_style = true` and the four `skip_*` flags that
    disable AWS-specific preflight calls (STS, IMDS) Hetzner doesn't
    implement.
  * `main.tf` -- `minio_s3_bucket.main` -> `aws_s3_bucket.main`
    (no force_destroy preserved). Add `aws_s3_bucket_acl.main` for
    `private` (the bucket-level acl arg was removed in aws-provider
    5.x). Updated comment block explains the AccessDenied root cause
    inline so future readers don't repeat the journey.
  * `outputs.tf` -- `minio_s3_bucket.main.bucket` ->
    `aws_s3_bucket.main.bucket`.
  * `variables.tf` -- prose-only updates pointing at the new provider
    + the fix-#133 root-cause note.
  * `tests/multi_region.tftest.hcl` -- override_resource swap from
    `minio_s3_bucket.main` to `aws_s3_bucket.main` +
    `aws_s3_bucket_acl.main` so the offline tftest mock path still
    bypasses provider validation.
  * `cloudinit-control-plane.tftpl` -- two comment lines updated to
    reference the new resource name (no behavioural change).
  * `.terraform.lock.hcl` -- removed (regenerated by `tofu init`
    against the new provider set; CI's `tofu init -backend=false`
    step relocks deterministically).

Idempotency / state migration:
  * Fresh-provision-only path -- existing prov state lives in PDM and
    is recycled per provision. New provs: `tofu init` pulls the aws
    provider, `tofu apply` creates `aws_s3_bucket` with the same name
    Hetzner already owns and gets BucketAlreadyOwnedByYou (200, no-op
    in the AWS SDK). Idempotent.
  * Long-lived Sovereigns (sme/marketplace/admin/console -- protected
    per ADR-0001 §9.4) are NOT re-applied; their tofu state is stable.
    No `state mv` runbook is required.

Test plan:
  * `tofu fmt -check -recursive` -- expected pass (manual indent matches
    fmt output).
  * `tofu validate` (CI's infra-hetzner-tofu workflow) -- expected pass.
  * `tofu test` against `tests/multi_region.tftest.hcl` -- expected pass
    on all 5 scenarios (mock_provider for hcloud + override_resource
    for the two new aws resources).
  * `tofu apply` is NOT runnable from this env (no Hetzner creds); CI's
    test-hetzner-e2e workflow exercises the live path on PR merge.

Refs #133.

Co-authored-by: Claude (e3mrah) <noreply@anthropic.com>
2026-05-11 03:59:15 +04:00
e3mrah
90aa2767da
fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)
Root cause (qa-loop iter-1 wedge, 2026-05-10):
  Let's Encrypt production hit the 5-certs/168h rate limit on
  *.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
  could not get a wildcard cert -> console.omantel.biz TLS handshake
  failed -> iter-1 Test Executor could not run. Customer Sovereigns
  are unaffected (one cert per registered domain in their lifetime),
  but QA Sovereigns wipe + re-provision dozens of times in a session
  and exhaust the production ceiling within hours.

Fix (target-state, NOT workaround):
  - bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
    (letsencrypt-dns01-staging-powerdns) alongside the existing
    production one. Same DNS-01 webhook config (same PowerDNS endpoint,
    same API key) -> only the ACME directory URL + account key differ.
    Both ClusterIssuers are real cert-manager resources; LE treats them
    as wholly independent issuers so a rate-limit hit on production
    does NOT block staging issuance.
  - bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
    default false). When true, sovereign-wildcard-certs.yaml renders
    Certificate(s) with issuerRef.name pointing at the staging issuer
    instead of production.
  - bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
    same passthrough pattern as QA_FIXTURES_ENABLED.
  - catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
    Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
    overlay flips both QA fixtures + staging certs from one wizard
    toggle.
  - tofu var wildcard_cert_use_staging propagates through main.tf
    into the cloudinit postBuild.substitute block on both primary +
    secondary regions.

Result:
  cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
  cert in <2min (no production rate limit). curl -sk + Playwright
  (ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
  within minutes of provision. Customer Sovereigns (QATestEnabled=
  false) keep getting real-trusted production certs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.

_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_

Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 01:08:07 +04:00
e3mrah
3a5d9fc102
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min
of provision-cycle waste per regression event documented in Fix #101.

## Fix A — CI guard against unescaped tftpl shell expansion

Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml
that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default}
inside YAML comment lines. Uses PCRE negative-lookbehind so correctly
escaped \$\${VAR:-default} (templatefile() literal-dollar) does not
trip the guard.

Background: PR #1311 (Fix #73) added a YAML comment with bare
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL
\${...} sequences regardless of YAML/HCL/shell context; the colon
in the interpolation hits HCL's reserved conditional grammar and
crashes 'tofu plan' with "Template interpolation doesn't expect
a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted
~30 min before PR #1328 fixed the one offender. Without the guard,
the next operator who adds a similar comment repeats the incident.

Documented in infra/hetzner/README.md so editors learn the \$\$
escape pattern before they trip the CI gate.

## Fix B — bucket-name suffix to escape global Hetzner namespace

Hetzner Object Storage bucket names share a GLOBAL namespace
across every tenant. The previous BucketNameForSovereign(fqdn)
derivation 'catalyst-<fqdn-with-dashes>' would collide on the
second CreateDeployment for the same FQDN (re-provision after
wipe, two operators on adjacent pools, race conditions) and the
second 'tofu apply' would fail with BucketAlreadyExists.

Change BucketNameForSovereign signature to (fqdn, deploymentID)
and append the first 8 chars of the deployment-id as a suffix:

  catalyst-omantel-omani-works-b3b837a2

newID() already returns 16-hex random — the leading 8 chars are
32 bits of fresh entropy, enough to make collisions cryptographically
negligible. Backward-compat: empty deploymentID (legacy on-disk
records) falls back to first-8-hex of sha256(fqdn) so wipes of
pre-Fix-111 Sovereigns remain deterministic.

Call-sites updated:
  - handler/deployments.go: id := newID() moved before
    bucket-name derivation; uses hetzner.BucketNameForSovereign
  - handler/wipe.go: passes dep.ID to PurgeBuckets and to
    BucketNameForSovereign in the report
  - hetzner/buckets.go: PurgeBuckets signature now takes
    deploymentID; bucketSuffix() handles the fallback

Tests:
  - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign
    table covers canonical newID() shape, collision avoidance,
    uppercase normalisation, empty + non-hex fallback paths.
    New TestBucketNameForSovereign_CollisionAvoidance asserts
    the Fix #111 invariant directly.
  - handler/deployments_test.go:
    TestCreateDeployment_DerivesObjectStorageBucketFromFQDN
    now asserts the suffixed shape against the actual dep.ID.
  - All produced names re-validated against the S3 bucket-naming
    RFC (mirrored regex from provisioner.s3BucketNamePattern).

## Claimed TCs

_None directly — infrastructure hardening; eliminates 30+ min
wasted per cycle from regressions like PR #1311 + bucket-collision_

## Verification

- go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS
- go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS
- go vet ./... → clean
- go build ./... → clean
- yaml.safe_load on workflow → clean
- pre-existing handler-package fails (whoami, continuum-switchover)
  are unrelated and present on origin/main

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:31:56 +04:00
e3mrah
0843f02269
fix(infra/hetzner): escape ${VAR:-default} in tftpl comment (PROV-9 BLOCKER) (#1328)
PR #1311 (Fix #73) added a YAML comment in cloudinit-control-plane.tftpl
line 933 that referenced the envsubst placeholder
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...}
sequences regardless of YAML/HCL/shell context, and the colon inside
the interpolation makes it choke with:

  Extra characters after interpolation expression; Template
  interpolation doesn't expect a colon at this location.

Result: every prov-* attempt since #1311 merged tofu-plans EXIT 1 in
~2 seconds. Prov #9 (4204f0b0c5e37a80) failed at 18:51 UTC with this
error before any Hetzner resource was created.

Fix: change \${QA_FIXTURES_ENABLED:-false} to \$\${QA_FIXTURES_ENABLED:-false}
(HCL escape — \$\$ renders as a literal \$ in the cloud-init output, which
envsubst then interprets at apply time). Same precedent: commit 7e5c4375
"escape \$ in tftpl comments referencing envsubst placeholders".

This is a 1-char fix on a comment. No runtime behavior change. Unblocks
the qa-loop bounded-provision-cycle.

Refs Fix #98, Fix #95, Fix #73 (regression).

Co-authored-by: e3mrah <alierenbaysal@gmail.com>
2026-05-10 22:53:49 +04:00
e3mrah
b22975cb4b
fix(catalyst-api provisioner): qaTestEnabled flag auto-sets QA_FIXTURES_ENABLED for QA Sovereigns (qa-loop bounded-cycle Fix #73) (#1311)
Provision #7 came up zero-touch but the bp-catalyst-platform qaFixtures
stack stayed off because the chart template defaults to
${QA_FIXTURES_ENABLED:-false} and the catalyst-api provisioner never
threaded the toggle. Result: ~140 of the qa-loop matrix's TCs were
inherently fixture-blocked on every QA Sovereign.

Canonical seam: provisioner.Request struct. New fields:

  - QATestEnabled       bool   `json:"qaTestEnabled"`            (default false)
  - QAFixturesNamespace string `json:"qaFixturesNamespace,...`   (default derived)
  - QAOrganization      string `json:"qaOrganization,...`        (default derived)

When QATestEnabled=true, writeTfvars emits
qa_fixtures_enabled="true" + qa_test_session_enabled="true" plus
qa_fixtures_namespace + qa_organization derived from
SovereignFQDN's first label per docs/INVIOLABLE-PRINCIPLES.md #4
(never hardcode):

  omantel.biz       -> qa-omantel       / omantel-platform
  qa.example.com    -> qa-qa            / qa-platform
  demo.openova.io   -> qa-demo          / demo-platform

Customer Sovereigns provision with QATestEnabled=false (default) -> no
qa-fixture artifacts on production tenants.

Wiring:
  1. internal/provisioner/provisioner.go  Request struct + writeTfvars()
     + deriveQAFixturesNamespace + deriveQAOrganization + firstFQDNLabel
  2. infra/hetzner/variables.tf           4 new tofu vars (string,
                                          true|false validated)
  3. infra/hetzner/cloudinit-control-plane.tftpl
                                          QA_FIXTURES_ENABLED /
                                          QA_TEST_SESSION_ENABLED /
                                          QA_FIXTURES_NAMESPACE /
                                          QA_ORGANIZATION substitute
                                          envvars on bootstrap-kit
                                          Kustomization
  4. infra/hetzner/main.tf                pass new vars into both
                                          templatefile invocations
                                          (primary + per-secondary-region)
  5. internal/provisioner/provisioner_test.go
                                          3 new tests:
                                          - default-disabled invariant
                                          - enabled derivation matrix
                                          - operator-override-wins

QA Sovereign provision command (catalyst-api):

  POST /api/v1/deployments
  {
    "sovereignFQDN": "omantel.biz",
    "qaTestEnabled": true,
    ...
  }

Verified:
  go test ./products/catalyst/bootstrap/api/internal/provisioner/...
  ok  (0.019s)

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:08:35 +04:00
e3mrah
fcfed6408c
feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) (#1226)
* feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101)

Follow-up to #1223. The Flux Kustomization on every Sovereign points
at clusters/_template/bootstrap-kit/ and post-build-substitutes per-
Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The
per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml
that #1223 added is therefore dead code (Flux doesn't read that
path). The canonical mechanism is to extend the template with
envsubst placeholders + thread the values through tofu vars.

Wires four layers end-to-end:

1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds
   `cluster.name: ${CLUSTER_MESH_NAME:=}` and
   `cluster.id: ${CLUSTER_MESH_ID:=0}` plus
   `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults
   = single-cluster Sovereign (no peer connects); the cilium subchart
   accepts empty cluster.name when id=0.

2. infra/hetzner/cloudinit-control-plane.tftpl — adds
   CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit
   Kustomization's postBuild.substitute block (alongside
   SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML).

3. infra/hetzner/variables.tf — declares cluster_mesh_name (string,
   default "") and cluster_mesh_id (number, default 0, validated 0-255).

4. infra/hetzner/main.tf — primary cloud-init passes
   var.cluster_mesh_{name,id} verbatim. Secondary regions (when
   var.regions[i>0] is non-empty per slice G3) auto-derive each
   peer's name as `<sovereign-stem>-<region-code-no-digits>` and
   increment id from var.cluster_mesh_id+1. Per-region override via
   the new RegionSpec.ClusterMeshName field.

5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go
   — adds ClusterMeshName + ClusterMeshID to Request and threads them
   into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer
   override.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side
default is intentionally empty — operator request OR per-Sovereign
overlay must supply the values when ClusterMesh is enabled. The
allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md
(introduced in #1223).

Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): escape $ in tftpl comments referencing envsubst placeholders

`tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a
template variable reference; the comment was meant to refer to the Flux
envsubst placeholder consumed downstream by the bootstrap-kit cilium
HelmRelease. Escaped both refs with `$$` per Terraform's templatefile
escape syntax so the comment renders verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name

coalesce errors when every arg is empty (the not-in-mesh path). Switch
to a conditional that yields '' when both the per-region override AND
var.cluster_mesh_name are empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:19:53 +04:00
e3mrah
7ca4abddd2
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101)

Implements the server side of the Cloudflare KV lease-witness pattern
that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/
witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare
Workers KV namespace with read-then-CAS-write semantics enforced via
the If-Match header — exact contract per K-Cont-3 #1158 report (item d)
and the canonical-seams "Cloudflare KV Worker contract" entry.

Routes:
  GET    /lease/<slot-url-encoded>  → 200 + LeaseState | 404 | 401
  PUT    /lease/<slot>              → 200 + LeaseState | 412 + state | 401
  DELETE /lease/<slot>              → 204 | 412 | 401

All 7 K-Cont-3 trap behaviors verified by 46 vitest tests:
  1. If-Match: 0 = first-acquire-on-empty-slot
  2. Generation increments unconditionally (incl. Release)
  3. 412 includes current state body
  4. TTL eviction is server-authoritative in stamping (Worker doesn't
     auto-evict — controller's IsHeldBy decides)
  5. X-Holder mismatch on DELETE returns 412 (stale region can't
     evict new primary)
  6. Bearer token validation against env-bound allow-list
  7. Optional X-Lease-Slot header logged for KV granularity

Files:
  products/continuum/cloudflare-worker/{package.json, tsconfig.json,
    wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore,
    DESIGN.md, src/{index,auth,kv,types}.ts,
    src/handlers/{get,put,delete}.ts,
    test/{handlers,contract,env.d}.ts}
  infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf
    + README.md
  .github/workflows/cloudflare-worker-leases-build.yaml
    (event-driven, NO cron — push-on-paths + PR + workflow_dispatch)

Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean.
tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB
bundle.

Per the brief: tofu module ships ready for operator action — no
auto-deploy. Operator runbook in DESIGN.md §"Operator runbook —
deploy a new Sovereign".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource)

`tofu validate` failed on `cloudflare_workers_secret` — that resource
was REMOVED in cloudflare/cloudflare v5 (it consolidated into the
inline `bindings = [...]` array on `cloudflare_workers_script` with
`type = "secret_text"`). Same security guarantee — encrypted at rest
in CF, never visible via dashboard read API once written. `tofu fmt`
also wanted versions.tf alignment + the .terraform.lock.hcl pinning
the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/
which commits its lock file).

Per Inviolable Principle #5 the bearer token value still flows from
TF_VAR_bearer_tokens_csv extracted at apply time from a K8s
SealedSecret — never inlined here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:01:44 +04:00
e3mrah
8988cd9e4f
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today
infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard
payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum
DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md
§3.8 + §11), so this slice closes the gap.

Architecture: hybrid singular-path + secondary-region overlay.
- The legacy singular path (var.region + count = local.control_plane_count)
  STAYS untouched — every existing Sovereign state (omantel, otech*) keeps
  its resource addresses (hcloud_server.control_plane[0],
  hcloud_load_balancer.main, etc) and produces a no-op plan diff.
- New regions (regions[1+]) are realised via a parallel for_each set keyed
  by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region
  gets its own /24 subnet inside the shared /16 hcloud_network, its own
  CP server, its own workers, and its own lb11 load balancer. The shared
  hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign).

Why hybrid not full for_each: a wholesale refactor would change every
existing resource address (hcloud_server.control_plane[0] →
hcloud_server.control_plane["mgmt"]), forcing every running Sovereign
to run `tofu state mv` for ~12 resources or face destructive recreates.
The brief explicitly bans that. Hybrid is purely additive — secondary
resources are NEW addresses no existing state carries.

No `tofu state mv` runbook required. Existing Sovereigns provisioned
with var.regions = [] or len(var.regions) == 1 produce identical plans
before and after this PR.

Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary
regions and adds per-cluster GitOps path differentiation; today every
secondary CP renders an identical Flux Kustomization pointed at
clusters/<sovereign_fqdn>/.

Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via
mock_provider + override_resource (no real Hetzner):
  - legacy_no_regions_payload (var.regions=[])
  - single_region_entry_does_not_double_provision (len==1)
  - three_region_mgmt_fsn_hel (EPIC-6 shape)
  - same_region_duplicates_produce_distinct_keys
  - non_hetzner_regions_are_filtered_out (oci entries skipped)
All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check
+ test on every PR touching infra/hetzner/**.

Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled":
push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron.

Validation:
  $ tofu validate
  Success! The configuration is valid.
  $ tofu fmt -check -recursive
  exit=0
  $ tofu test
  tests/multi_region.tftest.hcl... pass
    run "legacy_no_regions_payload"... pass
    run "single_region_entry_does_not_double_provision"... pass
    run "three_region_mgmt_fsn_hel"... pass
    run "same_region_duplicates_produce_distinct_keys"... pass
    run "non_hetzner_regions_are_filtered_out"... pass
  Success! 5 passed, 0 failed.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:29:44 +04:00
e3mrah
8e312cd244
fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966) (#967)
Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1)
failed at `tofu apply` with:

  Error: invalid input in field 'user_data' (invalid_input):
  [user_data => [Length must be between 0 and 32768.]]
  with hcloud_server.control_plane[0]
  on main.tf line 309

Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921
inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud-
init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's
multi-domain substitutions. Rendered size: ~37 KB.

Root cause: the prior strip regex `(?m)^[ ]{0,2}# .*\n` was scoped to
indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE
write_files content blocks (e.g. flux-bootstrap.yaml's triplicate
Kustomization documentation). Those comments are inert: every write_files
entry is YAML / JSON / key=value config (no shell scripts), and parsers
ignore `#`-prefixed lines entirely.

Changes:

1. New strip regex `(?m)^[ ]*#( |$).*\n` strips ANY-indent comment lines
   that start with `#` followed by space or EOL. Preserves:
   - `#cloud-config` line 1 (no space after `#`)
   - `#!`-shebangs (no space after `#`)
   - `#pragma`-style directives (`#` followed by non-space non-EOL)
   Applied to both `local.control_plane_cloud_init` and
   `local.worker_cloud_init`.

2. Plan-time guardrail via `lifecycle.precondition` on
   `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan
   (not apply) when `length(local.<*>_cloud_init) > 30720` bytes (30 KiB
   = 32 KiB hard cap minus 10% future-additions buffer). Future bloat-
   creep that silently re-eats the headroom now fails fast at plan-time
   BEFORE the network/LB/firewall/SSH-key resources get created.

Verified rendered sizes (Python simulation of templatefile + strip,
substitutions match real otech114 inputs):

  CP cloud-init:     79404 bytes raw → 21144 bytes stripped
                     (margin: 11624 under hard cap, 9576 under guardrail)
  Worker cloud-init:  3254 bytes raw →  2410 bytes stripped
                     (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes)

`#cloud-config` first-line preserved. All 18 write_files entries and
43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip
(comments are documentation only at the file-format level).

Closes #966

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 17:58:44 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00