fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs (#1502)

* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
This commit is contained in:
e3mrah 2026-05-15 19:13:35 +04:00 committed by GitHub
parent aa8c6dc391
commit 4465cd0d27
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -31,6 +31,8 @@ package provisioner
import (
"bufio"
"context"
"crypto/sha256"
"encoding/binary"
"encoding/json"
"errors"
"fmt"
@ -1247,8 +1249,20 @@ func writeTfvars(deployDir string, req Request) error {
// Cilium ClusterMesh per-Sovereign peer anchors (#1101 EPIC-6).
// Empty + 0 = not in a mesh. Tofu validates id ∈ [0, 255].
"cluster_mesh_name": req.ClusterMeshName,
"cluster_mesh_id": req.ClusterMeshID,
//
// Auto-derivation for zero-touch multi-region provs: when the
// operator omits ClusterMeshName/ClusterMeshID AND len(Regions)>1,
// derive both deterministically so the mesh comes up by default.
// Without this, every multi-region prov lands with cluster.id=0
// and Cilium kvstoremesh refuses to start: "ClusterID 0 is
// reserved". Operator may still override; auto-derive only kicks
// in when both fields are zero/empty (the omit-from-POST default).
// Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15):
// all 3 regions had cluster.id=0, clustermesh-apiserver
// CrashLoopBackOff 16 restarts, no inter-region mesh ever
// formed.
"cluster_mesh_name": deriveClusterMeshName(req),
"cluster_mesh_id": deriveClusterMeshID(req),
// Hetzner — token gets baked into the state file unless the operator
// configures a remote backend with encryption-at-rest. Per Catalyst
@ -1865,3 +1879,69 @@ func firstFQDNLabel(fqdn string) string {
// operation; kept as a non-panicking fallback for unit tests.
return s
}
// deriveClusterMeshName returns the canonical Cilium ClusterMesh name
// for this Sovereign. Operator may override via Request.ClusterMeshName;
// otherwise auto-derived from the FQDN's first label suffixed with
// "-mesh". For single-region provs (len(Regions) <= 1) returns empty
// string — single-cluster Sovereigns don't need a mesh.
//
// Caught on prov t104.omani.works (2026-05-15): operator submitted the
// canonical multi-region request without ClusterMeshName, defaulted to
// "" → cilium-config rendered cluster.name="default" on all 3 regions
// → kvstoremesh refused to start. Auto-derive closes the gap.
func deriveClusterMeshName(req Request) string {
if s := strings.TrimSpace(req.ClusterMeshName); s != "" {
return s
}
if len(req.Regions) <= 1 {
return ""
}
label := firstFQDNLabel(req.SovereignFQDN)
if label == "" {
return ""
}
return label + "-mesh"
}
// deriveClusterMeshID returns the canonical Cilium ClusterMesh peer ID
// for this Sovereign's PRIMARY region. Operator may override via
// Request.ClusterMeshID; otherwise auto-derived deterministically from
// the deployment ID hash, modulo 252, plus 1 (range 1..252; leaves
// 253-255 as a 3-slot pad for secondaries which main.tf computes as
// primary+1, primary+2, etc.).
//
// For single-region provs (len(Regions) <= 1) returns 0 (the
// "not-in-mesh" sentinel that variables.tf documents). The tofu module
// at infra/hetzner/main.tf has matching logic that emits secondaries
// at 0 when the primary is 0.
//
// Caught on prov t104.omani.works (2026-05-15): operator submitted
// multi-region request without ClusterMeshID, defaulted to 0 →
// cilium-config rendered cluster.id=0 on all 3 regions → Cilium
// reserves 0 → kvstoremesh CrashLoopBackOff with "ClusterID 0 is
// reserved" → no mesh ever formed → cross-region observability
// permanently broken.
func deriveClusterMeshID(req Request) int {
if req.ClusterMeshID != 0 {
return req.ClusterMeshID
}
if len(req.Regions) <= 1 {
return 0
}
src := strings.TrimSpace(req.DeploymentID)
if src == "" {
src = strings.TrimSpace(req.SovereignFQDN)
}
if src == "" {
return 0
}
sum := sha256.Sum256([]byte(src))
// Take a 32-bit window of the hash and reduce to [1, 252]. The
// primary uses this value; main.tf increments for secondaries so
// the max id any region sees is primary+(N-1) ≤ 252+2 = 254.
// 255 is intentionally avoided — Cilium uses it as a sentinel
// in some configs.
v := int(binary.BigEndian.Uint32(sum[:4]))
return (v % 252) + 1
}