fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs (#1502)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
This commit is contained in:
parent
aa8c6dc391
commit
4465cd0d27
@ -31,6 +31,8 @@ package provisioner
|
||||
import (
|
||||
"bufio"
|
||||
"context"
|
||||
"crypto/sha256"
|
||||
"encoding/binary"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
@ -1247,8 +1249,20 @@ func writeTfvars(deployDir string, req Request) error {
|
||||
|
||||
// Cilium ClusterMesh per-Sovereign peer anchors (#1101 EPIC-6).
|
||||
// Empty + 0 = not in a mesh. Tofu validates id ∈ [0, 255].
|
||||
"cluster_mesh_name": req.ClusterMeshName,
|
||||
"cluster_mesh_id": req.ClusterMeshID,
|
||||
//
|
||||
// Auto-derivation for zero-touch multi-region provs: when the
|
||||
// operator omits ClusterMeshName/ClusterMeshID AND len(Regions)>1,
|
||||
// derive both deterministically so the mesh comes up by default.
|
||||
// Without this, every multi-region prov lands with cluster.id=0
|
||||
// and Cilium kvstoremesh refuses to start: "ClusterID 0 is
|
||||
// reserved". Operator may still override; auto-derive only kicks
|
||||
// in when both fields are zero/empty (the omit-from-POST default).
|
||||
// Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15):
|
||||
// all 3 regions had cluster.id=0, clustermesh-apiserver
|
||||
// CrashLoopBackOff 16 restarts, no inter-region mesh ever
|
||||
// formed.
|
||||
"cluster_mesh_name": deriveClusterMeshName(req),
|
||||
"cluster_mesh_id": deriveClusterMeshID(req),
|
||||
|
||||
// Hetzner — token gets baked into the state file unless the operator
|
||||
// configures a remote backend with encryption-at-rest. Per Catalyst
|
||||
@ -1865,3 +1879,69 @@ func firstFQDNLabel(fqdn string) string {
|
||||
// operation; kept as a non-panicking fallback for unit tests.
|
||||
return s
|
||||
}
|
||||
|
||||
// deriveClusterMeshName returns the canonical Cilium ClusterMesh name
|
||||
// for this Sovereign. Operator may override via Request.ClusterMeshName;
|
||||
// otherwise auto-derived from the FQDN's first label suffixed with
|
||||
// "-mesh". For single-region provs (len(Regions) <= 1) returns empty
|
||||
// string — single-cluster Sovereigns don't need a mesh.
|
||||
//
|
||||
// Caught on prov t104.omani.works (2026-05-15): operator submitted the
|
||||
// canonical multi-region request without ClusterMeshName, defaulted to
|
||||
// "" → cilium-config rendered cluster.name="default" on all 3 regions
|
||||
// → kvstoremesh refused to start. Auto-derive closes the gap.
|
||||
func deriveClusterMeshName(req Request) string {
|
||||
if s := strings.TrimSpace(req.ClusterMeshName); s != "" {
|
||||
return s
|
||||
}
|
||||
if len(req.Regions) <= 1 {
|
||||
return ""
|
||||
}
|
||||
label := firstFQDNLabel(req.SovereignFQDN)
|
||||
if label == "" {
|
||||
return ""
|
||||
}
|
||||
return label + "-mesh"
|
||||
}
|
||||
|
||||
// deriveClusterMeshID returns the canonical Cilium ClusterMesh peer ID
|
||||
// for this Sovereign's PRIMARY region. Operator may override via
|
||||
// Request.ClusterMeshID; otherwise auto-derived deterministically from
|
||||
// the deployment ID hash, modulo 252, plus 1 (range 1..252; leaves
|
||||
// 253-255 as a 3-slot pad for secondaries which main.tf computes as
|
||||
// primary+1, primary+2, etc.).
|
||||
//
|
||||
// For single-region provs (len(Regions) <= 1) returns 0 (the
|
||||
// "not-in-mesh" sentinel that variables.tf documents). The tofu module
|
||||
// at infra/hetzner/main.tf has matching logic that emits secondaries
|
||||
// at 0 when the primary is 0.
|
||||
//
|
||||
// Caught on prov t104.omani.works (2026-05-15): operator submitted
|
||||
// multi-region request without ClusterMeshID, defaulted to 0 →
|
||||
// cilium-config rendered cluster.id=0 on all 3 regions → Cilium
|
||||
// reserves 0 → kvstoremesh CrashLoopBackOff with "ClusterID 0 is
|
||||
// reserved" → no mesh ever formed → cross-region observability
|
||||
// permanently broken.
|
||||
func deriveClusterMeshID(req Request) int {
|
||||
if req.ClusterMeshID != 0 {
|
||||
return req.ClusterMeshID
|
||||
}
|
||||
if len(req.Regions) <= 1 {
|
||||
return 0
|
||||
}
|
||||
src := strings.TrimSpace(req.DeploymentID)
|
||||
if src == "" {
|
||||
src = strings.TrimSpace(req.SovereignFQDN)
|
||||
}
|
||||
if src == "" {
|
||||
return 0
|
||||
}
|
||||
sum := sha256.Sum256([]byte(src))
|
||||
// Take a 32-bit window of the hash and reduce to [1, 252]. The
|
||||
// primary uses this value; main.tf increments for secondaries so
|
||||
// the max id any region sees is primary+(N-1) ≤ 252+2 = 254.
|
||||
// 255 is intentionally avoided — Cilium uses it as a sentinel
|
||||
// in some configs.
|
||||
v := int(binary.BigEndian.Uint32(sum[:4]))
|
||||
return (v % 252) + 1
|
||||
}
|
||||
|
||||
Loading…
Reference in New Issue
Block a user