* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges
PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:
<dep>:hel1-2:seaweedfs (secondary, missing "install-")
<dep>:gitea (primary, missing "install-")
But the FlowNode ids in the snapshot are:
<dep>:install-hel1-2:seaweedfs
<dep>:install-gitea
The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.
Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):
curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
every finish-to-start fromId malformed
canvas: sibling edges invisible across all 135 install Jobs
Fix in two places:
internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
Region-prefix each dep AND inject the "install-" prefix so
ev.DependsOn = ["install-<region>:<chart>"] before the bridge
receives the event. Symmetric with how ev.Component is constructed.
internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
Canonicalise every dep entry: if it doesn't already start with
JobNamePrefix ("install-"), prepend it. Idempotent on entries
that already are canonical (set by the phase1_watch.go path).
Covers the primary-region path (bare chart names like "gitea")
too — Job.DependsOn now stores "install-gitea", which matches
the composer's emitted FromId exactly.
Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)
* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values
Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.
Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).
Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.
* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs
Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
"ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.
Auto-derivation:
ClusterMeshName: <first-fqdn-label>-mesh
e.g. t105.omani.works → "t105-mesh"
ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1
Range [1, 252]; main.tf increments for secondaries so the max id
any region sees is primary + (regions - 1) ≤ 254. ID 255 is
intentionally avoided (Cilium sentinel).
Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).
Tested derive helpers against the last 4 prov IDs — all land in valid
range:
98395b3d9bd9c1aa → 74 (secondaries 75, 76)
005080699326a7ac → 29 (secondaries 30, 31)
22af2b1120158239 → 139
c9df5eed1c1ba6cf → 180
Build + provisioner unit tests green.
* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml
t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.
Downstream consequences observed live on t105:
hubble-relay CrashLoopBackOff:
"tls: failed to verify certificate: x509: certificate is valid for
*.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
.default.hubble-grpc.cilium.io"
clustermesh peer announcements use stale "default" identity →
cross-region mesh handshakes x509-fail.
Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.
* docs(sandbox): design docs for the Sandbox product
Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.
Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets
Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:
W reflector.go:561] failed to list *unstructured.Unstructured:
deployments.apps "cilium-operator" is forbidden: User
"system:serviceaccount:kube-system:cilium-envoy-tls-restart"
cannot list resource "deployments" in API group "apps" in the
namespace "kube-system"
The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.
Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.
* fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0
Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):
dig +short A console.t110.omani.works @ns1.openova.io
→ 49.12.16.160 ← ORPHAN IP — Hetzner reassigned to a 3rd party
The mothership PowerDNS had ZERO records for t110's hostnames. A stale
wildcard `*.omani.works` (manual leftover from earlier provs) was
returning a wrong IP that no longer belonged to the openova project at
Hetzner — sending operator traffic to an unrelated tenant. The deeper
gap: catalyst-api never auto-wrote the per-Sovereign A records that
browsers need to resolve.
The existing parent-domain flow has:
pdmCreatePowerDNSZone — stub at parent_domains.go:1096
certManagerStep — stub at parent_domains.go:1141
commitPDMWithRetry — runs ONLY for pool-allocated FQDNs
(otech<N>.<pool>), NOT BYO
So BYO-style (operator-owned parent like omani.works + arbitrary
Sovereign FQDN like t111.omani.works) left the parent zone untouched.
Fix:
internal/powerdns/client.go
+ PatchRRSets(ctx, zone, rrsets) — PATCH REPLACE on
/api/v1/servers/{id}/zones/{zone} with idempotent re-runs
internal/handler/handler.go
+ powerdnsZoneClient interface gains PatchRRSets — wired
automatically by SetPowerDNSZoneClient
internal/handler/sovereign_dns_records.go (new)
+ CanonicalSovereignSubdomains: console / auth / gitea / harbor /
registry / bao / grafana / hubble / pdns / openova-flow /
marketplace / api / guacamole
+ upsertSovereignParentZoneRecords: PATCH the parent zone with one
A record per subdomain → primary LB IP
+ upsertSovereignParentZoneRecordsFromResult: deployment-flow
wrapper that iterates every parentDomain in the request body
internal/handler/deployments.go
+ Call upsertSovereignParentZoneRecordsFromResult right after
commitPDMWithRetry on Phase-0 success — best-effort (log +
continue), so a PowerDNS hiccup doesn't bail the Sovereign
Operator override via CATALYST_SOVEREIGN_SUBDOMAINS not yet wired —
filed as follow-up. Today the canonical list is the chart-side HTTPRoute
list, kept aligned via the comment in sovereign_dns_records.go.
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>