Commit Graph

1365 Commits

Author SHA1 Message Date
github-actions[bot]
53d8f8e402 deploy: update catalyst images to 05c6edb 2026-05-16 09:17:23 +00:00
github-actions[bot]
3628f8fc31 deploy: update catalyst images to b7140b9 2026-05-16 09:08:59 +00:00
github-actions[bot]
5689ea4f44 deploy: update catalyst images to db116c2 2026-05-16 08:57:54 +00:00
e3mrah
db116c2d18
fix(kubeconfig): honour ?region=<key> on GET /kubeconfig (#1515)
Multi-region Sovereigns store secondary CP kubeconfigs at
<kubeconfigsDir>/<id>-<region>.yaml via the PUT endpoint (L520+). The
GET endpoint always read dep.Result.KubeconfigPath which is the
PRIMARY's path, so any caller asking for ?region=nbg1-1 got primary's
kubeconfig pointing at primary's IP (89.167.22.182 etc.) — silently.

Caught on t117 (7152ad51e7838836, 2026-05-16): D-gate validator
fetched all 3 region kubeconfigs via the GET endpoint with ?region=
and all 3 returned PRIMARY's endpoint. Every per-region check
(D8/D9/D12) inspected primary 3× instead of 3 distinct regions.
Workaround was reading directly from the PVC; this fix unblocks the
canonical API path.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 12:55:55 +04:00
github-actions[bot]
7bfe7266af deploy: update catalyst images to f30a49f 2026-05-16 08:14:32 +00:00
github-actions[bot]
243bb6b03d deploy: update catalyst images to 7f0de7f 2026-05-16 07:36:27 +00:00
github-actions[bot]
c59ae92b55 deploy: update catalyst images to dc59085 2026-05-16 06:59:40 +00:00
github-actions[bot]
ecf256d4a7 deploy: update catalyst images to 0c9e391 2026-05-15 20:04:18 +00:00
github-actions[bot]
2585b439d4 deploy: update catalyst images to 66e7768 2026-05-15 19:56:32 +00:00
e3mrah
66e7768e8e
fix(helmwatch): emit Succeeded events for HRs Ready at attach time (#1510)
When catalyst-api restarts and the bridge re-attaches to an already-
converged child cluster, the informer initial-list returns HRs already
in Ready=True. The previous processEvent path relied implicitly on the
zero-value of w.states[componentID] (empty string) being different
from the derived state — which works today but would silently regress
if a future refactor pre-seeded w.states from a prior snapshot.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): 4 HRs
converged across primary + sin-2 regions before/after the pod restart
at 19:16, but the mothership Jobs API kept reporting:

    install-self-sovereign-cutover  → running   (kubectl: Ready=True)
    install-powerdns                → running   (kubectl: Ready=True)
    install-catalyst-platform       → running   (kubectl: Ready=True)
    install-sin-2:reloader          → failed    (kubectl: Ready=True)

D6 (0 pending / 0 running) and D7 (mothership ≡ child) both failed.

Fix shape: processEvent's emission policy is now EXPLICITLY "first
observation OR real transition". `hadPrev` (the two-return-value map
lookup) is false on the FIRST event for componentID regardless of the
state value, so the dispatch fires unconditionally on attach. The
dedupe via prev != state still suppresses sub-second status-patch
churn that helm-controller's observedGeneration touches produce.

Idempotency: the jobs.Bridge's lastState map dedupes (componentID,
state) re-emissions at the bridge layer (Bridge.OnHelmReleaseEvent
line ~478), and the openova-flow-server's TypeSnapshot envelope is
idempotent at the receiver — so a re-emit propagated by the
flow_emitter periodic loop is safe.

Two new tests pin the contract:
  - TestTransition_AttachTimeReady_EmitsSucceededViaSubscribe asserts
    a Watcher attaching to a child cluster with 4 already-Ready HRs
    emits exactly one State=installed event per HR, BOTH on the
    primary emit callback AND through Subscribe (the bridge wiring).
  - TestTransition_FirstObservation_NeverDedupsAcrossWatchers asserts
    that constructing a new Watcher against the same fake client
    (the Pod-restart shape) re-emits the full component-event set,
    because w.states is independent per Watcher.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:54:25 +04:00
github-actions[bot]
feb42e2f80 deploy: update catalyst images to 5f8ba85 2026-05-15 19:41:57 +00:00
github-actions[bot]
0a63c19cc0 deploy: update catalyst images to 22668f2 2026-05-15 18:18:23 +00:00
e3mrah
22668f2870
feat(catalyst-api): auto-establish Cilium ClusterMesh after Phase-1 (#1508)
Implements DoD gates D9, D10, D11 from
docs/SOVEREIGN-MULTI-REGION-DOD.md. After phase1-watching reports all
HRs Ready, the orchestrator wires every region's clustermesh-apiserver
into a fully-connected peer mesh by writing the cross-cluster trust
material (CA bundles, peer endpoints, mTLS client certs) into each
cluster's kube-system Secrets. Cilium auto-reloads via the chart's
watch mechanism; a rollout-restart guarantees pickup.

- New handler/clustermesh.go orchestrator (AutoEstablishClusterMesh)
- Hook in phase1_watch.go markPhase1Done after fireHandover, runs on
  a goroutine with a 20-minute budget; skips when regions<2
- Idempotent: re-run on partially-meshed Sovereign converges
- Uses LoadBalancer IPs per region (provider-agnostic — A2/A3/A6)
- Hard-fails on Service type != LoadBalancer per invariant A3
- No cilium CLI shell-out (catalyst-api Pod doesn't ship it); mints
  per-peer client certs from the local cilium-ca via crypto/x509
- Three coverage tests against fake clientsets: happy-path 2-region,
  LB-absent peer marked Connected=false, idempotent re-run, single-
  region short-circuit

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 22:16:26 +04:00
github-actions[bot]
9613e69ecc deploy: update catalyst images to 93f6993 2026-05-15 18:06:42 +00:00
github-actions[bot]
b89fdfc9e7 deploy: update catalyst images to 4e199f1 2026-05-15 17:14:47 +00:00
e3mrah
4e199f137b
fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 (#1505)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

* fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):

  dig +short A console.t110.omani.works @ns1.openova.io
  → 49.12.16.160     ← ORPHAN IP — Hetzner reassigned to a 3rd party

The mothership PowerDNS had ZERO records for t110's hostnames. A stale
wildcard `*.omani.works` (manual leftover from earlier provs) was
returning a wrong IP that no longer belonged to the openova project at
Hetzner — sending operator traffic to an unrelated tenant. The deeper
gap: catalyst-api never auto-wrote the per-Sovereign A records that
browsers need to resolve.

The existing parent-domain flow has:
  pdmCreatePowerDNSZone     — stub at parent_domains.go:1096
  certManagerStep           — stub at parent_domains.go:1141
  commitPDMWithRetry        — runs ONLY for pool-allocated FQDNs
                              (otech<N>.<pool>), NOT BYO

So BYO-style (operator-owned parent like omani.works + arbitrary
Sovereign FQDN like t111.omani.works) left the parent zone untouched.

Fix:

  internal/powerdns/client.go
    + PatchRRSets(ctx, zone, rrsets) — PATCH REPLACE on
      /api/v1/servers/{id}/zones/{zone} with idempotent re-runs

  internal/handler/handler.go
    + powerdnsZoneClient interface gains PatchRRSets — wired
      automatically by SetPowerDNSZoneClient

  internal/handler/sovereign_dns_records.go (new)
    + CanonicalSovereignSubdomains: console / auth / gitea / harbor /
      registry / bao / grafana / hubble / pdns / openova-flow /
      marketplace / api / guacamole
    + upsertSovereignParentZoneRecords: PATCH the parent zone with one
      A record per subdomain → primary LB IP
    + upsertSovereignParentZoneRecordsFromResult: deployment-flow
      wrapper that iterates every parentDomain in the request body

  internal/handler/deployments.go
    + Call upsertSovereignParentZoneRecordsFromResult right after
      commitPDMWithRetry on Phase-0 success — best-effort (log +
      continue), so a PowerDNS hiccup doesn't bail the Sovereign

Operator override via CATALYST_SOVEREIGN_SUBDOMAINS not yet wired —
filed as follow-up. Today the canonical list is the chart-side HTTPRoute
list, kept aligned via the comment in sovereign_dns_records.go.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:12:38 +04:00
e3mrah
2c3ea44af8
fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:02:37 +04:00
github-actions[bot]
fc7bbc8711 deploy: update catalyst images to 3a19bb1 2026-05-15 15:51:00 +00:00
github-actions[bot]
51a9f7b1b5 deploy: update catalyst images to 4465cd0 2026-05-15 15:15:38 +00:00
e3mrah
4465cd0d27
fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs (#1502)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 19:13:35 +04:00
github-actions[bot]
aa8c6dc391 deploy: update catalyst images to 49ae2a7 2026-05-15 13:26:36 +00:00
e3mrah
49ae2a7cab
fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values (#1501)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:24:33 +04:00
github-actions[bot]
1f07721204 deploy: update catalyst images to 80fdbcd 2026-05-15 13:20:49 +00:00
e3mrah
80fdbcd8e1
fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges (#1500)
PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:18:40 +04:00
github-actions[bot]
5b2c8b79a8 deploy: update catalyst images to 1cd6c3f 2026-05-15 12:41:58 +00:00
e3mrah
1cd6c3f432
fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race (#1499)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

* fix(sovereign-tls): escape $ in tls-restart Job so Flux doesn't eat the bash vars

Root cause caught on prov t101.omani.works (c9df5eed1c1ba6cf, 2026-05-15):

The cilium-envoy-tls-restart Job's shell command uses bash variables
${SECRET_NS}, ${SECRET_NAME}, ${DS_NS}, ${DS_NAME}, ${tls_crt}, ${i}.
Flux's postBuild.substitute processes ${...} in the YAML BEFORE the
Job manifest lands in the cluster, and replaces every $-reference that
isn't in the Kustomization's substituteFrom map with an empty string.

Result on prov t101 (T+13m, mothership flipped status=ready):

  Job logs: "[tls-restart] waiting for / with non-empty tls.crt"
                                      ^^^ — namespace and name both empty

  Command becomes: `kubectl get secret -n "" "" --ignore-not-found ...`
  → polls a nonexistent secret forever
  → cilium-operator never gets the rollout-restart
  → CiliumEnvoyConfig's additionalAddresses.socketAddress: 0.0.0.0:30443
    bind never lands
  → cilium-envoy host:30443 stays unbound
  → Hetzner LB targets stay unhealthy on 30080/30443
  → console.<fqdn> serves HTTP 000 indefinitely
  → mothership's "Handover gate" timeout fires AT THE WRONG TIME — flips
    deployment status=ready before TLS is actually serving

The "Sovereign was up at t101" reading we saw briefly was a transient
TRAEFIK fallback cert from upstream during cert-issuance, NOT the
Sovereign envoy.

Fix: escape every bash variable reference inside the script as $$VAR so
Flux postBuild.substitute emits a literal $VAR which bash then evaluates
correctly at Job runtime. SOVEREIGN_FQDN in YAML labels stays as
${SOVEREIGN_FQDN} because that IS a Flux substitute (kept intentionally).

This is the third recurrence of "sibling deps lost / cilium-envoy host
bind missing / fresh prov console=000" on the same code path:
  PR #1431 — derive HR dependsOn from live watcher
  PR #1470 — persist DependsOn on every event
  PR #1494 — restart cilium-operator BEFORE cilium-envoy on first install
  PR #1497 — skip TLS verify on Sovereign k3s self-signed CA
  THIS  — escape \$VAR in Job command so Flux doesn't blank them

Each prior PR fixed a layer above the Job's own correctness. The Job
itself was always broken on fresh provs since the cilium-operator
restart line was added.

* fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race

Real architectural fix for the recurring "sibling deps lost on every fresh
provision" regression. PR #1431, PR #1470, PR #1497 each patched a layer
above the actual gap: the per-event emit path at helmwatch.go:1525 had
the unstructured HelmRelease in scope but THREW AWAY spec.dependsOn before
emitting the provisioner.Event. The bridge then wrote Job.DependsOn=[]
on every event, relying on a pre-existing seed having populated deps —
which never happened on fresh provs because the watcher's initial-list
sync (T+2m, right after tofu) fires with 0 HRs (Flux hasn't installed
anything yet).

The fix walks the data end-to-end:

  provisioner.Event   gains DependsOn []string
  helmwatch.processEvent  populates DependsOn: extractDependsOn(u) on
                          every PhaseComponent emit (the unstructured
                          HelmRelease was already in scope, just being
                          dropped at the event boundary)
  spawnSecondaryRegionWatchers  region-prefixes each entry so secondary
                                Jobs (install-<region>:<chart>) wire to
                                intra-region siblings, not bare primary
                                names
  Bridge.OnProvisionerEvent  passes ev.DependsOn to OnHelmReleaseEvent
  Bridge.OnHelmReleaseEvent  new dependsOn []string parameter; resolves
                             with 3-tier preference:
                               prior store value  >
                               event-carried (live HR spec.dependsOn) >
                               empty.
                             The prior-store branch keeps PR #1470's
                             pod-restart preservation; the event-carried
                             branch closes the fresh-prov gap.

No timing race, no re-seed band-aid, no /refresh-watch dependency. Every
HR transition observed by the watcher carries the live spec.dependsOn
through to the Job row — exactly the architecture that ComponentSnapshot
already documents at helmwatch.go:679-689 but the event path had
silently dropped.

Caught on prov t102.omani.works (22af2b1120158239, 2026-05-15) — all
hel1-2 HRs showed Deps:— in the JobsTable despite the bridge being
healthy (verified: x509 errors=0 post PR #1497, kubeconfigs present at
mtime T+2m, OnInitialListSynced fired).

Prior recurrences (each patched a layer above the actual gap):
  PR #1431 (2026-05-11) — derive HR dependsOn from live watcher (seed path)
  PR #1470 (2026-05-14) — persist DependsOn on every event (preserve prior)
  PR #1497 (2026-05-15) — skip TLS verify on Sovereign k3s self-signed CA
  PR #1498 (2026-05-15) — escape $ in tls-restart Job so Flux doesn't blank vars
  THIS  (2026-05-15) — actually plumb spec.dependsOn through the Event

Tests:
  go test ./internal/jobs/... ./internal/helmwatch/... ./internal/provisioner/...
  All green. 9 OnHelmReleaseEvent callsites updated for the new signature.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 16:39:52 +04:00
github-actions[bot]
fdbd47a5a8 deploy: update catalyst images to da63b45 2026-05-15 10:48:25 +00:00
e3mrah
da63b45b53
fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps (#1497)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 14:46:21 +04:00
github-actions[bot]
558d3b7095 deploy: update catalyst images to 1dc21bf 2026-05-14 18:54:01 +00:00
github-actions[bot]
11c9a1bb83 deploy: update catalyst images to 96fc3bf 2026-05-14 18:04:21 +00:00
e3mrah
96fc3bfc76
fix(routes): preserve /sovereign basepath on canonicalisation hard-nav + normalize PIN-login next (#1488)
Two related basepath-stripping bugs in hard-navigation paths:

A. router.tsx rootBeforeLoad canonicalisePath
   TanStack Router passes POST-basepath `location.pathname` (e.g. on
   contabo a visit to `/sovereign/provision/$id/jobs/install-X%3AY`
   arrives as `/provision/$id/jobs/install-X%3AY`). canonicalisePath
   lowercases the path, so `%3A` → `%3a` and the comparison triggers
   a hard-nav. But `window.location.replace(canonical)` operates on
   the FULL URL — the bare `/provision/...` target bypasses the SPA
   mount point and nginx 404s before the SPA loads. Same root cause
   as #1486, different hard-nav site.

B. VerifyPinPage hard-nav post-PIN
   The `next` query param arrives in two forms depending on which
   redirectToLogin variant produced it: SovereignConsoleLayout.tsx:91
   uses `window.location.pathname` (INCLUDES basepath) while :178
   uses currentPathRelativeToBasepath (STRIPS basepath). #1486
   unconditionally re-prefixed which double-prefixed the first form.
   Normalize to "post-basepath" form first, then re-prefix exactly
   once.

Fix shape: every window.location.{replace,assign} that operates on a
URL derived from router-internal data MUST re-add basepath. The router-
based `<Link to>` / `navigate({to})` paths are unaffected because
TanStack Router auto-prefixes those.

Caught live on prov #82 + #84 (omani.works, 2026-05-14): the canvas
row-click + PIN-login + canonicalise paths each generated bare
`/provision/...` URLs that hit nginx's 404 page.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:02:20 +04:00
github-actions[bot]
8c61db0d02 deploy: update catalyst images to a25fd33 2026-05-14 17:19:41 +00:00
e3mrah
a25fd33dea
fix(provisioner): key tofu workdir by DeploymentID, not FQDN (eliminate reprov tfstate carryover) (#1487)
Root cause for the prov #82#83#84 cascade on omani.works:

The per-prov tofu workdir was keyed by `strings.ReplaceAll(FQDN, ".", "-")`,
so every reprovision of the SAME SovereignFQDN reused the SAME directory.
When prov #82's force-wipe failed `tofu destroy` (the workdir held a tftpl
from before #1485's WILDCARD_CERT_ISSUER escape fix), the Hetzner-purge
fallback cleaned the cloud but the tfstate stayed dirty. Prov #83 then
inherited tfstate that referenced destroyed-via-Hetzner-purge resources
and `tofu apply` failed with "Saved plan is stale" / "resource already
exists".

The kubeconfig path was ALREADY keyed by DeploymentID; the tofu workdir
was the outlier. Bring it into alignment so each POST /deployments gets
a hermetic workdir. CreateDeployment generates a unique DeploymentID on
every call, so reprovs are isolated by construction.

Wizard-resume — the original justification for the FQDN-keyed design —
was already fragile (it required a clean prior tfstate), and is better
served by an explicit retry endpoint that re-uses the same DeploymentID
rather than implicit workdir reuse.

Affected callers:
- provisioner.go Provision + Destroy → workdirKey() (returns DeploymentID, falls back to FQDN-slug for legacy paths)
- wipe.go WipeDeployment → uses `id` (chi URL param) directly
- handover.go FinaliseHandover → uses `id` directly

Tests pass: provisioner + handler test packages.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:17:28 +04:00
github-actions[bot]
c7cc9bda35 deploy: update catalyst images to 00aeefe 2026-05-14 17:01:56 +00:00
e3mrah
00aeefedaa
fix(verify-pin): re-prefix basepath on window.location.replace after PIN success (#1486)
VerifyPinPage.tsx:104 calls window.location.replace(target) to drive a
hard navigation after PIN verification succeeds. Hard navigation BYPASSES
TanStack Router's basepath config — so on contabo (basepath='/sovereign'),
a `target` of `/provision/$id/jobs` lands the browser at
`https://console.openova.io/provision/$id/jobs` (no `/sovereign/` prefix).
nginx on contabo only serves the SPA under `/sovereign/*` and 404s
everything else, so the operator sees nginx's "404 page not found"
before the SPA has a chance to route.

The `next` value is stored post-basepath by design (basepathRelative.ts)
because router.navigate adds basepath back automatically. window.location
doesn't, so we have to re-add it manually for the hard-nav path.

Caught live on prov #82 (omani.works, 2026-05-14): after PIN-login on
console.openova.io/sovereign/login?next=%2Fprovision%2F.../jobs, the
replace landed on /provision/.../jobs → nginx 404.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:59:03 +04:00
github-actions[bot]
7096207c96 deploy: update catalyst images to cebc954 2026-05-14 16:23:10 +00:00
github-actions[bot]
c6d13f356c deploy: update catalyst images to 115c588 2026-05-14 14:52:50 +00:00
e3mrah
115c58885b
fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 18:50:34 +04:00
github-actions[bot]
fb99ae5fd0 deploy: update catalyst images to a88e132 2026-05-14 14:27:51 +00:00
github-actions[bot]
5752fc751f deploy: update catalyst images to bdceb3a 2026-05-14 12:45:34 +00:00
e3mrah
bdceb3a78a
fix(canvas): region phase sub-groups default to pending (not running) (#1479)
Empty handover/apps phase groups (no Jobs emitted yet for those
lifecycle phases) were hardcoded to 'running' which propagated up
to the root phase groups. With the rollup fix preserving stored
status when no children, the correct stored default is 'pending'.

After this, fresh-prov handover + apps groups show 'pending'
(accurate — those phases haven't started) and the rollup correctly
classifies bootstrap-kit + cutover region groups based on their
real install-* children.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:43:24 +04:00
github-actions[bot]
0e4cb67319 deploy: update catalyst images to 690d588 2026-05-14 12:40:44 +00:00
e3mrah
690d588a04
fix(canvas): rollup preserves leaf status when group has no children (#1478)
Bug found on prov #76 rollup: cluster-bootstrap (a leaf with
family='bootstrap') was being treated as an empty group and reset
from succeeded → pending. That status then cascaded up through
provisioner (whose 5 children include cluster-bootstrap) making
provisioner show pending despite all 5 phase jobs being succeeded.

Fix: when a node in groupNodeIdx has zero children in contains rels,
keep its STORED status instead of forcing pending. This preserves
leaf-with-group-family nodes (cluster-bootstrap) AND empty phase
groups (handover/apps before their Jobs exist).

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:38:30 +04:00
github-actions[bot]
195c6b5bc5 deploy: update catalyst images to 13d79c7 2026-05-14 12:35:31 +00:00
e3mrah
13d79c77f5
fix(flow-emit): lazy-start emit loop on snapshot request (#1477)
Bug found on prov #76: rolled-up group status fix wasn't visible
because catalyst-api Pod restart (image roll) killed the emit
goroutine. startFlowEmitLoop is only invoked from phase1_watch start
— for a deployment already at status=ready, the new Pod has no emit
loop until someone fires phase1 again.

Add idempotent startFlowEmitLoop call inside HandleFlowSnapshot so
any UI page load (which polls snapshot) reactivates the emit loop.
Combined with the existing phase1-start invocation, this covers both
fresh provisioning and post-restart UI access patterns.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:33:25 +04:00
github-actions[bot]
5527652b49 deploy: update catalyst images to f334950 2026-05-14 12:29:07 +00:00
e3mrah
f3349501b8
fix(canvas): roll-up group status from descendants (prov #76) (#1476)
Founder reported on prov #76: 'there are pending and running jobs
still I dont think they are true'. Examination showed all 135
install-* leaf statuses are succeeded but the synthetic group nodes
(cutover, handover, apps + per-region sub-groups) carried hardcoded
placeholder statuses ('running' / 'pending') from emit time.

Add bottom-up roll-up after all nodes/rels are emitted:
  - all descendants succeeded → succeeded
  - any descendant failed     → failed
  - any descendant running    → running
  - else                      → pending (no descendants or all pending)

Now cutover phase bubble shows succeeded when its install-self-
sovereign-cutover child has finished, etc. handover/apps stay pending
until real Jobs are emitted for them (jobs.Store integration is the
follow-up that materialises those phases).

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:26:59 +04:00
e3mrah
a2167f36de
fix(openova-flow): COPY go.sum + go mod download in Dockerfile (#1475)
CI build failed with missing go.sum entry for pgx after the
in-memory→CNPG rewrite (now has real deps). The previous Dockerfile
only COPYed go.mod — fine when the codebase had zero external deps,
broken once pgx + pgxpool + x/text + x/sync landed in go.sum.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:23:57 +04:00
e3mrah
808310b144
fix(openova-flow): pin pgx to v5.5.5 for Go 1.22 build compat (#1472)
CI Dockerfile uses golang:1.22-alpine. Default pgx@v5.9.2 requires
Go 1.25 — fix by pinning pgx@v5.5.5 + x/text@v0.21.0 + x/sync@v0.10.0.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:21:33 +04:00
github-actions[bot]
fb8303766e deploy: update catalyst images to 587a985 2026-05-14 10:18:12 +00:00