History

e3mrah 2c3ea44af8 fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 ) * fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. * fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99) correctly reached cilium-config — but only AFTER Flux helm-upgraded the release. The pre-Flux Cilium install (cloud-init line 1473) used /var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or cluster.id, so cilium-agent started with the chart defaults ("default", 0). The Flux upgrade then changed cilium-config but the already-running cilium-agent kept its in-memory cluster.name="default" because it reads ConfigMap once at startup. Downstream consequences observed live on t105: hubble-relay CrashLoopBackOff: "tls: failed to verify certificate: x509: certificate is valid for .t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1 .default.hubble-grpc.cilium.io" clustermesh peer announcements use stale "default" identity → cross-region mesh handshakes x509-fail. Fix: include cluster.name + cluster.id in the pre-Flux helm install's values file, sourced from the templatefile() vars cluster_mesh_name + cluster_mesh_id (already threaded per-region by main.tf:381-382 and :900-901). Now the first cilium-agent process announces with the correct identity, no helm-upgrade race. docs(sandbox): design docs for the Sandbox product Captures the agreed product shape, end-user journeys (developer + Sovereign admin), technical architecture (native agent TUI via xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue, four knowledge layers, JetStream/SSE integration), and the conversational-provisioning surface that reuses the same shell with a narrow MCP toolbox as an alternative to the catalyst-ui wizard. Status: design only — no implementation. Identifies one prerequisite (long-lived API token carrying org_id claim) with the exact files to extend in core/services/auth and platform/keycloak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the cilium-envoy-tls-restart Job stuck Running 10m+ with: W reflector.go:561] failed to list *unstructured.Unstructured: deployments.apps "cilium-operator" is forbidden: User "system:serviceaccount:kube-system:cilium-envoy-tls-restart" cannot list resource "deployments" in API group "apps" in the namespace "kube-system" The Role grants `get` + `patch` but `kubectl rollout status` (which the Job runs after `rollout restart`) does NOT just GET — internally it uses client-go informerwatcher to LIST+WATCH the resource. Without those verbs the informer fails and `rollout status` hangs until activeDeadlineSeconds (900s). The Job never restarts cilium-envoy, console.<fqdn> never serves. Fix: add `list` + `watch` to both rules (cilium-operator Deployment + cilium-envoy DaemonSet). Scoped by resourceName, so the SA still can't enumerate or watch other workloads. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 21:02:37 +04:00
..
docs	fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )	2026-05-15 21:02:37 +04:00
README.md	fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )	2026-05-15 21:02:37 +04:00

fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )

* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 21:02:37 +04:00

docs

fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )

2026-05-15 21:02:37 +04:00

README.md

fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )

2026-05-15 21:02:37 +04:00

README.md

OpenOva Sandbox (design)

Status: Design. Not yet implemented. Created: 2026-05-15.

OpenOva Sandbox is the per-user, per-Organization coding-agent plane that runs inside every OpenOva Sovereign. It hosts long-lived sessions of the agents developers already use (Claude Code, Cursor, Qwen Code, Aider, Opencode) — server-side, cluster-aware, identity-scoped — and surfaces them through a native terminal in the browser plus a card-stream view on mobile, both backed by the same persistent process.

Sandbox is also the conversational front-door to provisioning a brand-new Sovereign: the same shell, scoped to a narrower MCP tool surface, lets a non-technical user talk (text or voice) through standing up their cloud instead of filling in a wizard.

Naming

The chosen name is OpenOva Sandbox. Alternatives we considered:

Name	Positioning	Why we did not pick it
Sandbox (chosen)	"The cloud sandbox where your agents do real work"	Plain noun, matches `Sovereign` / `Catalyst` style. Inherits the moat directly: a real cloud sandbox per user, not a browser tab.
Forge	Active and agentic ("forge production code")	"Forge" is taken by smaller dev tools; trademark friction.
Studio	Lineage with Android/Visual Studio / Codespaces	Generic — nothing about the cluster-aware moat is implied in the name.

docs/business-requirements.md — what we are solving, who for, the moat, success criteria.
docs/user-journey.md — end-to-end wireframe storyboard for the developer (Nova user) and the Sovereign admin, including multi-device handoff and the EventForge build walkthrough.
docs/architecture.md — technical architecture: native TUI in the browser via xterm.js + WebSocket + PTY, the card protocol for mobile, the MCP server tool catalogue, the four knowledge layers (static / procedural / live / corpus), and exact integration points with the existing OpenOva primitives (vcluster per Org, Keycloak modes, Gitea, marketplace BYOD, JetStream, SSE).
docs/provisioning-chat.md — the conversational alternative to the catalyst-ui wizard; text + voice; same shell, narrower MCP surface.

What is already there, what we still need

Confirmed against the codebase (2026-05-15):

Foundation primitive	State	Reference
`Organization` CRD (`orgs.openova.io/v1`)	Shipped	`products/catalyst/chart/crds/organization.yaml`
vcluster per Org	Shipped	`core/controllers/organization/internal/gitops/manifests.go`
Keycloak realm (sovereign-shared vs per-Org SME mode)	Shipped	`platform/keycloak/chart/values.yaml`, `chart/templates/configmap-{sovereign,tenant}-realm.yaml`
Gitea Org + `catalyst-tenant` repo auto-provisioned per Org	Shipped	`core/controllers/organization/internal/controller/organization_controller.go`
UserAccess CR → RoleBindings (RBAC fan-out)	Shipped	same controller
Marketplace: subdomain + BYO custom domain	Shipped	`core/services/domain/handlers/handlers.go` (`POST /domain/byod`), `core/marketplace-api/handlers/handlers.go`
JetStream subject convention `catalyst.<domain>.<event>`	Shipped (ADR-0001 §6)	`core/services/shared/events/nats.go`
SSE feeds for deployments / cutover / flow / RBAC / K8s / continuum / openova-flow	Shipped (7+ endpoints)	`products/catalyst/bootstrap/api/internal/handler/*.go`, `products/openova-flow/server/internal/api/stream.go`
Harbor / SeaweedFS at host-cluster scope (multi-Org via projects/buckets)	Shipped (by design — not per-Org instances)	`platform/harbor/README.md`, `platform/seaweedfs/README.md`

The one prerequisite Sandbox needs that does not exist today:

Gap	What we need	Where to wire
Long-lived API token carrying `org_id` claim	A persistent token issued by Keycloak (or `core/services/auth`) that includes `org`, `groups`, and a Sandbox capability set. Today only a 15-minute JWT with `{sub, email, role}` exists; the tenant-realm Keycloak import has a `groups` mapper but not an `org` mapper.	`core/services/auth/handlers/handlers.go` (token issuance) + `platform/keycloak/chart/templates/configmap-tenant-realm.yaml` (add `org` protocolMapper)

Everything else Sandbox needs is greenfield product work and is described in the linked docs.

README.md

OpenOva Sandbox (design)

Naming

Contents

What is already there, what we still need