* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. * fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99) correctly reached cilium-config — but only AFTER Flux helm-upgraded the release. The pre-Flux Cilium install (cloud-init line 1473) used /var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or cluster.id, so cilium-agent started with the chart defaults ("default", 0). The Flux upgrade then changed cilium-config but the already-running cilium-agent kept its in-memory cluster.name="default" because it reads ConfigMap once at startup. Downstream consequences observed live on t105: hubble-relay CrashLoopBackOff: "tls: failed to verify certificate: x509: certificate is valid for *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1 .default.hubble-grpc.cilium.io" clustermesh peer announcements use stale "default" identity → cross-region mesh handshakes x509-fail. Fix: include cluster.name + cluster.id in the pre-Flux helm install's values file, sourced from the templatefile() vars cluster_mesh_name + cluster_mesh_id (already threaded per-region by main.tf:381-382 and :900-901). Now the first cilium-agent process announces with the correct identity, no helm-upgrade race. * docs(sandbox): design docs for the Sandbox product Captures the agreed product shape, end-user journeys (developer + Sovereign admin), technical architecture (native agent TUI via xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue, four knowledge layers, JetStream/SSE integration), and the conversational-provisioning surface that reuses the same shell with a narrow MCP toolbox as an alternative to the catalyst-ui wizard. Status: design only — no implementation. Identifies one prerequisite (long-lived API token carrying org_id claim) with the exact files to extend in core/services/auth and platform/keycloak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the cilium-envoy-tls-restart Job stuck Running 10m+ with: W reflector.go:561] failed to list *unstructured.Unstructured: deployments.apps "cilium-operator" is forbidden: User "system:serviceaccount:kube-system:cilium-envoy-tls-restart" cannot list resource "deployments" in API group "apps" in the namespace "kube-system" The Role grants `get` + `patch` but `kubectl rollout status` (which the Job runs after `rollout restart`) does NOT just GET — internally it uses client-go informerwatcher to LIST+WATCH the resource. Without those verbs the informer fails and `rollout status` hangs until activeDeadlineSeconds (900s). The Job never restarts cilium-envoy, console.<fqdn> never serves. Fix: add `list` + `watch` to both rules (cilium-operator Deployment + cilium-envoy DaemonSet). Scoped by resourceName, so the SA still can't enumerate or watch other workloads. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| docs | ||
| README.md | ||
OpenOva Sandbox (design)
Status: Design. Not yet implemented. Created: 2026-05-15.
OpenOva Sandbox is the per-user, per-Organization coding-agent plane that runs inside every OpenOva Sovereign. It hosts long-lived sessions of the agents developers already use (Claude Code, Cursor, Qwen Code, Aider, Opencode) — server-side, cluster-aware, identity-scoped — and surfaces them through a native terminal in the browser plus a card-stream view on mobile, both backed by the same persistent process.
Sandbox is also the conversational front-door to provisioning a brand-new Sovereign: the same shell, scoped to a narrower MCP tool surface, lets a non-technical user talk (text or voice) through standing up their cloud instead of filling in a wizard.
Naming
The chosen name is OpenOva Sandbox. Alternatives we considered:
| Name | Positioning | Why we did not pick it |
|---|---|---|
| Sandbox (chosen) | "The cloud sandbox where your agents do real work" | Plain noun, matches Sovereign / Catalyst style. Inherits the moat directly: a real cloud sandbox per user, not a browser tab. |
| Forge | Active and agentic ("forge production code") | "Forge" is taken by smaller dev tools; trademark friction. |
| Studio | Lineage with Android/Visual Studio / Codespaces | Generic — nothing about the cluster-aware moat is implied in the name. |
Contents
docs/business-requirements.md— what we are solving, who for, the moat, success criteria.docs/user-journey.md— end-to-end wireframe storyboard for the developer (Nova user) and the Sovereign admin, including multi-device handoff and the EventForge build walkthrough.docs/architecture.md— technical architecture: native TUI in the browser via xterm.js + WebSocket + PTY, the card protocol for mobile, the MCP server tool catalogue, the four knowledge layers (static / procedural / live / corpus), and exact integration points with the existing OpenOva primitives (vcluster per Org, Keycloak modes, Gitea, marketplace BYOD, JetStream, SSE).docs/provisioning-chat.md— the conversational alternative to the catalyst-ui wizard; text + voice; same shell, narrower MCP surface.
What is already there, what we still need
Confirmed against the codebase (2026-05-15):
| Foundation primitive | State | Reference |
|---|---|---|
Organization CRD (orgs.openova.io/v1) |
Shipped | products/catalyst/chart/crds/organization.yaml |
| vcluster per Org | Shipped | core/controllers/organization/internal/gitops/manifests.go |
| Keycloak realm (sovereign-shared vs per-Org SME mode) | Shipped | platform/keycloak/chart/values.yaml, chart/templates/configmap-{sovereign,tenant}-realm.yaml |
Gitea Org + catalyst-tenant repo auto-provisioned per Org |
Shipped | core/controllers/organization/internal/controller/organization_controller.go |
| UserAccess CR → RoleBindings (RBAC fan-out) | Shipped | same controller |
| Marketplace: subdomain + BYO custom domain | Shipped | core/services/domain/handlers/handlers.go (POST /domain/byod), core/marketplace-api/handlers/handlers.go |
JetStream subject convention catalyst.<domain>.<event> |
Shipped (ADR-0001 §6) | core/services/shared/events/nats.go |
| SSE feeds for deployments / cutover / flow / RBAC / K8s / continuum / openova-flow | Shipped (7+ endpoints) | products/catalyst/bootstrap/api/internal/handler/*.go, products/openova-flow/server/internal/api/stream.go |
| Harbor / SeaweedFS at host-cluster scope (multi-Org via projects/buckets) | Shipped (by design — not per-Org instances) | platform/harbor/README.md, platform/seaweedfs/README.md |
The one prerequisite Sandbox needs that does not exist today:
| Gap | What we need | Where to wire |
|---|---|---|
Long-lived API token carrying org_id claim |
A persistent token issued by Keycloak (or core/services/auth) that includes org, groups, and a Sandbox capability set. Today only a 15-minute JWT with {sub, email, role} exists; the tenant-realm Keycloak import has a groups mapper but not an org mapper. |
core/services/auth/handlers/handlers.go (token issuance) + platform/keycloak/chart/templates/configmap-tenant-realm.yaml (add org protocolMapper) |
Everything else Sandbox needs is greenfield product work and is described in the linked docs.