openova/products/sandbox
e3mrah 2c3ea44af8
fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:02:37 +04:00
..
docs fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504) 2026-05-15 21:02:37 +04:00
README.md fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504) 2026-05-15 21:02:37 +04:00

OpenOva Sandbox (design)

Status: Design. Not yet implemented. Created: 2026-05-15.

OpenOva Sandbox is the per-user, per-Organization coding-agent plane that runs inside every OpenOva Sovereign. It hosts long-lived sessions of the agents developers already use (Claude Code, Cursor, Qwen Code, Aider, Opencode) — server-side, cluster-aware, identity-scoped — and surfaces them through a native terminal in the browser plus a card-stream view on mobile, both backed by the same persistent process.

Sandbox is also the conversational front-door to provisioning a brand-new Sovereign: the same shell, scoped to a narrower MCP tool surface, lets a non-technical user talk (text or voice) through standing up their cloud instead of filling in a wizard.

Naming

The chosen name is OpenOva Sandbox. Alternatives we considered:

Name Positioning Why we did not pick it
Sandbox (chosen) "The cloud sandbox where your agents do real work" Plain noun, matches Sovereign / Catalyst style. Inherits the moat directly: a real cloud sandbox per user, not a browser tab.
Forge Active and agentic ("forge production code") "Forge" is taken by smaller dev tools; trademark friction.
Studio Lineage with Android/Visual Studio / Codespaces Generic — nothing about the cluster-aware moat is implied in the name.

Contents

  • docs/business-requirements.md — what we are solving, who for, the moat, success criteria.
  • docs/user-journey.md — end-to-end wireframe storyboard for the developer (Nova user) and the Sovereign admin, including multi-device handoff and the EventForge build walkthrough.
  • docs/architecture.md — technical architecture: native TUI in the browser via xterm.js + WebSocket + PTY, the card protocol for mobile, the MCP server tool catalogue, the four knowledge layers (static / procedural / live / corpus), and exact integration points with the existing OpenOva primitives (vcluster per Org, Keycloak modes, Gitea, marketplace BYOD, JetStream, SSE).
  • docs/provisioning-chat.md — the conversational alternative to the catalyst-ui wizard; text + voice; same shell, narrower MCP surface.

What is already there, what we still need

Confirmed against the codebase (2026-05-15):

Foundation primitive State Reference
Organization CRD (orgs.openova.io/v1) Shipped products/catalyst/chart/crds/organization.yaml
vcluster per Org Shipped core/controllers/organization/internal/gitops/manifests.go
Keycloak realm (sovereign-shared vs per-Org SME mode) Shipped platform/keycloak/chart/values.yaml, chart/templates/configmap-{sovereign,tenant}-realm.yaml
Gitea Org + catalyst-tenant repo auto-provisioned per Org Shipped core/controllers/organization/internal/controller/organization_controller.go
UserAccess CR → RoleBindings (RBAC fan-out) Shipped same controller
Marketplace: subdomain + BYO custom domain Shipped core/services/domain/handlers/handlers.go (POST /domain/byod), core/marketplace-api/handlers/handlers.go
JetStream subject convention catalyst.<domain>.<event> Shipped (ADR-0001 §6) core/services/shared/events/nats.go
SSE feeds for deployments / cutover / flow / RBAC / K8s / continuum / openova-flow Shipped (7+ endpoints) products/catalyst/bootstrap/api/internal/handler/*.go, products/openova-flow/server/internal/api/stream.go
Harbor / SeaweedFS at host-cluster scope (multi-Org via projects/buckets) Shipped (by design — not per-Org instances) platform/harbor/README.md, platform/seaweedfs/README.md

The one prerequisite Sandbox needs that does not exist today:

Gap What we need Where to wire
Long-lived API token carrying org_id claim A persistent token issued by Keycloak (or core/services/auth) that includes org, groups, and a Sandbox capability set. Today only a 15-minute JWT with {sub, email, role} exists; the tenant-realm Keycloak import has a groups mapper but not an org mapper. core/services/auth/handlers/handlers.go (token issuance) + platform/keycloak/chart/templates/configmap-tenant-realm.yaml (add org protocolMapper)

Everything else Sandbox needs is greenfield product work and is described in the linked docs.