PR #1637 shipped GET /api/v1/sandbox/sessions returning only the spec
the handler authored — Status was projected from `.status.phase` but
every other controller-managed field (sessions count, storage usage,
30-day spend, preview URLs, failure conditions) was ignored. The
catalyst-ui Sandbox surface had nothing to render past the FE-side
"queued" pill.
This wires the projection to read `.status.{phase,sessions,storageUsed,
spend30d,previews,conditions}` and surfaces them on the
list/get wire shape:
- `phase` (raw, Pending|Provisioning|Ready|Failed) alongside the
existing FE-projected `status` (pending|running|stopped|failed)
- `sessions`, `storageUsed`, `spend30d` — operator-visible quotas
- `previews[]` — one preview URL per PR/branch (skips rows missing
URL; coerces float64↔int64 from apiserver JSON round-trips)
- `conditions[]` — Type/Status/Reason/Message tuples verbatim, so
the FE can render TokenMintFailed / GitopsWriteFailed /
ManifestRenderFailed / OwnerEmailInvalid / NoAllowedChannels
inline instead of a generic red pill
Phase→FE-status mapping unchanged (matches sandbox.api.ts:
normalizeStatus). `mapSandboxStatus` refactored from
`(u *Unstructured)` to `(rawPhase string)` so the new
`readSandboxPhase` helper reads the field exactly once per item.
Added handler-package tests pinning the projection contract:
- StatusReflection — happy path, dropped-malformed rows
- PhaseProjection — every CRD phase → FE status
- FailedSurfacesConditions — Failed + TokenMintFailed visible
- NilInputDoesNotPanic — empty-slice defaults
- PreviewFloat64Coercion — apiserver JSON round-trip safety
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1621 shipped the SandboxSession xterm.js host with an "API pending"
placeholder banner. PR #1641 + #1657 wired the BE (sandbox-controller
renders the HTTPRoute on sandbox.<sov-fqdn>; pty-server exposes
WS /sessions/{id}/attach). This PR replaces the placeholder with a real
adapter:
- stdin : term.onData -> ws.send (TextEncoder binary frame)
- stdout : ws.onmessage -> term.write (ArrayBuffer / Uint8Array / Blob / string)
- resize : window resize -> fit.fit() -> POST sandbox.<sov-fqdn>/sessions/{id}/resize
- replay : pty-server ships the ring buffer as the first binary frame; the
generic onmessage path writes it verbatim, no special case
- reconnect: on close / error, schedule a retry with exponential backoff
(1s, 2s, 4s, 8s, 16s, 30s ceiling — same shape as
useComplianceStream). Connection banner reflects
connecting / connected / reconnecting / closed / idle.
Design-system inheritance: PortalShell wrapper unchanged, CSS-variable
colours throughout, amber for connecting/reconnecting and rose for
disconnected (the same shades the rest of the Sovereign Console uses).
The back-to-landing affordance the e2e suite asserts on is preserved.
Test seams kept: disableTerminal still skips xterm.js mount under
jsdom, plus new websocketFactory / resizeFetcher / reconnectBackoffMs /
disableReconnect props so unit tests can exercise the WS pump without a
real socket or wall-clock backoff.
npx tsc --noEmit clean on the full UI project.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires four real Stripe handlers in openova-sandbox-mcp, completing the
final unwired namespace from architecture.md §3 (sandbox.stripe.*):
- sandbox.stripe.bindAccount {api_key} — validates the key prefix
(sk_live_ / sk_test_ / rk_live_ / rk_test_), stores it in the
per-Sandbox Secret (`sandbox-<owner-uid>-secrets`, data-key
`stripe_api_key`) via the same write-path sandbox.secrets.write
uses, returns a masked confirmation (`sk_test_…xY12`).
- sandbox.stripe.listProducts — reads the bound key implicitly,
GET /v1/products with limit (1-100, default 20), active, and
starting_after cursor passthrough.
- sandbox.stripe.listPrices {product_id?} — same pagination shape;
optional product_id filter.
- sandbox.stripe.createCheckoutSession {price_id, success_url,
cancel_url} — validates absolute http(s) URLs, POSTs the
form-encoded line_items[0][price/quantity] body to
/v1/checkout/sessions, returns the hosted Checkout URL + session id.
Implementation:
- No new module dep — inline HTTPS calls to api.stripe.com via the
stdlib net/http client. stripe-go v82 would have pulled ~80
transitive packages for four endpoints; the surface we need is
tiny enough that a 100-line stripeDo helper covers it. Matches
the task's "stripe-go v82 if not already in deps; else inline
HTTPS" guidance.
- The key never round-trips on the wire after first bind. Agent
pastes once via bindAccount; every subsequent call reads it from
the Secret store. Stripe-Version header pinned to 2024-06-20 so
a future API revision can't silently break the wire format.
- Auth: RequiredCapability="sandbox.stripe" on every tool.
claims.OrgID match enforced by the registry's existing gate.
- Read-only cluster invariant: the only writes are to the
per-Sandbox Secret. assertManagedBy() enforced on bind so we
cannot mutate the controller-injected `sandbox-tokens` Secret.
Tests cover key validation (prefix + length), masking format, limit
clamping, the httptest.Server-backed happy-path + error-envelope
unwrap, form-urlencoded body shape for createCheckoutSession,
catalogue wiring (all four handlers non-nil, RequiredCapability
matches), and the registry capability gate (missing sandbox.stripe
cap → forbidden).
Closes the Wave 13 "last MCP namespace" gap; no chart bump.
Co-authored-by: Claude <noreply@anthropic.com>
Append a Wave 12-14 addendum to the convergence report capturing:
- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle
Companion memos (out-of-tree, not in this PR):
- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback
Append a Wave 12-14 addendum to the convergence report capturing:
- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle
Companion memos (out-of-tree, not in this PR):
- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)
Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.
Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml
Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:
1. **mcp-server Dockerfile built against a too-narrow context**. The
workflow passed `context: products/sandbox/mcp-server` and the
Dockerfile assumed `COPY . .` could see everything it needed, but
the `replace ../../../core/controllers` line in the module's go.mod
only resolves when the build can actually reach those paths. Result:
every push after #1658 failed at `go build` with `module not found`.
Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
COPY the replace targets' module roots + sources, then build with
WORKDIR set to the dependent module. Static binary still produced
into a distroless/static-debian12:nonroot final stage.
2. **mcp-server workflow had no chart auto-bump step**. Even after a
green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
stayed empty so the chart's `required` guard
(deployment.yaml line 72) refused to render. Added the same
yq-bump + bot-commit pattern build-sandbox-controller.yaml already
uses, targeting `.runtime.mcpImage` and writing a fully-qualified
`<repo>:<sha>` string (consumer reads it as one image reference,
not a {repository,tag} pair). Also widened paths-filter to include
core/controllers/** + core/services/shared/** so changes to the
replace targets re-trigger the build.
3. **pty-server workflow had no auto-bump either**. Same surgery:
yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
narrow (pty-server has no cross-tree `replace` directives).
4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
next chart roll out doesn't fail-fast before the rebuilt workflows
land their first bumps:
- ptyServerImage → ad5163e6 (current latest pty-server)
- mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
workflow will land the next real SHA on the next push to main).
Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
binary at /tmp/openova-sandbox-mcp; `file` confirms statically
linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
renders cleanly; both env vars carry the SHA-pinned image refs.
No Chart.yaml bump. Read-only clusters.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave 2 stubs for the sandbox.storage.* namespace with real
handlers backed by the host-cluster's unified SeaweedFS S3 API
(`seaweedfs.storage.svc:8333` per platform/seaweedfs/README.md). Every
handler is scoped to buckets prefixed `sandbox-<owner-uid>-` so the
agent cannot touch any other consumer's bucket (loki-data / cnpg-wal /
harbor-data, etc.).
Tools shipped:
- sandbox.storage.bindBucket {bucket_name?}
- sandbox.storage.signedUploadURL {bucket, key, expires_in_seconds?}
- sandbox.storage.signedDownloadURL {bucket, key, expires_in_seconds?}
- sandbox.storage.listBuckets
- sandbox.storage.deleteBucket {name}
Wire model: minio-go v7 (already canonical across the OpenOva tree —
catalyst-bootstrap/hetzner objectstorage depend on it) speaking S3 v4
to SeaweedFS. Presigned URLs default to 15 min and clamp to 7 days
(the S3 v4 signature ceiling).
Defence-in-depth: prefix-mismatch + 63-char S3 cap + alnum-only object
key regex all enforced BEFORE any S3 dial; arg-validation errors
surface clearly without first hitting a misleading creds error.
New env vars (sandbox-controller fills these at MCP Deployment
spec time):
SANDBOX_STORAGE_S3_ENDPOINT = "seaweedfs.storage.svc:8333"
SANDBOX_STORAGE_S3_ACCESS_KEY = "<per-Sandbox IAM access key>"
SANDBOX_STORAGE_S3_SECRET_KEY = "<per-Sandbox IAM secret>"
SANDBOX_STORAGE_S3_USE_TLS = "true|false" (default: false)
SANDBOX_STORAGE_S3_REGION = "us-east-1" (default; opaque to SeaweedFS)
Auth: same shape as PR #1658 (sandbox.auth.* + sandbox.secrets.*) —
claims.OrgID must match env.OrgID, RequiredCapability=sandbox.storage.
Tests: 9 new test functions covering prefix format, scope-gate,
bucket-name normalization (including cross-Sandbox refusal + length
guard), object-key validation, expiry clamp, region default, per-tool
arg validation, capability gate, and catalogue wiring. `go build` +
`go test` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 13 of the openova-sandbox-mcp catalogue replaces the deploy
stubs with real handlers:
- sandbox.deploy.staging {image, app?} upserts HR YAML in the
Org's catalyst-tenant
Gitea repo at
sandbox/<owner-uid>/deploy/
<app>/staging/helmrelease.yaml
- sandbox.deploy.production {image, app?} same shape, second
capability gate
(sandbox.deploy.production)
- sandbox.deploy.status {env, app?} reads requested-image
from Gitea + observed
image / Ready condition
from the live HR in the
Org vcluster
- sandbox.deploy.rollback {env, app?} reverts HR to the
openova.io/last-deployed-
image annotation
All writes go through the canonical pkg/gitea client (PutFile's
byte-equal short-circuit handles idempotency). Cluster reads use the
Sandbox's in-cluster dynamic client; the deploy tools never apply
manifests directly to the cluster — Flux on the host reconciles the
Gitea write per the architecture.md §3 contract.
Tests: 21 new in sandbox_deploy_test.go covering the Gitea write
fake, capability gates on production, rollback round-trip, image
splitter, HR ready condition parsing, registry advertisement.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave-2 not_implemented stubs for sandbox.preview.* with
real handlers in products/sandbox/mcp-server/internal/tools/sandbox_preview.go.
Five tools shipped:
- sandbox.preview.create {pr_number, image, repo?}
- sandbox.preview.list
- sandbox.preview.status {pr_number}
- sandbox.preview.teardown {pr_number}
- sandbox.preview.rebuild {pr_number, image}
All five address env.SandboxNamespace exclusively (the agent cannot pass
a namespace argument), and every mutation is gated by an
openova.io/managed-by=openova-sandbox-mcp + openova.io/preview-pr=<num>
label pair so teardown/rebuild can never touch pty-server or the
openova-sandbox-mcp Deployment in the same ns.
Hostname pattern: pr-<num>.<app>.sandbox.<sov-fqdn> — attaches to the
canonical Cilium Gateway (cilium-gateway/kube-system) the
sandbox-controller already routes for pty-server, so previews surface
under the same wildcard listener via deeper subdomains.
Auth: same HS256 + claims.OrgID + RequiredCapability=sandbox.preview
pattern PRs #1645 / #1656 / #1658 / #1659 established.
go build + go test clean.
Sovereign DoD D31 — CNPG-backed apps must replicate across the
Sovereign's regions when the operator opts in. PR #1562 wired this
into bp-wordpress-tenant chart-level. This change extends the same
toggle across BOTH user-facing paths:
1. Marketplace tenant flow (sme_tenant_gitops.go)
- smeTenantTemplateData gains EnableHotStandby/PrimaryRegion/
ReplicaRegion. renderSMETenantOverlay reads them from the
catalyst-api Pod env (SOVEREIGN_ENABLE_HOT_STANDBY +
SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION).
- Bp-wordpress-tenant HelmRelease emits pg.activeHotStandby.*
when the trio is valid; bp-wordpress-tenant chart 0.2.0+
(PR #1562) renders the primary + replica Cluster CR pair.
- Defence-in-depth: degenerate inputs (empty/identical regions)
fall back to single-Cluster shape rather than emitting a
HelmRelease the chart's validateActiveHotStandbyRegions helper
would fail at template time.
2. Sandbox plane (sandbox.db.provision)
- Env struct + NewEnvFromOS read the same Sovereign-level trio.
- sandbox.db.provision emits a primary + replica Cluster CR pair
when hotStandbyActive() — same shape bp-cnpg-pair renders for
marketplace apps + bp-wordpress-tenant cnpg-cluster.yaml: WAL
streaming via spec.managed.services.additional annotated
service.cilium.io/global=true, nodeAffinity pinning each side
to its declared region, replica.enabled=true with externalCluster
resolving the primary through the ClusterMesh-global Service alias.
- Best-effort rollback if the replica Create fails so the operator
never sees an orphan primary.
3. Plumbing (one knob, both paths)
- catalyst chart: values.sovereign.{enableHotStandby,primaryRegion,
replicaRegion} -> sovereign-fqdn ConfigMap keys -> catalyst-api env.
- sandbox chart: cnpg.activeHotStandby.{enabled,primaryRegion,
replicaRegion} -> controller env -> per-Sandbox MCP Pod env.
- Bootstrap-kit slot 13 + slot 19a wire SOVEREIGN_ENABLE_HOT_STANDBY/
SOVEREIGN_PRIMARY_REGION/SOVEREIGN_REPLICA_REGION envsubst
placeholders to BOTH chart paths so the operator flips one knob
on the per-Sovereign overlay and gets HA across the marketplace
tenant install AND the sandbox.db plane.
Default empty/false: every Sovereign that has not opted in keeps
rendering single-Cluster CNPG (zero regression).
gitlab-tenant + nextcloud-tenant charts: NOT shipped in this repo
today, so they are out of scope. When they land they can copy the
same value contract (pg.activeHotStandby.*) and the gitops writer
wiring already handles them — no chart-bump or controller change
required.
Tests
- sme_tenant_active_hot_standby_test.go: 8 cases (off, on-happy-path,
degenerate matrix incl. empty primary, empty replica, identical
regions, toggle off with regions).
- sandbox_db_hot_standby_test.go: 11 cases covering hotStandbyActive
matrix + replicaClusterName/replicationServiceName suffix rules +
full primary + replica CR shapes (nodeAffinity, switchover, managed
service, externalClusters).
- platform/wordpress-tenant/chart/tests/active-hot-standby-render.sh
still passes (5/5 gates green).
- catalyst-api SMETenant suite GREEN.
- sandbox-controller suite GREEN.
- helm template clean for sandbox chart (HA + default-off) and
catalyst chart (sovereign-fqdn-configmap + api-deployment).
Hard rules respected: READ-ONLY clusters, no Chart.yaml bump on
bp-catalyst-platform (envsubst-only wiring change in slot 13), no
host-cluster touch outside the chart-level seam.
Refs DoD D31.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 12 thin first-cut for the per-Org RAG + skill-pack surfaces
advertised in products/sandbox/docs/architecture.md §3:
- rag.search {query, repo?, limit?} — case-insensitive substring grep
over the Sandbox PVC mounted at /repo (filepath.WalkDir with sensible
binary/large-file skip + matches capped at 50, snippets trimmed to
80 chars). When the PVC isn't mounted yet the response returns
matches=[] + pendingApi=true so the agent can branch on "index not
ready" vs "no hits".
- skills.list / skills.get — hardcoded 3-entry catalogue
(openova-platform-basics, k8s-debugging, fluxcd-workflows) with
pendingOCI=true stub manifests. Wire envelope matches the eventual
OCI-backed catalogue so the agent's parsing layer carries forward.
Auth identical to sandbox.secrets.* (PR #1658): Registry.Call enforces
Claims.OrgID==env.OrgID and RequiredCapability ("rag" / "skills") via
the existing gate; both tools are read-only.
Build + vet + tests clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 11 follow-up to PR #1653 (sandbox.db.*). Replaces the stubbed
sandbox.auth.* and sandbox.secrets.* tool handlers with real
implementations so agents can manage per-Sandbox Keycloak realms /
OIDC clients and a per-Sandbox Secret store.
sandbox.auth.* (Keycloak Admin REST via the sandbox-controller-
injected admin bearer):
- sandbox.auth.provisionRealm {realm_name, display_name?}
POST /admin/realms — idempotent on 409 Conflict.
- sandbox.auth.listClients
GET /admin/realms/<sandbox-realm>/clients — friendly empty
list on 404 (realm not yet provisioned).
- sandbox.auth.registerClient {client_id, redirect_uris,
public_client?, name?}
POST /admin/realms/<sandbox-realm>/clients — idempotent on
409 Conflict, typed error on 404 (realm missing).
The Sandbox's "own" realm name is deterministic (`sandbox-<org>-
<id>`); the agent CANNOT pass a `realm` argument to list /
register, only provisionRealm accepts a free-form name.
sandbox.secrets.* (per-Sandbox K8s Secret store, base64-encoded
data, encrypted at rest by kube-apiserver encryption-provider):
- sandbox.secrets.read {key} — returns Found / KeyNotFound
/ NotFound (Secret missing)
- sandbox.secrets.write {key, value} — auto-creates the Secret on
first write (Added /
Updated / Created)
The Secret is named `sandbox-<owner-uid>-secrets` in env.Sandbox-
Namespace and gated by openova.io/managed-by=openova-sandbox-mcp
so sandbox.secrets.write CANNOT mutate the controller-injected
`sandbox-tokens` Secret or any other unmanaged Secret in the ns.
Auth: claims.OrgID == env.OrgID required (same as sandbox.db.*),
RequiredCapability = "sandbox.auth" / "sandbox.secrets".
New env vars (sandbox-controller injects on MCP Deployment):
- SANDBOX_OWNER_UID — `sandbox-<owner-uid>-secrets` suffix
- KEYCLOAK_ADMIN_URL — root of the Keycloak Admin REST API
- KEYCLOAK_ADMIN_TOKEN — pre-minted admin bearer
- KEYCLOAK_PARENT_REALM — default "master"
No chart bump; mcp-server-only change. go build + go test clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Sandbox public-URL flow (sandbox.<sov-fqdn>/sessions/<owner-uid>/*) had
three independent gaps that prevented PR #1641's HTTPRoute from resolving
end-to-end:
1. HTTPRoute parentRefs pointed at "catalyst-public/catalyst-system/https",
a Gateway that does not exist on a Sovereign. The canonical public
Gateway is "cilium-gateway/kube-system" (clusters/_template/
sovereign-tls/cilium-gateway.yaml), the same parent that organization-
controller's tenant_route.go and the chart's httproute.yaml attach to.
sectionName is omitted so the HTTPRoute auto-attaches to every listener
whose hostname matches sandbox.<sov-fqdn> — the wildcard
*.${SOVEREIGN_FQDN} HTTPS listener already in place per infra/hetzner/
main.tf locals.parent_domains_listeners_yaml fallback path.
2. The per-name Cilium Gateway cert (clusters/_template/sovereign-tls/
cilium-gateway-cert.yaml) is a SAN list, not a wildcard. Without
"sandbox.<sov-fqdn>" in its dnsNames cilium-envoy serves the default
fallback cert and browsers see NET::ERR_CERT_COMMON_NAME_INVALID.
This file is the source of the per-zone Secret
sovereign-wildcard-tls-<sov-fqdn-dashed> the Gateway listener
references — adding the SAN is the only TLS-side change needed; the
Gateway listener wildcard is already a hostname match.
3. The parent zone's A-record set is built from CanonicalSovereignSubdomains
in products/catalyst/bootstrap/api/internal/handler/
sovereign_dns_records.go. Without "sandbox" the PowerDNS PATCH never
writes sandbox.<sov-fqdn> A-record → primary LB IP, and the URL
resolves NXDOMAIN even when the listener + cert are healthy.
End-to-end resolution chain after this PR:
Browser → sandbox.<sov-fqdn>/sessions/<owner-uid>/ (PowerDNS A record
points at primary LB IPv4)
→ Hetzner LB :443 → cp-node :30443 (cilium-envoy)
→ Gateway listener https-<sov-fqdn-dashed> on *.<sov-fqdn> matches
hostname; cert SAN includes sandbox.<sov-fqdn> so TLS terminates
→ HTTPRoute pty-server in sandbox-<owner-uid> namespace matches
hostname + /sessions/<owner-uid>/ path prefix; URLRewrite strips
/sessions/<owner-uid>/ → /sessions/
→ backendRef pty-server:7681 in sandbox-<owner-uid> namespace
→ pty-server StatefulSet (PR #1641) serves the session
Hard rules respected: READ-ONLY clusters, no Chart.yaml bump (only
template content + Go renderer + Go handler list), helm template +
kubectl kustomize clean (verified locally), tests updated to assert the
new parentRefs shape and pass under go 1.23.
Codifies the 17-step marketplace customer journey (storefront → catalog →
product detail → voucher → signup → subdomain pick → PIN → checkout →
provisioning chain → console redirect) as a hermetic Playwright suite.
Previously the journey was only walked manually by ad-hoc fix-author
agents (see PR #1635 / docs/SESSION-2026-05-17-CONVERGENCE.md). This adds
a regression gate so future PRs catch breakage in any of the 14 spec
tests (17 step labels grouped into 14 Playwright tests — steps 12-15 are
asserted as one API-chain contract since CheckoutStep redirects to
console before the panel-poll UI would render).
Highlights
----------
- core/marketplace/playwright.config.ts — testDir=./playwright,
workers=1, baseURL from MARKETPLACE_BASE_URL (default
http://localhost:4321), same posture as
tests/e2e/playwright/playwright.config.ts.
- core/marketplace/playwright/customer-journey.spec.ts — every backend
call (/api/catalog/*, /api/auth/*, /api/tenant/*, /api/billing/*,
/api/provisioning/*) intercepted via page.route() so the run is
hermetic (npm run build && npm run preview is enough — no real
catalyst-api / billing / provisioning service required).
- Asserts the PR #1627 fix (deriveConsoleURL host-driven) — Sovereign
hosts redirect to console.<sov-fqdn> (no /nova), mothership stays on
console.openova.io/nova.
Verification
------------
npx playwright test customer-journey → 14 passed (2.5m).
Convergence wave 11 blocker on t16: bp-newapi HR install fails with
Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
"bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
requires accountId
PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:
attestation:
kind: commercial-contract
accountId: ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}
On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.
Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.
Touches two templates that gate on the same effective channel list:
- templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
pre-check ($qbdAttReady) that short-circuits the channel composition
block when attestation is incomplete. The downstream
`assertChannelAttestation` helper then sees an empty channel list
for the qwenBankDhofar slot and emits no error.
- templates/channel-seed-job.yaml — mirrors the same gate so the
post-install Helm hook Job + RBAC + audit ConfigMap also skip when
the channel itself was skipped (otherwise the Job would POST a row
whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).
`helm template platform/newapi/chart` renders cleanly in all three
states:
- default (qbd.enabled=false) → no channel, no seed Job
- qbd.enabled=true + empty accountId/contractRef → no channel, no
seed Job (NEW: pre-1.4.10 this hard-failed)
- qbd.enabled=true + accountId + contractRef present → channel
composed normally, seed Job emitted
Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.
READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1645 (Wave 8) wired gitea.* + k8s.read.* + session.* in the MCP
server but left sandbox.db.* as not_implemented stubs. This commit
ships the real handlers using the same dynamic-client pattern.
Tools shipped (all gated on `RequiredCapability=sandbox.db` + claim
OrgID==env.OrgID, all scoped to env.SandboxNamespace):
- sandbox.db.provision {name, plan?} — POSTs a CNPG Cluster CR
(default plan: 1 instance, 5Gi PVC, postgres 16, db=app). Returns
{host:<name>-rw.<ns>.svc.cluster.local, port:5432, dbname, user,
secretName:<name>-app, secretKey:password}.
- sandbox.db.list — labels-filtered LIST scoped to the Sandbox ns,
returns the same connection envelope per item plus a distilled
status summary (phase, readyInstances, Ready condition).
- sandbox.db.get {name} — GET one Cluster; refuses to surface a
Cluster lacking openova.io/managed-by=openova-sandbox-mcp
(defence-in-depth against an agent fishing for per-Org pair DBs).
- sandbox.db.drop {name} — DELETE with foreground propagation so the
operator cascades PVC/Service/Secret cleanup before returning.
Same managed-by guard as get.
- sandbox.db.dump {name} — POSTs a one-shot Backup CR
(`<cluster>-dump-<UTC>`). Returns the Backup name + the Cluster's
configured barmanObjectStore.destinationPath so the agent can find
the resulting S3 prefix without polling Backup.status.
Why CNPG Cluster CRs (not a per-Sandbox shared DB): per app DB keeps
tenancy / backup / restart blast-radius per-app, matches architecture
§3 + §7. Cluster CRs live in the Sandbox's OWN namespace
(sandbox-<owner-uid>); the agent cannot pass `namespace` — it's read
from env. The MCP server never mutates the resulting Pods/PVCs/
Services — the upstream CNPG operator (bp-cnpg) owns those.
Tests (sandbox_db_test.go, 9 cases incl. 5 capability-gate sub-tests):
- validation (name regex, missing name, unknown plan)
- default-plan CR shape (apiVersion, kind, labels, spec.instances,
storage.size, bootstrap.initdb.database, enableSuperuserAccess)
- connectionFor envelope matches CNPG service-name defaults
- on-demand Backup CR shape + managed-by label
- requireSandboxNS guard rails (no env / empty ns / populated)
- capability gate rejects bearers w/o sandbox.db
- status summary surfaces phase + Ready condition only
Hard rules respected: NO chart bump, no host-cluster touch — every
mutation lands inside the Sandbox's own namespace via the SA the
sandbox-controller already gives the MCP pod. go build + go vet +
go test clean. Catalogue test updated for new `sandbox.db.get`.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught live on t16.omantel.biz convergence: bp-sandbox HR stuck
Reconciling because its chart pull goes through harbor.<sov-fqdn>
(post-handover cutover slot 06a Step-06 phase-1 rewrites every
HelmRepository URL `oci://ghcr.io/openova-io` →
`oci://harbor.<sov-fqdn>/openova-io`), but harbor.<sov-fqdn> is not
reachable yet because bp-harbor itself has not reached Ready —
chicken-and-egg.
Same failure shape as Wave 7 #1610 with bp-hcloud-csi (REMOVED). This
PR takes the cleaner long-term cousin path: rather than remove the
slot, sequence it AFTER bp-harbor (slot 19) by renumbering to 19a
+ adding `bp-harbor` to the HR's dependsOn graph. The Sandbox MVP
Wave 11 slot stays available with no manual Day-2 add-app
re-introduction needed.
bp-harbor itself does not hit the cycle because its chart pull goes
through harbor.openova.io (the mothership-warmed proxy-cache wired
into k3s registries.yaml at cloud-init time) — NOT through
harbor.<sov-fqdn>.
Diff:
- clusters/_template/bootstrap-kit/61-bp-sandbox.yaml renamed →
19a-bp-sandbox.yaml; slot label "61" → "19a"; dependsOn adds
bp-harbor; header documents the move + chicken-and-egg context.
- clusters/_template/bootstrap-kit/kustomization.yaml: 19a slot
inserted right after 19-harbor.yaml with the post-cutover URL
rewrite rationale inline; old slot-61 entry replaced with a
back-pointer comment.
Verified `kubectl kustomize clusters/_template/bootstrap-kit/`
renders clean: bp-sandbox HR keeps slot label, gains
- name: bp-harbor in dependsOn, all other fields unchanged.
No Chart.yaml bump (this is a bootstrap-kit Kustomization-only fix,
not a chart change). READ-ONLY clusters.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:
pty-server (products/sandbox/pty-server/):
- session.Manager tracks lastActivity; Touch() called on session
create/stop, WS attach/detach, every WS message in/out, resize/signal.
- New GET /idle endpoint returns {lastActivityAt, activeSessions}.
- Unit tests cover the endpoint shape + Touch() bump.
sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
- New IdleScaler runnable, registered with mgr.Add() in main.go.
- NeedLeaderElection=true (singleton across HA replicas).
- Every 60s lists pty-server StatefulSets by label selector
(app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
constrained to `sandbox-*` namespaces in code for defence-in-depth.
- For each: probes the in-cluster Service /idle endpoint, stamps the
`openova.io/sandbox-last-activity-at` annotation, and patches
spec.replicas=0 once now-lastActivity exceeds the per-SS
`openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
- Probe failure with no prior annotation → skip (next tick); probe
failure WITH prior annotation → still decide on stale data so a
degraded probe path doesn't keep a forgotten Pod alive forever.
- activeSessions > 0 keeps the Pod alive regardless of idle window.
- Already-zero replicas → idempotent no-op.
Chart RBAC:
- ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
cluster-wide write on a non-CR resource, scoped to the controller's
own managed StatefulSets via the label selector + namespace prefix.
Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.
Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute
reconciler, but nothing set the field — every Org CR's TenantPublic
stayed zero-value, the reconciler short-circuited at the empty
ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium
Gateway.
Wire the patch at the only point that knows a tenant's product is
actually Ready: the provisioning service. Both the initial workflow
(`provision.completed`) and the day-2 install path
(`provision.app_ready`) now patch the Organization CR's
spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env),
subdomain (= slug), backendService (canonical vcluster-synced name),
port 80, and the picked product slug. Last-write-wins on subsequent
installs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through
env, never hardcoded — every Sovereign picks its own pool zone.
Empty env disables the patch entirely (legacy tenants keep working
through the Sovereign-wide tenant-wildcard route). Best-effort:
failures don't fail the provision. 404 on the CR is benign (legacy
tenant without an Organization counterpart).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.
Changes:
- New newapi.Client (core/controllers/sandbox/internal/newapi/) —
thin HTTP client for POST /admin/tokens/sandbox with the bridge's
{org_id, user_id, sandbox_id, allowed_channels} body + Bearer
ADMIN_SECRET auth. Interface so tests can stub.
- Reconciler extended:
* NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
* On every reconcile: decide mint-or-skip from annotation
openova.io/sandbox-token-expires-at vs. now + lead-time
* On mint: POST to bridge, stamp expires-at + rotated-at
annotations on the CR, render token bytes into a new
gitops manifest secret-newapi-token.yaml committed to the
per-Org catalyst-tenant repo at sandbox/<owner-uid>/
* Bridge failure → Failed/TokenMintFailed condition + 30s
requeue + no gitops writes (fail-loud)
* Empty DefaultChannels → NoAllowedChannels condition (fail
earlier than the bridge's 400)
- gitops.Render:
* New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
/NewAPITokenRotatedAt fields
* New secret-newapi-token.yaml template — Secret with
stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
optional kubectl.kubernetes.io/restartedAt rotation marker
so Wave 2's pty-server StatefulSet picks up rolling
restarts on token rotation
* kustomization.yaml appends the new manifest when token
present
- Chart wiring (platform/sandbox/chart):
* Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
(secretKeyRef from newapi-bp-newapi-token-signing-key,
optional: true), NEWAPI_DEFAULT_CHANNELS
* ClusterRole bumped to allow update/patch on the
sandboxes/ resource (the controller now stamps annotations
on the CR)
- platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
* Added emberstack/reflector annotations so the chart-emitted
Secret (newapi namespace) mirrors into the sandbox-controller
namespace by default; reflectorNamespaces is overrideable.
Tests:
- newapi client: happy-path round-trip, 401 surfaces, input
validation, request validation. 4 cases.
- sandbox-controller: existing Wave 1 cases (happy/idempotent/
drift/missing) still pass; 5 new cases for the token path:
fresh mint + Secret render, rotation on near-expiry, steady-
state no-mint, bridge failure surfaces condition, no-channels
misconfig fails early. 9 cases total, all green.
Hard rules honored:
- No Chart.yaml bump (chart pinning is a release-driver concern)
- go build + go test ./core/controllers/sandbox/... clean
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 9 regression gate for the Sandbox UI scaffold shipped in PR #1621.
Covers four happy-path surfaces:
- Sidebar Sandbox entry exists + accent-active class on /sandbox
- Landing renders 6 agent cards (aider / claude-code / cursor-agent /
little-coder / opencode / qwen-code) with Connect Claude Max CTA
- /sandbox/settings BYOS Connect button when disconnected
- /sandbox/$id route resolves + create POST sends agent=aider
Auth gate, deployment self-discovery, SSE events, and sandbox API are
all mocked via page.route so the spec runs against `npm run dev` (Vite
on :5173) with no catalyst-api required. Per-test timeout bumped to 90s
to absorb Vite's cold-cache xterm/tanstack-router module load.
Sovereign-mode env vars required for SovereignSidebar to render:
VITE_CATALYST_MODE=sovereign \\
VITE_SOVEREIGN_FQDN=sandbox.example.test \\
npm run dev
Local result: 4/4 passed in 2.1m (warm cache).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(pkg/gitea): add ListPullRequests + GetPullRequest read API
Wave 8 prerequisite for openova-sandbox-mcp's gitea.pr.list +
gitea.pr.get tools. Mirrors the existing client surface
(CreatePullRequest, ListOrgRepos) with state-filtered pagination and
a get-by-number fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs)
Wave 8 swaps the openova-sandbox-mcp Wave-2 not_implemented stubs for
production-ready handlers on:
- gitea.repo.list / gitea.repo.get (delegates to core/controllers/pkg/gitea)
- gitea.pr.list / gitea.pr.get (delegates to new ListPullRequests +
GetPullRequest helpers in pkg/gitea; org-scope check rejects cross-tenant
owner overrides at tool dispatch time)
- k8s.read.get / k8s.read.list / k8s.read.watch (dynamic.Interface against
the Sandbox pod's in-cluster SA or SANDBOX_KUBECONFIG; watch is a
bounded short-watch — long-lived subs land Wave 9 via MCP
resources/subscribe)
- sandbox.session.whoami / sandbox.session.info (echo per-call Claims +
Sandbox metadata so the agent can self-discover its scope)
Auth: every tools/call carries a bearer (via _auth.token arg OR
SANDBOX_TOKEN env). main.go validates HS256 against SANDBOX_JWT_SECRET
using the canonical core/services/shared/auth.Claims shape (PR #1619),
strips _auth from the args, installs Claims on ctx, then Registry.Call
gates on capability + org_id-match before reaching the handler.
sandbox.session.* skips the org-scope check (the operator's session
is the operator's regardless of which Org slug their claim carries).
Stubs retained (Wave 8+):
- sandbox.db.* (CNPG Cluster CR provisioning)
- sandbox.auth.* (Keycloak realm/client management)
- gitea.pr.create / gitea.pr.merge / gitea.issue.* / gitea.release.*
- k8s.read.logs
Hard rule preserved: k8s.write.* never lands in the MCP surface.
24 new tests (registry catalogue completeness, auth gate, gitea via
httptest stub, JWT round-trip, env-var parsing).
Builds clean against go 1.23 + k8s.io/client-go v0.31.1; module wires
core/controllers + core/services/shared via the same replace pattern
catalyst-bootstrap and every sme-service already use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PowerDNS now resolves <slug>.<parentDomain> for every Org mapped onto a
Sovereign's role=sme-pool parent domain (PR #1629), but no HTTPRoute was
attaching that hostname to the tenant's installed product Service. The
Cilium Gateway terminated TLS on the wildcard cert and fell through to
the marketplace tenant-wildcard route — serving the storefront landing
page instead of the tenant's WordPress / Nextcloud / GitLab install.
Fix:
1. Extend Organization CRD with optional spec.tenantPublic
(parentDomain, subdomain, backendService, backendPort, product).
2. organization-controller renders a Gateway-API HTTPRoute in the Org
namespace (= slug) attached to cilium-gateway/kube-system when
parentDomain is set. Skipped silently when unset so existing Orgs
keep working.
3. Chart-side templates/sme-services/tenant-public-routes.yaml renders
the same HTTPRoute shape from .Values.tenantRoutes[] for operators
that prefer static fixtures over the controller's reconcile loop.
4. Tests: TestReconcile_TenantPublic_RendersHTTPRoute and
TestReconcile_TenantPublic_DisabledByDefault cover both paths.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1633 added the Sandbox app to seedApps but never wired the matching plan
rows. The marketplace checkout hit "plan_id not found" the moment a customer
picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with
the Wave 1 baseline quota regardless of the picked tier.
This PR closes both gaps in lockstep:
Catalog:
- Plan struct gets ProductSlug + IncludedQuotas fields (back-compat:
omitempty BSON tags so legacy rows decode fine).
- expectedSandboxPlans() helper canonical-defines the three tiers:
sandbox-free 0 OMR 1 session, 1 agent, 5 GB, BYOS
sandbox-pro 9 OMR 3 sessions, 6 agents, 50 GB, BYOS (Popular)
sandbox-ent 49 OMR unlimited, 6 agents, 500 GB, BYOS
- seedAllData appends them on fresh seed; seedMissingSandboxPlans
backfills them on already-populated Sovereigns (idempotent GET-then-
create, patches missing ProductSlug/IncludedQuotas on legacy rows).
- UpdatePlan persists the two new fields.
Sandbox orchestrator wiring:
- SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID.
- buildSandbox stamps openova.io/plan-id annotation + spec.planId when
PlanID is non-empty.
- quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or
unknown plan_id falls through to DefaultQuota (Wave 1 baseline =
Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so
tenant-service avoids a compile-time dep on the catalog mongo stack.
Tests:
- TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the
Popular flag (sandbox-pro), and the quota ladder.
- TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises
all three tiers end-to-end (annotation + spec.planId + spec.quota).
- TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat
with pre-PR publishers.
- TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd /
retired plan IDs.
go build + go test clean for catalog, tenant, billing, provisioning,
shared, marketplace-api.
No Chart.yaml bump, no cluster touch.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.
This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:
- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
one Pod per in-flight session per architecture.md §1/§2). Env wired
per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
(key llm-gateway-token, optional). When claude-code is in
spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
the existing catalyst-public Cilium Gateway in catalyst-system.
PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
its own /sessions/<id> surface.
Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
defaults.
Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).
Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #831 follow-on to #827. Previously the Cilium Gateway declared a
single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under
non-primary parent zones (e.g. wp-foo.omani.homes when the operator
brings omani.homes as the SME pool) hit cilium-envoy's default fallback
cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered
by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR
\#827) existed but had no Gateway listener claiming its hostname.
Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent
zone. Materialised at Terraform plan time as a JSON-flow array
(infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode
of the listener objects iterating decoded parent_domains_yaml), threaded
through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and
consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}`
in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone
Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay
in lockstep.
Scalar placeholder (not multi-line block) because kustomize-build parses
the YAML before Flux runs envsubst — a placeholder on its own line at
column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then
swaps it for the JSON-flow array string, which the apiserver parses as
the real listener list.
Single-zone fallback preserved (var.parent_domains_yaml empty →
[{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone
provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions
(e.g. primary omani.works + sme-pool omani.homes) render 4 listeners.
Verification:
- kubectl kustomize clusters/_template/sovereign-tls/ → clean
- End-to-end simulation (single-zone, two-zone) renders correct
listener counts (2 / 4) with correct certificateRefs per zone.
- Listener naming `https-<sanitised>` / `http-<sanitised>` is unique
per listener so Gateway controller programs them all (duplicate
names produce Conflicting status condition).
Files:
- clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar
listeners placeholder + comment block explaining the why)
- infra/hetzner/main.tf (locals.parent_domains_decoded +
locals.parent_domains_listeners_yaml; threaded into primary CP and
secondary regions' templatefile() calls)
- infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML
substitute var in sovereign-tls Kustomization block)
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the
marketplace cart includes the sandbox product. Nobody was subscribing —
the event landed in NATS `catalyst.tenant.sandbox_requested` and aged
out unread, so no Sandbox CR (PR #1622) was ever minted and the
customer sat on a "Provisioning…" spinner forever.
This slice closes the loop. A new SandboxOrchestrator in tenant-service:
- Subscribes via events.MultiSubscriber (PR #1636) to the canonical
NATS subject + legacy Kafka topic.
- Parses {tenant_id, org_slug, owner_id, owner_email, agents,
sovereign, requested_at} and resolves the owner email (event field
→ store.GetMemberEmail → owner_id fallback).
- Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE
override) via a dynamic client, with spec per architecture §7:
owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi /
50 Gi / 3 sessions), spec.agentCatalogue from the cart.
- Idempotent: Get-then-Create with AlreadyExists swallowed so NATS
redeliveries + duplicate marketplace submits stay no-ops; the
sandbox-controller remains SoR for spec mutations.
Wiring in main.go is best-effort — when no in-cluster config nor
KUBECONFIG is available (CI / dev loops) the orchestrator is skipped
with a Warn; the rest of the tenant service still boots.
Hard rules: no chart bump, no cluster writes outside of the Sandbox
Create call (sandbox-controller reconciles the rest), `go build ./...`
clean, `go test ./...` clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.
Handler (platform/newapi/internal/handler/sandbox_token.go):
- Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
constant-time compared. 401 on mismatch / missing bearer.
- Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
De-duplicates + scrubs empty channel names so a controller bug
sending [""] can't mint a token that NewAPI silently treats as
"no restriction".
- Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
{sub: sandbox_id, org: org_id, user: user_id, channels: [...],
iat, exp: iat+7d, typ: "sandbox"}.
- Returns {token, expires_at}.
- Refuses with 503 when SigningKey or AdminSecret is unset
(visible chart-wiring gap, not a forgeable-token leak).
- Removes the previous Claims/jwt.Parse PAT-validation path that
came with the stub — caller is the controller, not an operator.
- NewHandlerFromEnv() factory loads + validates env at process
start so catalyst-api can fail loudly instead of shipping the
endpoint silently.
Unit tests (sandbox_token_test.go) — 11 cases:
- happy path (mint + claim shape + signature round-trip)
- de-dup + empty-channel scrub
- admin-secret mismatch / missing bearer → 401
- missing org_id / user_id / sandbox_id / empty channels → 400
- non-POST → 405
- unset env → 503
- mintSandboxToken empty-secret guard + round-trip
- response does not echo admin secret or signing key
Chart wiring (platform/newapi/chart):
- New Secret template sandbox-token-signing-key-secret.yaml
auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
(same load-bearing pattern as credentials-secret.yaml #943 and
gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
for both SIGNING_KEY and ADMIN_SECRET; persistence across
reconciles is required because a reconcile-time rotation would
silently invalidate every per-Sandbox token across the Sovereign
AND break the sandbox-controller's auth path until its Pod
restarts.
- values.yaml block sandboxTokenSigningKey.{existingSecret,
autoProvision, autoSecretName} matching the `credentials`
convention (operator override > auto-provision > skip-render).
- No Chart.yaml bump — chart value addition only.
Verification:
- go build ./platform/newapi/internal/handler/... — clean
- go test ./platform/newapi/internal/handler/... — 11/11 PASS
- helm template platform/newapi/chart — Secret renders
How sandbox-controller will use it:
1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
2. POST /admin/tokens/sandbox with bearer + body
{org_id: <Sandbox.spec.owner.orgRef.slug>,
user_id: <Sandbox.spec.owner.email>,
sandbox_id: <Sandbox.metadata.uid>,
allowed_channels: ["qwen3.6-bankdhofar"]}.
3. Write returned token into Secret/sandbox-<uid>-newapi-token.
4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the catalyst-api backend the Sandbox FE (PR #1621 — getSandboxes /
createSandbox / getByosStatus in sandbox.api.ts) has been calling into.
Without this handler the /sandbox surface on the Sovereign Console rendered
its empty state forever — every getSandboxes() 404'd at the catalyst-api
ingress and every "Start a session" click hit the same wall.
Handler — products/catalyst/bootstrap/api/internal/handler/sandbox_sessions.go
- GET /api/v1/sandbox/sessions — list Sandbox CRs in the
operator's Org namespace
- POST /api/v1/sandbox/sessions — create Sandbox CR with agent
validated against the 6-agent
catalogue (aider / claude-code /
cursor-agent / little-coder /
opencode / qwen-code)
- GET /api/v1/sandbox/sessions/{id} — fetch single Sandbox detail
- DELETE /api/v1/sandbox/sessions/{id} — graceful delete (the controller
fires finalizers + cleans up
the per-Sandbox vcluster
namespace + PVCs + RBAC)
Client resolution mirrors the Family E compliance + k8s_resource_actions.go
seam: k8sCache.Factory.DynamicClientFor(resolveChrootClusterID("")) is the
primary path; sovereignDepsFor() — rest.InClusterConfig() — is the chroot
in-cluster fallback per feedback_chroot_in_cluster_fallback.md. Both 503
when unavailable so the FE renders its "API pending" pill rather than a
spinner.
Org-scoping uses claims.Org (the org_id Keycloak claim PR #1619 lit up)
for the CR namespace + spec.owner.orgRef.slug. Single-tenant chroots
without an org_id fall back through CATALYST_SANDBOX_DEFAULT_NAMESPACE
to a sensible default per docs/INVIOLABLE-PRINCIPLES.md #4. Wave-1 quota
defaults (4 CPU / 8Gi memory / 50Gi storage / 3 concurrent sessions)
mirror products/sandbox/docs/architecture.md §7 — the FE doesn't yet
expose a quota picker.
Status projection: CRD vocabulary (Pending|Provisioning|Ready|Failed)
maps to FE vocabulary (pending|running|stopped|failed|unknown) in
mapSandboxStatus so a fresh Sandbox shows the spinner rather than
"unknown" until the controller catches up.
k8sCache.DefaultKinds — products/catalyst/bootstrap/api/internal/k8scache/kinds.go
- Adds sandbox.openova.io/v1 Sandbox so the generic /k8s/{kind} surface
enumerates Sandboxes the same way it does Applications + UserAccess.
Per feedback_chroot_in_cluster_fallback.md every new GVR here needs a
matching rule on the cutover-driver SA.
Cutover-driver RBAC — products/catalyst/chart/templates/clusterrole-cutover-driver.yaml
- Adds sandboxes.sandbox.openova.io with verbs split per
feedback_rbac_create_no_resourcenames.md:
rule 1: ["create"]
rule 2: ["get","list","watch","delete"]
- Read-only on status (the controller owns status); write is spec-only
on POST + the apiserver delete on DELETE.
Routes — products/catalyst/bootstrap/api/cmd/api/main.go
- Registered inside the RequireSession group alongside the existing
/api/v1/sandbox/byos/claude-code/* surface; same auth gate, same
patternless leading "/api/v1/sandbox/...".
Verified: go build clean, go vet clean, k8scache test suite green
(2.7s), helm template renders the new RBAC block.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>