On t22 (omantel.biz fresh Sovereign) 2 of 15 HTTPRoutes went
Accepted=False because their parentRef pointed at a gateway that
does not exist on any Sovereign:
catalyst-system/guacamole-server -> gateway-system/cilium-gateway
catalyst-system/openova-flow-server -> kube-system/catalyst-gateway
The canonical Sovereign Gateway is kube-system/cilium-gateway,
installed by bootstrap-kit/01-cilium.yaml and used by every other
HTTPRoute (catalyst-api, catalyst-ui, marketplace, gitea, harbor,
keycloak, grafana, hubble-ui, openbao, powerdns, tenant-wildcard).
gateway-system does not exist; catalyst-gateway does not exist.
Fixes:
- platform/guacamole/chart/values.yaml — default
guacamole.httproute.parentRef.namespace: gateway-system -> kube-system
- clusters/_template/bootstrap-kit/56-bp-openova-flow-server.yaml —
flowServer.httproute.gatewayRef.name: catalyst-gateway -> cilium-gateway
(namespace already kube-system, untouched)
Verified on t22: all 15 HTTPRoutes now Accepted=True after chart bump
+ Flux reconcile.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every services-build run since 2026-05-18 06:32 UTC failed with
"go: go.mod requires go >= 1.26.0 (running go 1.22.12; GOTOOLCHAIN=local)"
because a recent go.mod bump to `go 1.26.0` was not paired with a
Containerfile base-image bump.
5 strandled fixes that never produced new image SHAs:
- PR #1683 fix(billing): consume catalyst.usage.recorded from
CATALYST_SME stream (was creating overlapping CATALYST_USAGE)
- PR #1684 fix(provisioning): set Organization.spec.tenantPublic
- PR #1685 fix(catalog+billing): Sandbox Free/Pro/Ent plans + quota
- PR #1686 feat(sandbox): orchestrator listens tenant.sandbox_requested
- test(sandbox): integration tests for orchestrator + sessions API
The stranded billing image is the root cause of every voucher 502 on
t22 and blocks the full marketplace customer journey (steps 9, 10, 15
all fail). t22 billing Pod is in CrashLoopBackOff with the exact NATS
subject-overlap signature PR #1683 fixes.
Bumps all 10 service Containerfiles (auth/billing/catalog/catalyst-
catalog/domain/gateway/metering-sidecar/notification/provisioning/
tenant) to golang:1.26-alpine, matching the toolchain in go.mod.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stop handing every Sandbox session the full MCP surface. Each per-Sandbox
NewAPI token now carries a plan-derived capability allowlist that the MCP
server enforces against per-tool RequiredCapability via Claims.HasCapability:
- Free: read-only k8s + gitea read + session/rag/skills
- Pro: + sandbox.db.* + sandbox.storage.* + sandbox.preview.* +
sandbox.auth.* + sandbox.secrets.* + marketplace.* + flux.status
- Ent: + sandbox.deploy.{staging,production,...} + sandbox.stripe.* +
flux.{reconcile,suspend,resume} + gitea.pr.{create,merge} +
gitea.issue.*
Wiring:
- Sandbox CRD spec gains planId + capabilities[] (operator overlay).
- Sandbox sandboxapi.{CapabilitiesForPlan,ResolveCapabilities} is the
SoT; tenant orchestrator carries an exact-mirror capabilitiesForPlan
(no controllers-module dep — same isolation pattern quotaForPlan
uses).
- sandbox-controller threads spec.capabilities (falling back to plan)
into newapi.MintRequest.
- catalyst-api bridge handler accepts capabilities[] on the wire and
encodes it as the JWT `capabilities` claim (omitted when empty).
- Claims.HasCapability gains wildcard prefix matching (`sandbox.db.*`
satisfies `sandbox.db.provision`, `sandbox.db`, etc.) so plan grants
stay coarse. Plain stem matches WITHOUT a wildcard are intentionally
rejected — the production second-gate in sandbox_deploy.go stays
honest.
- MCP registry: every gated tool now carries its granular dotted
RequiredCapability (`sandbox.db.provision`, `gitea.pr.list`, …).
Read-only / session tools previously ungated also get granular
grants so Free tokens can browse without inheriting the write
surface.
No Chart.yaml bump — CRD additions are additive; existing Sandbox CRs
parse fine. Empty token capabilities downgrades to introspection only,
matching pre-PR-#1671 callers.
Tests: shared/auth/claims_test.go (wildcard matrix),
sandboxapi/capabilities_test.go (plan ladder + spec override),
sandbox_token_test.go (capabilities round-trip + omit-on-empty),
sandbox_controller_test.go (plan-derived + spec-override mint),
sandbox_consumer_test.go (orchestrator stamps spec.capabilities), plus
updates to every per-namespace registry test asserting new granular
RequiredCapability values.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Sandbox (sandbox.openova.io/v1) + sandbox-component Pod informers
to the openova-flow-adapter-flux DaemonSet, plus a kind-aware glyph in
the FlowCanvas. Each Sandbox CR renders as a bubble pulled under its
tenant-org (`contains` edge to <region>:org:<slug>). Each pty-server
/ openova-sandbox-mcp Pod renders under its parent Sandbox.
Adapter (adapter-flux):
- hr_informer.go: second factory watches all namespaces for the
new SandboxGVR + PodGVR (Sandboxes live in Org vclusters, not
flux-system). Mirrors the HR handler — dedupe on (id, status),
upsert-nodes + upsert-rels POST, delete-nodes on removal.
- sandbox_mapper.go: pure transform.
BuildFromSandbox: `.status.phase` → palette
(Pending/Provisioning/Ready/Failed → pending/running/succeeded/failed),
label = owner email, meta.kind = "Sandbox",
contains edge to <region>:org:<slug>.
BuildFromSandboxPod: only emits for pty-server / openova-sandbox-mcp
components (filter at mapper boundary so dedupe map stays clean);
meta.kind = "SandboxPod", contains edge to parent Sandbox (resolved
via `sandbox.openova.io/name` label or `sandbox-<name>` namespace
stem). CrashLoopBackOff / ImagePullBackOff → failed.
Canvas (FlowCanvas.tsx):
- NodeGlyph component swaps the bubble glyph by `node.meta.kind`:
Sandbox → terminal/monitor SVG matching the Sovereign sidebar's
Sandbox nav icon; SandboxPod → compact "›_" prompt; otherwise the
legacy ◇/✓/✗/◐/○ text glyph.
- data-meta-kind attribute on the bubble group for e2e selectors.
Test coverage:
- 11 new adapter mapper tests (Sandbox phase mappings, family-label
override, region fallback, missing-org fallback, Pod ready/not-ready
/CrashLoop/non-sandbox-skip/namespace-stem-parent).
- 3 new canvas tests (Sandbox glyph, SandboxPod glyph, legacy
fallback when meta.kind absent).
- Full suite GREEN: adapter-flux 11→22 tests, canvas 22→25 tests,
server tests unchanged 14 GREEN. go vet clean, tsc --noEmit clean.
No Chart.yaml bump — the adapter is shipped as part of the existing
openova-flow-adapter-flux DaemonSet image; the new GVRs are reconciled
in-process at next Pod roll.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica
with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that
dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux
rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was
rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner
resources tagged with the abandoned deployment-id stayed orphaned and
required manual `hcloud` cleanup. Three consecutive provs died this way in
one afternoon.
Option C (smallest blast radius): gate the deploy-bot at the workflow level.
1. New public endpoint GET /api/v1/deployments/in-flight-count on
catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight
status (pending / provisioning / tofu-applying / flux-bootstrapping).
Phase-1 (phase1-watching) is observational and resumes across Pod
restarts via resumePhase1Watch, so it does NOT block. Adopted
deployments are excluded. No FQDNs / owner emails in the response —
same information-disclosure posture as /api/v1/subdomains/check.
Unauthenticated; the deploy-bot has no session cookie.
2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint
before bumping values.yaml. count==0 → green light. count>0 → sleep
20s and retry. Hard cap 30 min (a stuck prov must not block all
future deploys — that would be the worst possible failure mode for a
CI gate). Fail-open on any non-200 / network error so the gate
cannot itself become an outage.
Notes:
- Mothership URL configurable via vars.CATALYST_API_URL (defaults to
https://console.openova.io). Sovereign chroot self-deploys can point
to their local catalyst-api.
- First-rollout safe: the endpoint does not exist on the LIVE
mothership until THIS PR's image lands, so the first run after merge
falls through the 404 branch and proceeds. Subsequent runs benefit
from the gate.
- NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image
refs in chart templates (existing behaviour), so the new endpoint
reaches Sovereigns through the normal chart-rebake path.
Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs
Phase-1 vs terminal vs adopted classification + empty-store green light.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalyst-api): COPY core/services/shared/ in Containerfile
PR #1658 added a second replace directive to
products/catalyst/bootstrap/api/go.mod pointing
github.com/openova-io/openova/core/services/shared at
../../../../core/services/shared (in-tree consume, same pattern as
core/controllers from #1152). The Containerfile was updated to wire the
controllers tree with COPY core/controllers/ /core/controllers/ but the
matching COPY for core/services/shared/ was missed.
Every catalyst-api build since #1658 fails at `go mod download` with:
go: github.com/openova-io/openova/core/services/shared@v0.0.0-... \
(replaced by ../../../../core/services/shared): reading \
/core/services/shared/go.mod: open /core/services/shared/go.mod: \
no such file or directory
Effect: deploy-bot stalled on catalyst-api bumps, every fresh Sovereign
provision ships the stale pre-#1658 image (e7b2062), all post-#1658
fixes (PARENT_DOMAINS_LISTENERS_YAML wiring, SME bridge token mints,
etc.) silently absent from the runtime.
Fix is one line: COPY core/services/shared/ /core/services/shared/
placed immediately after the controllers COPY, mirroring the same
relative-path math (../../../../core/services/shared resolves to
/core/services/shared from WORKDIR=/app, because each `..` of the
filesystem root is still root, then /core/services/shared).
Verified the failure mode on run 26028993151 (commit 4cf670b6, latest
main): exact log line matches. No Chart.yaml bump, no cluster changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* deploy: update catalyst images to 2350e42
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Pre-this-fix the bootstrap-kit overlay at
clusters/_template/bootstrap-kit/80-newapi.yaml:171 sets
`auth.adminUI.keycloak.existingSecret: newapi-oidc` but NOTHING in the
chart nor in the operator overlay materialised that Secret. The Pod
stayed in `CreateContainerConfigError: secret "newapi-oidc" not
found`, blocking the entire bp-newapi HR from reaching Ready (t20
debug matrix Fix#6).
New template templates/keycloak-client-secret.yaml uses Helm `lookup`
to retrieve the existing Secret bytes on every reconcile (idempotent —
preserves the OIDC client secret across upgrades), falling back to
`randAlphaNum 32` on first install. Mirrors the existing canonical
seam in the same chart (templates/credentials-secret.yaml issue #943,
templates/sandbox-token-signing-key-secret.yaml PR #1638) — both use
the same lookup-or-generate pattern with helm.sh/resource-policy: keep.
The sister chart platform/guacamole/chart/templates/keycloak-client-
secret.yaml uses a SealedSecret placeholder + a bootstrap Job hook;
THIS chart already standardised on the simpler `randAlphaNum + Helm
lookup` pattern, so this Secret follows the same seam to keep the
chart's credential-materialisation strategy consistent.
Gated on `auth.adminUI.mode=keycloak` AND non-empty
`auth.adminUI.keycloak.existingSecret`, so non-keycloak installs
render nothing extra.
No Chart.yaml bump — pure template addition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The official Apache Guacamole image entrypoint runs `rm -rf
$GUACAMOLE_HOME` (== `/home/guacamole/.guacamole`) before re-populating
the directory on every start. When the chart mounted an emptyDir
directly at `/home/guacamole/.guacamole`, that path was a mount point
from the kernel's perspective, so `rm` failed with:
rm: cannot remove '/home/guacamole/.guacamole':
Read-only file system
— the entrypoint exited non-zero and the Pod CrashLoopBackOff'd before
the webapp ever started. (t20 debug matrix — Fix #5.)
Mount the PARENT directory (`/home/guacamole`) instead. `.guacamole`
becomes a regular subdirectory inside the emptyDir, which the
entrypoint can freely `rm -rf` and recreate. The webapp's first-start
writes still land in a writable location under readOnlyRootFilesystem.
No Chart.yaml version bump per the t20 hard-rules contract — chart
release will roll in the next blueprint-release wave.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
t20 (2026-05-18) caught the bug: billing crashed at startup with NATS
error code 10065 "subjects overlap with an existing stream" because
CATALYST_SME (subjects `catalyst.>`, created by the tenant /
provisioning MultiSubscribers) had already claimed `catalyst.usage.recorded`
by the time billing tried to create CATALYST_USAGE
(subject `catalyst.usage.recorded`). JetStream forbids two Streams from
owning overlapping subject filters.
Option B per the matrix: have billing share CATALYST_SME and scope its
metering reads via a consumer-side FilterSubject instead of owning a
separate Stream. This matches the architecture every other SME service
(tenant, notification, provisioning) already uses for catalyst.* events.
Changes:
- core/services/shared/events/nats.go: add EnsureCatalystSMEStream
(public wrapper around the existing package-private ensureSMEStream
helper used by NewMultiSubscriber) + SubscribeUsageRecordedOnSME
(durable consumer on CATALYST_SME with FilterSubject scoped to
catalyst.usage.recorded). The original EnsureUsageStream and
SubscribeUsageRecorded are retained but marked Deprecated for
back-compat with any Catalyst-Zero / dev loop wired before t20.
- core/services/billing/main.go: replace the EnsureUsageStream call
with EnsureCatalystSMEStream and the SubscribeUsageRecorded call
with SubscribeUsageRecordedOnSME. Comment captures the t20 root
cause + the bootstrap-order rationale so the next reader doesn't
re-introduce the dedicated Stream.
The consumer-side FilterSubject (`catalyst.usage.recorded`) lives in
core/services/shared/events/nats.go inside SubscribeUsageRecordedOnSME.
go build + go test clean for core/services/billing and
core/services/shared.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1640 renamed Cilium Gateway listeners to `https-<sanitised-zone>` /
`http-<sanitised-zone>` to support multi-zone Sovereigns (primary +
SME pool). That broke single-zone Sovereigns because every platform
chart's HTTPRoute (harbor, keycloak, grafana, gitea, openbao, powerdns,
stalwart-tenant) hardcodes `parentRefs[0].sectionName: https`. Result:
every HTTPRoute reports `Accepted=False NoMatchingListener`, Sovereign
Console / Harbor / Keycloak etc. unreachable through the Gateway.
Fix: when `len(parent_domains_decoded) == 1` (the common case), render
listener names as the bare strings `https` / `http`. When > 1 (SME pool
present), keep the unique `https-<zone>` / `http-<zone>` naming so the
Gateway controller doesn't hit a duplicate-name Conflicting condition.
Multi-zone tenants whose HTTPRoutes must attach under a non-primary
zone override `sectionName` via values.yaml — out of scope here.
The per-zone certificateRefs.name (`sovereign-wildcard-tls-<sanitised-zone>`)
is unchanged — independent of the listener name.
Verified: kubectl kustomize clusters/_template/sovereign-tls/ clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per t20 debug matrix:
* `bp-self-sovereign-cutover` step-06 phase-1 rewrites every HelmRepository
URL from `oci://ghcr.io/openova-io` to `oci://${harbor_host}/openova-io`,
where `harbor_host` is derived from `sovereign.harborPublicURL`.
* Pre-fix: `harborPublicURL: https://harbor.${SOVEREIGN_FQDN}`.
* But the bp-harbor HTTPRoute publishes at `registry.${SOVEREIGN_FQDN}` —
see `clusters/_template/bootstrap-kit/19-harbor.yaml` line 167
(`gateway.host: registry.\${SOVEREIGN_FQDN}`). No HTTPRoute matches
`harbor.<sov>`, so post-pivot every OCI chart pull EOFs.
* Effect: bp-sandbox HR never Ready → bootstrap-kit Kustomization stuck
waiting on bp-sandbox health → t20 convergence blocks indefinitely.
Fix (chart-level, no Chart.yaml bump for bp-catalyst-platform):
* `clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml`
overlay value flipped `harbor.${SOVEREIGN_FQDN}` → `registry.${SOVEREIGN_FQDN}`.
* `platform/self-sovereign-cutover/chart/values.yaml` default placeholder
flipped `harbor.example.local` → `registry.example.local` so smoke
renders + docs line up.
* README + smoke command updated.
Smoke tests:
* `helm template smoke platform/self-sovereign-cutover/chart` — clean,
1851 lines, `HARBOR_PUBLIC_URL=https://registry.example.local`.
* `helm template smoke ... --set sovereign.harborPublicURL=https://registry.otechN.omani.works`
— clean, all step env vars carry the new host.
* `kubectl kustomize clusters/_template/bootstrap-kit/` — clean, 2926 lines,
overlay shows `harborPublicURL: https://registry.${SOVEREIGN_FQDN}`.
* `bash platform/self-sovereign-cutover/chart/tests/cutover-contract.sh`
— all gates green (Phase-0 ghcr-pull auth merge still works because
`harbor_host` is derived from `HARBOR_PUBLIC_URL` env at runtime, so
the script now correctly merges auth for `registry.<sov-fqdn>` instead
of `harbor.<sov-fqdn>`).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds regression coverage so the Sandbox event flow + REST surface can
be exercised without a live Sovereign — the convergence loop the
qa-loop's last 5 iterations relied on.
Tenant orchestrator (5 cases / 8 runs):
* full event flow — tenant.sandbox_requested envelope → in-process
BrokerSubscriber → SandboxOrchestrator.Start → recordingSandboxClient
materialises a CR shaped per architecture.md §7 (labels, annotations,
spec.owner/quota/agentCatalogue/planId)
* NATS-style redelivery is idempotent — second Emit() goes Get(found)
→ no-op, Create count stays at 1
* plan tiers fan out — free/pro/ent each stamp the right quota
(catches the PR #1633 regression)
* non-sandbox event types ignored at the dispatcher seam
* agentCatalogue strips empty / whitespace entries before persist
Catalyst sessions API (7 cases / 10 runs):
* POST → GET round-trip through a dynamic/fake apiserver via
SetSovereignDepsFactory (mirrors chroot Sovereign "Path 2")
* GET reflects controller status (sessions / storage / spend /
previews / conditions) into the FE wire shape
* Failed condition taxonomy — TokenMintFailed, GitopsWriteFailed,
ManifestRenderFailed each preserved verbatim so the FE renders
actionable error states instead of a generic red pill
* POST invalid-agent returns 400 before any apiserver call
* GET unknown sandbox returns 404 sandbox-not-found
* LIST → DELETE → LIST round-trip
* Org-scope isolation — claims.Org-scoped namespace boundary blocks
cross-Org leak
Hard rules followed: READ-ONLY fake clients (no apiserver write), no
chart bump, no production code changes — only new _test.go files.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1674 shipped the Sandbox Runtime Grafana dashboard with three
panels whose metrics did not yet exist anywhere in the fleet:
- "WebSocket Connections" → pty_server_websocket_connections (Gauge)
- "Idle-Timeout Scale-Down Events / hour" →
sandbox_controller_idle_timeout_events_total (Counter)
- "newapi Token Mint Requests / hour" →
newapi_admin_token_mint_requests_total{tool,status} (Counter)
Per Inviolable Principle #11 the panels render "No data" until the
emitter sides roll out. This PR closes that loop.
pty-server (products/sandbox/pty-server)
- New metrics.go: registers Gauge pty_server_websocket_connections
via promauto + exposes promhttp.Handler.
- routes.go: serves /metrics on GET; Inc/Dec the gauge around every
successful WS upgrade in attach() and cards() (Defer Dec so abnormal
returns still decrement).
- go.mod: + github.com/prometheus/client_golang v1.19.1 (matches the
version core/controllers already pulls in transitively).
- New unit test asserts GET /metrics carries the gauge name.
sandbox-controller (core/controllers/sandbox/internal/idlescaler)
- New metrics.go: Counter sandbox_controller_idle_timeout_events_total
with label {namespace} registered on controller-runtime's shared
registry (so the manager's existing :8080 /metrics endpoint surfaces
it — no new listener).
- idlescaler.go: bumps the counter inside scaleToZero() so every Pod
scaled to 0 ticks once. Namespace label matches the dashboard panel's
`sum by (namespace) (rate(...))` aggregation.
- New unit test verifies the counter delta is 1 on a successful
scale-to-zero pass.
newapi bridge handler (platform/newapi/internal/handler)
- New metrics.go: CounterVec newapi_admin_token_mint_requests_total
with labels {tool, status}; helper classifyStatus() maps HTTP codes
to a finite cardinality of 7 status values (ok / unauthorized /
bad_request / unavailable / server_error / method_not_allowed /
other). Exported MetricsHandler() so the catalyst-api wiring code
can mount /metrics on the same listener as the bridge.
- sandbox_token.go: recordMint(r, status) at every return path so the
counter ticks regardless of which branch the request hits.
- go.mod: + github.com/prometheus/client_golang v1.19.1.
- 5 new test cases assert counter delta == 1 for the documented
status transitions and that the X-Catalyst-Tool header surfaces
as the `tool` label.
sandbox-controller → newapi client (core/controllers/sandbox/internal/newapi)
- client.go: stamp `X-Catalyst-Tool: sandbox-controller` on every
outbound POST /admin/tokens/sandbox so the bridge counter's `tool`
label has the canonical value the dashboard panel filters on.
Helm charts
- platform/sandbox/chart/templates/service.yaml (new): ClusterIP
Service exposing the controller's :8080 metrics port. Required so a
ServiceMonitor selector has something to attach to.
- platform/sandbox/chart/templates/servicemonitor.yaml (new):
monitoring.coreos.com/v1 ServiceMonitor scoped to the metrics
Service. Default-off + double-guarded with
`.Capabilities.APIVersions.Has "monitoring.coreos.com/v1"` (matches
platform/harbor/chart/templates/servicemonitor.yaml pattern). values
block + per-Sovereign overrides (interval / scrapeTimeout / path /
labels / namespace) per Inviolable Principle #4.
- platform/newapi/chart/templates/servicemonitor.yaml (new): mirror
template targeting the existing bp-newapi Service / `http` port for
when the catalyst-api binary that mounts the bridge handler rolls
out behind the same Service. Default-off, capability-guarded.
No chart Chart.yaml bump. Validates: helm template + helm lint clean
on both charts; go build + go test clean across all three modules
(pty-server, core/controllers/sandbox, platform/newapi/internal/handler).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1670 wired the xterm.js WebSocket against `wss://sandbox.<sov-fqdn>/
sessions/<id>/attach`, but a freshly-created Sandbox CR takes 8-15s for
the sandbox-controller to reconcile (ns + RBAC + 3 PVCs + pty Pod + MCP
Pod + newapi Token). During that window the public ingress doesn't
exist yet, so the FE looped through ~6 reconnect attempts against a
non-existent host before giving up — operator saw a blank "Connecting…"
banner with no insight.
PR #1673 made the missing data available by surfacing `.status.phase` +
`.status.conditions[]` on `/api/v1/sandbox/sessions/{id}`. This PR
consumes it:
- New SandboxProvisioningPanel — phase badge + spinner +
per-stage waterfall (Namespace → RBAC → Storage → pty Pod →
MCP Pod → newapi Token) inferred from conditions[]. Renders
target-state on first paint (per INVIOLABLE-PRINCIPLES.md #1);
flips chips Done as each per-stage True condition lands.
- SandboxSession polls GET /sessions/{id} every 2s via TanStack
Query while phase !== 'Ready' && phase !== 'Failed', then stops
polling and renders the xterm.js panel.
- On phase==='Failed' the panel surfaces every False condition's
{Type, Reason, Message} verbatim (TokenMintFailed,
ManifestRenderFailed, GitopsWriteFailed, OwnerEmailInvalid,
NoAllowedChannels) plus a "Delete + retry" CTA that fires
DELETE /sessions/{id} then bounces the operator to /sandbox.
- On 404 (session deleted in another tab) the panel shows a
"Session no longer exists" state + back-to-landing CTA.
- SandboxLanding's create-session flow now navigates straight to
/sandbox/$id on POST success, seeds the per-session query cache
so the panel paints the canonical 6-stage waterfall instantly
(no 2s gap waiting for the first poll tick).
sandbox.api.ts extends the Sandbox shape with `phase: SandboxPhase`
+ `conditions: SandboxCondition[]` (always materialised arrays —
the FE never has to `?? []`). Adds `getSandbox(id)` + `deleteSandbox(id)`
endpoints. Three normalisers (`normalizeSandbox`, `normalizePhase`,
`normalizeConditions`) keep getSandboxes/getSandbox/createSandbox
projections from drifting.
Design-system inheritance per feedback_subagents_inherit_design_system:
- PortalShell wrapper (no bespoke chrome)
- Emerald/amber/rose ramps mirror AppDetail's phase chip + the
existing ConnectionBadge — every colour comes from documented
Tailwind design tokens, no new hex values
- Card chrome mirrors sandbox-session-card verbatim (rounded-xl,
var(--color-bg-2) on var(--color-border), 5-unit padding)
WebSocket effect now gated on `showTerminal` so we don't reconnect-
storm a non-existent ingress while provisioning. xterm host stays
mounted (display:none) across the gate flip so hostRef + resize
listeners survive.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 15 — closes the verification gap between PR #1637 (catalyst-api
/api/v1/sandbox/sessions), PR #1638 (real /admin/tokens/sandbox mint
in newapi), and PR #1643 (sandbox-controller calls the bridge):
1. core/controllers/sandbox/internal/newapi/integration_test.go (new)
End-to-end test wiring the REAL newapi.Client through net/http
against a httptest.Server that mirrors the bridge handler contract
(platform/newapi/internal/handler/sandbox_token.go) verbatim:
- happy-path: 200 with {token, expires_at} → controller renders
per-Sandbox Secret + stamps lifecycle annotations on the CR
(verifies wire path: Authorization: Bearer + body fields
org_id/user_id/sandbox_id/allowed_channels)
- 401 path: wrong admin bearer → 401 JSON envelope → controller
stamps TokenMintFailed False condition, NO gitops writes,
requeues, NO lifecycle annotations stamped
- transport-unreachable: closed server URL → TokenMintFailed
(verifies the error wrapping path)
Gap filled: client_test.go covers the HTTP client in isolation,
sandbox_controller_test.go uses an in-process stub. Neither
exercises the controller-runtime reconciler against a real HTTP
transport — this file is the only place where a regression in the
client's bearer/header/path/status-code handling would surface in
the context of the reconciler's state machine.
2. platform/newapi/chart — reflector wiring fix
Default reflectorNamespaces changed from "sandbox" to
"catalyst-system,sandbox". Root cause: clusters/_template/
bootstrap-kit/19a-bp-sandbox.yaml sets `targetNamespace:
catalyst-system` (the canonical install namespace of the
bp-sandbox HelmRelease) but the chart-emitted Secret was being
mirrored into a `sandbox` namespace that does not exist on a
stock Sovereign. Result: sandbox-controller Pod's
`NEWAPI_ADMIN_SECRET` env var landed empty (secretKeyRef
`optional: true` swallowed the missing-Secret error) → controller
started in gitops-only mode, never minted tokens, silently
degraded. Operator-visible only via a startup log line.
`sandbox` is retained in the default for legacy overlays that
install the controller into a dedicated namespace + for sister
tooling (catalyst-api PATs routed through the bridge) that wants
the admin bearer locally.
Verification:
- go build ./sandbox/... clean
- go test ./sandbox/internal/newapi/... — 7 tests pass (4 unit + 3
new integration)
- go test ./sandbox/... — all sandbox packages pass
- go vet ./sandbox/... clean
- helm template platform/newapi/chart/ -s sandbox-token-signing-
key-secret.yaml renders with
`reflection-{allowed,auto}-namespaces: "catalyst-system,sandbox"`
- helm template platform/sandbox/chart/ renders Deployment env
block `NEWAPI_ADMIN_SECRET` valueFrom Secret
`newapi-bp-newapi-token-signing-key` key `ADMIN_SECRET`
(optional: true) — unchanged
No Chart.yaml bump (chart pinning is a release-driver concern).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 1.4.162 chart OCI artifact (published at commit 0ad78790 on 2026-05-18
06:33 UTC) pre-dates the template additions merged shortly afterwards.
Sovereigns sourcing bootstrap-kit pin 1.4.162 download that stale artifact
and never get the new templates → tenant HTTPRoute / Cilium per-zone
listeners / bp-newapi attestation gate / sandbox-controller D31 wiring
never reach the chroot.
This collector PR bumps chart version 1.4.162→1.4.163 (chart-version-only
republish — same OCI pipeline race documented in 1.4.6, 1.4.14, 1.4.16,
1.4.116, 1.4.118 changelog entries) AND bumps the bootstrap-kit pin to
match. The next fresh prov pulls 1.4.163's OCI bytes which contain every
post-1.4.162 chart change.
Baked in 1.4.163:
- #1644 organization-controller renders per-tenant HTTPRoute so
<slug>.<parentDomain> serves the tenant's installed product
(crds/organization.yaml tenantPublic +
templates/sme-services/tenant-public-routes.yaml +
values.yaml tenantRoutes[] static-fixture path).
- #1650 provisioning service patches Organization.spec.tenantPublic on
product-install so the #1644 HTTPRoute reconciler has a non-empty
parentDomain to render.
- #1640 Cilium Gateway one listener pair (HTTPS:30443 + HTTP:30080) per
parent zone (clusters/_template/sovereign-tls/cilium-gateway.yaml).
- #1654 bp-newapi attestation gate — channel CR render now skipped when
ATTESTATION_ACCOUNT_ID env is empty (was blocking install with bogus
channel pointing at empty account).
- Sandbox-controller post-handover refinements:
* api-deployment.yaml SOVEREIGN_ENABLE_HOT_STANDBY +
SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION env vars
(D31 wiring; SME tenant gitops writer reads at render time so
every freshly-rendered bp-wordpress-tenant HR carries
pg.activeHotStandby block when toggle is true).
* sovereign-fqdn-configmap.yaml exposes enableHotStandby /
primaryRegion / replicaRegion keys.
* clusterrole-cutover-driver.yaml grants sandbox.openova.io/sandboxes
verbs (create split into its own Rule per
feedback_rbac_create_no_resourcenames.md).
* values.yaml sovereign.{enableHotStandby,primaryRegion,replicaRegion}
defaults (all empty — zero regression on Sovereigns that have not
opted into active-hot-standby).
Hard rules: clusters/ READ-ONLY for non-template paths (only the
_template bootstrap-kit pin bumps), helm template clean, helm lint clean.
Verification:
- helm template bp-catalyst-platform products/catalyst/chart/ → exit 0,
3009 lines rendered.
- helm lint products/catalyst/chart/ → 0 chart(s) failed.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GET /api/v1/sandbox/byos/claude-code/config returning
{clientIdConfigured, oauthAuthorizeURL?} so the SandboxSettings card can
pre-flight whether the chart's SANDBOX_BYOS_CLAUDE_CODE_CLIENT_ID is
still the PLACEHOLDER-AWAITING-FOUNDER-REGISTRATION sentinel (set by
PR #1619). When it is, the FE now renders the "Connect Claude Max"
button as DISABLED with an amber "Operator setup pending" pill + tooltip
"Anthropic OAuth client_id not yet registered — contact your Sovereign
operator" — rather than letting the user click a button whose OAuth URL
would 400 at Anthropic and confuse them about whether BYOS is broken vs
awaiting the founder action documented in claude-code-byos.md §8.
Design-system inheritance preserved: amber-500/40 + amber-500/10 +
amber-300 + text-[10px] uppercase pill matches SettingsPage verbatim,
disabled:cursor-not-allowed + disabled:opacity-50 is the same button
disabled treatment used elsewhere on the page, and the inline operator-
pending paragraph uses the existing text-[var(--color-text-dim)] token.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a default-off Grafana dashboard ConfigMap (label `grafana_dashboard: "1"`)
that the upstream grafana/grafana sidecar (kiwigrid/k8s-sidecar) auto-discovers
across all namespaces and loads on startup. Renders only when
`.Values.grafanaDashboard.enabled=true` — zero regression for every existing
Sovereign overlay.
Panels (per products/sandbox/docs/architecture.md §7):
1. Active Sandboxes — sum(kube_customresource_sandbox_info)
2. pty-server Pods Ready % — kube_pod_status_ready{condition=true}
joined to kube_pod_labels{label_app_kubernetes_io_name="pty-server"}
3. MCP Pods Ready % — same shape, label_app_kubernetes_io_name="openova-sandbox-mcp"
4. WebSocket Connections — pty_server_websocket_connections (Gauge)
5. PVC Usage % per Sandbox — kubelet_volume_stats_used_bytes / capacity_bytes
6. Idle-Timeout Scale-Down Events / hour —
rate(sandbox_controller_idle_timeout_events_total[5m]) × 3600
7. newapi Token Mint Requests / hour —
rate(newapi_admin_token_mint_requests_total{tool="sandbox-controller"}[5m]) × 3600
Pattern mirrors platform/seaweedfs/chart/.../seaweedfs-grafana-dashboard.yaml.
Per Inviolable Principle #11 (never fabricate metrics) every panel description
names the metric it depends on so panels whose emitter has not yet rolled out
across the fleet render as "No data" instead of a synthetic number.
Validated:
- `helm template` clean both modes (default-off → zero output; enabled → 1 CM)
- `helm lint` passes (1 INFO about icon — pre-existing)
- Dashboard JSON parses (json.loads), 7 panels enumerated
- No Chart.yaml bump
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1637 shipped GET /api/v1/sandbox/sessions returning only the spec
the handler authored — Status was projected from `.status.phase` but
every other controller-managed field (sessions count, storage usage,
30-day spend, preview URLs, failure conditions) was ignored. The
catalyst-ui Sandbox surface had nothing to render past the FE-side
"queued" pill.
This wires the projection to read `.status.{phase,sessions,storageUsed,
spend30d,previews,conditions}` and surfaces them on the
list/get wire shape:
- `phase` (raw, Pending|Provisioning|Ready|Failed) alongside the
existing FE-projected `status` (pending|running|stopped|failed)
- `sessions`, `storageUsed`, `spend30d` — operator-visible quotas
- `previews[]` — one preview URL per PR/branch (skips rows missing
URL; coerces float64↔int64 from apiserver JSON round-trips)
- `conditions[]` — Type/Status/Reason/Message tuples verbatim, so
the FE can render TokenMintFailed / GitopsWriteFailed /
ManifestRenderFailed / OwnerEmailInvalid / NoAllowedChannels
inline instead of a generic red pill
Phase→FE-status mapping unchanged (matches sandbox.api.ts:
normalizeStatus). `mapSandboxStatus` refactored from
`(u *Unstructured)` to `(rawPhase string)` so the new
`readSandboxPhase` helper reads the field exactly once per item.
Added handler-package tests pinning the projection contract:
- StatusReflection — happy path, dropped-malformed rows
- PhaseProjection — every CRD phase → FE status
- FailedSurfacesConditions — Failed + TokenMintFailed visible
- NilInputDoesNotPanic — empty-slice defaults
- PreviewFloat64Coercion — apiserver JSON round-trip safety
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1621 shipped the SandboxSession xterm.js host with an "API pending"
placeholder banner. PR #1641 + #1657 wired the BE (sandbox-controller
renders the HTTPRoute on sandbox.<sov-fqdn>; pty-server exposes
WS /sessions/{id}/attach). This PR replaces the placeholder with a real
adapter:
- stdin : term.onData -> ws.send (TextEncoder binary frame)
- stdout : ws.onmessage -> term.write (ArrayBuffer / Uint8Array / Blob / string)
- resize : window resize -> fit.fit() -> POST sandbox.<sov-fqdn>/sessions/{id}/resize
- replay : pty-server ships the ring buffer as the first binary frame; the
generic onmessage path writes it verbatim, no special case
- reconnect: on close / error, schedule a retry with exponential backoff
(1s, 2s, 4s, 8s, 16s, 30s ceiling — same shape as
useComplianceStream). Connection banner reflects
connecting / connected / reconnecting / closed / idle.
Design-system inheritance: PortalShell wrapper unchanged, CSS-variable
colours throughout, amber for connecting/reconnecting and rose for
disconnected (the same shades the rest of the Sovereign Console uses).
The back-to-landing affordance the e2e suite asserts on is preserved.
Test seams kept: disableTerminal still skips xterm.js mount under
jsdom, plus new websocketFactory / resizeFetcher / reconnectBackoffMs /
disableReconnect props so unit tests can exercise the WS pump without a
real socket or wall-clock backoff.
npx tsc --noEmit clean on the full UI project.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires four real Stripe handlers in openova-sandbox-mcp, completing the
final unwired namespace from architecture.md §3 (sandbox.stripe.*):
- sandbox.stripe.bindAccount {api_key} — validates the key prefix
(sk_live_ / sk_test_ / rk_live_ / rk_test_), stores it in the
per-Sandbox Secret (`sandbox-<owner-uid>-secrets`, data-key
`stripe_api_key`) via the same write-path sandbox.secrets.write
uses, returns a masked confirmation (`sk_test_…xY12`).
- sandbox.stripe.listProducts — reads the bound key implicitly,
GET /v1/products with limit (1-100, default 20), active, and
starting_after cursor passthrough.
- sandbox.stripe.listPrices {product_id?} — same pagination shape;
optional product_id filter.
- sandbox.stripe.createCheckoutSession {price_id, success_url,
cancel_url} — validates absolute http(s) URLs, POSTs the
form-encoded line_items[0][price/quantity] body to
/v1/checkout/sessions, returns the hosted Checkout URL + session id.
Implementation:
- No new module dep — inline HTTPS calls to api.stripe.com via the
stdlib net/http client. stripe-go v82 would have pulled ~80
transitive packages for four endpoints; the surface we need is
tiny enough that a 100-line stripeDo helper covers it. Matches
the task's "stripe-go v82 if not already in deps; else inline
HTTPS" guidance.
- The key never round-trips on the wire after first bind. Agent
pastes once via bindAccount; every subsequent call reads it from
the Secret store. Stripe-Version header pinned to 2024-06-20 so
a future API revision can't silently break the wire format.
- Auth: RequiredCapability="sandbox.stripe" on every tool.
claims.OrgID match enforced by the registry's existing gate.
- Read-only cluster invariant: the only writes are to the
per-Sandbox Secret. assertManagedBy() enforced on bind so we
cannot mutate the controller-injected `sandbox-tokens` Secret.
Tests cover key validation (prefix + length), masking format, limit
clamping, the httptest.Server-backed happy-path + error-envelope
unwrap, form-urlencoded body shape for createCheckoutSession,
catalogue wiring (all four handlers non-nil, RequiredCapability
matches), and the registry capability gate (missing sandbox.stripe
cap → forbidden).
Closes the Wave 13 "last MCP namespace" gap; no chart bump.
Co-authored-by: Claude <noreply@anthropic.com>
Append a Wave 12-14 addendum to the convergence report capturing:
- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle
Companion memos (out-of-tree, not in this PR):
- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback
Append a Wave 12-14 addendum to the convergence report capturing:
- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle
Companion memos (out-of-tree, not in this PR):
- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)
Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.
Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml
Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:
1. **mcp-server Dockerfile built against a too-narrow context**. The
workflow passed `context: products/sandbox/mcp-server` and the
Dockerfile assumed `COPY . .` could see everything it needed, but
the `replace ../../../core/controllers` line in the module's go.mod
only resolves when the build can actually reach those paths. Result:
every push after #1658 failed at `go build` with `module not found`.
Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
COPY the replace targets' module roots + sources, then build with
WORKDIR set to the dependent module. Static binary still produced
into a distroless/static-debian12:nonroot final stage.
2. **mcp-server workflow had no chart auto-bump step**. Even after a
green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
stayed empty so the chart's `required` guard
(deployment.yaml line 72) refused to render. Added the same
yq-bump + bot-commit pattern build-sandbox-controller.yaml already
uses, targeting `.runtime.mcpImage` and writing a fully-qualified
`<repo>:<sha>` string (consumer reads it as one image reference,
not a {repository,tag} pair). Also widened paths-filter to include
core/controllers/** + core/services/shared/** so changes to the
replace targets re-trigger the build.
3. **pty-server workflow had no auto-bump either**. Same surgery:
yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
narrow (pty-server has no cross-tree `replace` directives).
4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
next chart roll out doesn't fail-fast before the rebuilt workflows
land their first bumps:
- ptyServerImage → ad5163e6 (current latest pty-server)
- mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
workflow will land the next real SHA on the next push to main).
Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
binary at /tmp/openova-sandbox-mcp; `file` confirms statically
linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
renders cleanly; both env vars carry the SHA-pinned image refs.
No Chart.yaml bump. Read-only clusters.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave 2 stubs for the sandbox.storage.* namespace with real
handlers backed by the host-cluster's unified SeaweedFS S3 API
(`seaweedfs.storage.svc:8333` per platform/seaweedfs/README.md). Every
handler is scoped to buckets prefixed `sandbox-<owner-uid>-` so the
agent cannot touch any other consumer's bucket (loki-data / cnpg-wal /
harbor-data, etc.).
Tools shipped:
- sandbox.storage.bindBucket {bucket_name?}
- sandbox.storage.signedUploadURL {bucket, key, expires_in_seconds?}
- sandbox.storage.signedDownloadURL {bucket, key, expires_in_seconds?}
- sandbox.storage.listBuckets
- sandbox.storage.deleteBucket {name}
Wire model: minio-go v7 (already canonical across the OpenOva tree —
catalyst-bootstrap/hetzner objectstorage depend on it) speaking S3 v4
to SeaweedFS. Presigned URLs default to 15 min and clamp to 7 days
(the S3 v4 signature ceiling).
Defence-in-depth: prefix-mismatch + 63-char S3 cap + alnum-only object
key regex all enforced BEFORE any S3 dial; arg-validation errors
surface clearly without first hitting a misleading creds error.
New env vars (sandbox-controller fills these at MCP Deployment
spec time):
SANDBOX_STORAGE_S3_ENDPOINT = "seaweedfs.storage.svc:8333"
SANDBOX_STORAGE_S3_ACCESS_KEY = "<per-Sandbox IAM access key>"
SANDBOX_STORAGE_S3_SECRET_KEY = "<per-Sandbox IAM secret>"
SANDBOX_STORAGE_S3_USE_TLS = "true|false" (default: false)
SANDBOX_STORAGE_S3_REGION = "us-east-1" (default; opaque to SeaweedFS)
Auth: same shape as PR #1658 (sandbox.auth.* + sandbox.secrets.*) —
claims.OrgID must match env.OrgID, RequiredCapability=sandbox.storage.
Tests: 9 new test functions covering prefix format, scope-gate,
bucket-name normalization (including cross-Sandbox refusal + length
guard), object-key validation, expiry clamp, region default, per-tool
arg validation, capability gate, and catalogue wiring. `go build` +
`go test` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>