The catalyst-platform chart's templates/sme-services/provisioning-github-token.yaml
mirrors gitea-admin-secret.password verbatim into
sme/provisioning-github-token.GITHUB_TOKEN. The SME provisioning service
then sends `Authorization: token <PWD>` to Gitea — Gitea resolves the
Bearer/token credential as an API access token (sha1 lookup), the admin
password is not an access token, so Gitea returns 401 "user does not
exist [uid: 0, name: ]".
End result on t22: voucher checkout returns 200, /jobs redirect fires,
but no Organization CR is ever created (every Gitea API call from
provisioning 401s). Journey step 16 stalls indefinitely.
Verified on t22 (2026-05-18):
- sme/provisioning-github-token.GITHUB_TOKEN.last8 == gitea-admin-secret.password.last8 == ChxCejmH
- curl -H "Authorization: token <pwd>" /api/v1/user → 401 user does not exist
- curl -u gitea_admin:<pwd> /api/v1/user → 200 OK (Basic works, token doesn't)
- 0 organizations.orgs.openova.io cluster-wide
Fix: new cutover step 09 (gitea-token-mint) runs alongside the existing
01..08 chain at handover. The step:
1. DELETEs any stale catalyst-platform-bootstrap token (idempotent —
404 swallowed on first run).
2. POSTs /api/v1/users/gitea_admin/tokens with scope "all".
3. Captures the returned .sha1 (raw token bytes appear there exactly
once — Gitea hashes server-side after creation).
4. Validates by calling GET /api/v1/user with `Authorization: token <X>`
and asserts 200 + non-empty login field.
5. kubectl-patches Secret sme/provisioning-github-token.GITHUB_TOKEN
to the new token via strategic-merge stringData (kubectl base64s).
6. Rolls the provisioning Deployment so the new token takes effect
immediately (best-effort — skipped if marketplace disabled).
Order=9 (last) is functionally fine — none of steps 02-08 read the
provisioning-github-token Secret, and the SME provisioning service first
consumes the token at voucher checkout time (always postdates cutover).
Slot 9 vs 1b avoids renumbering 01..08 which would invalidate operator
history in the cutover-status ConfigMap audit trail.
Token credentials never appear in process argv (passed via stdin / env
to kubectl), and validate-failure paths sed-redact the new token from
stderr before surfacing the response body.
Contract-test guard added (Case 19): step ConfigMap rendered with
order=9, the POST /api/v1/users/.../tokens call present, sha1 capture
present, Authorization: token validation present, kubectl patch present.
Existing step-count gates updated 8 → 9 and 7 job-mode → 8.
chart bp-self-sovereign-cutover: 0.1.29 → 0.1.30
Refs TBD-C18
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Apache Guacamole webapp deploys under Tomcat's context path
`/guacamole/` (the WAR is `guacamole.war` so Tomcat exposes it at
`/<warname>/`). Tomcat's ROOT context at `/` returns 404. Probing
`/` previously caused both liveness AND readiness probes to fail
with HTTP 404 → kubelet restarted the Pod every ~60s → kube-system
Cilium gateway returned HTTP 503 to `https://guacamole.<sov>/`
because no Endpoint was ever Ready (observed on t22, 5 restarts in
8m of uptime).
Probing `/guacamole/` matches the actual servlet context the
webapp registers at boot.
Chart bump 0.1.22 -> 0.1.23. Bootstrap-kit pin follow-up in a
separate PR (pattern matches #1693 + #1694).
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On t22 (omantel.biz fresh Sovereign) 2 of 15 HTTPRoutes went
Accepted=False because their parentRef pointed at a gateway that
does not exist on any Sovereign:
catalyst-system/guacamole-server -> gateway-system/cilium-gateway
catalyst-system/openova-flow-server -> kube-system/catalyst-gateway
The canonical Sovereign Gateway is kube-system/cilium-gateway,
installed by bootstrap-kit/01-cilium.yaml and used by every other
HTTPRoute (catalyst-api, catalyst-ui, marketplace, gitea, harbor,
keycloak, grafana, hubble-ui, openbao, powerdns, tenant-wildcard).
gateway-system does not exist; catalyst-gateway does not exist.
Fixes:
- platform/guacamole/chart/values.yaml — default
guacamole.httproute.parentRef.namespace: gateway-system -> kube-system
- clusters/_template/bootstrap-kit/56-bp-openova-flow-server.yaml —
flowServer.httproute.gatewayRef.name: catalyst-gateway -> cilium-gateway
(namespace already kube-system, untouched)
Verified on t22: all 15 HTTPRoutes now Accepted=True after chart bump
+ Flux reconcile.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stop handing every Sandbox session the full MCP surface. Each per-Sandbox
NewAPI token now carries a plan-derived capability allowlist that the MCP
server enforces against per-tool RequiredCapability via Claims.HasCapability:
- Free: read-only k8s + gitea read + session/rag/skills
- Pro: + sandbox.db.* + sandbox.storage.* + sandbox.preview.* +
sandbox.auth.* + sandbox.secrets.* + marketplace.* + flux.status
- Ent: + sandbox.deploy.{staging,production,...} + sandbox.stripe.* +
flux.{reconcile,suspend,resume} + gitea.pr.{create,merge} +
gitea.issue.*
Wiring:
- Sandbox CRD spec gains planId + capabilities[] (operator overlay).
- Sandbox sandboxapi.{CapabilitiesForPlan,ResolveCapabilities} is the
SoT; tenant orchestrator carries an exact-mirror capabilitiesForPlan
(no controllers-module dep — same isolation pattern quotaForPlan
uses).
- sandbox-controller threads spec.capabilities (falling back to plan)
into newapi.MintRequest.
- catalyst-api bridge handler accepts capabilities[] on the wire and
encodes it as the JWT `capabilities` claim (omitted when empty).
- Claims.HasCapability gains wildcard prefix matching (`sandbox.db.*`
satisfies `sandbox.db.provision`, `sandbox.db`, etc.) so plan grants
stay coarse. Plain stem matches WITHOUT a wildcard are intentionally
rejected — the production second-gate in sandbox_deploy.go stays
honest.
- MCP registry: every gated tool now carries its granular dotted
RequiredCapability (`sandbox.db.provision`, `gitea.pr.list`, …).
Read-only / session tools previously ungated also get granular
grants so Free tokens can browse without inheriting the write
surface.
No Chart.yaml bump — CRD additions are additive; existing Sandbox CRs
parse fine. Empty token capabilities downgrades to introspection only,
matching pre-PR-#1671 callers.
Tests: shared/auth/claims_test.go (wildcard matrix),
sandboxapi/capabilities_test.go (plan ladder + spec override),
sandbox_token_test.go (capabilities round-trip + omit-on-empty),
sandbox_controller_test.go (plan-derived + spec-override mint),
sandbox_consumer_test.go (orchestrator stamps spec.capabilities), plus
updates to every per-namespace registry test asserting new granular
RequiredCapability values.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-this-fix the bootstrap-kit overlay at
clusters/_template/bootstrap-kit/80-newapi.yaml:171 sets
`auth.adminUI.keycloak.existingSecret: newapi-oidc` but NOTHING in the
chart nor in the operator overlay materialised that Secret. The Pod
stayed in `CreateContainerConfigError: secret "newapi-oidc" not
found`, blocking the entire bp-newapi HR from reaching Ready (t20
debug matrix Fix#6).
New template templates/keycloak-client-secret.yaml uses Helm `lookup`
to retrieve the existing Secret bytes on every reconcile (idempotent —
preserves the OIDC client secret across upgrades), falling back to
`randAlphaNum 32` on first install. Mirrors the existing canonical
seam in the same chart (templates/credentials-secret.yaml issue #943,
templates/sandbox-token-signing-key-secret.yaml PR #1638) — both use
the same lookup-or-generate pattern with helm.sh/resource-policy: keep.
The sister chart platform/guacamole/chart/templates/keycloak-client-
secret.yaml uses a SealedSecret placeholder + a bootstrap Job hook;
THIS chart already standardised on the simpler `randAlphaNum + Helm
lookup` pattern, so this Secret follows the same seam to keep the
chart's credential-materialisation strategy consistent.
Gated on `auth.adminUI.mode=keycloak` AND non-empty
`auth.adminUI.keycloak.existingSecret`, so non-keycloak installs
render nothing extra.
No Chart.yaml bump — pure template addition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The official Apache Guacamole image entrypoint runs `rm -rf
$GUACAMOLE_HOME` (== `/home/guacamole/.guacamole`) before re-populating
the directory on every start. When the chart mounted an emptyDir
directly at `/home/guacamole/.guacamole`, that path was a mount point
from the kernel's perspective, so `rm` failed with:
rm: cannot remove '/home/guacamole/.guacamole':
Read-only file system
— the entrypoint exited non-zero and the Pod CrashLoopBackOff'd before
the webapp ever started. (t20 debug matrix — Fix #5.)
Mount the PARENT directory (`/home/guacamole`) instead. `.guacamole`
becomes a regular subdirectory inside the emptyDir, which the
entrypoint can freely `rm -rf` and recreate. The webapp's first-start
writes still land in a writable location under readOnlyRootFilesystem.
No Chart.yaml version bump per the t20 hard-rules contract — chart
release will roll in the next blueprint-release wave.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per t20 debug matrix:
* `bp-self-sovereign-cutover` step-06 phase-1 rewrites every HelmRepository
URL from `oci://ghcr.io/openova-io` to `oci://${harbor_host}/openova-io`,
where `harbor_host` is derived from `sovereign.harborPublicURL`.
* Pre-fix: `harborPublicURL: https://harbor.${SOVEREIGN_FQDN}`.
* But the bp-harbor HTTPRoute publishes at `registry.${SOVEREIGN_FQDN}` —
see `clusters/_template/bootstrap-kit/19-harbor.yaml` line 167
(`gateway.host: registry.\${SOVEREIGN_FQDN}`). No HTTPRoute matches
`harbor.<sov>`, so post-pivot every OCI chart pull EOFs.
* Effect: bp-sandbox HR never Ready → bootstrap-kit Kustomization stuck
waiting on bp-sandbox health → t20 convergence blocks indefinitely.
Fix (chart-level, no Chart.yaml bump for bp-catalyst-platform):
* `clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml`
overlay value flipped `harbor.${SOVEREIGN_FQDN}` → `registry.${SOVEREIGN_FQDN}`.
* `platform/self-sovereign-cutover/chart/values.yaml` default placeholder
flipped `harbor.example.local` → `registry.example.local` so smoke
renders + docs line up.
* README + smoke command updated.
Smoke tests:
* `helm template smoke platform/self-sovereign-cutover/chart` — clean,
1851 lines, `HARBOR_PUBLIC_URL=https://registry.example.local`.
* `helm template smoke ... --set sovereign.harborPublicURL=https://registry.otechN.omani.works`
— clean, all step env vars carry the new host.
* `kubectl kustomize clusters/_template/bootstrap-kit/` — clean, 2926 lines,
overlay shows `harborPublicURL: https://registry.${SOVEREIGN_FQDN}`.
* `bash platform/self-sovereign-cutover/chart/tests/cutover-contract.sh`
— all gates green (Phase-0 ghcr-pull auth merge still works because
`harbor_host` is derived from `HARBOR_PUBLIC_URL` env at runtime, so
the script now correctly merges auth for `registry.<sov-fqdn>` instead
of `harbor.<sov-fqdn>`).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1674 shipped the Sandbox Runtime Grafana dashboard with three
panels whose metrics did not yet exist anywhere in the fleet:
- "WebSocket Connections" → pty_server_websocket_connections (Gauge)
- "Idle-Timeout Scale-Down Events / hour" →
sandbox_controller_idle_timeout_events_total (Counter)
- "newapi Token Mint Requests / hour" →
newapi_admin_token_mint_requests_total{tool,status} (Counter)
Per Inviolable Principle #11 the panels render "No data" until the
emitter sides roll out. This PR closes that loop.
pty-server (products/sandbox/pty-server)
- New metrics.go: registers Gauge pty_server_websocket_connections
via promauto + exposes promhttp.Handler.
- routes.go: serves /metrics on GET; Inc/Dec the gauge around every
successful WS upgrade in attach() and cards() (Defer Dec so abnormal
returns still decrement).
- go.mod: + github.com/prometheus/client_golang v1.19.1 (matches the
version core/controllers already pulls in transitively).
- New unit test asserts GET /metrics carries the gauge name.
sandbox-controller (core/controllers/sandbox/internal/idlescaler)
- New metrics.go: Counter sandbox_controller_idle_timeout_events_total
with label {namespace} registered on controller-runtime's shared
registry (so the manager's existing :8080 /metrics endpoint surfaces
it — no new listener).
- idlescaler.go: bumps the counter inside scaleToZero() so every Pod
scaled to 0 ticks once. Namespace label matches the dashboard panel's
`sum by (namespace) (rate(...))` aggregation.
- New unit test verifies the counter delta is 1 on a successful
scale-to-zero pass.
newapi bridge handler (platform/newapi/internal/handler)
- New metrics.go: CounterVec newapi_admin_token_mint_requests_total
with labels {tool, status}; helper classifyStatus() maps HTTP codes
to a finite cardinality of 7 status values (ok / unauthorized /
bad_request / unavailable / server_error / method_not_allowed /
other). Exported MetricsHandler() so the catalyst-api wiring code
can mount /metrics on the same listener as the bridge.
- sandbox_token.go: recordMint(r, status) at every return path so the
counter ticks regardless of which branch the request hits.
- go.mod: + github.com/prometheus/client_golang v1.19.1.
- 5 new test cases assert counter delta == 1 for the documented
status transitions and that the X-Catalyst-Tool header surfaces
as the `tool` label.
sandbox-controller → newapi client (core/controllers/sandbox/internal/newapi)
- client.go: stamp `X-Catalyst-Tool: sandbox-controller` on every
outbound POST /admin/tokens/sandbox so the bridge counter's `tool`
label has the canonical value the dashboard panel filters on.
Helm charts
- platform/sandbox/chart/templates/service.yaml (new): ClusterIP
Service exposing the controller's :8080 metrics port. Required so a
ServiceMonitor selector has something to attach to.
- platform/sandbox/chart/templates/servicemonitor.yaml (new):
monitoring.coreos.com/v1 ServiceMonitor scoped to the metrics
Service. Default-off + double-guarded with
`.Capabilities.APIVersions.Has "monitoring.coreos.com/v1"` (matches
platform/harbor/chart/templates/servicemonitor.yaml pattern). values
block + per-Sovereign overrides (interval / scrapeTimeout / path /
labels / namespace) per Inviolable Principle #4.
- platform/newapi/chart/templates/servicemonitor.yaml (new): mirror
template targeting the existing bp-newapi Service / `http` port for
when the catalyst-api binary that mounts the bridge handler rolls
out behind the same Service. Default-off, capability-guarded.
No chart Chart.yaml bump. Validates: helm template + helm lint clean
on both charts; go build + go test clean across all three modules
(pty-server, core/controllers/sandbox, platform/newapi/internal/handler).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 15 — closes the verification gap between PR #1637 (catalyst-api
/api/v1/sandbox/sessions), PR #1638 (real /admin/tokens/sandbox mint
in newapi), and PR #1643 (sandbox-controller calls the bridge):
1. core/controllers/sandbox/internal/newapi/integration_test.go (new)
End-to-end test wiring the REAL newapi.Client through net/http
against a httptest.Server that mirrors the bridge handler contract
(platform/newapi/internal/handler/sandbox_token.go) verbatim:
- happy-path: 200 with {token, expires_at} → controller renders
per-Sandbox Secret + stamps lifecycle annotations on the CR
(verifies wire path: Authorization: Bearer + body fields
org_id/user_id/sandbox_id/allowed_channels)
- 401 path: wrong admin bearer → 401 JSON envelope → controller
stamps TokenMintFailed False condition, NO gitops writes,
requeues, NO lifecycle annotations stamped
- transport-unreachable: closed server URL → TokenMintFailed
(verifies the error wrapping path)
Gap filled: client_test.go covers the HTTP client in isolation,
sandbox_controller_test.go uses an in-process stub. Neither
exercises the controller-runtime reconciler against a real HTTP
transport — this file is the only place where a regression in the
client's bearer/header/path/status-code handling would surface in
the context of the reconciler's state machine.
2. platform/newapi/chart — reflector wiring fix
Default reflectorNamespaces changed from "sandbox" to
"catalyst-system,sandbox". Root cause: clusters/_template/
bootstrap-kit/19a-bp-sandbox.yaml sets `targetNamespace:
catalyst-system` (the canonical install namespace of the
bp-sandbox HelmRelease) but the chart-emitted Secret was being
mirrored into a `sandbox` namespace that does not exist on a
stock Sovereign. Result: sandbox-controller Pod's
`NEWAPI_ADMIN_SECRET` env var landed empty (secretKeyRef
`optional: true` swallowed the missing-Secret error) → controller
started in gitops-only mode, never minted tokens, silently
degraded. Operator-visible only via a startup log line.
`sandbox` is retained in the default for legacy overlays that
install the controller into a dedicated namespace + for sister
tooling (catalyst-api PATs routed through the bridge) that wants
the admin bearer locally.
Verification:
- go build ./sandbox/... clean
- go test ./sandbox/internal/newapi/... — 7 tests pass (4 unit + 3
new integration)
- go test ./sandbox/... — all sandbox packages pass
- go vet ./sandbox/... clean
- helm template platform/newapi/chart/ -s sandbox-token-signing-
key-secret.yaml renders with
`reflection-{allowed,auto}-namespaces: "catalyst-system,sandbox"`
- helm template platform/sandbox/chart/ renders Deployment env
block `NEWAPI_ADMIN_SECRET` valueFrom Secret
`newapi-bp-newapi-token-signing-key` key `ADMIN_SECRET`
(optional: true) — unchanged
No Chart.yaml bump (chart pinning is a release-driver concern).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a default-off Grafana dashboard ConfigMap (label `grafana_dashboard: "1"`)
that the upstream grafana/grafana sidecar (kiwigrid/k8s-sidecar) auto-discovers
across all namespaces and loads on startup. Renders only when
`.Values.grafanaDashboard.enabled=true` — zero regression for every existing
Sovereign overlay.
Panels (per products/sandbox/docs/architecture.md §7):
1. Active Sandboxes — sum(kube_customresource_sandbox_info)
2. pty-server Pods Ready % — kube_pod_status_ready{condition=true}
joined to kube_pod_labels{label_app_kubernetes_io_name="pty-server"}
3. MCP Pods Ready % — same shape, label_app_kubernetes_io_name="openova-sandbox-mcp"
4. WebSocket Connections — pty_server_websocket_connections (Gauge)
5. PVC Usage % per Sandbox — kubelet_volume_stats_used_bytes / capacity_bytes
6. Idle-Timeout Scale-Down Events / hour —
rate(sandbox_controller_idle_timeout_events_total[5m]) × 3600
7. newapi Token Mint Requests / hour —
rate(newapi_admin_token_mint_requests_total{tool="sandbox-controller"}[5m]) × 3600
Pattern mirrors platform/seaweedfs/chart/.../seaweedfs-grafana-dashboard.yaml.
Per Inviolable Principle #11 (never fabricate metrics) every panel description
names the metric it depends on so panels whose emitter has not yet rolled out
across the fleet render as "No data" instead of a synthetic number.
Validated:
- `helm template` clean both modes (default-off → zero output; enabled → 1 CM)
- `helm lint` passes (1 INFO about icon — pre-existing)
- Dashboard JSON parses (json.loads), 7 panels enumerated
- No Chart.yaml bump
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback
Append a Wave 12-14 addendum to the convergence report capturing:
- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle
Companion memos (out-of-tree, not in this PR):
- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)
Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.
Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml
Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:
1. **mcp-server Dockerfile built against a too-narrow context**. The
workflow passed `context: products/sandbox/mcp-server` and the
Dockerfile assumed `COPY . .` could see everything it needed, but
the `replace ../../../core/controllers` line in the module's go.mod
only resolves when the build can actually reach those paths. Result:
every push after #1658 failed at `go build` with `module not found`.
Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
COPY the replace targets' module roots + sources, then build with
WORKDIR set to the dependent module. Static binary still produced
into a distroless/static-debian12:nonroot final stage.
2. **mcp-server workflow had no chart auto-bump step**. Even after a
green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
stayed empty so the chart's `required` guard
(deployment.yaml line 72) refused to render. Added the same
yq-bump + bot-commit pattern build-sandbox-controller.yaml already
uses, targeting `.runtime.mcpImage` and writing a fully-qualified
`<repo>:<sha>` string (consumer reads it as one image reference,
not a {repository,tag} pair). Also widened paths-filter to include
core/controllers/** + core/services/shared/** so changes to the
replace targets re-trigger the build.
3. **pty-server workflow had no auto-bump either**. Same surgery:
yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
narrow (pty-server has no cross-tree `replace` directives).
4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
next chart roll out doesn't fail-fast before the rebuilt workflows
land their first bumps:
- ptyServerImage → ad5163e6 (current latest pty-server)
- mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
workflow will land the next real SHA on the next push to main).
Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
binary at /tmp/openova-sandbox-mcp; `file` confirms statically
linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
renders cleanly; both env vars carry the SHA-pinned image refs.
No Chart.yaml bump. Read-only clusters.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sovereign DoD D31 — CNPG-backed apps must replicate across the
Sovereign's regions when the operator opts in. PR #1562 wired this
into bp-wordpress-tenant chart-level. This change extends the same
toggle across BOTH user-facing paths:
1. Marketplace tenant flow (sme_tenant_gitops.go)
- smeTenantTemplateData gains EnableHotStandby/PrimaryRegion/
ReplicaRegion. renderSMETenantOverlay reads them from the
catalyst-api Pod env (SOVEREIGN_ENABLE_HOT_STANDBY +
SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION).
- Bp-wordpress-tenant HelmRelease emits pg.activeHotStandby.*
when the trio is valid; bp-wordpress-tenant chart 0.2.0+
(PR #1562) renders the primary + replica Cluster CR pair.
- Defence-in-depth: degenerate inputs (empty/identical regions)
fall back to single-Cluster shape rather than emitting a
HelmRelease the chart's validateActiveHotStandbyRegions helper
would fail at template time.
2. Sandbox plane (sandbox.db.provision)
- Env struct + NewEnvFromOS read the same Sovereign-level trio.
- sandbox.db.provision emits a primary + replica Cluster CR pair
when hotStandbyActive() — same shape bp-cnpg-pair renders for
marketplace apps + bp-wordpress-tenant cnpg-cluster.yaml: WAL
streaming via spec.managed.services.additional annotated
service.cilium.io/global=true, nodeAffinity pinning each side
to its declared region, replica.enabled=true with externalCluster
resolving the primary through the ClusterMesh-global Service alias.
- Best-effort rollback if the replica Create fails so the operator
never sees an orphan primary.
3. Plumbing (one knob, both paths)
- catalyst chart: values.sovereign.{enableHotStandby,primaryRegion,
replicaRegion} -> sovereign-fqdn ConfigMap keys -> catalyst-api env.
- sandbox chart: cnpg.activeHotStandby.{enabled,primaryRegion,
replicaRegion} -> controller env -> per-Sandbox MCP Pod env.
- Bootstrap-kit slot 13 + slot 19a wire SOVEREIGN_ENABLE_HOT_STANDBY/
SOVEREIGN_PRIMARY_REGION/SOVEREIGN_REPLICA_REGION envsubst
placeholders to BOTH chart paths so the operator flips one knob
on the per-Sovereign overlay and gets HA across the marketplace
tenant install AND the sandbox.db plane.
Default empty/false: every Sovereign that has not opted in keeps
rendering single-Cluster CNPG (zero regression).
gitlab-tenant + nextcloud-tenant charts: NOT shipped in this repo
today, so they are out of scope. When they land they can copy the
same value contract (pg.activeHotStandby.*) and the gitops writer
wiring already handles them — no chart-bump or controller change
required.
Tests
- sme_tenant_active_hot_standby_test.go: 8 cases (off, on-happy-path,
degenerate matrix incl. empty primary, empty replica, identical
regions, toggle off with regions).
- sandbox_db_hot_standby_test.go: 11 cases covering hotStandbyActive
matrix + replicaClusterName/replicationServiceName suffix rules +
full primary + replica CR shapes (nodeAffinity, switchover, managed
service, externalClusters).
- platform/wordpress-tenant/chart/tests/active-hot-standby-render.sh
still passes (5/5 gates green).
- catalyst-api SMETenant suite GREEN.
- sandbox-controller suite GREEN.
- helm template clean for sandbox chart (HA + default-off) and
catalyst chart (sovereign-fqdn-configmap + api-deployment).
Hard rules respected: READ-ONLY clusters, no Chart.yaml bump on
bp-catalyst-platform (envsubst-only wiring change in slot 13), no
host-cluster touch outside the chart-level seam.
Refs DoD D31.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergence wave 11 blocker on t16: bp-newapi HR install fails with
Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
"bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
requires accountId
PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:
attestation:
kind: commercial-contract
accountId: ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}
On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.
Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.
Touches two templates that gate on the same effective channel list:
- templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
pre-check ($qbdAttReady) that short-circuits the channel composition
block when attestation is incomplete. The downstream
`assertChannelAttestation` helper then sees an empty channel list
for the qwenBankDhofar slot and emits no error.
- templates/channel-seed-job.yaml — mirrors the same gate so the
post-install Helm hook Job + RBAC + audit ConfigMap also skip when
the channel itself was skipped (otherwise the Job would POST a row
whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).
`helm template platform/newapi/chart` renders cleanly in all three
states:
- default (qbd.enabled=false) → no channel, no seed Job
- qbd.enabled=true + empty accountId/contractRef → no channel, no
seed Job (NEW: pre-1.4.10 this hard-failed)
- qbd.enabled=true + accountId + contractRef present → channel
composed normally, seed Job emitted
Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.
READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:
pty-server (products/sandbox/pty-server/):
- session.Manager tracks lastActivity; Touch() called on session
create/stop, WS attach/detach, every WS message in/out, resize/signal.
- New GET /idle endpoint returns {lastActivityAt, activeSessions}.
- Unit tests cover the endpoint shape + Touch() bump.
sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
- New IdleScaler runnable, registered with mgr.Add() in main.go.
- NeedLeaderElection=true (singleton across HA replicas).
- Every 60s lists pty-server StatefulSets by label selector
(app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
constrained to `sandbox-*` namespaces in code for defence-in-depth.
- For each: probes the in-cluster Service /idle endpoint, stamps the
`openova.io/sandbox-last-activity-at` annotation, and patches
spec.replicas=0 once now-lastActivity exceeds the per-SS
`openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
- Probe failure with no prior annotation → skip (next tick); probe
failure WITH prior annotation → still decide on stale data so a
degraded probe path doesn't keep a forgotten Pod alive forever.
- activeSessions > 0 keeps the Pod alive regardless of idle window.
- Already-zero replicas → idempotent no-op.
Chart RBAC:
- ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
cluster-wide write on a non-CR resource, scoped to the controller's
own managed StatefulSets via the label selector + namespace prefix.
Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.
Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.
Changes:
- New newapi.Client (core/controllers/sandbox/internal/newapi/) —
thin HTTP client for POST /admin/tokens/sandbox with the bridge's
{org_id, user_id, sandbox_id, allowed_channels} body + Bearer
ADMIN_SECRET auth. Interface so tests can stub.
- Reconciler extended:
* NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
* On every reconcile: decide mint-or-skip from annotation
openova.io/sandbox-token-expires-at vs. now + lead-time
* On mint: POST to bridge, stamp expires-at + rotated-at
annotations on the CR, render token bytes into a new
gitops manifest secret-newapi-token.yaml committed to the
per-Org catalyst-tenant repo at sandbox/<owner-uid>/
* Bridge failure → Failed/TokenMintFailed condition + 30s
requeue + no gitops writes (fail-loud)
* Empty DefaultChannels → NoAllowedChannels condition (fail
earlier than the bridge's 400)
- gitops.Render:
* New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
/NewAPITokenRotatedAt fields
* New secret-newapi-token.yaml template — Secret with
stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
optional kubectl.kubernetes.io/restartedAt rotation marker
so Wave 2's pty-server StatefulSet picks up rolling
restarts on token rotation
* kustomization.yaml appends the new manifest when token
present
- Chart wiring (platform/sandbox/chart):
* Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
(secretKeyRef from newapi-bp-newapi-token-signing-key,
optional: true), NEWAPI_DEFAULT_CHANNELS
* ClusterRole bumped to allow update/patch on the
sandboxes/ resource (the controller now stamps annotations
on the CR)
- platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
* Added emberstack/reflector annotations so the chart-emitted
Secret (newapi namespace) mirrors into the sandbox-controller
namespace by default; reflectorNamespaces is overrideable.
Tests:
- newapi client: happy-path round-trip, 401 surfaces, input
validation, request validation. 4 cases.
- sandbox-controller: existing Wave 1 cases (happy/idempotent/
drift/missing) still pass; 5 new cases for the token path:
fresh mint + Secret render, rotation on near-expiry, steady-
state no-mint, bridge failure surfaces condition, no-channels
misconfig fails early. 9 cases total, all green.
Hard rules honored:
- No Chart.yaml bump (chart pinning is a release-driver concern)
- go build + go test ./core/controllers/sandbox/... clean
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.
This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:
- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
one Pod per in-flight session per architecture.md §1/§2). Env wired
per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
(key llm-gateway-token, optional). When claude-code is in
spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
the existing catalyst-public Cilium Gateway in catalyst-system.
PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
its own /sessions/<id> surface.
Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
defaults.
Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).
Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.
Handler (platform/newapi/internal/handler/sandbox_token.go):
- Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
constant-time compared. 401 on mismatch / missing bearer.
- Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
De-duplicates + scrubs empty channel names so a controller bug
sending [""] can't mint a token that NewAPI silently treats as
"no restriction".
- Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
{sub: sandbox_id, org: org_id, user: user_id, channels: [...],
iat, exp: iat+7d, typ: "sandbox"}.
- Returns {token, expires_at}.
- Refuses with 503 when SigningKey or AdminSecret is unset
(visible chart-wiring gap, not a forgeable-token leak).
- Removes the previous Claims/jwt.Parse PAT-validation path that
came with the stub — caller is the controller, not an operator.
- NewHandlerFromEnv() factory loads + validates env at process
start so catalyst-api can fail loudly instead of shipping the
endpoint silently.
Unit tests (sandbox_token_test.go) — 11 cases:
- happy path (mint + claim shape + signature round-trip)
- de-dup + empty-channel scrub
- admin-secret mismatch / missing bearer → 401
- missing org_id / user_id / sandbox_id / empty channels → 400
- non-POST → 405
- unset env → 503
- mintSandboxToken empty-secret guard + round-trip
- response does not echo admin secret or signing key
Chart wiring (platform/newapi/chart):
- New Secret template sandbox-token-signing-key-secret.yaml
auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
(same load-bearing pattern as credentials-secret.yaml #943 and
gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
for both SIGNING_KEY and ADMIN_SECRET; persistence across
reconciles is required because a reconcile-time rotation would
silently invalidate every per-Sandbox token across the Sovereign
AND break the sandbox-controller's auth path until its Pod
restarts.
- values.yaml block sandboxTokenSigningKey.{existingSecret,
autoProvision, autoSecretName} matching the `credentials`
convention (operator override > auto-provision > skip-render).
- No Chart.yaml bump — chart value addition only.
Verification:
- go build ./platform/newapi/internal/handler/... — clean
- go test ./platform/newapi/internal/handler/... — 11/11 PASS
- helm template platform/newapi/chart — Secret renders
How sandbox-controller will use it:
1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
2. POST /admin/tokens/sandbox with bearer + body
{org_id: <Sandbox.spec.owner.orgRef.slug>,
user_id: <Sandbox.spec.owner.email>,
sandbox_id: <Sandbox.metadata.uid>,
allowed_channels: ["qwen3.6-bankdhofar"]}.
3. Write returned token into Secret/sandbox-<uid>-newapi-token.
4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>