Commit Graph

385 Commits

Author SHA1 Message Date
e3mrah
f686e30823
fix(cutover): mint Gitea API token + populate provisioning-github-token at handover (#1704)
Some checks are pending
Build openova-flow-server / build (push) Waiting to run
Build & Deploy Catalyst / build-ui (push) Waiting to run
Build & Deploy Catalyst / build-api (push) Waiting to run
Build & Deploy Catalyst / deploy (push) Blocked by required conditions
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Test — Bootstrap API (Go) / test (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
The catalyst-platform chart's templates/sme-services/provisioning-github-token.yaml
mirrors gitea-admin-secret.password verbatim into
sme/provisioning-github-token.GITHUB_TOKEN. The SME provisioning service
then sends `Authorization: token <PWD>` to Gitea — Gitea resolves the
Bearer/token credential as an API access token (sha1 lookup), the admin
password is not an access token, so Gitea returns 401 "user does not
exist [uid: 0, name: ]".

End result on t22: voucher checkout returns 200, /jobs redirect fires,
but no Organization CR is ever created (every Gitea API call from
provisioning 401s). Journey step 16 stalls indefinitely.

Verified on t22 (2026-05-18):
  - sme/provisioning-github-token.GITHUB_TOKEN.last8 == gitea-admin-secret.password.last8 == ChxCejmH
  - curl -H "Authorization: token <pwd>" /api/v1/user → 401 user does not exist
  - curl -u gitea_admin:<pwd> /api/v1/user → 200 OK (Basic works, token doesn't)
  - 0 organizations.orgs.openova.io cluster-wide

Fix: new cutover step 09 (gitea-token-mint) runs alongside the existing
01..08 chain at handover. The step:

  1. DELETEs any stale catalyst-platform-bootstrap token (idempotent —
     404 swallowed on first run).
  2. POSTs /api/v1/users/gitea_admin/tokens with scope "all".
  3. Captures the returned .sha1 (raw token bytes appear there exactly
     once — Gitea hashes server-side after creation).
  4. Validates by calling GET /api/v1/user with `Authorization: token <X>`
     and asserts 200 + non-empty login field.
  5. kubectl-patches Secret sme/provisioning-github-token.GITHUB_TOKEN
     to the new token via strategic-merge stringData (kubectl base64s).
  6. Rolls the provisioning Deployment so the new token takes effect
     immediately (best-effort — skipped if marketplace disabled).

Order=9 (last) is functionally fine — none of steps 02-08 read the
provisioning-github-token Secret, and the SME provisioning service first
consumes the token at voucher checkout time (always postdates cutover).
Slot 9 vs 1b avoids renumbering 01..08 which would invalidate operator
history in the cutover-status ConfigMap audit trail.

Token credentials never appear in process argv (passed via stdin / env
to kubectl), and validate-failure paths sed-redact the new token from
stderr before surfacing the response body.

Contract-test guard added (Case 19): step ConfigMap rendered with
order=9, the POST /api/v1/users/.../tokens call present, sha1 capture
present, Authorization: token validation present, kubectl patch present.
Existing step-count gates updated 8 → 9 and 7 job-mode → 8.

chart bp-self-sovereign-cutover: 0.1.29 → 0.1.30

Refs TBD-C18

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:30:56 +04:00
github-actions[bot]
6e741e42f7 chore(deploy): bump openova-flow-server image to fab091f [skip ci] 2026-05-18 13:23:07 +00:00
github-actions[bot]
33903a118b deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.24 2026-05-18 13:10:06 +00:00
e3mrah
a632ed50e2
fix(guacamole): readinessProbe path /guacamole/ matches webapp deploy root (Refs TBD-G4) (#1699)
The Apache Guacamole webapp deploys under Tomcat's context path
`/guacamole/` (the WAR is `guacamole.war` so Tomcat exposes it at
`/<warname>/`). Tomcat's ROOT context at `/` returns 404. Probing
`/` previously caused both liveness AND readiness probes to fail
with HTTP 404 → kubelet restarted the Pod every ~60s → kube-system
Cilium gateway returned HTTP 503 to `https://guacamole.<sov>/`
because no Endpoint was ever Ready (observed on t22, 5 restarts in
8m of uptime).

Probing `/guacamole/` matches the actual servlet context the
webapp registers at boot.

Chart bump 0.1.22 -> 0.1.23. Bootstrap-kit pin follow-up in a
separate PR (pattern matches #1693 + #1694).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:08:54 +04:00
github-actions[bot]
9966e5fcad deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.22 2026-05-18 12:38:54 +00:00
e3mrah
c1a364b631
fix(httproutes): retarget guacamole-server + openova-flow-server to cilium-gateway in kube-system (Refs TBD-G6, C12-004) (#1692)
On t22 (omantel.biz fresh Sovereign) 2 of 15 HTTPRoutes went
Accepted=False because their parentRef pointed at a gateway that
does not exist on any Sovereign:

  catalyst-system/guacamole-server     -> gateway-system/cilium-gateway
  catalyst-system/openova-flow-server  -> kube-system/catalyst-gateway

The canonical Sovereign Gateway is kube-system/cilium-gateway,
installed by bootstrap-kit/01-cilium.yaml and used by every other
HTTPRoute (catalyst-api, catalyst-ui, marketplace, gitea, harbor,
keycloak, grafana, hubble-ui, openbao, powerdns, tenant-wildcard).
gateway-system does not exist; catalyst-gateway does not exist.

Fixes:

  - platform/guacamole/chart/values.yaml — default
    guacamole.httproute.parentRef.namespace: gateway-system -> kube-system

  - clusters/_template/bootstrap-kit/56-bp-openova-flow-server.yaml —
    flowServer.httproute.gatewayRef.name: catalyst-gateway -> cilium-gateway
    (namespace already kube-system, untouched)

Verified on t22: all 15 HTTPRoutes now Accepted=True after chart bump
+ Flux reconcile.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:38:17 +04:00
github-actions[bot]
5407102d6b deploy: bump sandbox-controller image to 8017700 2026-05-18 12:33:11 +00:00
github-actions[bot]
5ee02f7d36 deploy: bump sandbox-mcp-server image to 8017700 2026-05-18 12:31:53 +00:00
e3mrah
8017700ad4
feat(sandbox): tier-bound MCP capabilities (Free/Pro/Ent plans gate tool access) (#1690)
Stop handing every Sandbox session the full MCP surface. Each per-Sandbox
NewAPI token now carries a plan-derived capability allowlist that the MCP
server enforces against per-tool RequiredCapability via Claims.HasCapability:

  - Free: read-only k8s + gitea read + session/rag/skills
  - Pro:  + sandbox.db.* + sandbox.storage.* + sandbox.preview.* +
          sandbox.auth.* + sandbox.secrets.* + marketplace.* + flux.status
  - Ent:  + sandbox.deploy.{staging,production,...} + sandbox.stripe.* +
          flux.{reconcile,suspend,resume} + gitea.pr.{create,merge} +
          gitea.issue.*

Wiring:
  - Sandbox CRD spec gains planId + capabilities[] (operator overlay).
  - Sandbox sandboxapi.{CapabilitiesForPlan,ResolveCapabilities} is the
    SoT; tenant orchestrator carries an exact-mirror capabilitiesForPlan
    (no controllers-module dep — same isolation pattern quotaForPlan
    uses).
  - sandbox-controller threads spec.capabilities (falling back to plan)
    into newapi.MintRequest.
  - catalyst-api bridge handler accepts capabilities[] on the wire and
    encodes it as the JWT `capabilities` claim (omitted when empty).
  - Claims.HasCapability gains wildcard prefix matching (`sandbox.db.*`
    satisfies `sandbox.db.provision`, `sandbox.db`, etc.) so plan grants
    stay coarse. Plain stem matches WITHOUT a wildcard are intentionally
    rejected — the production second-gate in sandbox_deploy.go stays
    honest.
  - MCP registry: every gated tool now carries its granular dotted
    RequiredCapability (`sandbox.db.provision`, `gitea.pr.list`, …).
    Read-only / session tools previously ungated also get granular
    grants so Free tokens can browse without inheriting the write
    surface.

No Chart.yaml bump — CRD additions are additive; existing Sandbox CRs
parse fine. Empty token capabilities downgrades to introspection only,
matching pre-PR-#1671 callers.

Tests: shared/auth/claims_test.go (wildcard matrix),
sandboxapi/capabilities_test.go (plan ladder + spec override),
sandbox_token_test.go (capabilities round-trip + omit-on-empty),
sandbox_controller_test.go (plan-derived + spec-override mint),
sandbox_consumer_test.go (orchestrator stamps spec.capabilities), plus
updates to every per-namespace registry test asserting new granular
RequiredCapability values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:30:00 +04:00
github-actions[bot]
3349690728 chore(deploy): bump openova-flow-adapter-flux image to 00eeff2 [skip ci] 2026-05-18 12:17:03 +00:00
github-actions[bot]
4cf670b6df deploy: bump sandbox-mcp-server image to ffb79aa 2026-05-18 10:55:22 +00:00
github-actions[bot]
5e06bf843a deploy: bump bp-newapi upstream v0.13.2 chart 1.4.14 2026-05-18 10:54:28 +00:00
github-actions[bot]
645d5282e2 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.21 2026-05-18 10:54:17 +00:00
e3mrah
798220413d
Merge pull request #1685 from openova-io/fix-t20-newapi-oidc-secret-materialization
fix(infra): Crossplane provider-hcloud package URL (xpkg.upbound.io → xpkg.crossplane.io after 2025 contrib migration)
2026-05-18 14:53:54 +04:00
Emrah Baysal
a01df29993 fix(newapi): create newapi-oidc Secret via Helm lookup from keycloak (was dangling reference)
Pre-this-fix the bootstrap-kit overlay at
clusters/_template/bootstrap-kit/80-newapi.yaml:171 sets
`auth.adminUI.keycloak.existingSecret: newapi-oidc` but NOTHING in the
chart nor in the operator overlay materialised that Secret. The Pod
stayed in `CreateContainerConfigError: secret "newapi-oidc" not
found`, blocking the entire bp-newapi HR from reaching Ready (t20
debug matrix Fix #6).

New template templates/keycloak-client-secret.yaml uses Helm `lookup`
to retrieve the existing Secret bytes on every reconcile (idempotent —
preserves the OIDC client secret across upgrades), falling back to
`randAlphaNum 32` on first install. Mirrors the existing canonical
seam in the same chart (templates/credentials-secret.yaml issue #943,
templates/sandbox-token-signing-key-secret.yaml PR #1638) — both use
the same lookup-or-generate pattern with helm.sh/resource-policy: keep.

The sister chart platform/guacamole/chart/templates/keycloak-client-
secret.yaml uses a SealedSecret placeholder + a bootstrap Job hook;
THIS chart already standardised on the simpler `randAlphaNum + Helm
lookup` pattern, so this Secret follows the same seam to keep the
chart's credential-materialisation strategy consistent.

Gated on `auth.adminUI.mode=keycloak` AND non-empty
`auth.adminUI.keycloak.existingSecret`, so non-keycloak installs
render nothing extra.

No Chart.yaml bump — pure template addition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:53:40 +02:00
e3mrah
7bfb65402e
fix(guacamole): mount /home/guacamole instead of /home/guacamole/.guacamole (entrypoint rm fails on mount point) (#1684)
The official Apache Guacamole image entrypoint runs `rm -rf
$GUACAMOLE_HOME` (== `/home/guacamole/.guacamole`) before re-populating
the directory on every start. When the chart mounted an emptyDir
directly at `/home/guacamole/.guacamole`, that path was a mount point
from the kernel's perspective, so `rm` failed with:

    rm: cannot remove '/home/guacamole/.guacamole':
        Read-only file system

— the entrypoint exited non-zero and the Pod CrashLoopBackOff'd before
the webapp ever started. (t20 debug matrix — Fix #5.)

Mount the PARENT directory (`/home/guacamole`) instead. `.guacamole`
becomes a regular subdirectory inside the emptyDir, which the
entrypoint can freely `rm -rf` and recreate. The webapp's first-start
writes still land in a writable location under readOnlyRootFilesystem.

No Chart.yaml version bump per the t20 hard-rules contract — chart
release will roll in the next blueprint-release wave.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:53:33 +04:00
e3mrah
b01281a70c
fix(self-sovereign-cutover): harborPublicURL → registry.<sov> (was harbor.<sov> — chicken-and-egg unblock) (#1681)
Per t20 debug matrix:

* `bp-self-sovereign-cutover` step-06 phase-1 rewrites every HelmRepository
  URL from `oci://ghcr.io/openova-io` to `oci://${harbor_host}/openova-io`,
  where `harbor_host` is derived from `sovereign.harborPublicURL`.
* Pre-fix: `harborPublicURL: https://harbor.${SOVEREIGN_FQDN}`.
* But the bp-harbor HTTPRoute publishes at `registry.${SOVEREIGN_FQDN}` —
  see `clusters/_template/bootstrap-kit/19-harbor.yaml` line 167
  (`gateway.host: registry.\${SOVEREIGN_FQDN}`). No HTTPRoute matches
  `harbor.<sov>`, so post-pivot every OCI chart pull EOFs.
* Effect: bp-sandbox HR never Ready → bootstrap-kit Kustomization stuck
  waiting on bp-sandbox health → t20 convergence blocks indefinitely.

Fix (chart-level, no Chart.yaml bump for bp-catalyst-platform):

* `clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml`
  overlay value flipped `harbor.${SOVEREIGN_FQDN}` → `registry.${SOVEREIGN_FQDN}`.
* `platform/self-sovereign-cutover/chart/values.yaml` default placeholder
  flipped `harbor.example.local` → `registry.example.local` so smoke
  renders + docs line up.
* README + smoke command updated.

Smoke tests:

* `helm template smoke platform/self-sovereign-cutover/chart` — clean,
  1851 lines, `HARBOR_PUBLIC_URL=https://registry.example.local`.
* `helm template smoke ... --set sovereign.harborPublicURL=https://registry.otechN.omani.works`
  — clean, all step env vars carry the new host.
* `kubectl kustomize clusters/_template/bootstrap-kit/` — clean, 2926 lines,
  overlay shows `harborPublicURL: https://registry.${SOVEREIGN_FQDN}`.
* `bash platform/self-sovereign-cutover/chart/tests/cutover-contract.sh`
  — all gates green (Phase-0 ghcr-pull auth merge still works because
  `harbor_host` is derived from `HARBOR_PUBLIC_URL` env at runtime, so
  the script now correctly merges auth for `registry.<sov-fqdn>` instead
  of `harbor.<sov-fqdn>`).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:51:29 +04:00
github-actions[bot]
42410c6e75 deploy: bump sandbox-controller image to 042b444 2026-05-18 10:38:03 +00:00
github-actions[bot]
35b9c77923 deploy: bump sandbox-mcp-server image to 042b444 2026-05-18 10:36:43 +00:00
github-actions[bot]
1595f3a867 deploy: bump sandbox-pty-server image to 042b444 2026-05-18 10:36:20 +00:00
github-actions[bot]
c9506020c3 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.13 2026-05-18 10:35:13 +00:00
e3mrah
042b444c5c
feat(sandbox): Prometheus emitters for Wave 14 Grafana panels (#1674) (#1679)
PR #1674 shipped the Sandbox Runtime Grafana dashboard with three
panels whose metrics did not yet exist anywhere in the fleet:

  - "WebSocket Connections" → pty_server_websocket_connections (Gauge)
  - "Idle-Timeout Scale-Down Events / hour" →
        sandbox_controller_idle_timeout_events_total (Counter)
  - "newapi Token Mint Requests / hour" →
        newapi_admin_token_mint_requests_total{tool,status} (Counter)

Per Inviolable Principle #11 the panels render "No data" until the
emitter sides roll out. This PR closes that loop.

pty-server (products/sandbox/pty-server)
  - New metrics.go: registers Gauge pty_server_websocket_connections
    via promauto + exposes promhttp.Handler.
  - routes.go: serves /metrics on GET; Inc/Dec the gauge around every
    successful WS upgrade in attach() and cards() (Defer Dec so abnormal
    returns still decrement).
  - go.mod: + github.com/prometheus/client_golang v1.19.1 (matches the
    version core/controllers already pulls in transitively).
  - New unit test asserts GET /metrics carries the gauge name.

sandbox-controller (core/controllers/sandbox/internal/idlescaler)
  - New metrics.go: Counter sandbox_controller_idle_timeout_events_total
    with label {namespace} registered on controller-runtime's shared
    registry (so the manager's existing :8080 /metrics endpoint surfaces
    it — no new listener).
  - idlescaler.go: bumps the counter inside scaleToZero() so every Pod
    scaled to 0 ticks once. Namespace label matches the dashboard panel's
    `sum by (namespace) (rate(...))` aggregation.
  - New unit test verifies the counter delta is 1 on a successful
    scale-to-zero pass.

newapi bridge handler (platform/newapi/internal/handler)
  - New metrics.go: CounterVec newapi_admin_token_mint_requests_total
    with labels {tool, status}; helper classifyStatus() maps HTTP codes
    to a finite cardinality of 7 status values (ok / unauthorized /
    bad_request / unavailable / server_error / method_not_allowed /
    other). Exported MetricsHandler() so the catalyst-api wiring code
    can mount /metrics on the same listener as the bridge.
  - sandbox_token.go: recordMint(r, status) at every return path so the
    counter ticks regardless of which branch the request hits.
  - go.mod: + github.com/prometheus/client_golang v1.19.1.
  - 5 new test cases assert counter delta == 1 for the documented
    status transitions and that the X-Catalyst-Tool header surfaces
    as the `tool` label.

sandbox-controller → newapi client (core/controllers/sandbox/internal/newapi)
  - client.go: stamp `X-Catalyst-Tool: sandbox-controller` on every
    outbound POST /admin/tokens/sandbox so the bridge counter's `tool`
    label has the canonical value the dashboard panel filters on.

Helm charts
  - platform/sandbox/chart/templates/service.yaml (new): ClusterIP
    Service exposing the controller's :8080 metrics port. Required so a
    ServiceMonitor selector has something to attach to.
  - platform/sandbox/chart/templates/servicemonitor.yaml (new):
    monitoring.coreos.com/v1 ServiceMonitor scoped to the metrics
    Service. Default-off + double-guarded with
    `.Capabilities.APIVersions.Has "monitoring.coreos.com/v1"` (matches
    platform/harbor/chart/templates/servicemonitor.yaml pattern). values
    block + per-Sovereign overrides (interval / scrapeTimeout / path /
    labels / namespace) per Inviolable Principle #4.
  - platform/newapi/chart/templates/servicemonitor.yaml (new): mirror
    template targeting the existing bp-newapi Service / `http` port for
    when the catalyst-api binary that mounts the bridge handler rolls
    out behind the same Service. Default-off, capability-guarded.

No chart Chart.yaml bump. Validates: helm template + helm lint clean
on both charts; go build + go test clean across all three modules
(pty-server, core/controllers/sandbox, platform/newapi/internal/handler).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:34:50 +04:00
github-actions[bot]
3869070336 deploy: bump sandbox-controller image to 6f24ea2 2026-05-18 10:31:45 +00:00
github-actions[bot]
74ecd5bd4a deploy: bump sandbox-mcp-server image to 6f24ea2 2026-05-18 10:30:35 +00:00
github-actions[bot]
991130e684 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.12 2026-05-18 10:29:09 +00:00
e3mrah
6f24ea20b0
test(sandbox+newapi): integration test newapi token mint round-trip + verify reflector wiring (#1677)
Wave 15 — closes the verification gap between PR #1637 (catalyst-api
/api/v1/sandbox/sessions), PR #1638 (real /admin/tokens/sandbox mint
in newapi), and PR #1643 (sandbox-controller calls the bridge):

1. core/controllers/sandbox/internal/newapi/integration_test.go (new)
   End-to-end test wiring the REAL newapi.Client through net/http
   against a httptest.Server that mirrors the bridge handler contract
   (platform/newapi/internal/handler/sandbox_token.go) verbatim:
     - happy-path: 200 with {token, expires_at} → controller renders
       per-Sandbox Secret + stamps lifecycle annotations on the CR
       (verifies wire path: Authorization: Bearer + body fields
       org_id/user_id/sandbox_id/allowed_channels)
     - 401 path: wrong admin bearer → 401 JSON envelope → controller
       stamps TokenMintFailed False condition, NO gitops writes,
       requeues, NO lifecycle annotations stamped
     - transport-unreachable: closed server URL → TokenMintFailed
       (verifies the error wrapping path)

   Gap filled: client_test.go covers the HTTP client in isolation,
   sandbox_controller_test.go uses an in-process stub. Neither
   exercises the controller-runtime reconciler against a real HTTP
   transport — this file is the only place where a regression in the
   client's bearer/header/path/status-code handling would surface in
   the context of the reconciler's state machine.

2. platform/newapi/chart — reflector wiring fix
   Default reflectorNamespaces changed from "sandbox" to
   "catalyst-system,sandbox". Root cause: clusters/_template/
   bootstrap-kit/19a-bp-sandbox.yaml sets `targetNamespace:
   catalyst-system` (the canonical install namespace of the
   bp-sandbox HelmRelease) but the chart-emitted Secret was being
   mirrored into a `sandbox` namespace that does not exist on a
   stock Sovereign. Result: sandbox-controller Pod's
   `NEWAPI_ADMIN_SECRET` env var landed empty (secretKeyRef
   `optional: true` swallowed the missing-Secret error) → controller
   started in gitops-only mode, never minted tokens, silently
   degraded. Operator-visible only via a startup log line.

   `sandbox` is retained in the default for legacy overlays that
   install the controller into a dedicated namespace + for sister
   tooling (catalyst-api PATs routed through the bridge) that wants
   the admin bearer locally.

Verification:

  - go build ./sandbox/... clean
  - go test ./sandbox/internal/newapi/... — 7 tests pass (4 unit + 3
    new integration)
  - go test ./sandbox/... — all sandbox packages pass
  - go vet ./sandbox/... clean
  - helm template platform/newapi/chart/ -s sandbox-token-signing-
    key-secret.yaml renders with
    `reflection-{allowed,auto}-namespaces: "catalyst-system,sandbox"`
  - helm template platform/sandbox/chart/ renders Deployment env
    block `NEWAPI_ADMIN_SECRET` valueFrom Secret
    `newapi-bp-newapi-token-signing-key` key `ADMIN_SECRET`
    (optional: true) — unchanged

No Chart.yaml bump (chart pinning is a release-driver concern).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:28:38 +04:00
e3mrah
1075f6bbf2
feat(sandbox-chart): Grafana dashboard for Sandbox runtime observability (#1674)
Add a default-off Grafana dashboard ConfigMap (label `grafana_dashboard: "1"`)
that the upstream grafana/grafana sidecar (kiwigrid/k8s-sidecar) auto-discovers
across all namespaces and loads on startup. Renders only when
`.Values.grafanaDashboard.enabled=true` — zero regression for every existing
Sovereign overlay.

Panels (per products/sandbox/docs/architecture.md §7):
  1. Active Sandboxes — sum(kube_customresource_sandbox_info)
  2. pty-server Pods Ready % — kube_pod_status_ready{condition=true}
     joined to kube_pod_labels{label_app_kubernetes_io_name="pty-server"}
  3. MCP Pods Ready % — same shape, label_app_kubernetes_io_name="openova-sandbox-mcp"
  4. WebSocket Connections — pty_server_websocket_connections (Gauge)
  5. PVC Usage % per Sandbox — kubelet_volume_stats_used_bytes / capacity_bytes
  6. Idle-Timeout Scale-Down Events / hour —
     rate(sandbox_controller_idle_timeout_events_total[5m]) × 3600
  7. newapi Token Mint Requests / hour —
     rate(newapi_admin_token_mint_requests_total{tool="sandbox-controller"}[5m]) × 3600

Pattern mirrors platform/seaweedfs/chart/.../seaweedfs-grafana-dashboard.yaml.
Per Inviolable Principle #11 (never fabricate metrics) every panel description
names the metric it depends on so panels whose emitter has not yet rolled out
across the fleet render as "No data" instead of a synthetic number.

Validated:
- `helm template` clean both modes (default-off → zero output; enabled → 1 CM)
- `helm lint` passes (1 INFO about icon — pre-existing)
- Dashboard JSON parses (json.loads), 7 panels enumerated
- No Chart.yaml bump

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:04:58 +04:00
e3mrah
94e98052a6
fix(sandbox-chart): smoke-render-mode default-off (was default-on; chart is .Values.enabled-gated, default-on renders empty → Blueprint Release fails 'empty render') (#1672)
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:46:26 +04:00
github-actions[bot]
fdc2b3340b deploy: bump sandbox-mcp-server image to de19be6 2026-05-18 09:38:38 +00:00
e3mrah
fcf86a6392
fix(sandbox-chart): no-upstream annotation (unblock Blueprint Release pipeline) (#1668)
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback

Append a Wave 12-14 addendum to the convergence report capturing:

- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle

Companion memos (out-of-tree, not in this PR):

- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)

Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.

Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml

Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:30:00 +04:00
github-actions[bot]
e5c2797ce6 deploy: bump sandbox-mcp-server image to cadc7b5 2026-05-18 09:25:43 +00:00
github-actions[bot]
87cf177a02 deploy: bump sandbox-pty-server image to cadc7b5 2026-05-18 09:23:28 +00:00
e3mrah
cadc7b5cea
fix(sandbox-ci): mcp-server Dockerfile repo-root context + pty/mcp auto-bump wiring (chart was half-deployable) (#1667)
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:

1. **mcp-server Dockerfile built against a too-narrow context**. The
   workflow passed `context: products/sandbox/mcp-server` and the
   Dockerfile assumed `COPY . .` could see everything it needed, but
   the `replace ../../../core/controllers` line in the module's go.mod
   only resolves when the build can actually reach those paths. Result:
   every push after #1658 failed at `go build` with `module not found`.
   Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
   COPY the replace targets' module roots + sources, then build with
   WORKDIR set to the dependent module. Static binary still produced
   into a distroless/static-debian12:nonroot final stage.

2. **mcp-server workflow had no chart auto-bump step**. Even after a
   green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
   stayed empty so the chart's `required` guard
   (deployment.yaml line 72) refused to render. Added the same
   yq-bump + bot-commit pattern build-sandbox-controller.yaml already
   uses, targeting `.runtime.mcpImage` and writing a fully-qualified
   `<repo>:<sha>` string (consumer reads it as one image reference,
   not a {repository,tag} pair). Also widened paths-filter to include
   core/controllers/** + core/services/shared/** so changes to the
   replace targets re-trigger the build.

3. **pty-server workflow had no auto-bump either**. Same surgery:
   yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
   narrow (pty-server has no cross-tree `replace` directives).

4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
   next chart roll out doesn't fail-fast before the rebuilt workflows
   land their first bumps:
   - ptyServerImage → ad5163e6 (current latest pty-server)
   - mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
     workflow will land the next real SHA on the next push to main).

Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
  binary at /tmp/openova-sandbox-mcp; `file` confirms statically
  linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
  renders cleanly; both env vars carry the SHA-pinned image refs.

No Chart.yaml bump. Read-only clusters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:22:17 +04:00
github-actions[bot]
9ef6e30ee4 deploy: bump sandbox-controller image to e83d08e 2026-05-18 08:57:14 +00:00
e3mrah
e83d08ea4e
feat(sandbox+tenant): CNPG active-hot-standby (ReplicaCluster) default for marketplace tenants when SOVEREIGN_ENABLE_HOT_STANDBY=true (#1661)
Sovereign DoD D31 — CNPG-backed apps must replicate across the
Sovereign's regions when the operator opts in. PR #1562 wired this
into bp-wordpress-tenant chart-level. This change extends the same
toggle across BOTH user-facing paths:

1. Marketplace tenant flow (sme_tenant_gitops.go)
   - smeTenantTemplateData gains EnableHotStandby/PrimaryRegion/
     ReplicaRegion. renderSMETenantOverlay reads them from the
     catalyst-api Pod env (SOVEREIGN_ENABLE_HOT_STANDBY +
     SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION).
   - Bp-wordpress-tenant HelmRelease emits pg.activeHotStandby.*
     when the trio is valid; bp-wordpress-tenant chart 0.2.0+
     (PR #1562) renders the primary + replica Cluster CR pair.
   - Defence-in-depth: degenerate inputs (empty/identical regions)
     fall back to single-Cluster shape rather than emitting a
     HelmRelease the chart's validateActiveHotStandbyRegions helper
     would fail at template time.

2. Sandbox plane (sandbox.db.provision)
   - Env struct + NewEnvFromOS read the same Sovereign-level trio.
   - sandbox.db.provision emits a primary + replica Cluster CR pair
     when hotStandbyActive() — same shape bp-cnpg-pair renders for
     marketplace apps + bp-wordpress-tenant cnpg-cluster.yaml: WAL
     streaming via spec.managed.services.additional annotated
     service.cilium.io/global=true, nodeAffinity pinning each side
     to its declared region, replica.enabled=true with externalCluster
     resolving the primary through the ClusterMesh-global Service alias.
   - Best-effort rollback if the replica Create fails so the operator
     never sees an orphan primary.

3. Plumbing (one knob, both paths)
   - catalyst chart: values.sovereign.{enableHotStandby,primaryRegion,
     replicaRegion} -> sovereign-fqdn ConfigMap keys -> catalyst-api env.
   - sandbox chart: cnpg.activeHotStandby.{enabled,primaryRegion,
     replicaRegion} -> controller env -> per-Sandbox MCP Pod env.
   - Bootstrap-kit slot 13 + slot 19a wire SOVEREIGN_ENABLE_HOT_STANDBY/
     SOVEREIGN_PRIMARY_REGION/SOVEREIGN_REPLICA_REGION envsubst
     placeholders to BOTH chart paths so the operator flips one knob
     on the per-Sovereign overlay and gets HA across the marketplace
     tenant install AND the sandbox.db plane.

Default empty/false: every Sovereign that has not opted in keeps
rendering single-Cluster CNPG (zero regression).

gitlab-tenant + nextcloud-tenant charts: NOT shipped in this repo
today, so they are out of scope. When they land they can copy the
same value contract (pg.activeHotStandby.*) and the gitops writer
wiring already handles them — no chart-bump or controller change
required.

Tests
- sme_tenant_active_hot_standby_test.go: 8 cases (off, on-happy-path,
  degenerate matrix incl. empty primary, empty replica, identical
  regions, toggle off with regions).
- sandbox_db_hot_standby_test.go: 11 cases covering hotStandbyActive
  matrix + replicaClusterName/replicationServiceName suffix rules +
  full primary + replica CR shapes (nodeAffinity, switchover, managed
  service, externalClusters).
- platform/wordpress-tenant/chart/tests/active-hot-standby-render.sh
  still passes (5/5 gates green).
- catalyst-api SMETenant suite GREEN.
- sandbox-controller suite GREEN.
- helm template clean for sandbox chart (HA + default-off) and
  catalyst chart (sovereign-fqdn-configmap + api-deployment).

Hard rules respected: READ-ONLY clusters, no Chart.yaml bump on
bp-catalyst-platform (envsubst-only wiring change in slot 13), no
host-cluster touch outside the chart-level seam.

Refs DoD D31.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:53:57 +04:00
github-actions[bot]
d82471ced1 deploy: bump sandbox-controller image to d5ea7d9 2026-05-18 08:19:53 +00:00
github-actions[bot]
5309bb8c39 deploy: bump sandbox-controller image to 63255bf 2026-05-18 08:15:56 +00:00
github-actions[bot]
18df061895 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.11 2026-05-18 08:00:46 +00:00
e3mrah
0604c5e057
fix(newapi): gate channel render on attestation present (was blocking install when accountId env empty) (#1654)
Convergence wave 11 blocker on t16: bp-newapi HR install fails with

  Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
  "bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
  channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
  requires accountId

PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:

  attestation:
    kind: commercial-contract
    accountId:   ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
    contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}

On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.

Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.

Touches two templates that gate on the same effective channel list:

  - templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
    pre-check ($qbdAttReady) that short-circuits the channel composition
    block when attestation is incomplete. The downstream
    `assertChannelAttestation` helper then sees an empty channel list
    for the qwenBankDhofar slot and emits no error.
  - templates/channel-seed-job.yaml — mirrors the same gate so the
    post-install Helm hook Job + RBAC + audit ConfigMap also skip when
    the channel itself was skipped (otherwise the Job would POST a row
    whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).

`helm template platform/newapi/chart` renders cleanly in all three
states:
  - default (qbd.enabled=false) → no channel, no seed Job
  - qbd.enabled=true + empty accountId/contractRef → no channel, no
    seed Job (NEW: pre-1.4.10 this hard-failed)
  - qbd.enabled=true + accountId + contractRef present → channel
    composed normally, seed Job emitted

Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.

READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:00:06 +04:00
github-actions[bot]
51913fe380 deploy: bump sandbox-controller image to ad5163e 2026-05-18 07:54:45 +00:00
e3mrah
ad5163e69a
feat(sandbox-controller): IdleScaler scales pty-server replicas to 0 after configured idle window (#1651)
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:

pty-server (products/sandbox/pty-server/):
  - session.Manager tracks lastActivity; Touch() called on session
    create/stop, WS attach/detach, every WS message in/out, resize/signal.
  - New GET /idle endpoint returns {lastActivityAt, activeSessions}.
  - Unit tests cover the endpoint shape + Touch() bump.

sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
  - New IdleScaler runnable, registered with mgr.Add() in main.go.
  - NeedLeaderElection=true (singleton across HA replicas).
  - Every 60s lists pty-server StatefulSets by label selector
    (app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
    constrained to `sandbox-*` namespaces in code for defence-in-depth.
  - For each: probes the in-cluster Service /idle endpoint, stamps the
    `openova.io/sandbox-last-activity-at` annotation, and patches
    spec.replicas=0 once now-lastActivity exceeds the per-SS
    `openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
    SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
  - Probe failure with no prior annotation → skip (next tick); probe
    failure WITH prior annotation → still decide on stale data so a
    degraded probe path doesn't keep a forgotten Pod alive forever.
  - activeSessions > 0 keeps the Pod alive regardless of idle window.
  - Already-zero replicas → idempotent no-op.

Chart RBAC:
  - ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
    cluster-wide write on a non-CR resource, scoped to the controller's
    own managed StatefulSets via the label selector + namespace prefix.

Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.

Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:51:36 +04:00
github-actions[bot]
c4fa06a9f4 deploy: bump sandbox-controller image to 3a3ee74 2026-05-18 07:46:53 +00:00
github-actions[bot]
c9fe39a20f deploy: bump bp-newapi upstream v0.13.2 chart 1.4.9 2026-05-18 07:44:23 +00:00
e3mrah
3a3ee742ec
feat(sandbox-controller): call newapi /admin/tokens/sandbox + write Secret + rotation (was placeholder) (#1643)
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.

Changes:

  - New newapi.Client (core/controllers/sandbox/internal/newapi/) —
    thin HTTP client for POST /admin/tokens/sandbox with the bridge's
    {org_id, user_id, sandbox_id, allowed_channels} body + Bearer
    ADMIN_SECRET auth. Interface so tests can stub.

  - Reconciler extended:
      * NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
      * On every reconcile: decide mint-or-skip from annotation
        openova.io/sandbox-token-expires-at vs. now + lead-time
      * On mint: POST to bridge, stamp expires-at + rotated-at
        annotations on the CR, render token bytes into a new
        gitops manifest secret-newapi-token.yaml committed to the
        per-Org catalyst-tenant repo at sandbox/<owner-uid>/
      * Bridge failure → Failed/TokenMintFailed condition + 30s
        requeue + no gitops writes (fail-loud)
      * Empty DefaultChannels → NoAllowedChannels condition (fail
        earlier than the bridge's 400)

  - gitops.Render:
      * New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
        /NewAPITokenRotatedAt fields
      * New secret-newapi-token.yaml template — Secret with
        stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
        optional kubectl.kubernetes.io/restartedAt rotation marker
        so Wave 2's pty-server StatefulSet picks up rolling
        restarts on token rotation
      * kustomization.yaml appends the new manifest when token
        present

  - Chart wiring (platform/sandbox/chart):
      * Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
        (secretKeyRef from newapi-bp-newapi-token-signing-key,
         optional: true), NEWAPI_DEFAULT_CHANNELS
      * ClusterRole bumped to allow update/patch on the
        sandboxes/ resource (the controller now stamps annotations
        on the CR)

  - platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
      * Added emberstack/reflector annotations so the chart-emitted
        Secret (newapi namespace) mirrors into the sandbox-controller
        namespace by default; reflectorNamespaces is overrideable.

Tests:

  - newapi client: happy-path round-trip, 401 surfaces, input
    validation, request validation. 4 cases.
  - sandbox-controller: existing Wave 1 cases (happy/idempotent/
    drift/missing) still pass; 5 new cases for the token path:
    fresh mint + Secret render, rotation on near-expiry, steady-
    state no-mint, bridge failure surfaces condition, no-channels
    misconfig fails early. 9 cases total, all green.

Hard rules honored:
  - No Chart.yaml bump (chart pinning is a release-driver concern)
  - go build + go test ./core/controllers/sandbox/... clean

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:43:50 +04:00
github-actions[bot]
2fee03f7d2 deploy: bump sandbox-controller image to c0020d9 2026-05-18 07:40:02 +00:00
github-actions[bot]
c6820e3d4a deploy: bump sandbox-controller image to 9f6354f 2026-05-18 07:33:12 +00:00
e3mrah
9f6354f1e1
feat(sandbox): controller spawns pty-server + MCP Pods (was just namespace+RBAC+PVCs) (#1641)
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.

This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:

- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
  one Pod per in-flight session per architecture.md §1/§2). Env wired
  per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
  SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
  LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
  (key llm-gateway-token, optional). When claude-code is in
  spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
  per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
  access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
  at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
  server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
  Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
  sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
  the existing catalyst-public Cilium Gateway in catalyst-system.
  PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
  its own /sessions/<id> surface.

Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
  refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
  render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
  overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
  SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
  defaults.

Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).

Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:00 +04:00
github-actions[bot]
22851d980d deploy: bump bp-newapi upstream v0.13.2 chart 1.4.8 2026-05-18 07:03:09 +00:00
e3mrah
4abd156fee
feat(newapi): real /admin/tokens/sandbox mint impl (was stub from #1619) (#1638)
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.

Handler (platform/newapi/internal/handler/sandbox_token.go):
  - Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
    constant-time compared. 401 on mismatch / missing bearer.
  - Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
    De-duplicates + scrubs empty channel names so a controller bug
    sending [""] can't mint a token that NewAPI silently treats as
    "no restriction".
  - Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
    {sub: sandbox_id, org: org_id, user: user_id, channels: [...],
     iat, exp: iat+7d, typ: "sandbox"}.
  - Returns {token, expires_at}.
  - Refuses with 503 when SigningKey or AdminSecret is unset
    (visible chart-wiring gap, not a forgeable-token leak).
  - Removes the previous Claims/jwt.Parse PAT-validation path that
    came with the stub — caller is the controller, not an operator.
  - NewHandlerFromEnv() factory loads + validates env at process
    start so catalyst-api can fail loudly instead of shipping the
    endpoint silently.

Unit tests (sandbox_token_test.go) — 11 cases:
  - happy path (mint + claim shape + signature round-trip)
  - de-dup + empty-channel scrub
  - admin-secret mismatch / missing bearer → 401
  - missing org_id / user_id / sandbox_id / empty channels → 400
  - non-POST → 405
  - unset env → 503
  - mintSandboxToken empty-secret guard + round-trip
  - response does not echo admin secret or signing key

Chart wiring (platform/newapi/chart):
  - New Secret template sandbox-token-signing-key-secret.yaml
    auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
    (same load-bearing pattern as credentials-secret.yaml #943 and
    gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
    for both SIGNING_KEY and ADMIN_SECRET; persistence across
    reconciles is required because a reconcile-time rotation would
    silently invalidate every per-Sandbox token across the Sovereign
    AND break the sandbox-controller's auth path until its Pod
    restarts.
  - values.yaml block sandboxTokenSigningKey.{existingSecret,
    autoProvision, autoSecretName} matching the `credentials`
    convention (operator override > auto-provision > skip-render).
  - No Chart.yaml bump — chart value addition only.

Verification:
  - go build ./platform/newapi/internal/handler/... — clean
  - go test ./platform/newapi/internal/handler/... — 11/11 PASS
  - helm template platform/newapi/chart — Secret renders

How sandbox-controller will use it:
  1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
  2. POST /admin/tokens/sandbox with bearer + body
     {org_id: <Sandbox.spec.owner.orgRef.slug>,
      user_id: <Sandbox.spec.owner.email>,
      sandbox_id: <Sandbox.metadata.uid>,
      allowed_channels: ["qwen3.6-bankdhofar"]}.
  3. Write returned token into Secret/sandbox-<uid>-newapi-token.
  4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:02:40 +04:00
github-actions[bot]
41eba2d436 deploy: bump sandbox-controller image to 1b0e86c 2026-05-18 06:14:36 +00:00