Commit Graph

2312 Commits

Author SHA1 Message Date
e3mrah
284cb62c94
feat(catalyst-api): /api/v1/sandbox/sessions exposes live controller status (was spec-only) (#1673)
PR #1637 shipped GET /api/v1/sandbox/sessions returning only the spec
the handler authored — Status was projected from `.status.phase` but
every other controller-managed field (sessions count, storage usage,
30-day spend, preview URLs, failure conditions) was ignored. The
catalyst-ui Sandbox surface had nothing to render past the FE-side
"queued" pill.

This wires the projection to read `.status.{phase,sessions,storageUsed,
spend30d,previews,conditions}` and surfaces them on the
list/get wire shape:

  - `phase` (raw, Pending|Provisioning|Ready|Failed) alongside the
    existing FE-projected `status` (pending|running|stopped|failed)
  - `sessions`, `storageUsed`, `spend30d` — operator-visible quotas
  - `previews[]` — one preview URL per PR/branch (skips rows missing
    URL; coerces float64↔int64 from apiserver JSON round-trips)
  - `conditions[]` — Type/Status/Reason/Message tuples verbatim, so
    the FE can render TokenMintFailed / GitopsWriteFailed /
    ManifestRenderFailed / OwnerEmailInvalid / NoAllowedChannels
    inline instead of a generic red pill

Phase→FE-status mapping unchanged (matches sandbox.api.ts:
normalizeStatus). `mapSandboxStatus` refactored from
`(u *Unstructured)` to `(rawPhase string)` so the new
`readSandboxPhase` helper reads the field exactly once per item.

Added handler-package tests pinning the projection contract:
  - StatusReflection — happy path, dropped-malformed rows
  - PhaseProjection — every CRD phase → FE status
  - FailedSurfacesConditions — Failed + TokenMintFailed visible
  - NilInputDoesNotPanic — empty-slice defaults
  - PreviewFloat64Coercion — apiserver JSON round-trip safety

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:59:20 +04:00
e3mrah
94e98052a6
fix(sandbox-chart): smoke-render-mode default-off (was default-on; chart is .Values.enabled-gated, default-on renders empty → Blueprint Release fails 'empty render') (#1672)
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:46:26 +04:00
github-actions[bot]
fdc2b3340b deploy: bump sandbox-mcp-server image to de19be6 2026-05-18 09:38:38 +00:00
e3mrah
2f10c2e85a
feat(sandbox-ui): SandboxSession real WebSocket connect + reconnect (was placeholder) (#1670)
PR #1621 shipped the SandboxSession xterm.js host with an "API pending"
placeholder banner. PR #1641 + #1657 wired the BE (sandbox-controller
renders the HTTPRoute on sandbox.<sov-fqdn>; pty-server exposes
WS /sessions/{id}/attach). This PR replaces the placeholder with a real
adapter:

- stdin   : term.onData -> ws.send (TextEncoder binary frame)
- stdout  : ws.onmessage -> term.write (ArrayBuffer / Uint8Array / Blob / string)
- resize  : window resize -> fit.fit() -> POST sandbox.<sov-fqdn>/sessions/{id}/resize
- replay  : pty-server ships the ring buffer as the first binary frame; the
            generic onmessage path writes it verbatim, no special case
- reconnect: on close / error, schedule a retry with exponential backoff
             (1s, 2s, 4s, 8s, 16s, 30s ceiling — same shape as
             useComplianceStream). Connection banner reflects
             connecting / connected / reconnecting / closed / idle.

Design-system inheritance: PortalShell wrapper unchanged, CSS-variable
colours throughout, amber for connecting/reconnecting and rose for
disconnected (the same shades the rest of the Sovereign Console uses).
The back-to-landing affordance the e2e suite asserts on is preserved.

Test seams kept: disableTerminal still skips xterm.js mount under
jsdom, plus new websocketFactory / resizeFetcher / reconnectBackoffMs /
disableReconnect props so unit tests can exercise the WS pump without a
real socket or wall-clock backoff.

npx tsc --noEmit clean on the full UI project.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:36:22 +04:00
e3mrah
de19be6b35
feat(sandbox-mcp): sandbox.stripe.* real impls (last MCP namespace) (#1671)
Wires four real Stripe handlers in openova-sandbox-mcp, completing the
final unwired namespace from architecture.md §3 (sandbox.stripe.*):

  - sandbox.stripe.bindAccount {api_key} — validates the key prefix
    (sk_live_ / sk_test_ / rk_live_ / rk_test_), stores it in the
    per-Sandbox Secret (`sandbox-<owner-uid>-secrets`, data-key
    `stripe_api_key`) via the same write-path sandbox.secrets.write
    uses, returns a masked confirmation (`sk_test_…xY12`).

  - sandbox.stripe.listProducts — reads the bound key implicitly,
    GET /v1/products with limit (1-100, default 20), active, and
    starting_after cursor passthrough.

  - sandbox.stripe.listPrices {product_id?} — same pagination shape;
    optional product_id filter.

  - sandbox.stripe.createCheckoutSession {price_id, success_url,
    cancel_url} — validates absolute http(s) URLs, POSTs the
    form-encoded line_items[0][price/quantity] body to
    /v1/checkout/sessions, returns the hosted Checkout URL + session id.

Implementation:

  - No new module dep — inline HTTPS calls to api.stripe.com via the
    stdlib net/http client. stripe-go v82 would have pulled ~80
    transitive packages for four endpoints; the surface we need is
    tiny enough that a 100-line stripeDo helper covers it. Matches
    the task's "stripe-go v82 if not already in deps; else inline
    HTTPS" guidance.

  - The key never round-trips on the wire after first bind. Agent
    pastes once via bindAccount; every subsequent call reads it from
    the Secret store. Stripe-Version header pinned to 2024-06-20 so
    a future API revision can't silently break the wire format.

  - Auth: RequiredCapability="sandbox.stripe" on every tool.
    claims.OrgID match enforced by the registry's existing gate.

  - Read-only cluster invariant: the only writes are to the
    per-Sandbox Secret. assertManagedBy() enforced on bind so we
    cannot mutate the controller-injected `sandbox-tokens` Secret.

Tests cover key validation (prefix + length), masking format, limit
clamping, the httptest.Server-backed happy-path + error-envelope
unwrap, form-urlencoded body shape for createCheckoutSession,
catalogue wiring (all four handlers non-nil, RequiredCapability
matches), and the registry capability gate (missing sandbox.stripe
cap → forbidden).

Closes the Wave 13 "last MCP namespace" gap; no chart bump.

Co-authored-by: Claude <noreply@anthropic.com>
2026-05-18 13:36:15 +04:00
e3mrah
6b3317f185
docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback (#1669)
Append a Wave 12-14 addendum to the convergence report capturing:

- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle

Companion memos (out-of-tree, not in this PR):

- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:17 +04:00
e3mrah
fcf86a6392
fix(sandbox-chart): no-upstream annotation (unblock Blueprint Release pipeline) (#1668)
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback

Append a Wave 12-14 addendum to the convergence report capturing:

- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle

Companion memos (out-of-tree, not in this PR):

- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)

Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.

Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml

Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:30:00 +04:00
github-actions[bot]
e5c2797ce6 deploy: bump sandbox-mcp-server image to cadc7b5 2026-05-18 09:25:43 +00:00
github-actions[bot]
87cf177a02 deploy: bump sandbox-pty-server image to cadc7b5 2026-05-18 09:23:28 +00:00
e3mrah
cadc7b5cea
fix(sandbox-ci): mcp-server Dockerfile repo-root context + pty/mcp auto-bump wiring (chart was half-deployable) (#1667)
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:

1. **mcp-server Dockerfile built against a too-narrow context**. The
   workflow passed `context: products/sandbox/mcp-server` and the
   Dockerfile assumed `COPY . .` could see everything it needed, but
   the `replace ../../../core/controllers` line in the module's go.mod
   only resolves when the build can actually reach those paths. Result:
   every push after #1658 failed at `go build` with `module not found`.
   Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
   COPY the replace targets' module roots + sources, then build with
   WORKDIR set to the dependent module. Static binary still produced
   into a distroless/static-debian12:nonroot final stage.

2. **mcp-server workflow had no chart auto-bump step**. Even after a
   green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
   stayed empty so the chart's `required` guard
   (deployment.yaml line 72) refused to render. Added the same
   yq-bump + bot-commit pattern build-sandbox-controller.yaml already
   uses, targeting `.runtime.mcpImage` and writing a fully-qualified
   `<repo>:<sha>` string (consumer reads it as one image reference,
   not a {repository,tag} pair). Also widened paths-filter to include
   core/controllers/** + core/services/shared/** so changes to the
   replace targets re-trigger the build.

3. **pty-server workflow had no auto-bump either**. Same surgery:
   yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
   narrow (pty-server has no cross-tree `replace` directives).

4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
   next chart roll out doesn't fail-fast before the rebuilt workflows
   land their first bumps:
   - ptyServerImage → ad5163e6 (current latest pty-server)
   - mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
     workflow will land the next real SHA on the next push to main).

Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
  binary at /tmp/openova-sandbox-mcp; `file` confirms statically
  linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
  renders cleanly; both env vars carry the SHA-pinned image refs.

No Chart.yaml bump. Read-only clusters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:22:17 +04:00
e3mrah
04079522ee
chore(release): bootstrap-kit pin 1.4.156→1.4.162 — Wave 14 collector (#1666)
Deploy-bot auto-bumped Chart.yaml + values.yaml image SHAs from 1.4.156
through 1.4.162 via 4 image-rev cycles, but the bootstrap-kit pin
sticks at the last manual collector bump (Wave 9 #1617). All provs
since t13 baked chart 1.4.156 which lacked Wave 10-13 fixes:
- #1622 sandbox controller chart
- #1640 Cilium Gateway per-zone listeners
- #1641 sandbox controller spawns pty+MCP Pods
- #1643 sandbox-controller calls newapi token mint
- #1644 organization-controller tenantPublic HTTPRoute
- #1650 provisioning sets Org.spec.tenantPublic
- #1652 bp-sandbox slot 61→19a (chicken-and-egg)
- #1654 bp-newapi attestation gate
- #1661 CNPG cross-region default

Bumping bootstrap-kit pin to 1.4.162 so the next fresh prov bakes the
correct chart artifact with all post-1.4.156 fixes included.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:15:50 +04:00
e3mrah
0b30ddca7f
Merge pull request #1664 from openova-io/sandbox-wave12-mcp-storage
feat(sandbox-mcp): sandbox.storage.* real impls (SeaweedFS bucket + signed URLs)
2026-05-18 13:07:54 +04:00
Emrah Baysal
2cebee57dd feat(sandbox-mcp): sandbox.storage.* real impls (SeaweedFS bucket + signed URLs)
Replaces the Wave 2 stubs for the sandbox.storage.* namespace with real
handlers backed by the host-cluster's unified SeaweedFS S3 API
(`seaweedfs.storage.svc:8333` per platform/seaweedfs/README.md). Every
handler is scoped to buckets prefixed `sandbox-<owner-uid>-` so the
agent cannot touch any other consumer's bucket (loki-data / cnpg-wal /
harbor-data, etc.).

Tools shipped:
  - sandbox.storage.bindBucket        {bucket_name?}
  - sandbox.storage.signedUploadURL   {bucket, key, expires_in_seconds?}
  - sandbox.storage.signedDownloadURL {bucket, key, expires_in_seconds?}
  - sandbox.storage.listBuckets
  - sandbox.storage.deleteBucket      {name}

Wire model: minio-go v7 (already canonical across the OpenOva tree —
catalyst-bootstrap/hetzner objectstorage depend on it) speaking S3 v4
to SeaweedFS. Presigned URLs default to 15 min and clamp to 7 days
(the S3 v4 signature ceiling).

Defence-in-depth: prefix-mismatch + 63-char S3 cap + alnum-only object
key regex all enforced BEFORE any S3 dial; arg-validation errors
surface clearly without first hitting a misleading creds error.

New env vars (sandbox-controller fills these at MCP Deployment
spec time):
  SANDBOX_STORAGE_S3_ENDPOINT   = "seaweedfs.storage.svc:8333"
  SANDBOX_STORAGE_S3_ACCESS_KEY = "<per-Sandbox IAM access key>"
  SANDBOX_STORAGE_S3_SECRET_KEY = "<per-Sandbox IAM secret>"
  SANDBOX_STORAGE_S3_USE_TLS    = "true|false" (default: false)
  SANDBOX_STORAGE_S3_REGION     = "us-east-1"  (default; opaque to SeaweedFS)

Auth: same shape as PR #1658 (sandbox.auth.* + sandbox.secrets.*) —
claims.OrgID must match env.OrgID, RequiredCapability=sandbox.storage.

Tests: 9 new test functions covering prefix format, scope-gate,
bucket-name normalization (including cross-Sandbox refusal + length
guard), object-key validation, expiry clamp, region default, per-tool
arg validation, capability gate, and catalogue wiring. `go build` +
`go test` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:06:49 +02:00
e3mrah
67518d4ecb
feat(sandbox-mcp): sandbox.deploy.staging/production/status/rollback (#1663)
Wave 13 of the openova-sandbox-mcp catalogue replaces the deploy
stubs with real handlers:

  - sandbox.deploy.staging    {image, app?}   upserts HR YAML in the
                                              Org's catalyst-tenant
                                              Gitea repo at
                                              sandbox/<owner-uid>/deploy/
                                              <app>/staging/helmrelease.yaml
  - sandbox.deploy.production {image, app?}   same shape, second
                                              capability gate
                                              (sandbox.deploy.production)
  - sandbox.deploy.status     {env, app?}     reads requested-image
                                              from Gitea + observed
                                              image / Ready condition
                                              from the live HR in the
                                              Org vcluster
  - sandbox.deploy.rollback   {env, app?}     reverts HR to the
                                              openova.io/last-deployed-
                                              image annotation

All writes go through the canonical pkg/gitea client (PutFile's
byte-equal short-circuit handles idempotency). Cluster reads use the
Sandbox's in-cluster dynamic client; the deploy tools never apply
manifests directly to the cluster — Flux on the host reconciles the
Gitea write per the architecture.md §3 contract.

Tests: 21 new in sandbox_deploy_test.go covering the Gitea write
fake, capability gates on production, rollback round-trip, image
splitter, HR ready condition parsing, registry advertisement.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:02:04 +04:00
github-actions[bot]
9ef6e30ee4 deploy: bump sandbox-controller image to e83d08e 2026-05-18 08:57:14 +00:00
e3mrah
c7b550f8f4
feat(sandbox-mcp): sandbox.preview.* real impls (per-PR preview Deployments) (#1662)
Replaces the Wave-2 not_implemented stubs for sandbox.preview.* with
real handlers in products/sandbox/mcp-server/internal/tools/sandbox_preview.go.
Five tools shipped:

- sandbox.preview.create   {pr_number, image, repo?}
- sandbox.preview.list
- sandbox.preview.status   {pr_number}
- sandbox.preview.teardown {pr_number}
- sandbox.preview.rebuild  {pr_number, image}

All five address env.SandboxNamespace exclusively (the agent cannot pass
a namespace argument), and every mutation is gated by an
openova.io/managed-by=openova-sandbox-mcp + openova.io/preview-pr=<num>
label pair so teardown/rebuild can never touch pty-server or the
openova-sandbox-mcp Deployment in the same ns.

Hostname pattern: pr-<num>.<app>.sandbox.<sov-fqdn> — attaches to the
canonical Cilium Gateway (cilium-gateway/kube-system) the
sandbox-controller already routes for pty-server, so previews surface
under the same wildcard listener via deeper subdomains.

Auth: same HS256 + claims.OrgID + RequiredCapability=sandbox.preview
pattern PRs #1645 / #1656 / #1658 / #1659 established.

go build + go test clean.
2026-05-18 12:55:20 +04:00
e3mrah
e83d08ea4e
feat(sandbox+tenant): CNPG active-hot-standby (ReplicaCluster) default for marketplace tenants when SOVEREIGN_ENABLE_HOT_STANDBY=true (#1661)
Sovereign DoD D31 — CNPG-backed apps must replicate across the
Sovereign's regions when the operator opts in. PR #1562 wired this
into bp-wordpress-tenant chart-level. This change extends the same
toggle across BOTH user-facing paths:

1. Marketplace tenant flow (sme_tenant_gitops.go)
   - smeTenantTemplateData gains EnableHotStandby/PrimaryRegion/
     ReplicaRegion. renderSMETenantOverlay reads them from the
     catalyst-api Pod env (SOVEREIGN_ENABLE_HOT_STANDBY +
     SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION).
   - Bp-wordpress-tenant HelmRelease emits pg.activeHotStandby.*
     when the trio is valid; bp-wordpress-tenant chart 0.2.0+
     (PR #1562) renders the primary + replica Cluster CR pair.
   - Defence-in-depth: degenerate inputs (empty/identical regions)
     fall back to single-Cluster shape rather than emitting a
     HelmRelease the chart's validateActiveHotStandbyRegions helper
     would fail at template time.

2. Sandbox plane (sandbox.db.provision)
   - Env struct + NewEnvFromOS read the same Sovereign-level trio.
   - sandbox.db.provision emits a primary + replica Cluster CR pair
     when hotStandbyActive() — same shape bp-cnpg-pair renders for
     marketplace apps + bp-wordpress-tenant cnpg-cluster.yaml: WAL
     streaming via spec.managed.services.additional annotated
     service.cilium.io/global=true, nodeAffinity pinning each side
     to its declared region, replica.enabled=true with externalCluster
     resolving the primary through the ClusterMesh-global Service alias.
   - Best-effort rollback if the replica Create fails so the operator
     never sees an orphan primary.

3. Plumbing (one knob, both paths)
   - catalyst chart: values.sovereign.{enableHotStandby,primaryRegion,
     replicaRegion} -> sovereign-fqdn ConfigMap keys -> catalyst-api env.
   - sandbox chart: cnpg.activeHotStandby.{enabled,primaryRegion,
     replicaRegion} -> controller env -> per-Sandbox MCP Pod env.
   - Bootstrap-kit slot 13 + slot 19a wire SOVEREIGN_ENABLE_HOT_STANDBY/
     SOVEREIGN_PRIMARY_REGION/SOVEREIGN_REPLICA_REGION envsubst
     placeholders to BOTH chart paths so the operator flips one knob
     on the per-Sovereign overlay and gets HA across the marketplace
     tenant install AND the sandbox.db plane.

Default empty/false: every Sovereign that has not opted in keeps
rendering single-Cluster CNPG (zero regression).

gitlab-tenant + nextcloud-tenant charts: NOT shipped in this repo
today, so they are out of scope. When they land they can copy the
same value contract (pg.activeHotStandby.*) and the gitops writer
wiring already handles them — no chart-bump or controller change
required.

Tests
- sme_tenant_active_hot_standby_test.go: 8 cases (off, on-happy-path,
  degenerate matrix incl. empty primary, empty replica, identical
  regions, toggle off with regions).
- sandbox_db_hot_standby_test.go: 11 cases covering hotStandbyActive
  matrix + replicaClusterName/replicationServiceName suffix rules +
  full primary + replica CR shapes (nodeAffinity, switchover, managed
  service, externalClusters).
- platform/wordpress-tenant/chart/tests/active-hot-standby-render.sh
  still passes (5/5 gates green).
- catalyst-api SMETenant suite GREEN.
- sandbox-controller suite GREEN.
- helm template clean for sandbox chart (HA + default-off) and
  catalyst chart (sovereign-fqdn-configmap + api-deployment).

Hard rules respected: READ-ONLY clusters, no Chart.yaml bump on
bp-catalyst-platform (envsubst-only wiring change in slot 13), no
host-cluster touch outside the chart-level seam.

Refs DoD D31.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:53:57 +04:00
e3mrah
128299397a
Merge pull request #1660 from openova-io/sandbox-wave12-mcp-marketplace-flux
feat(sandbox-mcp): marketplace.domain.* + flux.* real impls (proxy + dynamic client)
2026-05-18 12:50:27 +04:00
openova-bot
0e7b3232c2 Merge remote-tracking branch 'origin/main' into sandbox-wave12-mcp-marketplace-flux
# Conflicts:
#	products/sandbox/mcp-server/internal/tools/registry.go
2026-05-18 10:49:54 +02:00
openova-bot
c4065020de feat(sandbox-mcp): marketplace.domain.* + flux.* real impls (proxy + dynamic client)
Ships six previously-not_implemented MCP tools as real handlers:

  - marketplace.domain.byod {fqdn}
  - marketplace.domain.subdomain {subdomain, parent_zone?}
  - flux.status [{namespaces?}]
  - flux.reconcile {kind, name, namespace?}
  - flux.suspend {kind, name, namespace?}
  - flux.resume {kind, name, namespace?}

marketplace.domain.* proxies POST /domain/byod + POST /domain/subdomains
on the canonical domain service (core/services/domain/handlers), reusing
the Sandbox PAT as the bearer and the controller-injected
SANDBOX_DOMAIN_API_URL + SANDBOX_TENANT_ID env vars. The proxy returns a
stable snake_case envelope (status / fqdn / cname_target / registrar /
instructions) decoupled from any future domain-service wire refactor.

flux.* uses the same dynamic client every k8s.read.* tool already
builds (k8s_read.go::dynamicClient). flux.status enumerates HRs +
Kustomizations across the Sandbox namespace + caller-supplied extras
and distils the Ready condition + lastAppliedRevision into a flat
table. reconcile/suspend/resume issue strategic-merge patches against
the canonical reconcile.fluxcd.io/requestedAt annotation + spec.suspend
field — these are the only mutations the broader k8s.write.* READ-ONLY
hard-rule explicitly carves out (every other write surface stays
stubbed).

Wire env additions (NewEnvFromOS):

  - SANDBOX_DOMAIN_API_URL       — proxy target
  - SANDBOX_MARKETPLACE_API_URL  — reserved for future marketplace tools
  - SANDBOX_TENANT_ID            — scopes domain/byod calls

Auth: HS256 bearer (claims.OrgID == env.OrgID) + RequiredCapability
= "marketplace" for marketplace.* and "flux" for flux.* — same gate
shape as sandbox.db.* / sandbox.auth.* / sandbox.secrets.*.

Tests added:

  - marketplace_test.go: requireMarketplaceProxy gate, FQDN/subdomain
    regex, validation surface, httptest-backed proxy round-trip
    (verifies bearer + body + URL path), 409 idempotent branch,
    capability-gate matrix.
  - flux_test.go: kind → GVR routing, namespace defaulting,
    Ready-condition extraction over canonical Flux status, summary
    distillation, mutation arg validation, capability-gate matrix.

go build + go test clean (Go 1.23.4).
2026-05-18 10:46:13 +02:00
e3mrah
643d68c29d
feat(sandbox-mcp): rag.search + skills.list/get (lean stubs) (#1659)
Wave 12 thin first-cut for the per-Org RAG + skill-pack surfaces
advertised in products/sandbox/docs/architecture.md §3:

- rag.search {query, repo?, limit?} — case-insensitive substring grep
  over the Sandbox PVC mounted at /repo (filepath.WalkDir with sensible
  binary/large-file skip + matches capped at 50, snippets trimmed to
  80 chars). When the PVC isn't mounted yet the response returns
  matches=[] + pendingApi=true so the agent can branch on "index not
  ready" vs "no hits".
- skills.list / skills.get — hardcoded 3-entry catalogue
  (openova-platform-basics, k8s-debugging, fluxcd-workflows) with
  pendingOCI=true stub manifests. Wire envelope matches the eventual
  OCI-backed catalogue so the agent's parsing layer carries forward.

Auth identical to sandbox.secrets.* (PR #1658): Registry.Call enforces
Claims.OrgID==env.OrgID and RequiredCapability ("rag" / "skills") via
the existing gate; both tools are read-only.

Build + vet + tests clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:41:28 +04:00
github-actions[bot]
d82471ced1 deploy: bump sandbox-controller image to d5ea7d9 2026-05-18 08:19:53 +00:00
e3mrah
a2cbe3baa0
feat(sandbox-mcp): sandbox.auth.* + sandbox.secrets.* real impls (#1658)
Wave 11 follow-up to PR #1653 (sandbox.db.*). Replaces the stubbed
sandbox.auth.* and sandbox.secrets.* tool handlers with real
implementations so agents can manage per-Sandbox Keycloak realms /
OIDC clients and a per-Sandbox Secret store.

sandbox.auth.* (Keycloak Admin REST via the sandbox-controller-
injected admin bearer):

  - sandbox.auth.provisionRealm {realm_name, display_name?}
      POST /admin/realms — idempotent on 409 Conflict.
  - sandbox.auth.listClients
      GET /admin/realms/<sandbox-realm>/clients — friendly empty
      list on 404 (realm not yet provisioned).
  - sandbox.auth.registerClient {client_id, redirect_uris,
                                 public_client?, name?}
      POST /admin/realms/<sandbox-realm>/clients — idempotent on
      409 Conflict, typed error on 404 (realm missing).

  The Sandbox's "own" realm name is deterministic (`sandbox-<org>-
  <id>`); the agent CANNOT pass a `realm` argument to list /
  register, only provisionRealm accepts a free-form name.

sandbox.secrets.* (per-Sandbox K8s Secret store, base64-encoded
data, encrypted at rest by kube-apiserver encryption-provider):

  - sandbox.secrets.read  {key}        — returns Found / KeyNotFound
                                          / NotFound (Secret missing)
  - sandbox.secrets.write {key, value} — auto-creates the Secret on
                                          first write (Added /
                                          Updated / Created)

  The Secret is named `sandbox-<owner-uid>-secrets` in env.Sandbox-
  Namespace and gated by openova.io/managed-by=openova-sandbox-mcp
  so sandbox.secrets.write CANNOT mutate the controller-injected
  `sandbox-tokens` Secret or any other unmanaged Secret in the ns.

Auth: claims.OrgID == env.OrgID required (same as sandbox.db.*),
RequiredCapability = "sandbox.auth" / "sandbox.secrets".

New env vars (sandbox-controller injects on MCP Deployment):

  - SANDBOX_OWNER_UID      — `sandbox-<owner-uid>-secrets` suffix
  - KEYCLOAK_ADMIN_URL     — root of the Keycloak Admin REST API
  - KEYCLOAK_ADMIN_TOKEN   — pre-minted admin bearer
  - KEYCLOAK_PARENT_REALM  — default "master"

No chart bump; mcp-server-only change. go build + go test clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:19:46 +04:00
e3mrah
d5ea7d9de6
feat(sandbox): sandbox.<sov-fqdn> public URL — DNS + cert SAN + correct parentRefs (#1657)
The Sandbox public-URL flow (sandbox.<sov-fqdn>/sessions/<owner-uid>/*) had
three independent gaps that prevented PR #1641's HTTPRoute from resolving
end-to-end:

1. HTTPRoute parentRefs pointed at "catalyst-public/catalyst-system/https",
   a Gateway that does not exist on a Sovereign. The canonical public
   Gateway is "cilium-gateway/kube-system" (clusters/_template/
   sovereign-tls/cilium-gateway.yaml), the same parent that organization-
   controller's tenant_route.go and the chart's httproute.yaml attach to.
   sectionName is omitted so the HTTPRoute auto-attaches to every listener
   whose hostname matches sandbox.<sov-fqdn> — the wildcard
   *.${SOVEREIGN_FQDN} HTTPS listener already in place per infra/hetzner/
   main.tf locals.parent_domains_listeners_yaml fallback path.

2. The per-name Cilium Gateway cert (clusters/_template/sovereign-tls/
   cilium-gateway-cert.yaml) is a SAN list, not a wildcard. Without
   "sandbox.<sov-fqdn>" in its dnsNames cilium-envoy serves the default
   fallback cert and browsers see NET::ERR_CERT_COMMON_NAME_INVALID.
   This file is the source of the per-zone Secret
   sovereign-wildcard-tls-<sov-fqdn-dashed> the Gateway listener
   references — adding the SAN is the only TLS-side change needed; the
   Gateway listener wildcard is already a hostname match.

3. The parent zone's A-record set is built from CanonicalSovereignSubdomains
   in products/catalyst/bootstrap/api/internal/handler/
   sovereign_dns_records.go. Without "sandbox" the PowerDNS PATCH never
   writes sandbox.<sov-fqdn> A-record → primary LB IP, and the URL
   resolves NXDOMAIN even when the listener + cert are healthy.

End-to-end resolution chain after this PR:

  Browser → sandbox.<sov-fqdn>/sessions/<owner-uid>/  (PowerDNS A record
    points at primary LB IPv4)
  → Hetzner LB :443 → cp-node :30443 (cilium-envoy)
  → Gateway listener https-<sov-fqdn-dashed> on *.<sov-fqdn> matches
    hostname; cert SAN includes sandbox.<sov-fqdn> so TLS terminates
  → HTTPRoute pty-server in sandbox-<owner-uid> namespace matches
    hostname + /sessions/<owner-uid>/ path prefix; URLRewrite strips
    /sessions/<owner-uid>/ → /sessions/
  → backendRef pty-server:7681 in sandbox-<owner-uid> namespace
  → pty-server StatefulSet (PR #1641) serves the session

Hard rules respected: READ-ONLY clusters, no Chart.yaml bump (only
template content + Go renderer + Go handler list), helm template +
kubectl kustomize clean (verified locally), tests updated to assert the
new parentRefs shape and pass under go 1.23.
2026-05-18 12:15:59 +04:00
github-actions[bot]
5309bb8c39 deploy: bump sandbox-controller image to 63255bf 2026-05-18 08:15:56 +00:00
e3mrah
63255bf172
feat(sandbox-mcp): gitea.pr.create/merge + issue.* + k8s.read.logs (was stubs) (#1656)
Wave 11 promotes the remaining write-surface tools from #1645's stubs
to real handlers, so an agent inside a Sandbox can end-to-end open PRs,
file issues, comment, merge, and pull container logs without leaving the
MCP transport:

  - pkg/gitea: +MergePullRequest, +Issue + IssueComment types, +List/Get/
    Create/CommentOnIssue methods (new issues.go; pulls.go grows the
    merge helper). Same client envelope, same ErrRepoNotFound mapping.
  - mcp-server gitea.go: gitea.pr.create / gitea.pr.merge /
    gitea.issue.list / get / create / comment handlers + JSON Schemas.
    Same HS256 bearer + claims.OrgID match as #1645.
  - mcp-server k8s_read.go: k8s.read.logs via client-go's typed
    kubernetes.Interface (dynamic client doesn't expose Pods/log).
    Bounded fetch — follow=false, tail_lines default 200 capped at 5000,
    1 MiB byte cap, 30s deadline. Long-lived streams stay on the
    catalyst-api WebSocket surface.
  - tests: +merge_issues_test.go (pkg/gitea, 11 cases) + gitea_wave11_test.go
    (mcp-server, 14 cases) covering happy paths, missing-arg validation,
    explicit merge styles, list-after-create idempotency, and the two
    pre-cluster guard rails on k8s.read.logs.

Hard rules honoured: READ-ONLY clusters (k8s.write.* still stubbed),
no chart bump, go build + go test clean. Kept stubbed: sandbox.db.*,
sandbox.auth.*, gitea.release.list (Wave 12+).

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:12:41 +04:00
github-actions[bot]
e2e8132b00 deploy: update Catalyst marketplace image to f3915c0 2026-05-18 08:03:45 +00:00
e3mrah
f3915c01fa
test(marketplace): codified customer-journey regression (17 steps) (#1655)
Codifies the 17-step marketplace customer journey (storefront → catalog →
product detail → voucher → signup → subdomain pick → PIN → checkout →
provisioning chain → console redirect) as a hermetic Playwright suite.

Previously the journey was only walked manually by ad-hoc fix-author
agents (see PR #1635 / docs/SESSION-2026-05-17-CONVERGENCE.md). This adds
a regression gate so future PRs catch breakage in any of the 14 spec
tests (17 step labels grouped into 14 Playwright tests — steps 12-15 are
asserted as one API-chain contract since CheckoutStep redirects to
console before the panel-poll UI would render).

Highlights
----------
- core/marketplace/playwright.config.ts — testDir=./playwright,
  workers=1, baseURL from MARKETPLACE_BASE_URL (default
  http://localhost:4321), same posture as
  tests/e2e/playwright/playwright.config.ts.
- core/marketplace/playwright/customer-journey.spec.ts — every backend
  call (/api/catalog/*, /api/auth/*, /api/tenant/*, /api/billing/*,
  /api/provisioning/*) intercepted via page.route() so the run is
  hermetic (npm run build && npm run preview is enough — no real
  catalyst-api / billing / provisioning service required).
- Asserts the PR #1627 fix (deriveConsoleURL host-driven) — Sovereign
  hosts redirect to console.<sov-fqdn> (no /nova), mothership stays on
  console.openova.io/nova.

Verification
------------
npx playwright test customer-journey → 14 passed (2.5m).
2026-05-18 12:02:39 +04:00
github-actions[bot]
18df061895 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.11 2026-05-18 08:00:46 +00:00
e3mrah
0604c5e057
fix(newapi): gate channel render on attestation present (was blocking install when accountId env empty) (#1654)
Convergence wave 11 blocker on t16: bp-newapi HR install fails with

  Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
  "bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
  channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
  requires accountId

PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:

  attestation:
    kind: commercial-contract
    accountId:   ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
    contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}

On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.

Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.

Touches two templates that gate on the same effective channel list:

  - templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
    pre-check ($qbdAttReady) that short-circuits the channel composition
    block when attestation is incomplete. The downstream
    `assertChannelAttestation` helper then sees an empty channel list
    for the qwenBankDhofar slot and emits no error.
  - templates/channel-seed-job.yaml — mirrors the same gate so the
    post-install Helm hook Job + RBAC + audit ConfigMap also skip when
    the channel itself was skipped (otherwise the Job would POST a row
    whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).

`helm template platform/newapi/chart` renders cleanly in all three
states:
  - default (qbd.enabled=false) → no channel, no seed Job
  - qbd.enabled=true + empty accountId/contractRef → no channel, no
    seed Job (NEW: pre-1.4.10 this hard-failed)
  - qbd.enabled=true + accountId + contractRef present → channel
    composed normally, seed Job emitted

Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.

READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:00:06 +04:00
e3mrah
d080207c32
feat(sandbox-mcp): sandbox.db.* real impl (CNPG provision/list/get/drop/dump) (#1653)
PR #1645 (Wave 8) wired gitea.* + k8s.read.* + session.* in the MCP
server but left sandbox.db.* as not_implemented stubs. This commit
ships the real handlers using the same dynamic-client pattern.

Tools shipped (all gated on `RequiredCapability=sandbox.db` + claim
OrgID==env.OrgID, all scoped to env.SandboxNamespace):

  - sandbox.db.provision {name, plan?} — POSTs a CNPG Cluster CR
    (default plan: 1 instance, 5Gi PVC, postgres 16, db=app). Returns
    {host:<name>-rw.<ns>.svc.cluster.local, port:5432, dbname, user,
    secretName:<name>-app, secretKey:password}.
  - sandbox.db.list — labels-filtered LIST scoped to the Sandbox ns,
    returns the same connection envelope per item plus a distilled
    status summary (phase, readyInstances, Ready condition).
  - sandbox.db.get {name} — GET one Cluster; refuses to surface a
    Cluster lacking openova.io/managed-by=openova-sandbox-mcp
    (defence-in-depth against an agent fishing for per-Org pair DBs).
  - sandbox.db.drop {name} — DELETE with foreground propagation so the
    operator cascades PVC/Service/Secret cleanup before returning.
    Same managed-by guard as get.
  - sandbox.db.dump {name} — POSTs a one-shot Backup CR
    (`<cluster>-dump-<UTC>`). Returns the Backup name + the Cluster's
    configured barmanObjectStore.destinationPath so the agent can find
    the resulting S3 prefix without polling Backup.status.

Why CNPG Cluster CRs (not a per-Sandbox shared DB): per app DB keeps
tenancy / backup / restart blast-radius per-app, matches architecture
§3 + §7. Cluster CRs live in the Sandbox's OWN namespace
(sandbox-<owner-uid>); the agent cannot pass `namespace` — it's read
from env. The MCP server never mutates the resulting Pods/PVCs/
Services — the upstream CNPG operator (bp-cnpg) owns those.

Tests (sandbox_db_test.go, 9 cases incl. 5 capability-gate sub-tests):
  - validation (name regex, missing name, unknown plan)
  - default-plan CR shape (apiVersion, kind, labels, spec.instances,
    storage.size, bootstrap.initdb.database, enableSuperuserAccess)
  - connectionFor envelope matches CNPG service-name defaults
  - on-demand Backup CR shape + managed-by label
  - requireSandboxNS guard rails (no env / empty ns / populated)
  - capability gate rejects bearers w/o sandbox.db
  - status summary surfaces phase + Ready condition only

Hard rules respected: NO chart bump, no host-cluster touch — every
mutation lands inside the Sandbox's own namespace via the SA the
sandbox-controller already gives the MCP pod. go build + go vet +
go test clean. Catalogue test updated for new `sandbox.db.get`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:59:56 +04:00
e3mrah
7b77ebe99c
fix(bootstrap-kit): bp-sandbox slot move 61 → 19a to break harbor chicken-and-egg (#1652)
Caught live on t16.omantel.biz convergence: bp-sandbox HR stuck
Reconciling because its chart pull goes through harbor.<sov-fqdn>
(post-handover cutover slot 06a Step-06 phase-1 rewrites every
HelmRepository URL `oci://ghcr.io/openova-io` →
`oci://harbor.<sov-fqdn>/openova-io`), but harbor.<sov-fqdn> is not
reachable yet because bp-harbor itself has not reached Ready —
chicken-and-egg.

Same failure shape as Wave 7 #1610 with bp-hcloud-csi (REMOVED). This
PR takes the cleaner long-term cousin path: rather than remove the
slot, sequence it AFTER bp-harbor (slot 19) by renumbering to 19a
+ adding `bp-harbor` to the HR's dependsOn graph. The Sandbox MVP
Wave 11 slot stays available with no manual Day-2 add-app
re-introduction needed.

bp-harbor itself does not hit the cycle because its chart pull goes
through harbor.openova.io (the mothership-warmed proxy-cache wired
into k3s registries.yaml at cloud-init time) — NOT through
harbor.<sov-fqdn>.

Diff:
- clusters/_template/bootstrap-kit/61-bp-sandbox.yaml renamed →
  19a-bp-sandbox.yaml; slot label "61" → "19a"; dependsOn adds
  bp-harbor; header documents the move + chicken-and-egg context.
- clusters/_template/bootstrap-kit/kustomization.yaml: 19a slot
  inserted right after 19-harbor.yaml with the post-cutover URL
  rewrite rationale inline; old slot-61 entry replaced with a
  back-pointer comment.

Verified `kubectl kustomize clusters/_template/bootstrap-kit/`
renders clean: bp-sandbox HR keeps slot label, gains
- name: bp-harbor in dependsOn, all other fields unchanged.

No Chart.yaml bump (this is a bootstrap-kit Kustomization-only fix,
not a chart change). READ-ONLY clusters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:56:52 +04:00
github-actions[bot]
51913fe380 deploy: bump sandbox-controller image to ad5163e 2026-05-18 07:54:45 +00:00
e3mrah
ad5163e69a
feat(sandbox-controller): IdleScaler scales pty-server replicas to 0 after configured idle window (#1651)
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:

pty-server (products/sandbox/pty-server/):
  - session.Manager tracks lastActivity; Touch() called on session
    create/stop, WS attach/detach, every WS message in/out, resize/signal.
  - New GET /idle endpoint returns {lastActivityAt, activeSessions}.
  - Unit tests cover the endpoint shape + Touch() bump.

sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
  - New IdleScaler runnable, registered with mgr.Add() in main.go.
  - NeedLeaderElection=true (singleton across HA replicas).
  - Every 60s lists pty-server StatefulSets by label selector
    (app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
    constrained to `sandbox-*` namespaces in code for defence-in-depth.
  - For each: probes the in-cluster Service /idle endpoint, stamps the
    `openova.io/sandbox-last-activity-at` annotation, and patches
    spec.replicas=0 once now-lastActivity exceeds the per-SS
    `openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
    SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
  - Probe failure with no prior annotation → skip (next tick); probe
    failure WITH prior annotation → still decide on stale data so a
    degraded probe path doesn't keep a forgotten Pod alive forever.
  - activeSessions > 0 keeps the Pod alive regardless of idle window.
  - Already-zero replicas → idempotent no-op.

Chart RBAC:
  - ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
    cluster-wide write on a non-CR resource, scoped to the controller's
    own managed StatefulSets via the label selector + namespace prefix.

Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.

Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:51:36 +04:00
github-actions[bot]
c4fa06a9f4 deploy: bump sandbox-controller image to 3a3ee74 2026-05-18 07:46:53 +00:00
github-actions[bot]
c9fe39a20f deploy: bump bp-newapi upstream v0.13.2 chart 1.4.9 2026-05-18 07:44:23 +00:00
e3mrah
96d2d9bce7
fix(provisioning): set Organization.spec.tenantPublic on product-install (was empty; HTTPRoute reconciler had nothing to render) (#1650)
PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute
reconciler, but nothing set the field — every Org CR's TenantPublic
stayed zero-value, the reconciler short-circuited at the empty
ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium
Gateway.

Wire the patch at the only point that knows a tenant's product is
actually Ready: the provisioning service. Both the initial workflow
(`provision.completed`) and the day-2 install path
(`provision.app_ready`) now patch the Organization CR's
spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env),
subdomain (= slug), backendService (canonical vcluster-synced name),
port 80, and the picked product slug. Last-write-wins on subsequent
installs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through
env, never hardcoded — every Sovereign picks its own pool zone.
Empty env disables the patch entirely (legacy tenants keep working
through the Sovereign-wide tenant-wildcard route). Best-effort:
failures don't fail the provision. 404 on the CR is benign (legacy
tenant without an Organization counterpart).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:44:00 +04:00
e3mrah
3a3ee742ec
feat(sandbox-controller): call newapi /admin/tokens/sandbox + write Secret + rotation (was placeholder) (#1643)
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.

Changes:

  - New newapi.Client (core/controllers/sandbox/internal/newapi/) —
    thin HTTP client for POST /admin/tokens/sandbox with the bridge's
    {org_id, user_id, sandbox_id, allowed_channels} body + Bearer
    ADMIN_SECRET auth. Interface so tests can stub.

  - Reconciler extended:
      * NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
      * On every reconcile: decide mint-or-skip from annotation
        openova.io/sandbox-token-expires-at vs. now + lead-time
      * On mint: POST to bridge, stamp expires-at + rotated-at
        annotations on the CR, render token bytes into a new
        gitops manifest secret-newapi-token.yaml committed to the
        per-Org catalyst-tenant repo at sandbox/<owner-uid>/
      * Bridge failure → Failed/TokenMintFailed condition + 30s
        requeue + no gitops writes (fail-loud)
      * Empty DefaultChannels → NoAllowedChannels condition (fail
        earlier than the bridge's 400)

  - gitops.Render:
      * New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
        /NewAPITokenRotatedAt fields
      * New secret-newapi-token.yaml template — Secret with
        stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
        optional kubectl.kubernetes.io/restartedAt rotation marker
        so Wave 2's pty-server StatefulSet picks up rolling
        restarts on token rotation
      * kustomization.yaml appends the new manifest when token
        present

  - Chart wiring (platform/sandbox/chart):
      * Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
        (secretKeyRef from newapi-bp-newapi-token-signing-key,
         optional: true), NEWAPI_DEFAULT_CHANNELS
      * ClusterRole bumped to allow update/patch on the
        sandboxes/ resource (the controller now stamps annotations
        on the CR)

  - platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
      * Added emberstack/reflector annotations so the chart-emitted
        Secret (newapi namespace) mirrors into the sandbox-controller
        namespace by default; reflectorNamespaces is overrideable.

Tests:

  - newapi client: happy-path round-trip, 401 surfaces, input
    validation, request validation. 4 cases.
  - sandbox-controller: existing Wave 1 cases (happy/idempotent/
    drift/missing) still pass; 5 new cases for the token path:
    fresh mint + Secret render, rotation on near-expiry, steady-
    state no-mint, bridge failure surfaces condition, no-channels
    misconfig fails early. 9 cases total, all green.

Hard rules honored:
  - No Chart.yaml bump (chart pinning is a release-driver concern)
  - go build + go test ./core/controllers/sandbox/... clean

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:43:50 +04:00
e3mrah
8f4b34edd3
test(sandbox-ui): Playwright e2e for landing + settings + session nav (#1649)
Wave 9 regression gate for the Sandbox UI scaffold shipped in PR #1621.
Covers four happy-path surfaces:

- Sidebar Sandbox entry exists + accent-active class on /sandbox
- Landing renders 6 agent cards (aider / claude-code / cursor-agent /
  little-coder / opencode / qwen-code) with Connect Claude Max CTA
- /sandbox/settings BYOS Connect button when disconnected
- /sandbox/$id route resolves + create POST sends agent=aider

Auth gate, deployment self-discovery, SSE events, and sandbox API are
all mocked via page.route so the spec runs against `npm run dev` (Vite
on :5173) with no catalyst-api required. Per-test timeout bumped to 90s
to absorb Vite's cold-cache xterm/tanstack-router module load.

Sovereign-mode env vars required for SovereignSidebar to render:
  VITE_CATALYST_MODE=sovereign \\
  VITE_SOVEREIGN_FQDN=sandbox.example.test \\
  npm run dev

Local result: 4/4 passed in 2.1m (warm cache).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:41:25 +04:00
github-actions[bot]
2fee03f7d2 deploy: bump sandbox-controller image to c0020d9 2026-05-18 07:40:02 +00:00
e3mrah
c0020d9c33
feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs) (#1645)
* feat(pkg/gitea): add ListPullRequests + GetPullRequest read API

Wave 8 prerequisite for openova-sandbox-mcp's gitea.pr.list +
gitea.pr.get tools. Mirrors the existing client surface
(CreatePullRequest, ListOrgRepos) with state-filtered pagination and
a get-by-number fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs)

Wave 8 swaps the openova-sandbox-mcp Wave-2 not_implemented stubs for
production-ready handlers on:

- gitea.repo.list / gitea.repo.get (delegates to core/controllers/pkg/gitea)
- gitea.pr.list / gitea.pr.get     (delegates to new ListPullRequests +
  GetPullRequest helpers in pkg/gitea; org-scope check rejects cross-tenant
  owner overrides at tool dispatch time)
- k8s.read.get / k8s.read.list / k8s.read.watch (dynamic.Interface against
  the Sandbox pod's in-cluster SA or SANDBOX_KUBECONFIG; watch is a
  bounded short-watch — long-lived subs land Wave 9 via MCP
  resources/subscribe)
- sandbox.session.whoami / sandbox.session.info (echo per-call Claims +
  Sandbox metadata so the agent can self-discover its scope)

Auth: every tools/call carries a bearer (via _auth.token arg OR
SANDBOX_TOKEN env). main.go validates HS256 against SANDBOX_JWT_SECRET
using the canonical core/services/shared/auth.Claims shape (PR #1619),
strips _auth from the args, installs Claims on ctx, then Registry.Call
gates on capability + org_id-match before reaching the handler.
sandbox.session.* skips the org-scope check (the operator's session
is the operator's regardless of which Org slug their claim carries).

Stubs retained (Wave 8+):
- sandbox.db.*   (CNPG Cluster CR provisioning)
- sandbox.auth.* (Keycloak realm/client management)
- gitea.pr.create / gitea.pr.merge / gitea.issue.* / gitea.release.*
- k8s.read.logs

Hard rule preserved: k8s.write.* never lands in the MCP surface.

24 new tests (registry catalogue completeness, auth gate, gitea via
httptest stub, JWT round-trip, env-var parsing).

Builds clean against go 1.23 + k8s.io/client-go v0.31.1; module wires
core/controllers + core/services/shared via the same replace pattern
catalyst-bootstrap and every sme-service already use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:36:53 +04:00
github-actions[bot]
c6820e3d4a deploy: bump sandbox-controller image to 9f6354f 2026-05-18 07:33:12 +00:00
e3mrah
a8a56a25f6
fix(org-controller): render per-tenant HTTPRoute so <slug>.omani.homes serves traffic (#1644)
PowerDNS now resolves <slug>.<parentDomain> for every Org mapped onto a
Sovereign's role=sme-pool parent domain (PR #1629), but no HTTPRoute was
attaching that hostname to the tenant's installed product Service. The
Cilium Gateway terminated TLS on the wildcard cert and fell through to
the marketplace tenant-wildcard route — serving the storefront landing
page instead of the tenant's WordPress / Nextcloud / GitLab install.

Fix:

1. Extend Organization CRD with optional spec.tenantPublic
   (parentDomain, subdomain, backendService, backendPort, product).
2. organization-controller renders a Gateway-API HTTPRoute in the Org
   namespace (= slug) attached to cilium-gateway/kube-system when
   parentDomain is set. Skipped silently when unset so existing Orgs
   keep working.
3. Chart-side templates/sme-services/tenant-public-routes.yaml renders
   the same HTTPRoute shape from .Values.tenantRoutes[] for operators
   that prefer static fixtures over the controller's reconcile loop.
4. Tests: TestReconcile_TenantPublic_RendersHTTPRoute and
   TestReconcile_TenantPublic_DisabledByDefault cover both paths.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:32:54 +04:00
e3mrah
8888d9edd1
feat(catalog+billing): Sandbox Free/Pro/Ent plans + quota wire (was no plans = broken checkout) (#1642)
PR #1633 added the Sandbox app to seedApps but never wired the matching plan
rows. The marketplace checkout hit "plan_id not found" the moment a customer
picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with
the Wave 1 baseline quota regardless of the picked tier.

This PR closes both gaps in lockstep:

Catalog:
- Plan struct gets ProductSlug + IncludedQuotas fields (back-compat:
  omitempty BSON tags so legacy rows decode fine).
- expectedSandboxPlans() helper canonical-defines the three tiers:
    sandbox-free  0 OMR  1 session, 1 agent,    5 GB, BYOS
    sandbox-pro   9 OMR  3 sessions, 6 agents, 50 GB, BYOS (Popular)
    sandbox-ent  49 OMR  unlimited,  6 agents, 500 GB, BYOS
- seedAllData appends them on fresh seed; seedMissingSandboxPlans
  backfills them on already-populated Sovereigns (idempotent GET-then-
  create, patches missing ProductSlug/IncludedQuotas on legacy rows).
- UpdatePlan persists the two new fields.

Sandbox orchestrator wiring:
- SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID.
- buildSandbox stamps openova.io/plan-id annotation + spec.planId when
  PlanID is non-empty.
- quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or
  unknown plan_id falls through to DefaultQuota (Wave 1 baseline =
  Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so
  tenant-service avoids a compile-time dep on the catalog mongo stack.

Tests:
- TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the
  Popular flag (sandbox-pro), and the quota ladder.
- TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises
  all three tiers end-to-end (annotation + spec.planId + spec.quota).
- TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat
  with pre-PR publishers.
- TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd /
  retired plan IDs.

go build + go test clean for catalog, tenant, billing, provisioning,
shared, marketplace-api.

No Chart.yaml bump, no cluster touch.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:31:25 +04:00
e3mrah
9f6354f1e1
feat(sandbox): controller spawns pty-server + MCP Pods (was just namespace+RBAC+PVCs) (#1641)
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.

This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:

- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
  one Pod per in-flight session per architecture.md §1/§2). Env wired
  per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
  SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
  LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
  (key llm-gateway-token, optional). When claude-code is in
  spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
  per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
  access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
  at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
  server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
  Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
  sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
  the existing catalyst-public Cilium Gateway in catalyst-system.
  PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
  its own /sessions/<id> surface.

Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
  refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
  render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
  overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
  SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
  defaults.

Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).

Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:00 +04:00
e3mrah
422da46360
fix(sovereign-tls): cilium-gateway listeners per parentZone (#1640)
Issue #831 follow-on to #827. Previously the Cilium Gateway declared a
single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under
non-primary parent zones (e.g. wp-foo.omani.homes when the operator
brings omani.homes as the SME pool) hit cilium-envoy's default fallback
cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered
by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR
\#827) existed but had no Gateway listener claiming its hostname.

Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent
zone. Materialised at Terraform plan time as a JSON-flow array
(infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode
of the listener objects iterating decoded parent_domains_yaml), threaded
through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and
consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}`
in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone
Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay
in lockstep.

Scalar placeholder (not multi-line block) because kustomize-build parses
the YAML before Flux runs envsubst — a placeholder on its own line at
column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then
swaps it for the JSON-flow array string, which the apiserver parses as
the real listener list.

Single-zone fallback preserved (var.parent_domains_yaml empty →
[{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone
provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions
(e.g. primary omani.works + sme-pool omani.homes) render 4 listeners.

Verification:
  - kubectl kustomize clusters/_template/sovereign-tls/ → clean
  - End-to-end simulation (single-zone, two-zone) renders correct
    listener counts (2 / 4) with correct certificateRefs per zone.
  - Listener naming `https-<sanitised>` / `http-<sanitised>` is unique
    per listener so Gateway controller programs them all (duplicate
    names produce Conflicting status condition).

Files:
  - clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar
    listeners placeholder + comment block explaining the why)
  - infra/hetzner/main.tf (locals.parent_domains_decoded +
    locals.parent_domains_listeners_yaml; threaded into primary CP and
    secondary regions' templatefile() calls)
  - infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML
    substitute var in sovereign-tls Kustomization block)

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:26 +04:00
e3mrah
4c83d98765
feat(sandbox): orchestrator listens tenant.sandbox_requested → Sandbox CR materialisation (#1639)
PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the
marketplace cart includes the sandbox product. Nobody was subscribing —
the event landed in NATS `catalyst.tenant.sandbox_requested` and aged
out unread, so no Sandbox CR (PR #1622) was ever minted and the
customer sat on a "Provisioning…" spinner forever.

This slice closes the loop. A new SandboxOrchestrator in tenant-service:

- Subscribes via events.MultiSubscriber (PR #1636) to the canonical
  NATS subject + legacy Kafka topic.
- Parses {tenant_id, org_slug, owner_id, owner_email, agents,
  sovereign, requested_at} and resolves the owner email (event field
  → store.GetMemberEmail → owner_id fallback).
- Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE
  override) via a dynamic client, with spec per architecture §7:
  owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi /
  50 Gi / 3 sessions), spec.agentCatalogue from the cart.
- Idempotent: Get-then-Create with AlreadyExists swallowed so NATS
  redeliveries + duplicate marketplace submits stay no-ops; the
  sandbox-controller remains SoR for spec mutations.

Wiring in main.go is best-effort — when no in-cluster config nor
KUBECONFIG is available (CI / dev loops) the orchestrator is skipped
with a Warn; the rest of the tenant service still boots.

Hard rules: no chart bump, no cluster writes outside of the Sandbox
Create call (sandbox-controller reconciles the rest), `go build ./...`
clean, `go test ./...` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:22 +04:00
github-actions[bot]
22851d980d deploy: bump bp-newapi upstream v0.13.2 chart 1.4.8 2026-05-18 07:03:09 +00:00
e3mrah
4abd156fee
feat(newapi): real /admin/tokens/sandbox mint impl (was stub from #1619) (#1638)
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.

Handler (platform/newapi/internal/handler/sandbox_token.go):
  - Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
    constant-time compared. 401 on mismatch / missing bearer.
  - Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
    De-duplicates + scrubs empty channel names so a controller bug
    sending [""] can't mint a token that NewAPI silently treats as
    "no restriction".
  - Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
    {sub: sandbox_id, org: org_id, user: user_id, channels: [...],
     iat, exp: iat+7d, typ: "sandbox"}.
  - Returns {token, expires_at}.
  - Refuses with 503 when SigningKey or AdminSecret is unset
    (visible chart-wiring gap, not a forgeable-token leak).
  - Removes the previous Claims/jwt.Parse PAT-validation path that
    came with the stub — caller is the controller, not an operator.
  - NewHandlerFromEnv() factory loads + validates env at process
    start so catalyst-api can fail loudly instead of shipping the
    endpoint silently.

Unit tests (sandbox_token_test.go) — 11 cases:
  - happy path (mint + claim shape + signature round-trip)
  - de-dup + empty-channel scrub
  - admin-secret mismatch / missing bearer → 401
  - missing org_id / user_id / sandbox_id / empty channels → 400
  - non-POST → 405
  - unset env → 503
  - mintSandboxToken empty-secret guard + round-trip
  - response does not echo admin secret or signing key

Chart wiring (platform/newapi/chart):
  - New Secret template sandbox-token-signing-key-secret.yaml
    auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
    (same load-bearing pattern as credentials-secret.yaml #943 and
    gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
    for both SIGNING_KEY and ADMIN_SECRET; persistence across
    reconciles is required because a reconcile-time rotation would
    silently invalidate every per-Sandbox token across the Sovereign
    AND break the sandbox-controller's auth path until its Pod
    restarts.
  - values.yaml block sandboxTokenSigningKey.{existingSecret,
    autoProvision, autoSecretName} matching the `credentials`
    convention (operator override > auto-provision > skip-render).
  - No Chart.yaml bump — chart value addition only.

Verification:
  - go build ./platform/newapi/internal/handler/... — clean
  - go test ./platform/newapi/internal/handler/... — 11/11 PASS
  - helm template platform/newapi/chart — Secret renders

How sandbox-controller will use it:
  1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
  2. POST /admin/tokens/sandbox with bearer + body
     {org_id: <Sandbox.spec.owner.orgRef.slug>,
      user_id: <Sandbox.spec.owner.email>,
      sandbox_id: <Sandbox.metadata.uid>,
      allowed_channels: ["qwen3.6-bankdhofar"]}.
  3. Write returned token into Secret/sandbox-<uid>-newapi-token.
  4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:02:40 +04:00
e3mrah
401ab6713a
feat(catalyst-api): /api/v1/sandbox/sessions CRUD + sandboxes GVR in k8sCache + cutover-driver RBAC (#1637)
Wires the catalyst-api backend the Sandbox FE (PR #1621 — getSandboxes /
createSandbox / getByosStatus in sandbox.api.ts) has been calling into.
Without this handler the /sandbox surface on the Sovereign Console rendered
its empty state forever — every getSandboxes() 404'd at the catalyst-api
ingress and every "Start a session" click hit the same wall.

Handler — products/catalyst/bootstrap/api/internal/handler/sandbox_sessions.go
- GET    /api/v1/sandbox/sessions          — list Sandbox CRs in the
                                              operator's Org namespace
- POST   /api/v1/sandbox/sessions          — create Sandbox CR with agent
                                              validated against the 6-agent
                                              catalogue (aider / claude-code /
                                              cursor-agent / little-coder /
                                              opencode / qwen-code)
- GET    /api/v1/sandbox/sessions/{id}     — fetch single Sandbox detail
- DELETE /api/v1/sandbox/sessions/{id}     — graceful delete (the controller
                                              fires finalizers + cleans up
                                              the per-Sandbox vcluster
                                              namespace + PVCs + RBAC)

Client resolution mirrors the Family E compliance + k8s_resource_actions.go
seam: k8sCache.Factory.DynamicClientFor(resolveChrootClusterID("")) is the
primary path; sovereignDepsFor() — rest.InClusterConfig() — is the chroot
in-cluster fallback per feedback_chroot_in_cluster_fallback.md. Both 503
when unavailable so the FE renders its "API pending" pill rather than a
spinner.

Org-scoping uses claims.Org (the org_id Keycloak claim PR #1619 lit up)
for the CR namespace + spec.owner.orgRef.slug. Single-tenant chroots
without an org_id fall back through CATALYST_SANDBOX_DEFAULT_NAMESPACE
to a sensible default per docs/INVIOLABLE-PRINCIPLES.md #4. Wave-1 quota
defaults (4 CPU / 8Gi memory / 50Gi storage / 3 concurrent sessions)
mirror products/sandbox/docs/architecture.md §7 — the FE doesn't yet
expose a quota picker.

Status projection: CRD vocabulary (Pending|Provisioning|Ready|Failed)
maps to FE vocabulary (pending|running|stopped|failed|unknown) in
mapSandboxStatus so a fresh Sandbox shows the spinner rather than
"unknown" until the controller catches up.

k8sCache.DefaultKinds — products/catalyst/bootstrap/api/internal/k8scache/kinds.go
- Adds sandbox.openova.io/v1 Sandbox so the generic /k8s/{kind} surface
  enumerates Sandboxes the same way it does Applications + UserAccess.
  Per feedback_chroot_in_cluster_fallback.md every new GVR here needs a
  matching rule on the cutover-driver SA.

Cutover-driver RBAC — products/catalyst/chart/templates/clusterrole-cutover-driver.yaml
- Adds sandboxes.sandbox.openova.io with verbs split per
  feedback_rbac_create_no_resourcenames.md:
    rule 1: ["create"]
    rule 2: ["get","list","watch","delete"]
- Read-only on status (the controller owns status); write is spec-only
  on POST + the apiserver delete on DELETE.

Routes — products/catalyst/bootstrap/api/cmd/api/main.go
- Registered inside the RequireSession group alongside the existing
  /api/v1/sandbox/byos/claude-code/* surface; same auth gate, same
  patternless leading "/api/v1/sandbox/...".

Verified: go build clean, go vet clean, k8scache test suite green
(2.7s), helm template renders the new RBAC block.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:45:05 +04:00