Commit Graph

2349 Commits

Author SHA1 Message Date
e3mrah
c1a364b631
fix(httproutes): retarget guacamole-server + openova-flow-server to cilium-gateway in kube-system (Refs TBD-G6, C12-004) (#1692)
On t22 (omantel.biz fresh Sovereign) 2 of 15 HTTPRoutes went
Accepted=False because their parentRef pointed at a gateway that
does not exist on any Sovereign:

  catalyst-system/guacamole-server     -> gateway-system/cilium-gateway
  catalyst-system/openova-flow-server  -> kube-system/catalyst-gateway

The canonical Sovereign Gateway is kube-system/cilium-gateway,
installed by bootstrap-kit/01-cilium.yaml and used by every other
HTTPRoute (catalyst-api, catalyst-ui, marketplace, gitea, harbor,
keycloak, grafana, hubble-ui, openbao, powerdns, tenant-wildcard).
gateway-system does not exist; catalyst-gateway does not exist.

Fixes:

  - platform/guacamole/chart/values.yaml — default
    guacamole.httproute.parentRef.namespace: gateway-system -> kube-system

  - clusters/_template/bootstrap-kit/56-bp-openova-flow-server.yaml —
    flowServer.httproute.gatewayRef.name: catalyst-gateway -> cilium-gateway
    (namespace already kube-system, untouched)

Verified on t22: all 15 HTTPRoutes now Accepted=True after chart bump
+ Flux reconcile.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:38:17 +04:00
github-actions[bot]
93fa6c53ed deploy: update sme service images to 8878938 + bump chart to 1.4.165 2026-05-18 12:37:54 +00:00
e3mrah
8878938a43
fix(ci): bump sme-services Containerfiles golang 1.22 → 1.26 (unblock 5 stranded fixes) (#1691)
Every services-build run since 2026-05-18 06:32 UTC failed with
"go: go.mod requires go >= 1.26.0 (running go 1.22.12; GOTOOLCHAIN=local)"
because a recent go.mod bump to `go 1.26.0` was not paired with a
Containerfile base-image bump.

5 strandled fixes that never produced new image SHAs:
- PR #1683 fix(billing): consume catalyst.usage.recorded from
  CATALYST_SME stream (was creating overlapping CATALYST_USAGE)
- PR #1684 fix(provisioning): set Organization.spec.tenantPublic
- PR #1685 fix(catalog+billing): Sandbox Free/Pro/Ent plans + quota
- PR #1686 feat(sandbox): orchestrator listens tenant.sandbox_requested
- test(sandbox): integration tests for orchestrator + sessions API

The stranded billing image is the root cause of every voucher 502 on
t22 and blocks the full marketplace customer journey (steps 9, 10, 15
all fail). t22 billing Pod is in CrashLoopBackOff with the exact NATS
subject-overlap signature PR #1683 fixes.

Bumps all 10 service Containerfiles (auth/billing/catalog/catalyst-
catalog/domain/gateway/metering-sidecar/notification/provisioning/
tenant) to golang:1.26-alpine, matching the toolchain in go.mod.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:36:39 +04:00
github-actions[bot]
5407102d6b deploy: bump sandbox-controller image to 8017700 2026-05-18 12:33:11 +00:00
github-actions[bot]
5ee02f7d36 deploy: bump sandbox-mcp-server image to 8017700 2026-05-18 12:31:53 +00:00
e3mrah
8017700ad4
feat(sandbox): tier-bound MCP capabilities (Free/Pro/Ent plans gate tool access) (#1690)
Stop handing every Sandbox session the full MCP surface. Each per-Sandbox
NewAPI token now carries a plan-derived capability allowlist that the MCP
server enforces against per-tool RequiredCapability via Claims.HasCapability:

  - Free: read-only k8s + gitea read + session/rag/skills
  - Pro:  + sandbox.db.* + sandbox.storage.* + sandbox.preview.* +
          sandbox.auth.* + sandbox.secrets.* + marketplace.* + flux.status
  - Ent:  + sandbox.deploy.{staging,production,...} + sandbox.stripe.* +
          flux.{reconcile,suspend,resume} + gitea.pr.{create,merge} +
          gitea.issue.*

Wiring:
  - Sandbox CRD spec gains planId + capabilities[] (operator overlay).
  - Sandbox sandboxapi.{CapabilitiesForPlan,ResolveCapabilities} is the
    SoT; tenant orchestrator carries an exact-mirror capabilitiesForPlan
    (no controllers-module dep — same isolation pattern quotaForPlan
    uses).
  - sandbox-controller threads spec.capabilities (falling back to plan)
    into newapi.MintRequest.
  - catalyst-api bridge handler accepts capabilities[] on the wire and
    encodes it as the JWT `capabilities` claim (omitted when empty).
  - Claims.HasCapability gains wildcard prefix matching (`sandbox.db.*`
    satisfies `sandbox.db.provision`, `sandbox.db`, etc.) so plan grants
    stay coarse. Plain stem matches WITHOUT a wildcard are intentionally
    rejected — the production second-gate in sandbox_deploy.go stays
    honest.
  - MCP registry: every gated tool now carries its granular dotted
    RequiredCapability (`sandbox.db.provision`, `gitea.pr.list`, …).
    Read-only / session tools previously ungated also get granular
    grants so Free tokens can browse without inheriting the write
    surface.

No Chart.yaml bump — CRD additions are additive; existing Sandbox CRs
parse fine. Empty token capabilities downgrades to introspection only,
matching pre-PR-#1671 callers.

Tests: shared/auth/claims_test.go (wildcard matrix),
sandboxapi/capabilities_test.go (plan ladder + spec override),
sandbox_token_test.go (capabilities round-trip + omit-on-empty),
sandbox_controller_test.go (plan-derived + spec-override mint),
sandbox_consumer_test.go (orchestrator stamps spec.capabilities), plus
updates to every per-namespace registry test asserting new granular
RequiredCapability values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:30:00 +04:00
github-actions[bot]
3349690728 chore(deploy): bump openova-flow-adapter-flux image to 00eeff2 [skip ci] 2026-05-18 12:17:03 +00:00
e3mrah
00eeff241a
feat(openova-flow): Sandbox node type on flow canvas (CR + pty/MCP Pod lifecycle visible) (#1689)
Adds Sandbox (sandbox.openova.io/v1) + sandbox-component Pod informers
to the openova-flow-adapter-flux DaemonSet, plus a kind-aware glyph in
the FlowCanvas. Each Sandbox CR renders as a bubble pulled under its
tenant-org (`contains` edge to <region>:org:<slug>). Each pty-server
/ openova-sandbox-mcp Pod renders under its parent Sandbox.

Adapter (adapter-flux):
  - hr_informer.go: second factory watches all namespaces for the
    new SandboxGVR + PodGVR (Sandboxes live in Org vclusters, not
    flux-system). Mirrors the HR handler — dedupe on (id, status),
    upsert-nodes + upsert-rels POST, delete-nodes on removal.
  - sandbox_mapper.go: pure transform.
    BuildFromSandbox: `.status.phase` → palette
    (Pending/Provisioning/Ready/Failed → pending/running/succeeded/failed),
    label = owner email, meta.kind = "Sandbox",
    contains edge to <region>:org:<slug>.
    BuildFromSandboxPod: only emits for pty-server / openova-sandbox-mcp
    components (filter at mapper boundary so dedupe map stays clean);
    meta.kind = "SandboxPod", contains edge to parent Sandbox (resolved
    via `sandbox.openova.io/name` label or `sandbox-<name>` namespace
    stem). CrashLoopBackOff / ImagePullBackOff → failed.

Canvas (FlowCanvas.tsx):
  - NodeGlyph component swaps the bubble glyph by `node.meta.kind`:
    Sandbox → terminal/monitor SVG matching the Sovereign sidebar's
    Sandbox nav icon; SandboxPod → compact "›_" prompt; otherwise the
    legacy ◇/✓/✗/◐/○ text glyph.
  - data-meta-kind attribute on the bubble group for e2e selectors.

Test coverage:
  - 11 new adapter mapper tests (Sandbox phase mappings, family-label
    override, region fallback, missing-org fallback, Pod ready/not-ready
    /CrashLoop/non-sandbox-skip/namespace-stem-parent).
  - 3 new canvas tests (Sandbox glyph, SandboxPod glyph, legacy
    fallback when meta.kind absent).
  - Full suite GREEN: adapter-flux 11→22 tests, canvas 22→25 tests,
    server tests unchanged 14 GREEN. go vet clean, tsc --noEmit clean.

No Chart.yaml bump — the adapter is shipped as part of the existing
openova-flow-adapter-flux DaemonSet image; the new GVRs are reconciled
in-process at next Pod roll.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:14:03 +04:00
github-actions[bot]
9d2f7f1626 deploy: update catalyst images to 96674b7 2026-05-18 11:56:58 +00:00
e3mrah
96674b71c9
fix(ci+catalyst-api): hold deploy-bot bumps when any prov is in-flight (was rolling catalyst-api Pod mid-tofu-apply, abandoning provs) (#1688)
Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica
with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that
dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux
rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was
rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner
resources tagged with the abandoned deployment-id stayed orphaned and
required manual `hcloud` cleanup. Three consecutive provs died this way in
one afternoon.

Option C (smallest blast radius): gate the deploy-bot at the workflow level.

  1. New public endpoint GET /api/v1/deployments/in-flight-count on
     catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight
     status (pending / provisioning / tofu-applying / flux-bootstrapping).
     Phase-1 (phase1-watching) is observational and resumes across Pod
     restarts via resumePhase1Watch, so it does NOT block. Adopted
     deployments are excluded. No FQDNs / owner emails in the response —
     same information-disclosure posture as /api/v1/subdomains/check.
     Unauthenticated; the deploy-bot has no session cookie.

  2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint
     before bumping values.yaml. count==0 → green light. count>0 → sleep
     20s and retry. Hard cap 30 min (a stuck prov must not block all
     future deploys — that would be the worst possible failure mode for a
     CI gate). Fail-open on any non-200 / network error so the gate
     cannot itself become an outage.

Notes:
  - Mothership URL configurable via vars.CATALYST_API_URL (defaults to
    https://console.openova.io). Sovereign chroot self-deploys can point
    to their local catalyst-api.
  - First-rollout safe: the endpoint does not exist on the LIVE
    mothership until THIS PR's image lands, so the first run after merge
    falls through the 404 branch and proceeds. Subsequent runs benefit
    from the gate.
  - NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image
    refs in chart templates (existing behaviour), so the new endpoint
    reaches Sovereigns through the normal chart-rebake path.

Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs
Phase-1 vs terminal vs adopted classification + empty-store green light.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:54:54 +04:00
github-actions[bot]
6a7bd14784 deploy: update catalyst images to f6543dd
Some checks are pending
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Cluster bootstrap-kit drift guardrail / Detect bootstrap-kit drift (push) Waiting to run
Phase-8a preflight C — Cilium Gateway HTTPRoute admission / Preflight Cilium HTTPRoute admission (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
2026-05-18 11:14:49 +00:00
e3mrah
f6543dd488
chore(release): chart 1.4.163→1.4.164 — Wave 17 collector (all 7 t20 fixes baked) (#1687)
t20 diagnostic identified 7 root causes blocking fresh prov convergence;
all 7 PRs merged. This collector bumps chart pin to republish 1.4.164
artifact with everything baked in:

- #1681 harborPublicURL → registry.<sov> (cutover URL chicken-and-egg)
- #1682 single-zone bare listener names (sectionName collision)
- #1683 NATS billing consumer on CATALYST_SME (stream overlap)
- #1684 guacamole mount /home/guacamole (entrypoint rm)
- #1685 newapi-oidc Secret via Helm lookup
- #1686 catalyst-api Containerfile COPY core/services/shared (build pipeline unblocked since #1658)
- Wave 11 hcloud-csi cycle removal (#1610)
- Wave 14 chart 1.4.163 (#1666)
- Wave 15+15b sandbox no-upstream + default-off

Plus deploy-bot auto-bumps: catalyst-api/ui 22f30ce, sandbox-mcp-server.

Per session_2026_05_18_overnight_22prs.md feedback_bootstrap_kit_pin_lag.md:
ALWAYS bump BOTH Chart.yaml + bootstrap-kit pin together.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:12:40 +04:00
github-actions[bot]
c508a3cff4 deploy: update catalyst images to 22f30ce 2026-05-18 11:05:09 +00:00
e3mrah
22f30ce246
fix(catalyst-api): COPY core/services/shared/ in Containerfile (build broken since #1658; deploy-bot stalled — chicken-and-egg unblock for all post-#1658 fixes reaching Sovereigns) (#1686)
* fix(catalyst-api): COPY core/services/shared/ in Containerfile

PR #1658 added a second replace directive to
products/catalyst/bootstrap/api/go.mod pointing
github.com/openova-io/openova/core/services/shared at
../../../../core/services/shared (in-tree consume, same pattern as
core/controllers from #1152). The Containerfile was updated to wire the
controllers tree with COPY core/controllers/ /core/controllers/ but the
matching COPY for core/services/shared/ was missed.

Every catalyst-api build since #1658 fails at `go mod download` with:

  go: github.com/openova-io/openova/core/services/shared@v0.0.0-... \
  (replaced by ../../../../core/services/shared): reading \
  /core/services/shared/go.mod: open /core/services/shared/go.mod: \
  no such file or directory

Effect: deploy-bot stalled on catalyst-api bumps, every fresh Sovereign
provision ships the stale pre-#1658 image (e7b2062), all post-#1658
fixes (PARENT_DOMAINS_LISTENERS_YAML wiring, SME bridge token mints,
etc.) silently absent from the runtime.

Fix is one line: COPY core/services/shared/ /core/services/shared/
placed immediately after the controllers COPY, mirroring the same
relative-path math (../../../../core/services/shared resolves to
/core/services/shared from WORKDIR=/app, because each `..` of the
filesystem root is still root, then /core/services/shared).

Verified the failure mode on run 26028993151 (commit 4cf670b6, latest
main): exact log line matches. No Chart.yaml bump, no cluster changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deploy: update catalyst images to 2350e42

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-18 15:01:58 +04:00
github-actions[bot]
4cf670b6df deploy: bump sandbox-mcp-server image to ffb79aa 2026-05-18 10:55:22 +00:00
github-actions[bot]
5e06bf843a deploy: bump bp-newapi upstream v0.13.2 chart 1.4.14 2026-05-18 10:54:28 +00:00
github-actions[bot]
645d5282e2 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.21 2026-05-18 10:54:17 +00:00
e3mrah
798220413d
Merge pull request #1685 from openova-io/fix-t20-newapi-oidc-secret-materialization
fix(infra): Crossplane provider-hcloud package URL (xpkg.upbound.io → xpkg.crossplane.io after 2025 contrib migration)
2026-05-18 14:53:54 +04:00
Emrah Baysal
a01df29993 fix(newapi): create newapi-oidc Secret via Helm lookup from keycloak (was dangling reference)
Pre-this-fix the bootstrap-kit overlay at
clusters/_template/bootstrap-kit/80-newapi.yaml:171 sets
`auth.adminUI.keycloak.existingSecret: newapi-oidc` but NOTHING in the
chart nor in the operator overlay materialised that Secret. The Pod
stayed in `CreateContainerConfigError: secret "newapi-oidc" not
found`, blocking the entire bp-newapi HR from reaching Ready (t20
debug matrix Fix #6).

New template templates/keycloak-client-secret.yaml uses Helm `lookup`
to retrieve the existing Secret bytes on every reconcile (idempotent —
preserves the OIDC client secret across upgrades), falling back to
`randAlphaNum 32` on first install. Mirrors the existing canonical
seam in the same chart (templates/credentials-secret.yaml issue #943,
templates/sandbox-token-signing-key-secret.yaml PR #1638) — both use
the same lookup-or-generate pattern with helm.sh/resource-policy: keep.

The sister chart platform/guacamole/chart/templates/keycloak-client-
secret.yaml uses a SealedSecret placeholder + a bootstrap Job hook;
THIS chart already standardised on the simpler `randAlphaNum + Helm
lookup` pattern, so this Secret follows the same seam to keep the
chart's credential-materialisation strategy consistent.

Gated on `auth.adminUI.mode=keycloak` AND non-empty
`auth.adminUI.keycloak.existingSecret`, so non-keycloak installs
render nothing extra.

No Chart.yaml bump — pure template addition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:53:40 +02:00
e3mrah
7bfb65402e
fix(guacamole): mount /home/guacamole instead of /home/guacamole/.guacamole (entrypoint rm fails on mount point) (#1684)
The official Apache Guacamole image entrypoint runs `rm -rf
$GUACAMOLE_HOME` (== `/home/guacamole/.guacamole`) before re-populating
the directory on every start. When the chart mounted an emptyDir
directly at `/home/guacamole/.guacamole`, that path was a mount point
from the kernel's perspective, so `rm` failed with:

    rm: cannot remove '/home/guacamole/.guacamole':
        Read-only file system

— the entrypoint exited non-zero and the Pod CrashLoopBackOff'd before
the webapp ever started. (t20 debug matrix — Fix #5.)

Mount the PARENT directory (`/home/guacamole`) instead. `.guacamole`
becomes a regular subdirectory inside the emptyDir, which the
entrypoint can freely `rm -rf` and recreate. The webapp's first-start
writes still land in a writable location under readOnlyRootFilesystem.

No Chart.yaml version bump per the t20 hard-rules contract — chart
release will roll in the next blueprint-release wave.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:53:33 +04:00
e3mrah
ffb79aab12
fix(billing): consume catalyst.usage.recorded from CATALYST_SME stream (was creating overlapping CATALYST_USAGE) (#1683)
t20 (2026-05-18) caught the bug: billing crashed at startup with NATS
error code 10065 "subjects overlap with an existing stream" because
CATALYST_SME (subjects `catalyst.>`, created by the tenant /
provisioning MultiSubscribers) had already claimed `catalyst.usage.recorded`
by the time billing tried to create CATALYST_USAGE
(subject `catalyst.usage.recorded`). JetStream forbids two Streams from
owning overlapping subject filters.

Option B per the matrix: have billing share CATALYST_SME and scope its
metering reads via a consumer-side FilterSubject instead of owning a
separate Stream. This matches the architecture every other SME service
(tenant, notification, provisioning) already uses for catalyst.* events.

Changes:
- core/services/shared/events/nats.go: add EnsureCatalystSMEStream
  (public wrapper around the existing package-private ensureSMEStream
  helper used by NewMultiSubscriber) + SubscribeUsageRecordedOnSME
  (durable consumer on CATALYST_SME with FilterSubject scoped to
  catalyst.usage.recorded). The original EnsureUsageStream and
  SubscribeUsageRecorded are retained but marked Deprecated for
  back-compat with any Catalyst-Zero / dev loop wired before t20.
- core/services/billing/main.go: replace the EnsureUsageStream call
  with EnsureCatalystSMEStream and the SubscribeUsageRecorded call
  with SubscribeUsageRecordedOnSME. Comment captures the t20 root
  cause + the bootstrap-order rationale so the next reader doesn't
  re-introduce the dedicated Stream.

The consumer-side FilterSubject (`catalyst.usage.recorded`) lives in
core/services/shared/events/nats.go inside SubscribeUsageRecordedOnSME.

go build + go test clean for core/services/billing and
core/services/shared.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:53:25 +04:00
e3mrah
cc13aec980
fix(sovereign-tls): bare https/http listener names when single parent zone (collision with chart HTTPRoutes sectionName) (#1682)
PR #1640 renamed Cilium Gateway listeners to `https-<sanitised-zone>` /
`http-<sanitised-zone>` to support multi-zone Sovereigns (primary +
SME pool). That broke single-zone Sovereigns because every platform
chart's HTTPRoute (harbor, keycloak, grafana, gitea, openbao, powerdns,
stalwart-tenant) hardcodes `parentRefs[0].sectionName: https`. Result:
every HTTPRoute reports `Accepted=False NoMatchingListener`, Sovereign
Console / Harbor / Keycloak etc. unreachable through the Gateway.

Fix: when `len(parent_domains_decoded) == 1` (the common case), render
listener names as the bare strings `https` / `http`. When > 1 (SME pool
present), keep the unique `https-<zone>` / `http-<zone>` naming so the
Gateway controller doesn't hit a duplicate-name Conflicting condition.

Multi-zone tenants whose HTTPRoutes must attach under a non-primary
zone override `sectionName` via values.yaml — out of scope here.

The per-zone certificateRefs.name (`sovereign-wildcard-tls-<sanitised-zone>`)
is unchanged — independent of the listener name.

Verified: kubectl kustomize clusters/_template/sovereign-tls/ clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:51:42 +04:00
e3mrah
b01281a70c
fix(self-sovereign-cutover): harborPublicURL → registry.<sov> (was harbor.<sov> — chicken-and-egg unblock) (#1681)
Per t20 debug matrix:

* `bp-self-sovereign-cutover` step-06 phase-1 rewrites every HelmRepository
  URL from `oci://ghcr.io/openova-io` to `oci://${harbor_host}/openova-io`,
  where `harbor_host` is derived from `sovereign.harborPublicURL`.
* Pre-fix: `harborPublicURL: https://harbor.${SOVEREIGN_FQDN}`.
* But the bp-harbor HTTPRoute publishes at `registry.${SOVEREIGN_FQDN}` —
  see `clusters/_template/bootstrap-kit/19-harbor.yaml` line 167
  (`gateway.host: registry.\${SOVEREIGN_FQDN}`). No HTTPRoute matches
  `harbor.<sov>`, so post-pivot every OCI chart pull EOFs.
* Effect: bp-sandbox HR never Ready → bootstrap-kit Kustomization stuck
  waiting on bp-sandbox health → t20 convergence blocks indefinitely.

Fix (chart-level, no Chart.yaml bump for bp-catalyst-platform):

* `clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml`
  overlay value flipped `harbor.${SOVEREIGN_FQDN}` → `registry.${SOVEREIGN_FQDN}`.
* `platform/self-sovereign-cutover/chart/values.yaml` default placeholder
  flipped `harbor.example.local` → `registry.example.local` so smoke
  renders + docs line up.
* README + smoke command updated.

Smoke tests:

* `helm template smoke platform/self-sovereign-cutover/chart` — clean,
  1851 lines, `HARBOR_PUBLIC_URL=https://registry.example.local`.
* `helm template smoke ... --set sovereign.harborPublicURL=https://registry.otechN.omani.works`
  — clean, all step env vars carry the new host.
* `kubectl kustomize clusters/_template/bootstrap-kit/` — clean, 2926 lines,
  overlay shows `harborPublicURL: https://registry.${SOVEREIGN_FQDN}`.
* `bash platform/self-sovereign-cutover/chart/tests/cutover-contract.sh`
  — all gates green (Phase-0 ghcr-pull auth merge still works because
  `harbor_host` is derived from `HARBOR_PUBLIC_URL` env at runtime, so
  the script now correctly merges auth for `registry.<sov-fqdn>` instead
  of `harbor.<sov-fqdn>`).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:51:29 +04:00
e3mrah
3acb340b36
test(sandbox): integration tests for orchestrator + sessions API status reflection (#1680)
Adds regression coverage so the Sandbox event flow + REST surface can
be exercised without a live Sovereign — the convergence loop the
qa-loop's last 5 iterations relied on.

Tenant orchestrator (5 cases / 8 runs):
  * full event flow — tenant.sandbox_requested envelope → in-process
    BrokerSubscriber → SandboxOrchestrator.Start → recordingSandboxClient
    materialises a CR shaped per architecture.md §7 (labels, annotations,
    spec.owner/quota/agentCatalogue/planId)
  * NATS-style redelivery is idempotent — second Emit() goes Get(found)
    → no-op, Create count stays at 1
  * plan tiers fan out — free/pro/ent each stamp the right quota
    (catches the PR #1633 regression)
  * non-sandbox event types ignored at the dispatcher seam
  * agentCatalogue strips empty / whitespace entries before persist

Catalyst sessions API (7 cases / 10 runs):
  * POST → GET round-trip through a dynamic/fake apiserver via
    SetSovereignDepsFactory (mirrors chroot Sovereign "Path 2")
  * GET reflects controller status (sessions / storage / spend /
    previews / conditions) into the FE wire shape
  * Failed condition taxonomy — TokenMintFailed, GitopsWriteFailed,
    ManifestRenderFailed each preserved verbatim so the FE renders
    actionable error states instead of a generic red pill
  * POST invalid-agent returns 400 before any apiserver call
  * GET unknown sandbox returns 404 sandbox-not-found
  * LIST → DELETE → LIST round-trip
  * Org-scope isolation — claims.Org-scoped namespace boundary blocks
    cross-Org leak

Hard rules followed: READ-ONLY fake clients (no apiserver write), no
chart bump, no production code changes — only new _test.go files.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:41:41 +04:00
github-actions[bot]
42410c6e75 deploy: bump sandbox-controller image to 042b444 2026-05-18 10:38:03 +00:00
github-actions[bot]
35b9c77923 deploy: bump sandbox-mcp-server image to 042b444 2026-05-18 10:36:43 +00:00
github-actions[bot]
1595f3a867 deploy: bump sandbox-pty-server image to 042b444 2026-05-18 10:36:20 +00:00
github-actions[bot]
c9506020c3 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.13 2026-05-18 10:35:13 +00:00
e3mrah
042b444c5c
feat(sandbox): Prometheus emitters for Wave 14 Grafana panels (#1674) (#1679)
PR #1674 shipped the Sandbox Runtime Grafana dashboard with three
panels whose metrics did not yet exist anywhere in the fleet:

  - "WebSocket Connections" → pty_server_websocket_connections (Gauge)
  - "Idle-Timeout Scale-Down Events / hour" →
        sandbox_controller_idle_timeout_events_total (Counter)
  - "newapi Token Mint Requests / hour" →
        newapi_admin_token_mint_requests_total{tool,status} (Counter)

Per Inviolable Principle #11 the panels render "No data" until the
emitter sides roll out. This PR closes that loop.

pty-server (products/sandbox/pty-server)
  - New metrics.go: registers Gauge pty_server_websocket_connections
    via promauto + exposes promhttp.Handler.
  - routes.go: serves /metrics on GET; Inc/Dec the gauge around every
    successful WS upgrade in attach() and cards() (Defer Dec so abnormal
    returns still decrement).
  - go.mod: + github.com/prometheus/client_golang v1.19.1 (matches the
    version core/controllers already pulls in transitively).
  - New unit test asserts GET /metrics carries the gauge name.

sandbox-controller (core/controllers/sandbox/internal/idlescaler)
  - New metrics.go: Counter sandbox_controller_idle_timeout_events_total
    with label {namespace} registered on controller-runtime's shared
    registry (so the manager's existing :8080 /metrics endpoint surfaces
    it — no new listener).
  - idlescaler.go: bumps the counter inside scaleToZero() so every Pod
    scaled to 0 ticks once. Namespace label matches the dashboard panel's
    `sum by (namespace) (rate(...))` aggregation.
  - New unit test verifies the counter delta is 1 on a successful
    scale-to-zero pass.

newapi bridge handler (platform/newapi/internal/handler)
  - New metrics.go: CounterVec newapi_admin_token_mint_requests_total
    with labels {tool, status}; helper classifyStatus() maps HTTP codes
    to a finite cardinality of 7 status values (ok / unauthorized /
    bad_request / unavailable / server_error / method_not_allowed /
    other). Exported MetricsHandler() so the catalyst-api wiring code
    can mount /metrics on the same listener as the bridge.
  - sandbox_token.go: recordMint(r, status) at every return path so the
    counter ticks regardless of which branch the request hits.
  - go.mod: + github.com/prometheus/client_golang v1.19.1.
  - 5 new test cases assert counter delta == 1 for the documented
    status transitions and that the X-Catalyst-Tool header surfaces
    as the `tool` label.

sandbox-controller → newapi client (core/controllers/sandbox/internal/newapi)
  - client.go: stamp `X-Catalyst-Tool: sandbox-controller` on every
    outbound POST /admin/tokens/sandbox so the bridge counter's `tool`
    label has the canonical value the dashboard panel filters on.

Helm charts
  - platform/sandbox/chart/templates/service.yaml (new): ClusterIP
    Service exposing the controller's :8080 metrics port. Required so a
    ServiceMonitor selector has something to attach to.
  - platform/sandbox/chart/templates/servicemonitor.yaml (new):
    monitoring.coreos.com/v1 ServiceMonitor scoped to the metrics
    Service. Default-off + double-guarded with
    `.Capabilities.APIVersions.Has "monitoring.coreos.com/v1"` (matches
    platform/harbor/chart/templates/servicemonitor.yaml pattern). values
    block + per-Sovereign overrides (interval / scrapeTimeout / path /
    labels / namespace) per Inviolable Principle #4.
  - platform/newapi/chart/templates/servicemonitor.yaml (new): mirror
    template targeting the existing bp-newapi Service / `http` port for
    when the catalyst-api binary that mounts the bridge handler rolls
    out behind the same Service. Default-off, capability-guarded.

No chart Chart.yaml bump. Validates: helm template + helm lint clean
on both charts; go build + go test clean across all three modules
(pty-server, core/controllers/sandbox, platform/newapi/internal/handler).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:34:50 +04:00
e3mrah
9f75c196a6
feat(sandbox-ui): provisioning progress panel + conditions display (#1678)
PR #1670 wired the xterm.js WebSocket against `wss://sandbox.<sov-fqdn>/
sessions/<id>/attach`, but a freshly-created Sandbox CR takes 8-15s for
the sandbox-controller to reconcile (ns + RBAC + 3 PVCs + pty Pod + MCP
Pod + newapi Token). During that window the public ingress doesn't
exist yet, so the FE looped through ~6 reconnect attempts against a
non-existent host before giving up — operator saw a blank "Connecting…"
banner with no insight.

PR #1673 made the missing data available by surfacing `.status.phase` +
`.status.conditions[]` on `/api/v1/sandbox/sessions/{id}`. This PR
consumes it:

  - New SandboxProvisioningPanel — phase badge + spinner +
    per-stage waterfall (Namespace → RBAC → Storage → pty Pod →
    MCP Pod → newapi Token) inferred from conditions[]. Renders
    target-state on first paint (per INVIOLABLE-PRINCIPLES.md #1);
    flips chips Done as each per-stage True condition lands.
  - SandboxSession polls GET /sessions/{id} every 2s via TanStack
    Query while phase !== 'Ready' && phase !== 'Failed', then stops
    polling and renders the xterm.js panel.
  - On phase==='Failed' the panel surfaces every False condition's
    {Type, Reason, Message} verbatim (TokenMintFailed,
    ManifestRenderFailed, GitopsWriteFailed, OwnerEmailInvalid,
    NoAllowedChannels) plus a "Delete + retry" CTA that fires
    DELETE /sessions/{id} then bounces the operator to /sandbox.
  - On 404 (session deleted in another tab) the panel shows a
    "Session no longer exists" state + back-to-landing CTA.
  - SandboxLanding's create-session flow now navigates straight to
    /sandbox/$id on POST success, seeds the per-session query cache
    so the panel paints the canonical 6-stage waterfall instantly
    (no 2s gap waiting for the first poll tick).

sandbox.api.ts extends the Sandbox shape with `phase: SandboxPhase`
+ `conditions: SandboxCondition[]` (always materialised arrays —
the FE never has to `?? []`). Adds `getSandbox(id)` + `deleteSandbox(id)`
endpoints. Three normalisers (`normalizeSandbox`, `normalizePhase`,
`normalizeConditions`) keep getSandboxes/getSandbox/createSandbox
projections from drifting.

Design-system inheritance per feedback_subagents_inherit_design_system:
  - PortalShell wrapper (no bespoke chrome)
  - Emerald/amber/rose ramps mirror AppDetail's phase chip + the
    existing ConnectionBadge — every colour comes from documented
    Tailwind design tokens, no new hex values
  - Card chrome mirrors sandbox-session-card verbatim (rounded-xl,
    var(--color-bg-2) on var(--color-border), 5-unit padding)

WebSocket effect now gated on `showTerminal` so we don't reconnect-
storm a non-existent ingress while provisioning. xterm host stays
mounted (display:none) across the gate flip so hostRef + resize
listeners survive.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:34:47 +04:00
github-actions[bot]
3869070336 deploy: bump sandbox-controller image to 6f24ea2 2026-05-18 10:31:45 +00:00
github-actions[bot]
74ecd5bd4a deploy: bump sandbox-mcp-server image to 6f24ea2 2026-05-18 10:30:35 +00:00
github-actions[bot]
991130e684 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.12 2026-05-18 10:29:09 +00:00
e3mrah
6f24ea20b0
test(sandbox+newapi): integration test newapi token mint round-trip + verify reflector wiring (#1677)
Wave 15 — closes the verification gap between PR #1637 (catalyst-api
/api/v1/sandbox/sessions), PR #1638 (real /admin/tokens/sandbox mint
in newapi), and PR #1643 (sandbox-controller calls the bridge):

1. core/controllers/sandbox/internal/newapi/integration_test.go (new)
   End-to-end test wiring the REAL newapi.Client through net/http
   against a httptest.Server that mirrors the bridge handler contract
   (platform/newapi/internal/handler/sandbox_token.go) verbatim:
     - happy-path: 200 with {token, expires_at} → controller renders
       per-Sandbox Secret + stamps lifecycle annotations on the CR
       (verifies wire path: Authorization: Bearer + body fields
       org_id/user_id/sandbox_id/allowed_channels)
     - 401 path: wrong admin bearer → 401 JSON envelope → controller
       stamps TokenMintFailed False condition, NO gitops writes,
       requeues, NO lifecycle annotations stamped
     - transport-unreachable: closed server URL → TokenMintFailed
       (verifies the error wrapping path)

   Gap filled: client_test.go covers the HTTP client in isolation,
   sandbox_controller_test.go uses an in-process stub. Neither
   exercises the controller-runtime reconciler against a real HTTP
   transport — this file is the only place where a regression in the
   client's bearer/header/path/status-code handling would surface in
   the context of the reconciler's state machine.

2. platform/newapi/chart — reflector wiring fix
   Default reflectorNamespaces changed from "sandbox" to
   "catalyst-system,sandbox". Root cause: clusters/_template/
   bootstrap-kit/19a-bp-sandbox.yaml sets `targetNamespace:
   catalyst-system` (the canonical install namespace of the
   bp-sandbox HelmRelease) but the chart-emitted Secret was being
   mirrored into a `sandbox` namespace that does not exist on a
   stock Sovereign. Result: sandbox-controller Pod's
   `NEWAPI_ADMIN_SECRET` env var landed empty (secretKeyRef
   `optional: true` swallowed the missing-Secret error) → controller
   started in gitops-only mode, never minted tokens, silently
   degraded. Operator-visible only via a startup log line.

   `sandbox` is retained in the default for legacy overlays that
   install the controller into a dedicated namespace + for sister
   tooling (catalyst-api PATs routed through the bridge) that wants
   the admin bearer locally.

Verification:

  - go build ./sandbox/... clean
  - go test ./sandbox/internal/newapi/... — 7 tests pass (4 unit + 3
    new integration)
  - go test ./sandbox/... — all sandbox packages pass
  - go vet ./sandbox/... clean
  - helm template platform/newapi/chart/ -s sandbox-token-signing-
    key-secret.yaml renders with
    `reflection-{allowed,auto}-namespaces: "catalyst-system,sandbox"`
  - helm template platform/sandbox/chart/ renders Deployment env
    block `NEWAPI_ADMIN_SECRET` valueFrom Secret
    `newapi-bp-newapi-token-signing-key` key `ADMIN_SECRET`
    (optional: true) — unchanged

No Chart.yaml bump (chart pinning is a release-driver concern).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:28:38 +04:00
e3mrah
6f41bd1d51
chore(release): chart 1.4.162→1.4.163 — Wave 16 collector (all post-Wave 14 template+template changes baked) (#1676)
The 1.4.162 chart OCI artifact (published at commit 0ad78790 on 2026-05-18
06:33 UTC) pre-dates the template additions merged shortly afterwards.
Sovereigns sourcing bootstrap-kit pin 1.4.162 download that stale artifact
and never get the new templates → tenant HTTPRoute / Cilium per-zone
listeners / bp-newapi attestation gate / sandbox-controller D31 wiring
never reach the chroot.

This collector PR bumps chart version 1.4.162→1.4.163 (chart-version-only
republish — same OCI pipeline race documented in 1.4.6, 1.4.14, 1.4.16,
1.4.116, 1.4.118 changelog entries) AND bumps the bootstrap-kit pin to
match. The next fresh prov pulls 1.4.163's OCI bytes which contain every
post-1.4.162 chart change.

Baked in 1.4.163:
- #1644 organization-controller renders per-tenant HTTPRoute so
  <slug>.<parentDomain> serves the tenant's installed product
  (crds/organization.yaml tenantPublic +
  templates/sme-services/tenant-public-routes.yaml +
  values.yaml tenantRoutes[] static-fixture path).
- #1650 provisioning service patches Organization.spec.tenantPublic on
  product-install so the #1644 HTTPRoute reconciler has a non-empty
  parentDomain to render.
- #1640 Cilium Gateway one listener pair (HTTPS:30443 + HTTP:30080) per
  parent zone (clusters/_template/sovereign-tls/cilium-gateway.yaml).
- #1654 bp-newapi attestation gate — channel CR render now skipped when
  ATTESTATION_ACCOUNT_ID env is empty (was blocking install with bogus
  channel pointing at empty account).
- Sandbox-controller post-handover refinements:
  * api-deployment.yaml SOVEREIGN_ENABLE_HOT_STANDBY +
    SOVEREIGN_PRIMARY_REGION + SOVEREIGN_REPLICA_REGION env vars
    (D31 wiring; SME tenant gitops writer reads at render time so
    every freshly-rendered bp-wordpress-tenant HR carries
    pg.activeHotStandby block when toggle is true).
  * sovereign-fqdn-configmap.yaml exposes enableHotStandby /
    primaryRegion / replicaRegion keys.
  * clusterrole-cutover-driver.yaml grants sandbox.openova.io/sandboxes
    verbs (create split into its own Rule per
    feedback_rbac_create_no_resourcenames.md).
  * values.yaml sovereign.{enableHotStandby,primaryRegion,replicaRegion}
    defaults (all empty — zero regression on Sovereigns that have not
    opted into active-hot-standby).

Hard rules: clusters/ READ-ONLY for non-template paths (only the
_template bootstrap-kit pin bumps), helm template clean, helm lint clean.

Verification:
- helm template bp-catalyst-platform products/catalyst/chart/ → exit 0,
  3009 lines rendered.
- helm lint products/catalyst/chart/ → 0 chart(s) failed.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:23:20 +04:00
e3mrah
1e14656dbe
feat(sandbox-ui): disable Connect Claude Max button when OAuth client_id is placeholder (clear founder-action signal) (#1675)
Adds GET /api/v1/sandbox/byos/claude-code/config returning
{clientIdConfigured, oauthAuthorizeURL?} so the SandboxSettings card can
pre-flight whether the chart's SANDBOX_BYOS_CLAUDE_CODE_CLIENT_ID is
still the PLACEHOLDER-AWAITING-FOUNDER-REGISTRATION sentinel (set by
PR #1619). When it is, the FE now renders the "Connect Claude Max"
button as DISABLED with an amber "Operator setup pending" pill + tooltip
"Anthropic OAuth client_id not yet registered — contact your Sovereign
operator" — rather than letting the user click a button whose OAuth URL
would 400 at Anthropic and confuse them about whether BYOS is broken vs
awaiting the founder action documented in claude-code-byos.md §8.

Design-system inheritance preserved: amber-500/40 + amber-500/10 +
amber-300 + text-[10px] uppercase pill matches SettingsPage verbatim,
disabled:cursor-not-allowed + disabled:opacity-50 is the same button
disabled treatment used elsewhere on the page, and the inline operator-
pending paragraph uses the existing text-[var(--color-text-dim)] token.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:07:22 +04:00
e3mrah
1075f6bbf2
feat(sandbox-chart): Grafana dashboard for Sandbox runtime observability (#1674)
Add a default-off Grafana dashboard ConfigMap (label `grafana_dashboard: "1"`)
that the upstream grafana/grafana sidecar (kiwigrid/k8s-sidecar) auto-discovers
across all namespaces and loads on startup. Renders only when
`.Values.grafanaDashboard.enabled=true` — zero regression for every existing
Sovereign overlay.

Panels (per products/sandbox/docs/architecture.md §7):
  1. Active Sandboxes — sum(kube_customresource_sandbox_info)
  2. pty-server Pods Ready % — kube_pod_status_ready{condition=true}
     joined to kube_pod_labels{label_app_kubernetes_io_name="pty-server"}
  3. MCP Pods Ready % — same shape, label_app_kubernetes_io_name="openova-sandbox-mcp"
  4. WebSocket Connections — pty_server_websocket_connections (Gauge)
  5. PVC Usage % per Sandbox — kubelet_volume_stats_used_bytes / capacity_bytes
  6. Idle-Timeout Scale-Down Events / hour —
     rate(sandbox_controller_idle_timeout_events_total[5m]) × 3600
  7. newapi Token Mint Requests / hour —
     rate(newapi_admin_token_mint_requests_total{tool="sandbox-controller"}[5m]) × 3600

Pattern mirrors platform/seaweedfs/chart/.../seaweedfs-grafana-dashboard.yaml.
Per Inviolable Principle #11 (never fabricate metrics) every panel description
names the metric it depends on so panels whose emitter has not yet rolled out
across the fleet render as "No data" instead of a synthetic number.

Validated:
- `helm template` clean both modes (default-off → zero output; enabled → 1 CM)
- `helm lint` passes (1 INFO about icon — pre-existing)
- Dashboard JSON parses (json.loads), 7 panels enumerated
- No Chart.yaml bump

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:04:58 +04:00
e3mrah
284cb62c94
feat(catalyst-api): /api/v1/sandbox/sessions exposes live controller status (was spec-only) (#1673)
PR #1637 shipped GET /api/v1/sandbox/sessions returning only the spec
the handler authored — Status was projected from `.status.phase` but
every other controller-managed field (sessions count, storage usage,
30-day spend, preview URLs, failure conditions) was ignored. The
catalyst-ui Sandbox surface had nothing to render past the FE-side
"queued" pill.

This wires the projection to read `.status.{phase,sessions,storageUsed,
spend30d,previews,conditions}` and surfaces them on the
list/get wire shape:

  - `phase` (raw, Pending|Provisioning|Ready|Failed) alongside the
    existing FE-projected `status` (pending|running|stopped|failed)
  - `sessions`, `storageUsed`, `spend30d` — operator-visible quotas
  - `previews[]` — one preview URL per PR/branch (skips rows missing
    URL; coerces float64↔int64 from apiserver JSON round-trips)
  - `conditions[]` — Type/Status/Reason/Message tuples verbatim, so
    the FE can render TokenMintFailed / GitopsWriteFailed /
    ManifestRenderFailed / OwnerEmailInvalid / NoAllowedChannels
    inline instead of a generic red pill

Phase→FE-status mapping unchanged (matches sandbox.api.ts:
normalizeStatus). `mapSandboxStatus` refactored from
`(u *Unstructured)` to `(rawPhase string)` so the new
`readSandboxPhase` helper reads the field exactly once per item.

Added handler-package tests pinning the projection contract:
  - StatusReflection — happy path, dropped-malformed rows
  - PhaseProjection — every CRD phase → FE status
  - FailedSurfacesConditions — Failed + TokenMintFailed visible
  - NilInputDoesNotPanic — empty-slice defaults
  - PreviewFloat64Coercion — apiserver JSON round-trip safety

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:59:20 +04:00
e3mrah
94e98052a6
fix(sandbox-chart): smoke-render-mode default-off (was default-on; chart is .Values.enabled-gated, default-on renders empty → Blueprint Release fails 'empty render') (#1672)
Wave 15 #1668 added the annotation but used default-on which trips the
empty-render guard because the chart's resources are all gated on
.Values.enabled (default false). Flip to default-off so the smoke render
skips the chart per design.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:46:26 +04:00
github-actions[bot]
fdc2b3340b deploy: bump sandbox-mcp-server image to de19be6 2026-05-18 09:38:38 +00:00
e3mrah
2f10c2e85a
feat(sandbox-ui): SandboxSession real WebSocket connect + reconnect (was placeholder) (#1670)
PR #1621 shipped the SandboxSession xterm.js host with an "API pending"
placeholder banner. PR #1641 + #1657 wired the BE (sandbox-controller
renders the HTTPRoute on sandbox.<sov-fqdn>; pty-server exposes
WS /sessions/{id}/attach). This PR replaces the placeholder with a real
adapter:

- stdin   : term.onData -> ws.send (TextEncoder binary frame)
- stdout  : ws.onmessage -> term.write (ArrayBuffer / Uint8Array / Blob / string)
- resize  : window resize -> fit.fit() -> POST sandbox.<sov-fqdn>/sessions/{id}/resize
- replay  : pty-server ships the ring buffer as the first binary frame; the
            generic onmessage path writes it verbatim, no special case
- reconnect: on close / error, schedule a retry with exponential backoff
             (1s, 2s, 4s, 8s, 16s, 30s ceiling — same shape as
             useComplianceStream). Connection banner reflects
             connecting / connected / reconnecting / closed / idle.

Design-system inheritance: PortalShell wrapper unchanged, CSS-variable
colours throughout, amber for connecting/reconnecting and rose for
disconnected (the same shades the rest of the Sovereign Console uses).
The back-to-landing affordance the e2e suite asserts on is preserved.

Test seams kept: disableTerminal still skips xterm.js mount under
jsdom, plus new websocketFactory / resizeFetcher / reconnectBackoffMs /
disableReconnect props so unit tests can exercise the WS pump without a
real socket or wall-clock backoff.

npx tsc --noEmit clean on the full UI project.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:36:22 +04:00
e3mrah
de19be6b35
feat(sandbox-mcp): sandbox.stripe.* real impls (last MCP namespace) (#1671)
Wires four real Stripe handlers in openova-sandbox-mcp, completing the
final unwired namespace from architecture.md §3 (sandbox.stripe.*):

  - sandbox.stripe.bindAccount {api_key} — validates the key prefix
    (sk_live_ / sk_test_ / rk_live_ / rk_test_), stores it in the
    per-Sandbox Secret (`sandbox-<owner-uid>-secrets`, data-key
    `stripe_api_key`) via the same write-path sandbox.secrets.write
    uses, returns a masked confirmation (`sk_test_…xY12`).

  - sandbox.stripe.listProducts — reads the bound key implicitly,
    GET /v1/products with limit (1-100, default 20), active, and
    starting_after cursor passthrough.

  - sandbox.stripe.listPrices {product_id?} — same pagination shape;
    optional product_id filter.

  - sandbox.stripe.createCheckoutSession {price_id, success_url,
    cancel_url} — validates absolute http(s) URLs, POSTs the
    form-encoded line_items[0][price/quantity] body to
    /v1/checkout/sessions, returns the hosted Checkout URL + session id.

Implementation:

  - No new module dep — inline HTTPS calls to api.stripe.com via the
    stdlib net/http client. stripe-go v82 would have pulled ~80
    transitive packages for four endpoints; the surface we need is
    tiny enough that a 100-line stripeDo helper covers it. Matches
    the task's "stripe-go v82 if not already in deps; else inline
    HTTPS" guidance.

  - The key never round-trips on the wire after first bind. Agent
    pastes once via bindAccount; every subsequent call reads it from
    the Secret store. Stripe-Version header pinned to 2024-06-20 so
    a future API revision can't silently break the wire format.

  - Auth: RequiredCapability="sandbox.stripe" on every tool.
    claims.OrgID match enforced by the registry's existing gate.

  - Read-only cluster invariant: the only writes are to the
    per-Sandbox Secret. assertManagedBy() enforced on bind so we
    cannot mutate the controller-injected `sandbox-tokens` Secret.

Tests cover key validation (prefix + length), masking format, limit
clamping, the httptest.Server-backed happy-path + error-envelope
unwrap, form-urlencoded body shape for createCheckoutSession,
catalogue wiring (all four handlers non-nil, RequiredCapability
matches), and the registry capability gate (missing sandbox.stripe
cap → forbidden).

Closes the Wave 13 "last MCP namespace" gap; no chart bump.

Co-authored-by: Claude <noreply@anthropic.com>
2026-05-18 13:36:15 +04:00
e3mrah
6b3317f185
docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback (#1669)
Append a Wave 12-14 addendum to the convergence report capturing:

- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle

Companion memos (out-of-tree, not in this PR):

- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:17 +04:00
e3mrah
fcf86a6392
fix(sandbox-chart): no-upstream annotation (unblock Blueprint Release pipeline) (#1668)
* docs: session 2026-05-17/18 Wave 12-14 addendum + bootstrap-kit pin lag feedback

Append a Wave 12-14 addendum to the convergence report capturing:

- t-prov cycle log (t13 FAIL, t14 FAIL, t15 PASS, t16-t19 STUCK on stale chart, t20 in flight on 1.4.162)
- Three silent-failure traps: Wave 8 CloudPage TS error stalled UI builds 3h; Wave 13 mcp-server Dockerfile context broke sandbox-mcp builds for 3 days since #1658; Wave 14 bootstrap-kit pin lag stalled all chart propagation for 6h of provs
- Wave 12-14 PR roster (#1656/#1658/#1659/#1660/#1661/#1662/#1663/#1664/#1666/#1667) plus session total now 51 PRs
- Lesson 6: deploy-bot does NOT auto-bump the bootstrap-kit slot 13 pin; manual collector PR required per cycle

Companion memos (out-of-tree, not in this PR):

- session_2026_05_18_overnight_22prs.md gets a Wave 12-14 outcomes section
- new feedback_bootstrap_kit_pin_lag.md pins the pattern + detection one-liner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sandbox-chart): add no-upstream annotation (unblock Blueprint Release pipeline)

Blueprint Release CI was failing on every push that touched
platform/sandbox/chart/* since PR #1622 because the chart didn't declare
either dependencies: OR the catalyst.openova.io/no-upstream: "true"
annotation. Per docs/BLUEPRINT-AUTHORING.md §11.1 every umbrella chart at
platform/<name>/chart/ MUST do one of those two.

Sandbox is Catalyst-authored (sandbox-controller built in-house), so the
no-upstream annotation is correct. Matches existing pattern in:
- platform/bp-vcluster-helmrepo/chart/Chart.yaml
- platform/cnpg-pair/chart/Chart.yaml
- platform/external-secrets-stores/chart/Chart.yaml

Without this, Blueprint Release fails → bp-catalyst-platform chart
artifact at 1.4.162 never republishes with the latest sandbox image
refs (cadc7b5 from PR #1667 auto-bump) → fresh provs keep getting
stale sandbox runtime images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:30:00 +04:00
github-actions[bot]
e5c2797ce6 deploy: bump sandbox-mcp-server image to cadc7b5 2026-05-18 09:25:43 +00:00
github-actions[bot]
87cf177a02 deploy: bump sandbox-pty-server image to cadc7b5 2026-05-18 09:23:28 +00:00
e3mrah
cadc7b5cea
fix(sandbox-ci): mcp-server Dockerfile repo-root context + pty/mcp auto-bump wiring (chart was half-deployable) (#1667)
Sandbox chart was un-deployable end-to-end because three CI-side gaps
compounded after PR #1658 wired the mcp-server module to depend on
core/controllers + core/services/shared via `replace` directives:

1. **mcp-server Dockerfile built against a too-narrow context**. The
   workflow passed `context: products/sandbox/mcp-server` and the
   Dockerfile assumed `COPY . .` could see everything it needed, but
   the `replace ../../../core/controllers` line in the module's go.mod
   only resolves when the build can actually reach those paths. Result:
   every push after #1658 failed at `go build` with `module not found`.
   Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout):
   COPY the replace targets' module roots + sources, then build with
   WORKDIR set to the dependent module. Static binary still produced
   into a distroless/static-debian12:nonroot final stage.

2. **mcp-server workflow had no chart auto-bump step**. Even after a
   green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml
   stayed empty so the chart's `required` guard
   (deployment.yaml line 72) refused to render. Added the same
   yq-bump + bot-commit pattern build-sandbox-controller.yaml already
   uses, targeting `.runtime.mcpImage` and writing a fully-qualified
   `<repo>:<sha>` string (consumer reads it as one image reference,
   not a {repository,tag} pair). Also widened paths-filter to include
   core/controllers/** + core/services/shared/** so changes to the
   replace targets re-trigger the build.

3. **pty-server workflow had no auto-bump either**. Same surgery:
   yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays
   narrow (pty-server has no cross-tree `replace` directives).

4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the
   next chart roll out doesn't fail-fast before the rebuilt workflows
   land their first bumps:
   - ptyServerImage → ad5163e6 (current latest pty-server)
   - mcpImage → 1b0e86c (last pre-#1658 green build; the rebuilt
     workflow will land the next real SHA on the next push to main).

Verified locally:
- `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static
  binary at /tmp/openova-sandbox-mcp; `file` confirms statically
  linked ELF).
- `helm template test platform/sandbox/chart --set enabled=true …`
  renders cleanly; both env vars carry the SHA-pinned image refs.

No Chart.yaml bump. Read-only clusters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:22:17 +04:00
e3mrah
04079522ee
chore(release): bootstrap-kit pin 1.4.156→1.4.162 — Wave 14 collector (#1666)
Deploy-bot auto-bumped Chart.yaml + values.yaml image SHAs from 1.4.156
through 1.4.162 via 4 image-rev cycles, but the bootstrap-kit pin
sticks at the last manual collector bump (Wave 9 #1617). All provs
since t13 baked chart 1.4.156 which lacked Wave 10-13 fixes:
- #1622 sandbox controller chart
- #1640 Cilium Gateway per-zone listeners
- #1641 sandbox controller spawns pty+MCP Pods
- #1643 sandbox-controller calls newapi token mint
- #1644 organization-controller tenantPublic HTTPRoute
- #1650 provisioning sets Org.spec.tenantPublic
- #1652 bp-sandbox slot 61→19a (chicken-and-egg)
- #1654 bp-newapi attestation gate
- #1661 CNPG cross-region default

Bumping bootstrap-kit pin to 1.4.162 so the next fresh prov bakes the
correct chart artifact with all post-1.4.156 fixes included.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:15:50 +04:00
e3mrah
0b30ddca7f
Merge pull request #1664 from openova-io/sandbox-wave12-mcp-storage
feat(sandbox-mcp): sandbox.storage.* real impls (SeaweedFS bucket + signed URLs)
2026-05-18 13:07:54 +04:00
Emrah Baysal
2cebee57dd feat(sandbox-mcp): sandbox.storage.* real impls (SeaweedFS bucket + signed URLs)
Replaces the Wave 2 stubs for the sandbox.storage.* namespace with real
handlers backed by the host-cluster's unified SeaweedFS S3 API
(`seaweedfs.storage.svc:8333` per platform/seaweedfs/README.md). Every
handler is scoped to buckets prefixed `sandbox-<owner-uid>-` so the
agent cannot touch any other consumer's bucket (loki-data / cnpg-wal /
harbor-data, etc.).

Tools shipped:
  - sandbox.storage.bindBucket        {bucket_name?}
  - sandbox.storage.signedUploadURL   {bucket, key, expires_in_seconds?}
  - sandbox.storage.signedDownloadURL {bucket, key, expires_in_seconds?}
  - sandbox.storage.listBuckets
  - sandbox.storage.deleteBucket      {name}

Wire model: minio-go v7 (already canonical across the OpenOva tree —
catalyst-bootstrap/hetzner objectstorage depend on it) speaking S3 v4
to SeaweedFS. Presigned URLs default to 15 min and clamp to 7 days
(the S3 v4 signature ceiling).

Defence-in-depth: prefix-mismatch + 63-char S3 cap + alnum-only object
key regex all enforced BEFORE any S3 dial; arg-validation errors
surface clearly without first hitting a misleading creds error.

New env vars (sandbox-controller fills these at MCP Deployment
spec time):
  SANDBOX_STORAGE_S3_ENDPOINT   = "seaweedfs.storage.svc:8333"
  SANDBOX_STORAGE_S3_ACCESS_KEY = "<per-Sandbox IAM access key>"
  SANDBOX_STORAGE_S3_SECRET_KEY = "<per-Sandbox IAM secret>"
  SANDBOX_STORAGE_S3_USE_TLS    = "true|false" (default: false)
  SANDBOX_STORAGE_S3_REGION     = "us-east-1"  (default; opaque to SeaweedFS)

Auth: same shape as PR #1658 (sandbox.auth.* + sandbox.secrets.*) —
claims.OrgID must match env.OrgID, RequiredCapability=sandbox.storage.

Tests: 9 new test functions covering prefix format, scope-gate,
bucket-name normalization (including cross-Sandbox refusal + length
guard), object-key validation, expiry clamp, region default, per-tool
arg validation, capability gate, and catalogue wiring. `go build` +
`go test` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:06:49 +02:00