docs: session 2026-05-17/18 convergence report + DoD D32-D35 + Sandbox status update (#1635)
- New docs/SESSION-2026-05-17-CONVERGENCE.md narrative session report covering the 22 user-facing PRs (#1597-#1632) across 9 waves: founder bug families, BSS iframe-seam removal, bp-hcloud-csi removal, CloudPage TS hotfix, Sandbox W1-W5 scaffold, and 9 convergence-cleanup fixes. - SOVEREIGN-MULTI-REGION-DOD.md extended D31 -> D35: Sandbox CRD installable (D32), Sandbox agent catalogue picker (D33), newapi Sovereign-side LLM gateway (D34), NATS broker round-trip publish+consume (D35). - products/sandbox/README.md flips Status from "Design. Not yet implemented." to "Wave 1-5 implementation in flight (PRs #1615/#1618/#1619/#1621/#1622/#1632 merged; runtime smoke pending fresh prov)". Adds founder TODO to register Anthropic OAuth client_id per claude-code-byos.md. No code, chart, or test changes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
9690ff8351
commit
62c5620741
280
docs/SESSION-2026-05-17-CONVERGENCE.md
Normal file
280
docs/SESSION-2026-05-17-CONVERGENCE.md
Normal file
@ -0,0 +1,280 @@
|
||||
# Session 2026-05-17 / 18 — Convergence Report
|
||||
|
||||
**Window:** 2026-05-17 evening → 2026-05-18 early morning UTC
|
||||
**PR range:** #1597 → #1632 (~36 numbered PRs, plus the auto-generated `deploy:` collector commits between them)
|
||||
**Bastion:** `bastion.openova.io` (`vmi3305700`, 11 GiB RAM, 6 vCPU)
|
||||
**Charts shipped:** 1.4.153 → 1.4.160 (eight version bumps in one window)
|
||||
**Theme:** finish founder bugs, drop iframes from BSS, scaffold the Sandbox product end-to-end, and wire the SME event/identity/DNS plumbing that 31-gate convergence had been silently skipping over.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
- ~22 user-facing PRs grouped into nine waves (Waves 1–9 + Sandbox W1–W5 + a "convergence cleanup" wave).
|
||||
- Two OOM incidents on the bastion forced the move from "max-6 flat agent count" → "weighted-slot accounting". Test executors now cost 1.5 slots each, Fix-Authors 1.0; dispatcher refuses to launch if the sum would exceed 6.
|
||||
- One single TypeScript regression in `CloudPage.tsx` (Wave 8) silently held back Wave 5 + Wave 6 + Sandbox UI from reaching fresh provs for several hours. We did not notice until a chart bump (1.4.155 → 1.4.156, PR #1617) was required to "republish with fresh UI bytes". Lesson now pinned: **always check Build & Deploy Catalyst CI status after a merge**.
|
||||
- The NATS broker bridge (PR #1626) closed the **publish leg** of the SME event flow (tenant/billing dispatchers now publish to NATS). The **consume leg** — Sandbox controller + Organization reconciler subscribing to `catalyst.tenant.created` / `catalyst.order.placed` — is queued for the next wave.
|
||||
- Four new DoD gates appended: **D32 Sandbox CRD installable**, **D33 Sandbox agent catalogue picker**, **D34 newapi LLM gateway**, **D35 NATS broker round-trip** — see `SOVEREIGN-MULTI-REGION-DOD.md`.
|
||||
|
||||
---
|
||||
|
||||
## Timeline by wave
|
||||
|
||||
### Wave 1 — D17 list-view route fix (#1597)
|
||||
|
||||
Hot off the morning's t142 walk: `/cloud?view=list&kind=<X>` still bounced to `/dashboard` when the kind filter was set. Bare-minimum route guard cleanup — five lines plus a Playwright snapshot.
|
||||
|
||||
PR: **#1597** `fix(ui): D17 — /cloud?view=list&kind=<X> no longer redirects to /dashboard`
|
||||
|
||||
### Wave 2 — Founder-bug families B / C / D / E / F / G (#1598-#1604)
|
||||
|
||||
The Family pass over the founder's bug-list from the t142 walk. Each family is one PR over multiple files; the `chart 1.4.153→1.4.154` collector (#1604) republishes all UI/api images together so the chart values bump is atomic.
|
||||
|
||||
| PR | Family | Subject |
|
||||
|---|---|---|
|
||||
| #1598 | F | BSS routes mounted at `/console/bss/*` with RBAC menu gating (founder #1) |
|
||||
| #1599 | D | Treemap fan-out for cluster / region / vcluster / family + Layer-1 default |
|
||||
| #1600 | C | `ResourceDetailPage` real data + tab nav (founder #5) |
|
||||
| #1601 | G | 6 singletons (C8-001/C8-005/C9-006/C10-002/C10-003/C7-007) |
|
||||
| #1602 | E | Compliance UI (Kyverno + Falco + SBOM + framework filter) |
|
||||
| #1603 | B | AppDetail status sync (HR → UI wire + correct ns/label) |
|
||||
| #1604 | — | Chart 1.4.153 → 1.4.154 collector |
|
||||
|
||||
The 6-PR parallel fan-out triggered the **first OOM incident** of the night (see "OOM Incidents" below).
|
||||
|
||||
### Wave 5 — UX polish + chart 1.4.155 (#1605)
|
||||
|
||||
Single squashed PR: sidebar reorder, BSS icon swap, marketplace surface promoted from inline component to `SettingsCard`. Chart 1.4.154 → 1.4.155.
|
||||
|
||||
PR: **#1605** `feat(ui): Wave 5 — UX polish (sidebar reorder + BSS icon + marketplace as SettingsCard) + chart 1.4.155`
|
||||
|
||||
### Wave 6 — BSS native port (Option B step 1) (#1606-#1614)
|
||||
|
||||
Founder ruling earlier in the day: drop the iframe seam in BSS. Each BSS sub-page (Landing / Orders / Billing / Revenue / Vouchers / Tenants) is rewritten natively in the Sovereign Console as a React page, sharing the PortalShell + design tokens, and the iframe shell is deleted.
|
||||
|
||||
Six PRs landed in two waves of three (parallel within wave, serial between waves) to stay under the 6-slot ceiling.
|
||||
|
||||
| PR | Page |
|
||||
|---|---|
|
||||
| #1606 | BSS native landing (Option B step 1, kills iframe seam) |
|
||||
| #1607 / #1608 | Orders (two PRs because the first squash-merged with an unrelated CI bump; #1608 is the canonical one) |
|
||||
| #1611 / #1613 | Revenue (KPI + chart + breakdown — re-shipped after a rebase) |
|
||||
| #1612 | Billing |
|
||||
| #1609 | Vouchers (table + Issue modal) |
|
||||
| #1614 | Tenants (drops iframe, native table) |
|
||||
|
||||
### Wave 7 — Critical-path hotfix: bp-hcloud-csi (#1610)
|
||||
|
||||
Slot 17a's `bp-hcloud-csi` had a chicken-and-egg with harbor — the CSI pulled its provisioner image from harbor, but harbor needed PVCs to come up. Resolved in #1610 by removing slot 17a entirely (Hetzner-CCM handles volume mount once we drop the explicit CSI install).
|
||||
|
||||
PR: **#1610** `fix(bootstrap-kit): remove bp-hcloud-csi slot 17a — chicken-and-egg with harbor (Wave 7 critical-path hotfix)`
|
||||
|
||||
### Wave 8 — CloudPage TypeScript hotfix (#1616)
|
||||
|
||||
**This is the wave that silently held back everything Wave 5 + Wave 6 had merged.**
|
||||
|
||||
`CloudPage.tsx` had `kindCounts` typed against a hand-maintained literal union of resource kinds. Wave 5 + Wave 6 had landed cleanly individually, but the chart collector (#1604) introduced two new kinds via the Compliance UI (`policyreports`, `clusterpolicyreports`) that were not in the union. The Catalyst UI Docker build failed in CI with a TypeScript error — silently, because CI failure stops at the image-build step and doesn't open a noisy issue. No image was pushed to ghcr.io, so the chart values reference (which we'd already bumped in #1604/#1605/#1615) pointed at SHA tags that didn't exist. Every chart-values bump after #1604 inherited the same problem, but the chart still synthesised fine — Flux on every fresh prov pulled `ImagePullBackOff`.
|
||||
|
||||
We caught it only when a manual `kubectl get pods -n catalyst-system` on a verifier prov showed the UI Pod stuck `ImagePullBackOff`. Tracing back: the ghcr.io tag genuinely did not exist.
|
||||
|
||||
PR: **#1616** `fix(ui): Wave 8 hotfix — CloudPage kindCounts adds policyreports + clusterpolicyreports (unblocks UI build)`
|
||||
|
||||
Lesson (now pinned in memory): **after any PR that merges, check `gh run list --workflow="Build & Deploy Catalyst" --limit 1` and confirm `conclusion=success` before bumping chart values that reference the same SHA**. A broken UI build does NOT fail the chart bump — it just leaves a dangling image reference.
|
||||
|
||||
### Wave 9 — Chart 1.4.155 → 1.4.156 collector (#1617)
|
||||
|
||||
After #1616 landed, the UI image finally re-published to ghcr.io. To force every downstream Flux Kustomization to roll its pods (image tag is still the same SHA, but the underlying image is now the corrected one — kubelet may have cached the broken pull), we bumped chart minor.
|
||||
|
||||
PR: **#1617** `chore(release): chart 1.4.155→1.4.156 — Wave 9 collector republishes with fresh UI bytes`
|
||||
|
||||
### Sandbox Wave 1-5 — full product scaffold (#1615 / #1618 / #1619 / #1621 / #1622 / #1632)
|
||||
|
||||
Six PRs across one product directory. Architecture is described in `products/sandbox/docs/architecture.md` (shipped 2026-05-15).
|
||||
|
||||
| PR | Subject |
|
||||
|---|---|
|
||||
| #1615 | Wave 1 step 1: CRD `sandbox.openova.io/v1.Sandbox` |
|
||||
| #1618 | Wave 2: pty-server + openova-sandbox-mcp scaffold |
|
||||
| #1619 | Wave 1b: newapi proxy + BYOS + org-scoped JWT (in `core/services/auth` + new `core/services/newapi`) |
|
||||
| #1621 | Wave 3: Sandbox UI (landing + session host + BYOS settings) inside the Sovereign Console |
|
||||
| #1622 | Wave 1: controller + chart scaffold (chart templates `sandbox-controller.yaml`, `sandbox-rbac.yaml`) |
|
||||
| #1632 | CI: build workflows for controller + pty-server + mcp-server (so the chart can actually deploy) |
|
||||
|
||||
**The CI workflow PR (#1632) was the missing piece** — the chart referenced `ghcr.io/openova-io/openova/sandbox-{controller,pty-server,mcp-server}:<sha>` but no workflow ever pushed those images. #1632 added three `.github/workflows/build-sandbox-*.yml` files. Same lesson as Wave 8: a chart values bump that references an image is meaningless if the build pipeline doesn't push the image.
|
||||
|
||||
### Convergence cleanup wave (#1623-#1631)
|
||||
|
||||
Nine PRs after the Sandbox work, surfacing every SME / tenant / DNS / event / identity gap that 31-gate convergence had been working around silently.
|
||||
|
||||
| PR | Layer | Subject |
|
||||
|---|---|---|
|
||||
| #1623 | marketplace | Persist subdomain TLD across wizard steps |
|
||||
| #1624 | bootstrap-kit | Install vcluster CRDs + controller on Sovereign (gates Org → vCluster spawn) |
|
||||
| #1625 | catalyst-api | Wire `/api/v1/sme/billing/vouchers/{list,issue,revoke}` proxy |
|
||||
| #1626 | sme (broker bridge) | Wire tenant + billing event dispatchers to NATS (was Redpanda-only, blocking convergence) |
|
||||
| #1627 | marketplace | Post-purchase redirect to Sovereign-local console (was hardcoded to mothership) |
|
||||
| #1628 | billing | Skip Stripe when voucher covers 100% of total (unblocks fully-paid voucher checkout) |
|
||||
| #1629 | domain | Per-tenant DNS reconciler — `<slug>.<pool-domain>` resolves to Sovereign LB (was mothership) |
|
||||
| #1630 | catalyst-api | Mint HS256 token on SME proxy calls (was forwarding incompatible RS256) |
|
||||
| #1631 | sandbox+bootstrap-kit | newapi Sovereign install (Bank Dhofar Qwen wired for Sandbox) |
|
||||
|
||||
PR #1626 is the **broker publish leg**. The matching **consume leg** is still open: the Sandbox controller and Organization reconciler need to subscribe to `catalyst.tenant.created` / `catalyst.order.placed` and react. Today they still poll Catalyst-API. See "Remaining convergence gaps" below.
|
||||
|
||||
---
|
||||
|
||||
## OOM incidents and the move to weighted slots
|
||||
|
||||
### Incident 1 — Wave 2 family fan-out (~21:00 UTC)
|
||||
|
||||
Dispatched six Fix-Authors for Families B/C/D/E/F/G simultaneously, all sync-Agent. Aggregate RSS ramped to 9.3 GiB; the parent `claude-code` PID 218043 was OOM-killed by the kernel at ~9.7 GiB. Recovered with `claude --resume`, then re-dispatched the same six but **serialised them** in 2 waves of 3.
|
||||
|
||||
Memory cost per agent observed in this incident:
|
||||
- Fix-Author (touches ~10 files): 1.1–1.3 GiB RSS at peak
|
||||
- Test-Executor (Playwright walk, chrome headed): 1.6–1.9 GiB RSS at peak
|
||||
|
||||
### Incident 2 — Sandbox parallel dispatch (~02:00 UTC)
|
||||
|
||||
Dispatched 4 Fix-Authors (Sandbox W1 + W1b + W2 + W3) plus 2 Test-Executors (verifier walks of t142) simultaneously. Flat count = 6 = within the 6-slot ceiling. RSS peaked at 8.9 GiB → no OOM, but the verifier Playwright sessions stalled (browser tab couldn't get a fresh allocation, sat in CPU-spin) for ~3 minutes.
|
||||
|
||||
### Resolution — weighted slots
|
||||
|
||||
New dispatcher accounting (now mirrored in `~/.claude/projects/-home-openova-repos-openova-private/memory/feedback_max_6_parallel_agents.md`):
|
||||
|
||||
| Agent kind | Weight |
|
||||
|---|---|
|
||||
| Fix-Author (focused, ≤15 files) | 1.0 |
|
||||
| Fix-Author (large refactor, >25 files) | 1.5 |
|
||||
| Test-Executor (Playwright walk) | 1.5 |
|
||||
| Research/sub-grep agent (read-only) | 0.5 |
|
||||
|
||||
Hard rule: sum of weights at any instant ≤ 6.0. Dispatcher refuses to launch if the next agent would push over. If workload requires more, serialise into waves whose weight-sum each stays ≤ 6.
|
||||
|
||||
Wave 6 BSS port (six PRs) is the worked example: split into two waves of 3 PRs each (weight 3.0 per wave) with sequential gates between, total wall-clock 17min instead of an all-in-parallel that would have OOM'd.
|
||||
|
||||
---
|
||||
|
||||
## The CloudPage TypeScript regression — the silent multi-hour stall
|
||||
|
||||
Timeline:
|
||||
|
||||
| Time (UTC) | Event |
|
||||
|---|---|
|
||||
| ~22:00 | Wave 5 (#1605) merges. Chart 1.4.154 → 1.4.155. UI image tag updated in chart values. |
|
||||
| ~22:05 | "Build & Deploy Catalyst" workflow fires for #1605's HEAD SHA. UI Docker build fails with TypeScript error in `CloudPage.tsx` (`kindCounts` literal-union missing two kinds from Family E's Compliance UI #1602). Workflow result = failure. No ghcr.io tag pushed. **No human checks the workflow result.** |
|
||||
| ~22:15 → ~01:30 | Wave 6 BSS pages (#1606-#1614) all merge with their own deploy-collector commits. Each collector bumps the UI image tag in chart values to the latest SHA. The UI image at every one of those SHAs **does not exist on ghcr.io** because the build keeps failing for the same `kindCounts` reason. |
|
||||
| ~01:30 | Verifier prov spun up. Catalyst-UI Pod stuck `ImagePullBackOff` on every fresh prov in the window. |
|
||||
| ~01:45 | Root cause traced: `gh run list` shows the consecutive "Build & Deploy Catalyst" failures, all citing TS2322 on `CloudPage.tsx:284`. |
|
||||
| ~02:00 | #1616 ships the type fix. |
|
||||
| ~02:15 | #1617 bumps chart 1.4.155 → 1.4.156 to roll all pods on a fresh image. |
|
||||
|
||||
**Cost:** 3.5 hours of "convergence is broken on fresh prov, why?" investigation that turned out to be a single missing union member.
|
||||
|
||||
**Mitigation now baked in:** After every PR that merges, run
|
||||
|
||||
```bash
|
||||
gh run list --workflow="Build & Deploy Catalyst" --limit 3 --json conclusion,headSha,headBranch
|
||||
```
|
||||
|
||||
and confirm the latest matching SHA has `conclusion=success` before bumping chart values that reference that SHA. The CI status check is now a hard prerequisite for any chart-values commit, not an after-the-fact verification.
|
||||
|
||||
---
|
||||
|
||||
## The broker publish/consume split (PR #1626)
|
||||
|
||||
The SME (Service Management Engine) plane has two services that emit events: the tenant service emits `catalyst.tenant.created` / `catalyst.tenant.updated`, and the billing service emits `catalyst.order.placed` / `catalyst.invoice.paid`. Pre-#1626, both used Redpanda dispatchers — but Redpanda was only installed as a dev convenience; on Sovereign we standardised on NATS JetStream (per ADR-0001 §6).
|
||||
|
||||
PR #1626 (`fix(sme): wire tenant + billing event dispatchers to NATS`) replaces the Redpanda dispatcher with the NATS one. **This is the publish leg only.** Events now flow:
|
||||
|
||||
```
|
||||
tenant-service ─POST(catalyst.tenant.created)─> NATS JetStream <subject>
|
||||
billing-service ─POST(catalyst.order.placed)─> NATS JetStream <subject>
|
||||
```
|
||||
|
||||
The **consume leg** is still polled-over-API:
|
||||
|
||||
- The Organization controller (`core/controllers/organization`) polls Catalyst-API `/api/v1/orgs/pending` every 5s to find new tenants.
|
||||
- The Sandbox controller (`products/sandbox`, shipped this session in #1615/#1622) polls similarly.
|
||||
|
||||
The follow-up PR (queued, not in this session) wires both controllers to **subscribe** to NATS:
|
||||
|
||||
```
|
||||
NATS catalyst.tenant.created ─> Org controller (Reconcile + Spawn vcluster)
|
||||
NATS catalyst.order.placed ─> Sandbox controller (Reconcile + Spawn Sandbox)
|
||||
```
|
||||
|
||||
This is gated on **D35** in the new DoD additions.
|
||||
|
||||
---
|
||||
|
||||
## Remaining convergence gaps after this session
|
||||
|
||||
1. **newapi Sovereign-side auth.**
|
||||
#1619 + #1631 shipped the newapi proxy on the Sovereign and wired Bank Dhofar Qwen as a backend. Identity is org-scoped JWT (HS256, minted by `core/services/auth`). The Sovereign-side ingress for `newapi.<fqdn>` is up but the **JWT validation** on the Sovereign side currently trusts any HS256 with the right `iss`; the matching key-rotation flow + JWKS endpoint are NOT yet shipped. Untested on a fresh prov.
|
||||
|
||||
2. **vCluster CRD on the Sovereign.**
|
||||
#1624 installs the vcluster CRDs + controller on the Sovereign at bootstrap. But the controller's RBAC has not been audited for Sovereign-vs-mothership scope, and the Organization controller's reconcile still references the mothership vcluster API in two paths (`organization_controller.go:312`, `organization_controller.go:478`). Will block D29 zero-touch on the first tenant-create on a fresh Sovereign.
|
||||
|
||||
3. **Sandbox-marketplace wiring.**
|
||||
The Sandbox CRD (#1615) and controller (#1622) exist. The Sandbox UI (#1621) is mounted at `/console/sandbox`. But the marketplace **does not list Sandbox as a purchasable product yet** — there is no `marketplace-entry` YAML for it. End-to-end "tenant clicks 'Buy Sandbox' → controller spawns a Sandbox" is not wired. This is the natural test for D32 + D33.
|
||||
|
||||
4. **D31 active-hot-standby tenant install end-to-end walk.**
|
||||
Chart renders correctly (verified via `helm template` on 2026-05-17, memo `feedback_d31_chart_verification.md`). But the tenant-marketplace WordPress entry that wires `pg.activeHotStandby.enabled=true` into the install payload has not been shipped. Tenant on the Sovereign cannot today pick "WordPress with cross-region replication" — only "WordPress" with a single CNPG instance.
|
||||
|
||||
5. **D35 NATS broker round-trip.**
|
||||
PR #1626 closes the publish leg. The consume leg is open (see above). Convergence is not declared until a fresh prov can issue a voucher, redeem it, and the resulting `catalyst.tenant.created` event reaches the Org controller via NATS (not via polling) within 2s.
|
||||
|
||||
---
|
||||
|
||||
## PR roster (this session)
|
||||
|
||||
```
|
||||
#1597 fix(ui): D17 — /cloud?view=list&kind=<X> no longer redirects to /dashboard
|
||||
#1598 feat(ui): Family F — BSS in Sovereign Console
|
||||
#1599 fix(catalyst-api+ui): Family D — treemap fan-out
|
||||
#1600 fix(ui): Family C — ResourceDetailPage real data + tab nav
|
||||
#1601 fix(multi): Family G — 6 singletons
|
||||
#1602 feat(ui+api): Family E — Compliance UI (Kyverno+Falco+SBOM)
|
||||
#1603 fix(catalyst-api+ui): Family B — AppDetail status sync
|
||||
#1604 chore(release): chart 1.4.153→1.4.154
|
||||
#1605 feat(ui): Wave 5 — UX polish + chart 1.4.155
|
||||
#1606 feat(ui): Wave 6 PR 1 — BSS native landing
|
||||
#1607 feat(ui): Wave 6 PR 3 — BSS Orders (superseded)
|
||||
#1608 feat(ui): Wave 6 PR 3 — BSS Orders native
|
||||
#1609 feat(ui): Wave 6 PR 5 — BSS Vouchers native
|
||||
#1610 fix(bootstrap-kit): remove bp-hcloud-csi slot 17a (Wave 7)
|
||||
#1611 feat(ui): Wave 6 PR 4 — BSS Revenue (superseded)
|
||||
#1612 feat(ui): Wave 6 PR 2 — BSS Billing native
|
||||
#1613 feat(ui): Wave 6 PR 4 — BSS Revenue native
|
||||
#1614 feat(ui): Wave 6 PR 6 — BSS Tenants native
|
||||
#1615 feat(sandbox): Wave 1 step 1 — CRD sandbox.openova.io/v1.Sandbox
|
||||
#1616 fix(ui): Wave 8 hotfix — CloudPage kindCounts (unblocks UI build)
|
||||
#1617 chore(release): chart 1.4.155→1.4.156
|
||||
#1618 feat(sandbox): Wave 2 — pty-server + openova-sandbox-mcp scaffold
|
||||
#1619 feat(sandbox+auth+newapi): Wave 1b — newapi proxy + BYOS + org-scoped JWT
|
||||
#1621 feat(ui): Wave 3 — Sandbox UI (landing + session host + BYOS settings)
|
||||
#1622 feat(sandbox): Wave 1 — controller + chart scaffold
|
||||
#1623 fix(marketplace): persist subdomain TLD across wizard steps
|
||||
#1624 fix(bootstrap-kit): install vcluster CRDs + controller on Sovereign
|
||||
#1625 fix(catalyst-api): wire /api/v1/sme/billing/vouchers proxy
|
||||
#1626 fix(sme): wire tenant + billing event dispatchers to NATS
|
||||
#1627 fix(marketplace): post-purchase redirect to Sovereign-local console
|
||||
#1628 fix(billing): skip Stripe when voucher covers 100%
|
||||
#1629 fix(domain): per-tenant DNS reconciler — <slug>.<pool-domain> → Sovereign LB
|
||||
#1630 fix(catalyst-api): mint HS256 token on SME proxy calls
|
||||
#1631 feat(sandbox+bootstrap-kit): newapi Sovereign install (Bank Dhofar Qwen)
|
||||
#1632 ci(sandbox): build workflows for controller + pty-server + mcp-server
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lessons (now pinned in memory)
|
||||
|
||||
1. **Always check Build & Deploy Catalyst CI status after a merge.** A broken UI build does not fail the chart bump — it leaves a dangling image reference that no operator notices until `ImagePullBackOff` on a fresh prov. Mitigation: `gh run list --workflow="Build & Deploy Catalyst" --limit 1 --json conclusion` before any chart-values commit. (Wave 8.)
|
||||
|
||||
2. **Weighted slot accounting beats flat agent count.** Test-Executors (Playwright) cost ~1.5× a Fix-Author in RSS. Flat 6-agent ceiling worked for fix-only waves but failed when mixed with verifier walks. Dispatcher now tracks weighted sum ≤ 6.0. (OOM Incident 2.)
|
||||
|
||||
3. **Publish leg ≠ consume leg.** Wiring an event publisher to NATS does NOT mean any consumer is listening yet. After every dispatcher rewire, immediately audit (or open a follow-up issue for) each subscriber that should react. The convergence test is round-trip, not one-way. (PR #1626.)
|
||||
|
||||
4. **A new CRD/controller without a CI workflow that pushes its image is dead code.** Sandbox W1–W4 sat unusable for hours because no workflow built the three images the chart referenced. (PR #1632 had to backfill this.) When adding a new service, the **first** scaffold commit must include the `.github/workflows/build-<service>.yml`.
|
||||
|
||||
5. **Sub-agents inherit the design system — Sandbox UI proved it.** Wave 3 Sandbox UI (#1621) shipped clean PortalShell-wrapped components the first time, because the prompt to the Fix-Author included the design-system inheritance block verbatim. Contrast with the Family F BSS pages earlier in the day that needed a re-roll. (`feedback_subagents_inherit_design_system.md`.)
|
||||
@ -19,7 +19,7 @@ Mirrored to auto-memory at `~/.claude/projects/-home-openova-repos-openova-priva
|
||||
|
||||
---
|
||||
|
||||
## DoD gates (D0–D31)
|
||||
## DoD gates (D0–D35)
|
||||
|
||||
Every gate must pass on a SINGLE fresh provision in one continuous run. No partial credit.
|
||||
|
||||
@ -59,6 +59,10 @@ Every gate must pass on a SINGLE fresh provision in one continuous run. No parti
|
||||
| **D29** | **Voucher-based organization (tenant) provisioning is zero-touch.** Recipient opens the voucher email → clicks redeem link → PIN-login as `demo@openova.io` (or `support@openova.io`) → lands on an organization-creation wizard → completes the form → a new `Organization`/`Tenant` CR is created → tenant namespace + RBAC + bootstrap apps converge → recipient is auto-redirected to their tenant home page. NO operator intervention beyond the voucher email. Added 2026-05-16. | Playwright MCP |
|
||||
| **D30** | **Free-subdomain selection from operator-curated pool.** Organization wizard step MUST present a subdomain picker populated from the configured pool: `omani.homes`, `omani.rest`, `omani.trades` (and any others the operator has provisioned). Tenant chooses a free subdomain (e.g., `acme.omani.homes`) → cert provisions → tenant landing page resolves on the chosen FQDN with publicly-trusted TLS. The pool MUST come from a Sovereign-side CR/config (not hardcoded). Added 2026-05-16. | Playwright MCP + dig + curl |
|
||||
| **D31** | **Tenant application with CNPG active-hot-standby replication.** Inside the new tenant, user picks a CNPG-backed app from the marketplace (e.g., Ghost or WordPress) → selects "active hot-standby" → app installs with a CNPG Cluster that replicates across the Sovereign's regions (primary + at least one replica). `kubectl get cluster.postgresql.cnpg.io -A` in the tenant context shows `instances` distributed across regions (region label / topology spread). Failover test: cordoning the primary region brings the replica to primary, app remains reachable on its FQDN within the documented RTO. Added 2026-05-16. | Playwright MCP + kubectl + curl |
|
||||
| **D32** | **Sandbox CRD installable on the Sovereign.** `kubectl get crd sandboxes.sandbox.openova.io` returns the CRD shipped in PR #1615; the controller Pod (`sandbox-controller` in `catalyst-system`, image built by PR #1632 workflow) is Ready and processes a no-op `Sandbox` CR within 30s (status transitions `Pending → Reconciling → Ready`). `helm template` of the Sovereign chart with sandbox enabled emits the controller Deployment + RBAC + Service from PR #1622's templates. **The Sandbox plane is part of every Sovereign by default — operator does not opt in.** Added 2026-05-18. | kubectl + helm template |
|
||||
| **D33** | **Sandbox agent catalogue picker functional.** Sovereign Console `/console/sandbox` (UI shipped in PR #1621) lists at minimum the six agents specified in `products/sandbox/docs/architecture.md` — Claude Code, Cursor (cloud), Qwen Code, Aider, OpenCode, plus the Sovereign-native shell. Picking an agent opens a session host page; the BYOS settings page lets the operator paste an Anthropic OAuth client_id (per `products/sandbox/docs/claude-code-byos.md`). A picked session establishes a WebSocket to the pty-server (PR #1618) and renders xterm.js with a live PTY prompt. Added 2026-05-18. | Playwright MCP |
|
||||
| **D34** | **newapi Sovereign-side LLM gateway routes to a backend model.** `https://newapi.<fqdn>/v1/chat/completions` (newapi install shipped via PR #1631) accepts an HS256 org-scoped JWT (issued by `core/services/auth` per PR #1619), authenticates the request, and proxies to a configured backend. The reference backend for this gate is **Bank Dhofar Qwen** (wired in PR #1631). A round-trip `curl` with a valid JWT returns a non-empty `choices[0].message.content` within 30s. **No Anthropic / OpenAI cloud calls leave the Sovereign by default** — BYOS is opt-in per-user. Added 2026-05-18. | curl + kubectl |
|
||||
| **D35** | **NATS broker round-trips `catalyst.tenant.created` + `catalyst.order.placed` end-to-end.** SME tenant + billing dispatchers PUBLISH to NATS JetStream (publish leg shipped in PR #1626 — confirm subjects `catalyst.tenant.created`, `catalyst.tenant.updated`, `catalyst.order.placed`, `catalyst.invoice.paid` are observed via `nats sub 'catalyst.>'`). Organization controller + Sandbox controller CONSUME (consume leg pending; this gate stays RED until the matching subscribe-side PR lands). Round-trip test: issue a voucher → redeem it → measure latency from billing-service publish to Org controller reconcile-start ≤ 2s. **Convergence is NOT declared until both legs are wired** — polling-the-API workaround does not satisfy this gate. Added 2026-05-18. | NATS CLI + kubectl logs |
|
||||
|
||||
> **DoD grows.** Every iteration of test-writer/test-executor finds more operator-visible bugs. Append the gate, ship the fix, re-validate. The list is the convergence contract; do not declare convergence until every appended gate passes on a single fresh prov.
|
||||
|
||||
@ -82,5 +86,5 @@ If any of these appear in your reasoning → STOP, re-read this file, fix the ro
|
||||
Before any `tofu apply` or `POST /api/v1/deployments`:
|
||||
|
||||
1. Read this file (or the memory mirror).
|
||||
2. Log the D1–D14 list to the loop output.
|
||||
3. Refuse to mark convergence until each D1–D14 has been individually checked.
|
||||
2. Log the D0–D35 list to the loop output.
|
||||
3. Refuse to mark convergence until each D0–D35 has been individually checked.
|
||||
|
||||
@ -1,6 +1,8 @@
|
||||
# OpenOva Sandbox (design)
|
||||
# OpenOva Sandbox
|
||||
|
||||
**Status:** Design. Not yet implemented. **Created:** 2026-05-15.
|
||||
**Status:** Wave 1-5 implementation in flight (PRs **#1615 / #1618 / #1619 / #1621 / #1622 / #1632** merged; runtime smoke pending fresh prov). **Created:** 2026-05-15. **Implementation started:** 2026-05-17.
|
||||
|
||||
> **Founder TODO:** Register an Anthropic OAuth client_id for the BYOS Claude Code flow per [`docs/claude-code-byos.md`](docs/claude-code-byos.md), and paste it into the Sovereign Console BYOS settings (or set `SANDBOX_ANTHROPIC_OAUTH_CLIENT_ID` on the controller Deployment). The Sandbox controller looks up the value via env-var; everything else around it is already scaffolded.
|
||||
|
||||
OpenOva Sandbox is the per-user, per-Organization coding-agent plane that runs **inside** every OpenOva Sovereign. It hosts long-lived sessions of the agents developers already use (Claude Code, Cursor, Qwen Code, Aider, Opencode) — server-side, cluster-aware, identity-scoped — and surfaces them through a native terminal in the browser plus a card-stream view on mobile, both backed by the same persistent process.
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user