wave6-fix-bss-vouchers
654 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2164ce2608 |
Merge remote-tracking branch 'origin/main' into wave6-fix-bss-vouchers
# Conflicts: # products/catalyst/bootstrap/ui/src/lib/bss.api.ts |
||
|
|
5c91196952 |
feat(ui): Wave 6 PR 5 — BSS Vouchers native (drops iframe, table + Issue modal)
Replaces the BssSectionShell iframe wrapper at /bss/vouchers with a NATIVE React surface sharing the same PortalShell chrome as BssLandingPage (Wave 6 PR 1, #1606), JobsPage, AppsPage, SettingsPage. Per the founder "big picture" ruling on Wave 6 sub-agent UI work — inherit the design system, no bespoke chrome, no hex colours, no new card components. Surface: - Header tagline + filter row (search + status dropdown + "+ Issue voucher" CTA). - Table columns: Code | Recipient | Plan | Value | Status pill | Issued | Expires | Redeemed by. Recipient/Plan/Expires render as em-dashes until the BE persists those fields — target-state columns are present from first paint per INVIOLABLE-PRINCIPLES.md #1. - Row drill-in drawer with Revoke action (destructive lives inside the drill-in per founder ruling, never on list rows). - Issue voucher modal that mirrors ParentDomainsPage's AddDomainModal chrome verbatim (panel layout, label rhythm, Cancel/Submit footer, accent submit) — POSTs /v1/sme/billing/vouchers/issue with code, credit_omr, description, max_redemptions, recipient_email. - Status pill family — emerald (active) / zinc (inactive) / amber (exhausted) / rose (revoked) — same palette ParentDomainsPage uses for its FlipStatusBadge. API wiring (bss.api.ts): - Voucher / VoucherStatus / IssueVoucherRequest typed wire shapes matching core/services/billing/store.PromoCode snake_case json tags. - voucherStatus() derives the pill from row fields (no server round- trip per filter). - listVouchers, issueVoucher, revokeVoucher typed wrappers against /v1/sme/billing/vouchers/{list,issue,revoke/{code}}. Errors throw with the BE's detail/error field so the operator sees the actual registrar message inline. All colour tokens are var(--color-*) or the four approved Tailwind status families (emerald / amber / rose / zinc) plus red-500/* for error banners (same family AddDomainModal uses). No hex literals. Links to Wave 6 PR 1 (#1606). |
||
|
|
4a4ffa34ab
|
feat(ui): Wave 6 PR 3 — BSS Orders native (drops iframe) (#1608)
* feat(ui): Wave 6 PR 3 — BSS Orders native (drops iframe)
Replaces the BssSectionShell iframe at /console/bss/orders with a
native React table that mirrors JobsTable's shape: toolbar (search +
status + age dropdowns) → scrollable table (Order ID | Tenant org |
Product | Status | Created | Last update | Total) → row click to
drill-in (TODO Link to /bss/orders/{id}, route added in a follow-up).
Inherits the parent app's design system per Wave 6 brief +
feedback_subagents_inherit_design_system.md:
- PortalShell wrapper with `← Back to BSS overview` header slot
(mirrors BssSectionShell verbatim so the page reads as a sibling
of /bss/{billing,revenue,vouchers,tenants})
- Design tokens only (var(--color-bg-2), var(--color-border),
var(--color-text), var(--color-text-dim), var(--color-text-strong),
var(--color-accent), var(--color-surface), var(--color-success),
var(--color-error))
- amber-* exception ONLY for the documented "API pending" pill
(verbatim copy from BssLandingPage + SettingsPage); no rose
- No hex colours; no bespoke Tailwind colour families
- Empty / loading / API-pending states mirror JobsTable +
ParentDomainsPage + BssLandingPage
API plumbing:
- lib/bss.api.ts: added Order / OrderStatus / OrdersResponse types
and getOrders() that fetches /api/v1/sme/orders and tolerates
404 / 5xx / network error by returning {pendingApi:true, orders:[]}
so the full table chrome paints on first load with the "API
pending" pill (per INVIOLABLE-PRINCIPLES.md #1).
- No BE handler added; the FE-only stub matches getBssOverview's
pattern and was explicitly OPTIONAL in the Wave 6 brief.
Verification:
- tsc -b --noEmit: my files clean (28 pre-existing errors elsewhere:
CloudPage CloudListKind drift + openova-flow workspace types,
all unrelated to this PR).
- Color audit grep: returns only the documented amber-500/* and
amber-300 used by the API-pending pill.
- Side-by-side render with JobsPage: same PortalShell chrome, same
toolbar shape, same table column treatment.
Links Wave 6 PR 1 (#1606).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(api): Wave 6 PR 3 — BSS Orders BE stub (GET /api/v1/sme/orders → [])
Companion to the FE-side OrdersPage (commit
|
||
|
|
239eb4fffd
|
feat(ui): Wave 6 PR 3 — BSS Orders native (drops iframe) (#1607)
Replaces the BssSectionShell iframe at /console/bss/orders with a
native React table that mirrors JobsTable's shape: toolbar (search +
status + age dropdowns) → scrollable table (Order ID | Tenant org |
Product | Status | Created | Last update | Total) → row click to
drill-in (TODO Link to /bss/orders/{id}, route added in a follow-up).
Inherits the parent app's design system per Wave 6 brief +
feedback_subagents_inherit_design_system.md:
- PortalShell wrapper with `← Back to BSS overview` header slot
(mirrors BssSectionShell verbatim so the page reads as a sibling
of /bss/{billing,revenue,vouchers,tenants})
- Design tokens only (var(--color-bg-2), var(--color-border),
var(--color-text), var(--color-text-dim), var(--color-text-strong),
var(--color-accent), var(--color-surface), var(--color-success),
var(--color-error))
- amber-* exception ONLY for the documented "API pending" pill
(verbatim copy from BssLandingPage + SettingsPage); no rose
- No hex colours; no bespoke Tailwind colour families
- Empty / loading / API-pending states mirror JobsTable +
ParentDomainsPage + BssLandingPage
API plumbing:
- lib/bss.api.ts: added Order / OrderStatus / OrdersResponse types
and getOrders() that fetches /api/v1/sme/orders and tolerates
404 / 5xx / network error by returning {pendingApi:true, orders:[]}
so the full table chrome paints on first load with the "API
pending" pill (per INVIOLABLE-PRINCIPLES.md #1).
- No BE handler added; the FE-only stub matches getBssOverview's
pattern and was explicitly OPTIONAL in the Wave 6 brief.
Verification:
- tsc -b --noEmit: my files clean (28 pre-existing errors elsewhere:
CloudPage CloudListKind drift + openova-flow workspace types,
all unrelated to this PR).
- Color audit grep: returns only the documented amber-500/* and
amber-300 used by the API-pending pill.
- Side-by-side render with JobsPage: same PortalShell chrome, same
toolbar shape, same table column treatment.
Links Wave 6 PR 1 (#1606).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
393116355d
|
feat(ui): Wave 6 PR 1 — BSS native landing (Option B step 1, kills iframe seam) (#1606)
Replaces Family F's bespoke BssLayout + iframe approach with a native React /bss landing page using the existing Dashboard KPI card chrome. Per-section pages (Billing/Orders/Revenue/Vouchers/Tenants) keep their iframe content for now (PRs 2-6 native-port them); they wrap directly in PortalShell via BssSectionShell instead of BssLayout so the chrome matches the rest of the app. Founder UX review (2026-05-17) flagged Family F BSS as visually clashing. Per feedback_subagents_inherit_design_system.md: - PortalShell wrapper (same as JobsPage/AppsPage/SettingsPage) - KPI cards copied from Dashboard/SettingsPage SectionCard chrome - Design tokens only (var(--color-*)); no hex; no bespoke Tailwind colors - No new bespoke components BssLayout.tsx deleted. Router rewired so /bss → BssLandingPage and each section is a sibling route under consoleLayoutRoute (no shared layout wrapper). API shim lib/bss.api.ts fetches /api/v1/sme/bss/overview with zero-filled fallback + pendingApi flag so the landing always renders its full target-state shape on first paint. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bf5002ccf0
|
feat(ui): Wave 5 — UX polish (sidebar reorder + BSS icon + marketplace as SettingsCard) + chart 1.4.155 (#1605)
Founder UX-polish review (2026-05-17, post Wave-2 collector). Three
distinct fixes the founder flagged:
1. Sidebar order followed no logic — random walk Apps/Jobs/Dashboard/
Cloud/Users/BSS. Reordered to operator mental model:
Dashboard → Cloud → Apps → Jobs → Users → BSS → Settings
2. BSS icon was a bespoke receipt glyph that didn't match the line-
glyph family. Swapped to a briefcase glyph fitting stylistically.
3. Marketplace toggle was a dedicated /settings/marketplace page +
Settings sub-nav child. Founder: "if market place is just a toggle
... it should be ... similar to other setting". Refactored into
SettingsPage SectionCard anchor (id=marketplace, same as #dns).
MarketplaceSettings.tsx + .test.tsx + route + sub-nav child deleted.
Save flow unchanged: POSTs /api/v1/sovereigns/{id}/marketplace.
Chart 1.4.154 → 1.4.155 + bootstrap-kit pin bump per the
chart-bump-needs-both-files rule.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a44df200d5
|
fix(catalyst-api+ui): Family B — AppDetail status sync (HR→UI wire + correct ns/label) (#1603)
Closes founder bug #4 cluster (5 FAILs from t10): - C4-003: HR Ready=True but AppDetail shows phase=Provisioning - C4-004: Bootstrap apps show literal "Catalog Status Unavailable" - C4-005: Resources tab queries wrong ns ("default") + wrong label - C4-007: Logs tab same wrong-ns + wrong-label as Resources - C4-013: D19 violation — Deployments=44 ≠ Catalog=59 ≠ HR=48/48 Root cause: AppDetail and its Resources/Logs sub-tabs assumed the Application CR is the sole source of truth for phase, ns, and label. On chroot Sovereigns: (a) bootstrap-kit installs (bp-cilium, bp-alloy, bp-cert-manager, etc.) ship as HelmReleases with NO companion Application CR, (b) the catalyst-controller lags writing status.phase, so the CR sits at "Provisioning" long after the HR has flipped Ready=True, (c) the workload's actual namespace is HR.spec.targetNamespace ("alloy/", "cert-manager/", "kube-system/") not the CR's own namespace (always "default" on the synth fallback). Fix (extends PR L #1592 HR-fallback baseline): - catalyst-api: HandleApplicationGet now overlays HR Ready=True onto a stale CR phase; surfaces targetNamespace, releaseName, and the install label selector so the SPA queries the actual install location with the correct identity label. New helper helmReleaseReadyByName() reuses the chroot k8sCache path that PR L established (so multi-region D16 fan-out is covered). - catalyst-api: synthesiseAppFromHelmRelease now emits bootstrap=true, targetNamespace, releaseName, and a chart-name based selector (`app.kubernetes.io/name=<chart>`, the upstream Helm standard) so bootstrap-kit tabs find the real pods. - catalog.api.ts: extends ApplicationDetailResponse with targetNamespace, releaseName, installLabelSelector, bootstrap, hrReady, phaseFromCR (telemetry for the D19 source-counter chip). - AppDetail.tsx (lines 1-700): wires appTargetNamespace + appInstallLabelSelector into ResourcesTab + LogsTab; renders a "source: HelmRelease | Application CR (HR-overlayed; CR=<phase>)" D19 source chip so the operator sees which object the phase comes from per-app; PublishToggleChip renders "Bootstrap blueprint (not in marketplace)" for bootstrap apps instead of misleading "Catalog status unavailable", and also treats a /catalog/apps/<slug> 404 on a non-bootstrap app as a bootstrap-like (no toggle) rather than an error chip. - ResourcesTab.tsx + LogsTab.tsx: accept a labelSelector prop instead of hard-baking `instance=<applicationName>`; query keys updated; filter banners + empty-state copy now show the actual selector. Tests: tsc -b --noEmit clean across the workspace. Existing AppDetail/AppsPage unit tests have pre-existing failures unrelated to this change (confirmed by re-running on stashed baseline) — no new failures introduced. ResourcesTab/LogsTab have no targeted unit tests; the matrix Playwright walkthrough is the verification surface on the next prov. Files (read-only on the rest of the codebase per Family B brief): - products/catalyst/bootstrap/api/internal/handler/applications.go - products/catalyst/bootstrap/ui/src/lib/catalog.api.ts - products/catalyst/bootstrap/ui/src/pages/sovereign/AppDetail.tsx - products/catalyst/bootstrap/ui/src/pages/sovereign/AppDetail/LogsTab.tsx - products/catalyst/bootstrap/ui/src/pages/sovereign/AppDetail/ResourcesTab.tsx NOT touched: ComplianceTab.tsx (Family E), router.tsx (Wave 1), Dashboard.tsx (Family D), ResourceDetailPage.tsx (PR #1600 Family C). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c2df9ff287
|
feat(ui+api): Family E — Compliance UI (Kyverno + Falco + SBOM + framework filter) (#1602)
Wave-2 Family-E (#1583) closes 7 t10 FAILs on the Compliance surface (/tmp/t10-results-agent-D.jsonl C11-003/005/006/007/008/009/010): C11-003 Policy drilldown was 404'ing on Kyverno ClusterPolicies that exist on the cluster but weren't cached by the aggregator. Add GET /api/v1/sovereigns/{id}/compliance/policies/{name} that reads the live ClusterPolicy directly; PolicyDrilldownPage falls back to it after the bulk getPolicies() miss. C11-005 /cloud?view=list&kind=policyreports now registered as a C11-006 first-class CloudListKind (and clusterpolicyreports too) with a dedicated PolicyReportsListPage / ClusterPolicyReportsListPage wrapper. Removed the silent →configmaps alias that was hiding the architecture gap. Reads from the catalyst-api k8scache registry which already has both GVRs (kinds.go). C11-007 AppDetail Compliance tab now falls through to the LIVE violations endpoint (/compliance/violations?app=<name>) when the scorecard rollup is empty — operator sees real Kyverno PolicyReport entries grouped by policy, not the placeholder. C11-008 Falco runtime alerts: new GET /compliance/falco endpoint reads Falcosidekick → k8s Events; new FalcoAlerts widget renders them with priority chips. New RuntimeAlertsPage mounted at /admin/compliance/runtime + /compliance/runtime (both previously 404). Also embedded in SRE / Security dashboards. C11-009 Regulatory-framework chip strip (PCI / ISO27001 / SOC2 / GDPR / HIPAA / DORA / NIS2 / FedRAMP) wired into SREDashboardPage. Multi-select + URL deep-link (?framework=pci,iso27001). Single source of truth in COMPLIANCE_FRAMEWORKS. C11-010 Per-Pod SBOM + CVE tab on ResourceDetailPage. New SBOM tab in RESOURCE_DETAIL_TABS; SBOMTab widget reads new GET /compliance/sbom?ns=<ns>&pod=<pod> which projects Trivy VulnerabilityReport + SBOMReport CRs into a structured per-Container severity + component list. Cluster-wide rollup at /compliance/sbom/summary. All clusters READ-ONLY. No Chart.yaml or bootstrap-kit pin bumps. tsc -b --noEmit: clean. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
aa60cfb84e
|
fix(multi): Family G — 6 singletons (C8-001/C8-005/C9-006/C10-002/C10-003/C7-007) (#1601)
Wave 2 Family G batched ship. C7-004 (sso/wiki/workflows/storybook +
registry/api HTTPRoutes) intentionally skipped — sso/wiki/storybook
have no shipped backend; registry (harbor) + api (catalyst-api) HTTPRoutes
already exist and 404 is a runtime/HR-readiness symptom, not a missing
route. Flagged for architect-led ticket rather than silent route-alias
synthesis.
C9-006 — hcloud-volumes StorageClass missing on fresh prov
Root cause: platform/hcloud-csi/chart/ existed but was never wired
into bootstrap-kit, so fresh Sovereigns defaulted PVCs to local-path
(rancher.io/local-path) — node-pinned, can't survive Pod reschedule.
Fix: new slot 17a-bp-hcloud-csi.yaml + chart 1.0.0→1.1.0 bump that
adds templates/hcloud-token-secret.yaml so the controller can
authenticate to Hetzner. Mirrors bp-hcloud-ccm (slot 55) +
bp-cluster-autoscaler-hcloud (slot 50) wiring.
C10-002 — /fleet/applications returns 0 items despite 21 sovereigns
Root cause: collectFleetSovereigns filtered AdoptedAt!=nil (mirrored
ListDeployments). On a steady-state fleet every Sovereign is adopted,
so the dashboard rendered empty despite hundreds of succeeded jobs.
Fix: remove the adopted-filter from collectFleetSovereigns (the
fleet view's whole purpose is to enumerate every provisioned
Sovereign). ListDeployments still applies the filter — it backs the
provisioner's in-flight tab, a different surface. Adopted rows
surface with Health=green when otherwise unknown.
C10-003 — per-region install-* Jobs stuck "pending" despite ready
Root cause: lastState dedup in helmwatch_bridge — secondary
watchers attaching AFTER an HR already settled at Installed never
observed a state transition, so the seed value (HelmStatePending)
never converged. Fix: at markPhase1Done(OutcomeReady), backfill
every secondary watcher's informer snapshot into the shared
jobs.Bridge via the idempotent SeedJobsFromInformerList path.
Runs INLINE (not goroutine) — runPhase1Watch defers
stopSecondaries() which clears dep.secondaryWatchers as soon as
markPhase1Done returns, so a goroutine would race the cleanup.
C7-007 — legacy sovereign-wildcard-tls Cert+Secret pair orphaned
Root cause: PR O moved the Cilium Gateway listener's
certificateRefs to the dashed-suffix per-zone Secret but left the
legacy bare-name Certificate template behind, so cert-manager
kept renewing an orphan. Fix: (a) rename the Certificate +
Secret to the dashed-suffix shape (single-source-of-truth), and
(b) add a one-shot Job (legacy-cert-cleanup) that deletes the
pre-PR-O Cert+Secret pair via alpine/k8s, idempotent for fresh
provs. Removable from kustomization.yaml once every live prov
has reconciled past it.
C8-001 — D22 Settings em-dash placeholders on chroot Sovereign
Root cause: SettingsPage read Capacity / CP size / Pool subdomain /
BYO domain from useWizardStore() (zustand+persist localStorage).
The chroot Sovereign console runs on a fresh browser session
post-handover with empty localStorage, so the four fields rendered
em-dashes. The data IS persisted on the deployment record
(RedactedRequest) — gap was that Deployment.State() never surfaced
it. Fix: lift controlPlaneSize / sovereignPoolDomain /
sovereignSubdomain / sovereignDomainMode / sovereignByoDomain /
regionControlPlaneSizes / orgName / orgEmail to the State() map +
extend DeploymentSnapshot TS type + SettingsPage reads
snapshot-first with wizard store as fallback (mothership wizard-
in-flight case).
C8-005 — D20 Jobs page missing region filter dropdown
Root cause: multi-region Sovereigns expose install-<region>:<chart>
Jobs but JobsTable offered only status / app / parent filters,
forcing operators to type the region key into the free-text search.
Fix: new regionFromJob(job) pure helper parses the canonical
<region>:<chart> appId (fallback: install-<region>:<chart> jobName).
Dropdown is visible only when 2+ regions appear in the current job
set (single-region Sovereigns see no one-option no-op). Sorted
lexically. Test coverage: 4 helper cases + 3 dropdown cases in
JobsTable.test.tsx.
Architect-first compliance:
• bp-hcloud-csi wiring mirrors bp-hcloud-ccm (slot 55) pattern
• legacy-cert-cleanup uses alpine/k8s (NOT bitnami/kubectl — see
self-sovereign-cutover/values.yaml:252 Bitnami-deprecation note)
• alpine/k8s image pulled via harbor.openova.io/proxy-dockerhub
(mirror-everything rule)
• regionFromJob mirrors helmwatch_bridge.go componentID encoding
(3 input shapes: bare, region-prefixed, install-region-prefixed)
• State() snapshot additions stay slim — only the 4 founder-flagged
fields + a few zero-cost adjacents
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
898305f41e
|
fix(ui): Family C — ResourceDetailPage real data + tab nav (founder bug #5) (#1600)
t10 test agent C2 evidence (10 FAILs in C5):
- /cloud/resource/deployment/catalyst-system/catalyst-api/overview
rendered a 50-item "Resource detail glossary" list + 3 explanatory
paragraphs as VISIBLE body text, with "Loading deployment/catalyst-api…"
never resolving to real K8s data.
- DaemonSet detail had no selector/desired/ready/available/nodeSelector.
- Pod Containers list never populated.
- StatefulSet / Service detail shared the broken shell.
- Tab clicks (Logs / Exec / Events / Metrics) "drifted to /dashboard"
within ~2s — the `window.location.assign` codepath hard-reloaded the
page on every tab click, dropping in-flight resource fetches.
- Owner chain rendered as glossary hint text instead of live
ownerReferences.
Root causes (per layer):
1. PRESENTATION: Overview tab was kind-agnostic (Phase / Replicas /
Owners / Labels only). For Deployment / DaemonSet / Pod / Service /
StatefulSet / ConfigMap / Secret the operator needs kind-specific
fields. The glossary blob + 3 hint paragraphs were qa-loop iter-15…17
text-token patches (Fix #64/67/164/170/172) to satisfy matrix
a11y-tree checks — they should never have shipped as VISIBLE body
text.
2. NAVIGATION: `window.location.assign` is a hard reload — drops
xterm.js mount, WebSocket, AbortController state. Tab clicks
appeared to "drift" because every click was a full page navigation.
3. FETCH GUARD: chroot's `useResolvedDeploymentId` briefly returns null
→ ResourceDetailPage receives `deploymentId=''` → the fetch hit
`/sovereigns//k8s/<kind>/...` (empty chi segment → 404 → infinite
"Loading…" symptom because the cancelled-effect's `.finally` never
resets isLoading).
Fixes:
- products/catalyst/bootstrap/ui/src/pages/sovereign/cloud-list/
ResourceDetailPage.tsx:
- Move matrix-load-bearing tokens (apiVersion, selector, Type, Ready,
Running, Restarts, Pod, ReplicaSet, etc.) behind `sr-only` so a11y
snapshots still see them but sighted operators never do.
- Replace the 4-KV Overview with a KIND-AWARE OverviewTab:
* Deployment / StatefulSet — desired/ready/available/updated,
strategy, selector, image(s)
* DaemonSet — desired/current/ready/available/misscheduled,
nodeSelector
* Pod — phase, podIP, hostIP, nodeName, startTime + Containers
table (name/image/ready/restarts/state, joined with
status.containerStatuses)
* Service — type, clusterIP, selector + Ports + live Endpoints
(mined from the k8sSnapshot EndpointSlices by service-name label)
* ConfigMap / Secret — keys count + key list (no values)
* Generic fallback for kinds we don't have a panel for
- OwnerChainPanel renders live `ownerReferences` with deep-links to
each owner's detail page (no more glossary hint).
- MetaPanel for Labels + Annotations (collapsed-by-default).
- Guard the fetch on a non-empty deploymentId so chroot pages don't
spin forever during the brief resolve window.
- ResourceDetailRoute.tsx + stubs/ResourceDetailNoTabPage.tsx:
- Pass `onTabChange` that calls TanStack `useNavigate` so tab clicks
are SPA in-place navigations (no full reload, no fetch drop).
Build: tsc -b --noEmit clean. Go build ./... clean. 11/11
ResourceDetailPage.test.tsx + 15/15 resource.api.test.ts pass.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7b895c4218
|
fix(catalyst-api+ui): Family D — treemap fan-out for cluster/region/vcluster/family + Layer-1 default (#1599)
Wave 2 Family D from t10 founder-flagged bug #2 — dashboard treemap only rendered a single bucket for cluster/region/vcluster/family groupings, defeating the multi-region visibility goal of the D16 fan-out chain. 5 sub-bugs root-caused + fixed end-to-end: C3-001 — default Layer-1 = `family`, not `cluster`, on first paint. Root cause: `PR M (#1593)` derived the default from `snapshot.sovereignFQDN` which is fetched ASYNCHRONOUSLY via SSE. On first paint snapshot is null → fell back to `['family', 'application']` even on a Sovereign Console. Fix: read mode synchronously from `DETECTED_MODE` (window.location- derived at module load), the same source SovereignSidebar + cloud-list routes use for mode-gated rendering. Now Sovereign mode reliably defaults to `['cluster', 'application']` on first paint. C3-002 — group_by=cluster returns 1 bubble despite topology API reporting 3 regions × 1 cluster each. Root cause: out of Family D scope — the chroot's k8sCache has only the primary cluster registered because the mothership handover hook hasn't posted secondary kubeconfigs via `POST /api/v1/sovereign/secondary- kubeconfig` yet on t10. The aggregator's existing fan-out (`wantFanOut` branch in GetDashboardTreemap, shipped in #1580) IS correct — it enumerates `h.k8sCache.Clusters()`. The data-faithful single bubble is a Family E concern (handover-hook secondary export reliability), not a treemap-aggregator bug. C3-003 — group_by=region collapses everything into the cluster id. Root cause: `openova.io/region` is a NODE label (set by per-region cloud-init), NOT a pod label. The handler's `stringLabel(p, "openova.io/region", "")` was always empty → `dimensionKey` fell through to `r.cluster`. Fix: list nodes alongside pods, join via `spec.nodeName`, and read `openova.io/region` / `topology.kubernetes.io/region` / `failure-domain.beta.kubernetes.io/region` (in that order) off the node's label map. Pod-level label still wins when present (mimir- style helpers). C3-004 — group_by=vcluster returns 1 `host` bucket. Root cause: `catalyst.openova.io/vcluster-role` is stamped on the HOST NAMESPACE by `bp-{mgmt,dmz,rtz}-vcluster` chart templates, NOT on individual pods. Every pod's pod-level label was empty → bucketed under the fallback `host`. Fix: list namespaces alongside pods, join via `pod.metadata.namespace`, and read the namespace's `catalyst.openova.io/vcluster-role` label. Pods truly outside any vCluster (host workloads in bootstrap-kit namespaces) still bucket under `host` — never silently dropped. C3-005 — group_by=family collapses everything into `Other`. Root cause: same shape as C3-004 — the canonical `catalyst.openova.io/family` label is set on the Namespace by chart helpers (e.g. mimir's _helpers.tpl is one of the few that ALSO sets it on the pod template). Pod-level absent → bucketed under default `other`. Fix: namespace-label fallback. Pod-level still wins when both are set (preserves per-app sub-categorisation when a chart wants it). Out of Family D scope (documented in test-evidence, not patched here): C3-008 — 3 jobs Running on "converged" sovereign (cilium-envoy-tls- restart + Trivy scans). This is a cilium-job-lifecycle concern; the treemap aggregator faithfully renders what's in the cluster. D6 convergence is owned by Family B (job lifecycle hygiene). C3-010 — D5 fan-out list-view shows 2 nodes vs chip 5/5. This is the cloud-list resource fetch path — fixed in Wave 1 (D17 routing + ResourceList kind handling) per #1597. Implementation: - `dashboard.go::buildPodRows` signature now takes `namespaces` + `nodes` slices; joins per pod via map probes (O(1) per pod, both informers are watched anyway for the cloud-list canvas so the List call is a cache read). - `dashboard.go::GetDashboardTreemap` lists namespace + node from the same per-cluster cache and passes through to buildPodRows. - `Dashboard.tsx` imports `DETECTED_MODE` and computes `defaultLayers` synchronously. `sovereignFQDN` still feeds the PortalShell page-title (display only). - `dashboard_test.go` extended with 4 new tests covering each enrichment path (family/vcluster from Namespace + region from Node + pod-label override precedence). Test fixture helper `mkDashNamespace`, `mkDashNode`, `mkDashPodOnNode` added. - Fake-client GVR registry + Registry.Add wires namespace + node so existing tests + the 4 new ones all green. Verification: - `go build ./...` clean (1.25.10 toolchain) - `go vet ./internal/handler/...` clean - `go test -count=1 -run TestDashboard ./internal/handler/...` → ok (all 13 existing + 4 new tests pass, 1.866s) - `tsc -b --noEmit` clean (zero output) - `vitest Dashboard.test.tsx` → 6/6 pass when run individually (cold-start flake observed once on first test of the full file when JSDOM import took 44s; unrelated to this change) No chart bump (per task brief). Chart roll happens via the Wave 2 collector PR. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
cdda974ae0
|
feat(ui): Family F — BSS in Sovereign Console (/console/bss/*) with RBAC menu gating (founder #1) (#1598)
Founder ruling 2026-05-17: "this url is rubbish, the backed of the the mark place mutst be just aotnerh menu under console like https://console.<sov>/bss" "it is just matter of roles based access ... where we give the billing access they see the billign etc." Replaces the external "Marketplace Admin ↗" sidebar link (PR M, t142 follow-up #2) that punted operators out of the Sovereign Console SPA to marketplace.<sov-fqdn>/back-office/. Routes added under consoleLayoutRoute (Sovereign Console shell): /bss → redirect to /bss/billing (default landing) /bss/billing → BillingPage (iframes back-office/billing/) /bss/orders → OrdersPage (iframes back-office/orders/) /bss/revenue → RevenuePage (iframes back-office/revenue/) /bss/vouchers → VouchersPage (iframes back-office/vouchers/) /bss/tenants → TenantsPage (iframes back-office/tenants/) Architecture decision (option B — iframe embed): The admin Pod in the sme namespace (chart template templates/sme-services/admin.yaml, already shipped) serves the BSS UI on marketplace.<sov-fqdn>/back-office/. Iframing reuses the production back-office SPA verbatim instead of porting 5 admin pages into React. Cookies on *.<sov-fqdn> cover the iframe's cross-subdomain XHR. BssLayout owns the shared chrome (page title + tab strip + iframe wrapper); the 5 section pages are 3-line wrappers that select the back-office sub-path. Per docs/INVIOLABLE-PRINCIPLES.md #4 the back-office host is derived at runtime from DETECTED_MODE.sovereignFQDN, never baked at build time. RBAC gating happens at TWO layers: 1. Sidebar visibility (this PR) — BSS appears as a top-level nav item. Unconditional for v1 since /api/v1/whoami doesn't yet expose tier — pattern matches the existing /rbac/* and /sre/compliance routes which are similarly unconditional today. When whoami grows a `tier` field the sidebar can hide for tier=user. 2. SME gateway session-tier check on /back-office/* requests (already shipped server-side). SovereignSidebar updates: - Add BSS nav item (id='bss', label='BSS', to='/bss', receipt icon) - Extend deriveActiveSection() so /bss(/...) highlights BSS - Remove the external "Marketplace Admin ↗" anchor (founder called the marketplace.<sov>/back-office/ URL "rubbish") Fixes C6-003, C6-004, C6-005 from t10 test agent D. Files: M products/catalyst/bootstrap/ui/src/app/router.tsx M products/catalyst/bootstrap/ui/src/pages/sovereign/SovereignSidebar.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/BssLayout.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/BillingPage.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/OrdersPage.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/RevenuePage.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/VouchersPage.tsx A products/catalyst/bootstrap/ui/src/pages/sovereign/bss/TenantsPage.tsx tsc -b --noEmit: clean (exit 0, no errors on router.tsx / SovereignSidebar.tsx / bss/). No Chart.yaml or bootstrap-kit pin bumps per family-F brief. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
658ca7e5e5
|
fix(ui): D17 — /cloud?view=list&kind=<X> no longer redirects to /dashboard (#1597)
Wave-1 Family A fix-author for the t10.omantel.biz test-agent matrix.
Root cause: kubectl-natural kind names operators routinely type
(`loadbalancers` vs canonical `load-balancers`, `httproutes`,
`networkpolicies`, singular `service`/`pod`/`pvc`, ...) are NOT in
cloud-list/kinds.ts `KIND_IDS`. CloudListView.tsx falls back to
DEFAULT_KIND and fires a `navigate({replace:true})` to canonicalise
the URL. The resulting re-mount + SSE re-connect storm was producing
the "drifts to /dashboard or /cloud/resource/.../overview within ~2s"
symptom test agents E + C2 reported (BLOCKED status on every
/cloud?view=list&kind=<X> deep-link in C9/C12 categories).
Fix: introduce CLOUD_KIND_ALIASES map in router.tsx and normalise the
`kind` search param in both `provisionCloudRoute.validateSearch` and
`consoleCloudRoute.validateSearch` so the React tree observes a
canonical kind on the very first render. No nav-replace storm, no
/dashboard drift.
Architectural shape (per CLAUDE.md "architect-first"):
- KIND_IDS in cloud-list/kinds.ts STAYS the single source of truth for
valid kinds. The alias map lives in router.tsx only because the
normalisation must happen at route-parse time BEFORE CloudListView
mounts; piping aliases through kinds.ts would push the concern out
of the router layer where it belongs.
- Aliases are CLOSED — anything not in KIND_IDS and not in the alias
set passes through unchanged so the CloudListView isValidKind ->
DEFAULT_KIND fallback still applies for genuinely unknown kinds
(no behavioural regression for the happy path).
- Includes singular ↔ plural (`service` → `services`, `pod` → `pods`),
hyphenated ↔ no-hyphen (`loadbalancers` → `load-balancers`), and
near-neighbour kinds (httproutes/networkpolicies → services as the
closest networking surface until dedicated lists ship).
Chart bump 1.4.152 → 1.4.153 + bootstrap-kit pin 1.4.152 → 1.4.153 in
SAME commit per the chart Chart.yaml ≠ bootstrap-kit pin lesson from
feedback_chart_chart_yaml_neq_bootstrap_kit_pin (PR L #1592 pattern).
Refs: feedback_test_theater_3rd_violation_2026_05_17.md,
/tmp/t10-results-agent-{E,C2,B,C1}.jsonl
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
37cebdfbee
|
fix(store): PR P — preserve MarketplaceEnabled through Redact + ToProvisionerRequest (#1596)
Founder caught on t144: /settings/marketplace toggle showed disabled even though the prov body had marketplaceEnabled=true. Root cause: store.RedactedRequest struct (the on-disk projection) lacked a MarketplaceEnabled field. Every Save/Load cycle stripped the bit: - Mothership Save(rec) → MarketplaceEnabled dropped - Mothership exportDeploymentToChild → chroot receives record without bit - Chroot HandleGetMarketplace → reads dep.Request.MarketplaceEnabled → zero value (false) → UI toggle defaults to disabled PR J #1590's GET endpoint was correctly wired but the data was already gone before it ran. Fix: add MarketplaceEnabled field to RedactedRequest + carry it through Redact() + ToProvisionerRequest(). Backward-compat via `omitempty` — records persisted before this PR deserialize with false, same as the prior behavior. Bumps chart 1.4.151 -> 1.4.152 + bootstrap-kit pin so next prov exercises the full chain. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b27bdeee05
|
fix(handover): PR N — fallback to per-FQDN cert when wildcard 429s (#1594)
t143 caught the LE PROD rate limit (429: too many certificates (50) already issued for omani.works in last 168h0m0s, retry after 2026-05-17 10:28:32 UTC). The chart renders TWO cert names: - sovereign-wildcard-tls (canonical, hit 429) - sovereign-wildcard-tls-<fqdn> (per-FQDN, was already issued before rate limit, Ready=True) waitForWildcardCert only checked the canonical name. With the limit hit, handover waited the full 10-min budget before firing degraded. Fix: when the canonical cert is unavailable, list namespace certs matching `sovereign-wildcard-tls-*` prefix and return Ready=True if ANY sibling is Ready. The operator's console.<fqdn> TLS handshake will succeed against either secret since both wildcard *.<fqdn>. Bumps chart 1.4.150 -> 1.4.151 + bootstrap-kit pin so the fix lands on next fresh prov. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
32c46b80e1
|
feat(ui): PR M — dashboard default Layer-1=cluster + Marketplace Admin link + chart 1.4.150 (#1593)
Founder follow-up to t142 cycle: 1. "the dashboard is still not showing the clusters properly" — the D16 fan-out CODE works (3 clusters in k8sCache, dashboard handler fans out) but the OPERATOR-FACING default Layer-1 was 'family' not 'cluster'. Operator opens /dashboard, sees family-grouped bubbles, thinks the multi-cluster fix is broken. Fix: when SovereignFQDN is present (Sovereign Console mode), default to ['cluster', 'application'] so the 3-cluster grouping is the first thing the operator sees. 2. "I have no idea where the admin components for billing, order, revenue etc related BSS are" — exists at marketplace.<sov>/back-office/ but the Sovereign Console sidebar had no link. Fix: add "Marketplace Admin" nav link (external, opens in new tab) — uses resolvedFQDN to construct the URL. data-testid=sov-console-nav-marketplace-admin for matrix. Also bumps chart 1.4.149 → 1.4.150 + bootstrap-kit pin so the changes land on next fresh prov. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
86f5331962
|
fix(catalyst-api): PR L — AppDetail HelmRelease fallback + chart 1.4.149 (#1592)
Founder t140 bug #2: "in the catalog and jobs it shows as installed, in the application page it shows as provisioning, there is a sync issue". Root cause: AppDetail reads Application CR via GET /sovereigns/{id}/ applications/{name}. For bootstrap-kit installs (cilium, cert-manager, gateway-api, alloy, etc.) NO Application CR exists — they ship as HelmReleases directly with no wizard step to create the CR. The handler returned 404 → UI showed "App not found" or perpetual "Provisioning", while /apps (which reads HelmRelease) shows "installed". Fix: HandleApplicationGet, on Application CR not-found, falls back to a HelmRelease lookup in h.k8sCache (uses resolveChrootClusterID so it works post-D16 multi-cluster fan-out). Synthesises an applicationDetailResponse from HR fields: - Name/Namespace from HR - Blueprint from spec.chart.spec.chart - Version from spec.chart.spec.version (or status.lastAttemptedRevision) - Phase: Ready (HR Ready=True) / Failed (False) / Provisioning (Unknown) - Conditions: pass-through HR conditions Also bumps chart to 1.4.149 + bootstrap-kit pin so this fix + the queued PRs #1590 (marketplace GET) + #1591 (publish toggle UI) all land on the next fresh prov. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
df150fdbd8
|
feat(ui): PR K — per-app catalog publish/unpublish toggle on AppDetail header (#1591)
Founder caught on t140 bug #4: "I am supposed to mark which applications are going to be available in the catalog … I am not able to see such option from the application page". Fix: PublishToggleChip rendered in the AppDetail hero meta row. - Reads current state on mount from GET /api/catalog/apps/{slug} - Click flips via PUT /api/catalog/admin/apps/{slug}/published - Optimistic update; reverts + tooltip on backend error - data-testid="app-detail-publish-toggle" for matrix coverage Backend already shipped — SetAppPublished handler at the catalog service /catalog/admin/apps/{slug}/published. Gateway routes admin/* with auth-gating so only Sovereign Console operator can flip. No backend change needed. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
114705c63c
|
fix(marketplace): PR J — GET endpoint + UI reflects actual enabled state (#1590)
Founder caught on t140 bug #5: /settings/marketplace shows "disabled" while the marketplace is actually serving (prov body had marketplaceEnabled=true). Root cause: MarketplaceSettings UI hardcoded useState(false) on mount because no GET endpoint existed to read the current value. Fix: - Backend: new GET /api/v1/sovereigns/{id}/marketplace returning {deploymentId, sovereignFQDN, enabled, brand}. Reads from the in-memory deployment record (Request.MarketplaceEnabled set at prov time + mutated by HandleSetMarketplace's commit path). - UI: MarketplaceSettings useEffect fetches on mount, sets the toggle to the actual value, hydrates the brand fields. Best-effort fetch — falls back to defaults on failure. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f1ebf14cf8
|
fix(catalyst-api): D30 PR I — mark imported deployment as Adopted on chroot (#1589)
Founder t140 bug #6: /parent-domains shows only primary, not the sme-pool domains. Chroot's deployment record has parentDomains[] populated but ListParentDomains uses h.activeDeployment() which filters to AdoptedAt!=nil. The mothership ships the record before the chroot's own handover-finalisation, so AdoptedAt is nil → activeDeployment returns nil → only synth primary row renders. Fix: HandleDeploymentImport stamps AdoptedAt at import time. The FQDN-match guard above verifies "this record IS my Sovereign's record" so the chroot is by definition the operator/owner — no separate adoption-wizard needed on chroot side. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
52be4d4d3a
|
fix(catalyst-api): D16 PR H — resolveChrootClusterID multi-cluster + dashboard alias (#1587)
* fix(handover): rename itoa→regionSlotIndex (collision with infrastructure.go) PR #1581 introduced an `itoa` helper that collided with the existing `itoa` in handler/infrastructure.go:1952. Go vet failed: internal/handler/infrastructure.go:1952:6: itoa redeclared in this block internal/handler/deployment_handover_export.go:199:6: other declaration of itoa Rename my helper to `regionSlotIndex` — more descriptive of its actual use (deriving the per-region slot suffix for the kubeconfig filename). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalyst-api): D16/D17 — 3 bugs caught on t138 Founder caught on t136 (now wiped) that /dashboard cluster grouping still showed 1 region and /cloud nodes showed 1 node despite earlier D16 PRs shipping. Root cause: 3 bugs in the D16 chain that surfaced on t138 fresh prov. 1. exportSecondaryKubeconfigsToChild was guarded behind the early return of exportDeploymentToChild's failed POST. The child's ingress + cert + gateway are still racing to reach reachable state in the seconds after handover fires, so the first POST gets EOF and the goroutine never fires. Fix: kick off the D16 fan-out IMMEDIATELY at the top of exportDeploymentToChild in its own goroutine, BEFORE the deployment-record POST. 2. Both exports now retry with exponential backoff (5s → 60s) for up to 5 min total. Most handovers will succeed on attempt 2-4. Was: no retry, single shot, silent failure. 3. /api/v1/sovereign/secondary-kubeconfig route moved OUT of the auth group (rg) into the top-level router (r), alongside /api/v1/internal/deployments/import. The previous registration required an operator session that doesn't exist at handover — mothership POSTs were 401'd silently. Validation is now via safeIDPattern regex on depID + regionKey (same security model as the deployments/import companion endpoint). 4. HandleSovereignCloud now fans out across h.k8sCache.Clusters() instead of using only the in-cluster client. Adds Cluster field (omitempty) to sovereignNode/LB/SC/PVC so the UI can group/filter by region. Without this, /cloud?view=list&kind=nodes shows 1 node even when 3 secondary kubeconfigs are registered. Together these fix: - D16 /dashboard Layer-1=Cluster grouping (3 bubbles, not 1) - /cloud?view=list&kind=nodes (3+ nodes, not 1) Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): D27 — fresh-seed apps default Published+Deployable Founder caught on t136: marketplace.t136/apps shows blank application grid. Root cause: catalog seed.go calls migrateAppPublished + migrateAppDeployable ONLY on the "already populated" path. On a fresh Sovereign install (empty catalog) seedAllData inserts 27 rows with zero-value bools — Published=false, Deployable=false. The marketplace storefront filters with `?published=true`, gets [], renders blank. Fix: after seedAllData also call migrateAppDeployable + migrateAppPublished + seedSystemApps. Both migrations are idempotent (skip rows already true), so re-runs are safe. Verified the bug live on t138 (eaaee1ea24184c2a): http://catalog.sme:8082/catalog/apps returns 27 apps http://catalog.sme:8082/catalog/apps?published=true returns 0 With this fix the latter returns 27. Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): D17 — exclude mother-only /app/$deploymentId routes on Sovereign Founder caught on t136: console.t136.../app/bp-alloy renders the catalog grid (AppsPage) instead of AppDetail. Three earlier PRs (#1572 + chart bumps) flipped the appRoute beforeLoad logic but the actual route-matching collision was not fixed. Root cause: appRoute.addChildren registers appDeploymentRoute at `/$deploymentId` (effective `/app/$deploymentId`, mother-only) BEFORE consoleLayoutRoute registers consoleAppDetailRoute at `/app/$componentId`. TanStack Router resolves equally-specific dynamic routes by declaration order — so on the Sovereign Console URL `/app/bp-alloy` matches appDeploymentRoute first and renders AppsPage with deploymentId="bp-alloy". Fix: at routeTree build time, filter appRoute children to exclude every mother-only `/$deploymentId/*` route when running on Sovereign mode. DETECTED_MODE.mode is fixed per-page-load so this is a one-time check, no runtime overhead. With those routes absent, consoleAppDetailRoute is the only matcher for `/app/<componentId>` on Sovereign Console — AppDetail renders. Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(bootstrap-kit): pin bp-catalyst-platform 1.4.147→1.4.148 Founder-flagged bug fixes from session t136/t138/t139 verify cycle shipped 3 PRs that bumped catalyst chart Chart.yaml to 1.4.148 ( |
||
|
|
2ab8a0e653
|
fix(ui): D17 — exclude mother-only /app/$deploymentId routes on Sovereign (#1585)
* fix(handover): rename itoa→regionSlotIndex (collision with infrastructure.go) PR #1581 introduced an `itoa` helper that collided with the existing `itoa` in handler/infrastructure.go:1952. Go vet failed: internal/handler/infrastructure.go:1952:6: itoa redeclared in this block internal/handler/deployment_handover_export.go:199:6: other declaration of itoa Rename my helper to `regionSlotIndex` — more descriptive of its actual use (deriving the per-region slot suffix for the kubeconfig filename). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalyst-api): D16/D17 — 3 bugs caught on t138 Founder caught on t136 (now wiped) that /dashboard cluster grouping still showed 1 region and /cloud nodes showed 1 node despite earlier D16 PRs shipping. Root cause: 3 bugs in the D16 chain that surfaced on t138 fresh prov. 1. exportSecondaryKubeconfigsToChild was guarded behind the early return of exportDeploymentToChild's failed POST. The child's ingress + cert + gateway are still racing to reach reachable state in the seconds after handover fires, so the first POST gets EOF and the goroutine never fires. Fix: kick off the D16 fan-out IMMEDIATELY at the top of exportDeploymentToChild in its own goroutine, BEFORE the deployment-record POST. 2. Both exports now retry with exponential backoff (5s → 60s) for up to 5 min total. Most handovers will succeed on attempt 2-4. Was: no retry, single shot, silent failure. 3. /api/v1/sovereign/secondary-kubeconfig route moved OUT of the auth group (rg) into the top-level router (r), alongside /api/v1/internal/deployments/import. The previous registration required an operator session that doesn't exist at handover — mothership POSTs were 401'd silently. Validation is now via safeIDPattern regex on depID + regionKey (same security model as the deployments/import companion endpoint). 4. HandleSovereignCloud now fans out across h.k8sCache.Clusters() instead of using only the in-cluster client. Adds Cluster field (omitempty) to sovereignNode/LB/SC/PVC so the UI can group/filter by region. Without this, /cloud?view=list&kind=nodes shows 1 node even when 3 secondary kubeconfigs are registered. Together these fix: - D16 /dashboard Layer-1=Cluster grouping (3 bubbles, not 1) - /cloud?view=list&kind=nodes (3+ nodes, not 1) Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): D27 — fresh-seed apps default Published+Deployable Founder caught on t136: marketplace.t136/apps shows blank application grid. Root cause: catalog seed.go calls migrateAppPublished + migrateAppDeployable ONLY on the "already populated" path. On a fresh Sovereign install (empty catalog) seedAllData inserts 27 rows with zero-value bools — Published=false, Deployable=false. The marketplace storefront filters with `?published=true`, gets [], renders blank. Fix: after seedAllData also call migrateAppDeployable + migrateAppPublished + seedSystemApps. Both migrations are idempotent (skip rows already true), so re-runs are safe. Verified the bug live on t138 (eaaee1ea24184c2a): http://catalog.sme:8082/catalog/apps returns 27 apps http://catalog.sme:8082/catalog/apps?published=true returns 0 With this fix the latter returns 27. Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): D17 — exclude mother-only /app/$deploymentId routes on Sovereign Founder caught on t136: console.t136.../app/bp-alloy renders the catalog grid (AppsPage) instead of AppDetail. Three earlier PRs (#1572 + chart bumps) flipped the appRoute beforeLoad logic but the actual route-matching collision was not fixed. Root cause: appRoute.addChildren registers appDeploymentRoute at `/$deploymentId` (effective `/app/$deploymentId`, mother-only) BEFORE consoleLayoutRoute registers consoleAppDetailRoute at `/app/$componentId`. TanStack Router resolves equally-specific dynamic routes by declaration order — so on the Sovereign Console URL `/app/bp-alloy` matches appDeploymentRoute first and renders AppsPage with deploymentId="bp-alloy". Fix: at routeTree build time, filter appRoute children to exclude every mother-only `/$deploymentId/*` route when running on Sovereign mode. DETECTED_MODE.mode is fixed per-page-load so this is a one-time check, no runtime overhead. With those routes absent, consoleAppDetailRoute is the only matcher for `/app/<componentId>` on Sovereign Console — AppDetail renders. Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9fc2850504
|
fix(catalyst-api): D16/D17 — 3 bugs caught on t138 fresh prov (#1583)
* fix(handover): rename itoa→regionSlotIndex (collision with infrastructure.go) PR #1581 introduced an `itoa` helper that collided with the existing `itoa` in handler/infrastructure.go:1952. Go vet failed: internal/handler/infrastructure.go:1952:6: itoa redeclared in this block internal/handler/deployment_handover_export.go:199:6: other declaration of itoa Rename my helper to `regionSlotIndex` — more descriptive of its actual use (deriving the per-region slot suffix for the kubeconfig filename). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalyst-api): D16/D17 — 3 bugs caught on t138 Founder caught on t136 (now wiped) that /dashboard cluster grouping still showed 1 region and /cloud nodes showed 1 node despite earlier D16 PRs shipping. Root cause: 3 bugs in the D16 chain that surfaced on t138 fresh prov. 1. exportSecondaryKubeconfigsToChild was guarded behind the early return of exportDeploymentToChild's failed POST. The child's ingress + cert + gateway are still racing to reach reachable state in the seconds after handover fires, so the first POST gets EOF and the goroutine never fires. Fix: kick off the D16 fan-out IMMEDIATELY at the top of exportDeploymentToChild in its own goroutine, BEFORE the deployment-record POST. 2. Both exports now retry with exponential backoff (5s → 60s) for up to 5 min total. Most handovers will succeed on attempt 2-4. Was: no retry, single shot, silent failure. 3. /api/v1/sovereign/secondary-kubeconfig route moved OUT of the auth group (rg) into the top-level router (r), alongside /api/v1/internal/deployments/import. The previous registration required an operator session that doesn't exist at handover — mothership POSTs were 401'd silently. Validation is now via safeIDPattern regex on depID + regionKey (same security model as the deployments/import companion endpoint). 4. HandleSovereignCloud now fans out across h.k8sCache.Clusters() instead of using only the in-cluster client. Adds Cluster field (omitempty) to sovereignNode/LB/SC/PVC so the UI can group/filter by region. Without this, /cloud?view=list&kind=nodes shows 1 node even when 3 secondary kubeconfigs are registered. Together these fix: - D16 /dashboard Layer-1=Cluster grouping (3 bubbles, not 1) - /cloud?view=list&kind=nodes (3+ nodes, not 1) Refs: feedback_test_theater_3rd_violation_2026_05_17.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9237c1e6ee
|
fix(handover): rename itoa→regionSlotIndex (collision with infrastructure.go) (#1582)
PR #1581 introduced an `itoa` helper that collided with the existing `itoa` in handler/infrastructure.go:1952. Go vet failed: internal/handler/infrastructure.go:1952:6: itoa redeclared in this block internal/handler/deployment_handover_export.go:199:6: other declaration of itoa Rename my helper to `regionSlotIndex` — more descriptive of its actual use (deriving the per-region slot suffix for the kubeconfig filename). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ce4ef6ba98
|
feat(handover): export secondary kubeconfigs to chroot at handover (D16 PR B) (#1581)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses catalyst-system namespace
PR #1564 created the owner UserAccess CR with .Namespace("") — the
apiserver returned "could not find the requested resource" because
useraccesses.access.openova.io is NAMESPACED (Crossplane Claim per
the XRD's claimNames block at platform/crossplane-claims/chart/
templates/xrds/useraccess.yaml).
Pin to catalyst-system (where catalyst-api + every Catalyst-authored
CR lives) and stamp the namespace on the object too. The existing
ListUserAccess handler uses Namespace("") so the entry surfaces on
/users without per-namespace filtering.
Verified the CRD shape on t134 2026-05-17:
$ kubectl api-resources --api-group=access.openova.io
useraccesses access.openova.io/v1alpha1 true UserAccess
^^^^
NAMESPACED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses tierRoleRef not wildcard app
PR #1564 + #1577 created the CR shape with applications=[{app:"*",...}]
but the useraccess XRD schema rejects `app: "*"` (pattern
^[a-z0-9][a-z0-9-]{0,62}$). The seed handler logged
"spec.applications[0].app: Invalid value: \"*\"" on every handover.
The XRD has a `tierRoleRef` field (pattern
^openova:tier-(viewer|developer|operator|admin|owner)$) that's the
canonical owner-tier semantic — when set, useraccess-controller binds
the named ClusterRole on the target via RoleBinding/ClusterRoleBinding.
`openova:tier-owner` is shipped by EPIC-3 (#1098) slice T1's
tier-clusterroles.yaml.
Drop the applications[] block + use tierRoleRef = openova:tier-owner.
Verified live on t135 2026-05-17 — error log showed exact pattern
mismatch before this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(chroot): POST /api/v1/sovereign/secondary-kubeconfig (D16 PR A)
D16 multi-cluster fan-out requires the chroot's k8sCache.Factory to
have all 3 regions' kubeconfigs registered so dashboard handler's
per-cluster h.k8sCache.List(clusterID, ...) enumerates pods from each.
Today the chroot only auto-registers its own in-cluster apiserver via
FactoryFromEnv's chroot self-registration branch. Secondary
kubeconfigs live on the mothership PVC + aren't replicated.
This handler bridges the gap:
- Accepts JSON {deploymentId, regionKey, kubeconfigYaml}
- Validates ids via ^[a-z0-9][a-z0-9-]{0,62}$ pattern (defense in
depth — filename composed from these)
- Writes kubeconfig 0o600 to /var/lib/catalyst/kubeconfigs/<depID>-<region>.yaml
(canonical FactoryFromEnv path so restart re-registers)
- Calls k8sCache.AddCluster — idempotent per Factory contract
PR B (next): mothership-side handover hook iterates secondary regions
and POSTs each kubeconfig to the chroot.
PR C (next): dashboard.go fan-out across all registered cluster IDs
when group_by includes cluster/region.
Per docs/INVIOLABLE-PRINCIPLES.md #10 kubeconfig bytes never enter a
logged struct + are written 0o600.
Memo: feedback_d16_dashboard_multi_cluster_fan_out.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dashboard): multi-cluster fan-out when group_by=cluster|region (D16 PR C)
When group_by includes "cluster" or "region", enumerate ALL registered
k8sCache clusters (primary + secondaries synced via PR #1579's POST
/api/v1/sovereign/secondary-kubeconfig endpoint) and concatenate
podRows from each before aggregation.
Layer-1=Cluster on /dashboard now renders 3 bubbles on a 3-region
Sovereign (was 1 bubble before).
For group_by that ONLY contains {namespace,family,application,vcluster,
sovereign} the primary clusterID's pods are sufficient and faster — no
fan-out cost.
PR B (mothership-side handover hook to POST each secondary kubeconfig)
will complete the chain. Until then, secondaries don't appear in
k8sCache.Clusters() so this fan-out is a no-op on existing provs — but
the code is in place for when PR B lands.
Memo: feedback_d16_dashboard_multi_cluster_fan_out.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(handover): export secondary kubeconfigs to chroot at handover (D16 PR B)
Closes the D16 multi-cluster fan-out chain:
- PR #1579 (PR A): chroot endpoint accepts kubeconfigs
- PR #1580 (PR C): dashboard handler fans out across registered clusters
- This PR (PR B): mothership-side hook iterates secondary regions at
handover, reads each region's kubeconfig from the mothership PVC,
and POSTs to the chroot's endpoint
After handover-fire, exportSecondaryKubeconfigsToChild fires as a
goroutine (alongside exportDeploymentToChild). Best-effort per region:
a failure on region N doesn't abort N+1.
The chroot's k8sCache.Factory.AddCluster runs on every POST so
dashboard /api/v1/dashboard/treemap?group_by=cluster|region now
enumerates pods from all N regions and Layer-1=Cluster renders N
bubbles on an N-region Sovereign.
regionKeysForExport derives the filename convention `<region>-<slot>`
from dep.Request.Regions[1:] (primary is auto-registered by the
chroot's FactoryFromEnv self-registration so we skip index 0).
Per docs/INVIOLABLE-PRINCIPLES.md #10 kubeconfig bytes never enter a
logged struct + are read with stdlib os.ReadFile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d92f734374
|
feat(dashboard): multi-cluster fan-out when group_by=cluster|region (D16 PR C) (#1580)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses catalyst-system namespace
PR #1564 created the owner UserAccess CR with .Namespace("") — the
apiserver returned "could not find the requested resource" because
useraccesses.access.openova.io is NAMESPACED (Crossplane Claim per
the XRD's claimNames block at platform/crossplane-claims/chart/
templates/xrds/useraccess.yaml).
Pin to catalyst-system (where catalyst-api + every Catalyst-authored
CR lives) and stamp the namespace on the object too. The existing
ListUserAccess handler uses Namespace("") so the entry surfaces on
/users without per-namespace filtering.
Verified the CRD shape on t134 2026-05-17:
$ kubectl api-resources --api-group=access.openova.io
useraccesses access.openova.io/v1alpha1 true UserAccess
^^^^
NAMESPACED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses tierRoleRef not wildcard app
PR #1564 + #1577 created the CR shape with applications=[{app:"*",...}]
but the useraccess XRD schema rejects `app: "*"` (pattern
^[a-z0-9][a-z0-9-]{0,62}$). The seed handler logged
"spec.applications[0].app: Invalid value: \"*\"" on every handover.
The XRD has a `tierRoleRef` field (pattern
^openova:tier-(viewer|developer|operator|admin|owner)$) that's the
canonical owner-tier semantic — when set, useraccess-controller binds
the named ClusterRole on the target via RoleBinding/ClusterRoleBinding.
`openova:tier-owner` is shipped by EPIC-3 (#1098) slice T1's
tier-clusterroles.yaml.
Drop the applications[] block + use tierRoleRef = openova:tier-owner.
Verified live on t135 2026-05-17 — error log showed exact pattern
mismatch before this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(chroot): POST /api/v1/sovereign/secondary-kubeconfig (D16 PR A)
D16 multi-cluster fan-out requires the chroot's k8sCache.Factory to
have all 3 regions' kubeconfigs registered so dashboard handler's
per-cluster h.k8sCache.List(clusterID, ...) enumerates pods from each.
Today the chroot only auto-registers its own in-cluster apiserver via
FactoryFromEnv's chroot self-registration branch. Secondary
kubeconfigs live on the mothership PVC + aren't replicated.
This handler bridges the gap:
- Accepts JSON {deploymentId, regionKey, kubeconfigYaml}
- Validates ids via ^[a-z0-9][a-z0-9-]{0,62}$ pattern (defense in
depth — filename composed from these)
- Writes kubeconfig 0o600 to /var/lib/catalyst/kubeconfigs/<depID>-<region>.yaml
(canonical FactoryFromEnv path so restart re-registers)
- Calls k8sCache.AddCluster — idempotent per Factory contract
PR B (next): mothership-side handover hook iterates secondary regions
and POSTs each kubeconfig to the chroot.
PR C (next): dashboard.go fan-out across all registered cluster IDs
when group_by includes cluster/region.
Per docs/INVIOLABLE-PRINCIPLES.md #10 kubeconfig bytes never enter a
logged struct + are written 0o600.
Memo: feedback_d16_dashboard_multi_cluster_fan_out.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dashboard): multi-cluster fan-out when group_by=cluster|region (D16 PR C)
When group_by includes "cluster" or "region", enumerate ALL registered
k8sCache clusters (primary + secondaries synced via PR #1579's POST
/api/v1/sovereign/secondary-kubeconfig endpoint) and concatenate
podRows from each before aggregation.
Layer-1=Cluster on /dashboard now renders 3 bubbles on a 3-region
Sovereign (was 1 bubble before).
For group_by that ONLY contains {namespace,family,application,vcluster,
sovereign} the primary clusterID's pods are sufficient and faster — no
fan-out cost.
PR B (mothership-side handover hook to POST each secondary kubeconfig)
will complete the chain. Until then, secondaries don't appear in
k8sCache.Clusters() so this fan-out is a no-op on existing provs — but
the code is in place for when PR B lands.
Memo: feedback_d16_dashboard_multi_cluster_fan_out.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
bcab6430cb
|
feat(chroot): POST /api/v1/sovereign/secondary-kubeconfig (D16 PR A) (#1579)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses catalyst-system namespace
PR #1564 created the owner UserAccess CR with .Namespace("") — the
apiserver returned "could not find the requested resource" because
useraccesses.access.openova.io is NAMESPACED (Crossplane Claim per
the XRD's claimNames block at platform/crossplane-claims/chart/
templates/xrds/useraccess.yaml).
Pin to catalyst-system (where catalyst-api + every Catalyst-authored
CR lives) and stamp the namespace on the object too. The existing
ListUserAccess handler uses Namespace("") so the entry surfaces on
/users without per-namespace filtering.
Verified the CRD shape on t134 2026-05-17:
$ kubectl api-resources --api-group=access.openova.io
useraccesses access.openova.io/v1alpha1 true UserAccess
^^^^
NAMESPACED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses tierRoleRef not wildcard app
PR #1564 + #1577 created the CR shape with applications=[{app:"*",...}]
but the useraccess XRD schema rejects `app: "*"` (pattern
^[a-z0-9][a-z0-9-]{0,62}$). The seed handler logged
"spec.applications[0].app: Invalid value: \"*\"" on every handover.
The XRD has a `tierRoleRef` field (pattern
^openova:tier-(viewer|developer|operator|admin|owner)$) that's the
canonical owner-tier semantic — when set, useraccess-controller binds
the named ClusterRole on the target via RoleBinding/ClusterRoleBinding.
`openova:tier-owner` is shipped by EPIC-3 (#1098) slice T1's
tier-clusterroles.yaml.
Drop the applications[] block + use tierRoleRef = openova:tier-owner.
Verified live on t135 2026-05-17 — error log showed exact pattern
mismatch before this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(chroot): POST /api/v1/sovereign/secondary-kubeconfig (D16 PR A)
D16 multi-cluster fan-out requires the chroot's k8sCache.Factory to
have all 3 regions' kubeconfigs registered so dashboard handler's
per-cluster h.k8sCache.List(clusterID, ...) enumerates pods from each.
Today the chroot only auto-registers its own in-cluster apiserver via
FactoryFromEnv's chroot self-registration branch. Secondary
kubeconfigs live on the mothership PVC + aren't replicated.
This handler bridges the gap:
- Accepts JSON {deploymentId, regionKey, kubeconfigYaml}
- Validates ids via ^[a-z0-9][a-z0-9-]{0,62}$ pattern (defense in
depth — filename composed from these)
- Writes kubeconfig 0o600 to /var/lib/catalyst/kubeconfigs/<depID>-<region>.yaml
(canonical FactoryFromEnv path so restart re-registers)
- Calls k8sCache.AddCluster — idempotent per Factory contract
PR B (next): mothership-side handover hook iterates secondary regions
and POSTs each kubeconfig to the chroot.
PR C (next): dashboard.go fan-out across all registered cluster IDs
when group_by includes cluster/region.
Per docs/INVIOLABLE-PRINCIPLES.md #10 kubeconfig bytes never enter a
logged struct + are written 0o600.
Memo: feedback_d16_dashboard_multi_cluster_fan_out.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4f62dd21b3
|
fix(handover): D21 owner seed uses tierRoleRef not wildcard app (#1578)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses catalyst-system namespace
PR #1564 created the owner UserAccess CR with .Namespace("") — the
apiserver returned "could not find the requested resource" because
useraccesses.access.openova.io is NAMESPACED (Crossplane Claim per
the XRD's claimNames block at platform/crossplane-claims/chart/
templates/xrds/useraccess.yaml).
Pin to catalyst-system (where catalyst-api + every Catalyst-authored
CR lives) and stamp the namespace on the object too. The existing
ListUserAccess handler uses Namespace("") so the entry surfaces on
/users without per-namespace filtering.
Verified the CRD shape on t134 2026-05-17:
$ kubectl api-resources --api-group=access.openova.io
useraccesses access.openova.io/v1alpha1 true UserAccess
^^^^
NAMESPACED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses tierRoleRef not wildcard app
PR #1564 + #1577 created the CR shape with applications=[{app:"*",...}]
but the useraccess XRD schema rejects `app: "*"` (pattern
^[a-z0-9][a-z0-9-]{0,62}$). The seed handler logged
"spec.applications[0].app: Invalid value: \"*\"" on every handover.
The XRD has a `tierRoleRef` field (pattern
^openova:tier-(viewer|developer|operator|admin|owner)$) that's the
canonical owner-tier semantic — when set, useraccess-controller binds
the named ClusterRole on the target via RoleBinding/ClusterRoleBinding.
`openova:tier-owner` is shipped by EPIC-3 (#1098) slice T1's
tier-clusterroles.yaml.
Drop the applications[] block + use tierRoleRef = openova:tier-owner.
Verified live on t135 2026-05-17 — error log showed exact pattern
mismatch before this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ea30ded120
|
fix(handover): D21 owner seed uses catalyst-system namespace (#1577)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(handover): D21 owner seed uses catalyst-system namespace
PR #1564 created the owner UserAccess CR with .Namespace("") — the
apiserver returned "could not find the requested resource" because
useraccesses.access.openova.io is NAMESPACED (Crossplane Claim per
the XRD's claimNames block at platform/crossplane-claims/chart/
templates/xrds/useraccess.yaml).
Pin to catalyst-system (where catalyst-api + every Catalyst-authored
CR lives) and stamp the namespace on the object too. The existing
ListUserAccess handler uses Namespace("") so the entry surfaces on
/users without per-namespace filtering.
Verified the CRD shape on t134 2026-05-17:
$ kubectl api-resources --api-group=access.openova.io
useraccesses access.openova.io/v1alpha1 true UserAccess
^^^^
NAMESPACED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
33ed484e04
|
fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30) (#1576)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".
Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.
The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)
When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:
1. Burning a Dynadot API credit on a flip that would be idempotent.
2. The D30 blocker — current Dynadot creds return pdm-status-401
even when the desired NS state already exists. Caught on t132
2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
parentDomains attempt.
Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.
This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3568b72b5e
|
fix(cloud): hide non-active 0/0 chips (D15) (#1574)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment reads to populate the deployment record so Sovereign Console Settings page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL, orgName (instead of `—` placeholders). Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName, controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with empty defaults. Per-Sovereign overlays wire actual values from cloud- init substitute placeholders (mirrors regionsJson pattern). Catalyst-api Pod now reads them via valueFrom configMapKeyRef + optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap so env stays empty there — correct, mothership is signer not validator). Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP post-#1568. This PR fills the remaining 3 D22 fields when operator wires the values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(slot-13): add D22 sovereign-side identity placeholders Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} + ${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569) → catalyst-api env → chrootEnsureDeployment populates the deployment record → Settings page renders real values instead of `—`. This PR alone is a no-op (placeholders default to empty, same as today). The cloud-init substitute lines + provisioner.go tfvars need to land in a companion PR to actually populate the values on next-prov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit Kustomization's postBuild.substitute env, which the slot-13 placeholders (#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}. Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment populates the deployment record (#1567 + #1568 fallback). SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes the dependency cycle (hcloud_server.cp doesn't exist at cloudinit render time). Separate PR will source it via metadata-service or post-create ConfigMap patch. Next-prov (t133+) Sovereign Console Settings page now renders real ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b) PR #1552 stripped the `/app` prefix on Sovereign mode to make `/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match. But consoleAppDetailRoute is registered at `/app/$componentId` under consoleLayoutRoute — no chroot route matches `/<componentId>` directly, so stripping leaves an empty render path. Playwright walkthrough on t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render body_len=9 (empty). Invert the logic: only redirect mothership-only sub-paths (/dashboard Fleet view, /install wizard, /sre, /sec, /blueprints) which have no Sovereign Console equivalent. For everything else (component names like `/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match pick consoleAppDetailRoute / consoleAppsRoute. Caught live on t132 via Playwright walker3.js — agent a4825c5a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(handover): re-mint handover JWT on every GetDeployment (D0) D0 Playwright walkthrough on t132 2026-05-17 caught: handoverURL persisted at handover-fire time carries a JWT that expires per DefaultTTL (5min). Operators who click /jobs hours later get the stale token → Sovereign-side /auth/handover rejects with raw JSON {"error":"invalid token"} — no UI fallback, no /auth/handover-error, auto-redirect to /dashboard never fires. Re-mint the JWT on every GetDeployment when deployment is ready + handover-fired so the URL returned to the wizard is always freshly-signed. Best-effort: on mint failure, leave the existing URL in place so a transient signer error doesn't break polling. Helper is idempotent + locked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): hide non-active 0/0 chips (D15) Playwright walkthrough on t132 2026-05-17 caught D15 PARTIAL: 15 chips are correct but Bucket+Volume show 0/0. Founder rule (DoD D15): "No kind chip shows 0/0 for a resource that actually exists in the cluster". Bucket+Volume genuinely don't exist on this Sovereign so showing 0/0 is noise. Hide chips with count exactly 0 unless they're the active selection (operator who navigated to an empty kind keeps context). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
58dbb92f4f
|
fix(handover): re-mint handover JWT on every GetDeployment (D0) (#1573)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment reads to populate the deployment record so Sovereign Console Settings page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL, orgName (instead of `—` placeholders). Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName, controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with empty defaults. Per-Sovereign overlays wire actual values from cloud- init substitute placeholders (mirrors regionsJson pattern). Catalyst-api Pod now reads them via valueFrom configMapKeyRef + optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap so env stays empty there — correct, mothership is signer not validator). Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP post-#1568. This PR fills the remaining 3 D22 fields when operator wires the values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(slot-13): add D22 sovereign-side identity placeholders Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} + ${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569) → catalyst-api env → chrootEnsureDeployment populates the deployment record → Settings page renders real values instead of `—`. This PR alone is a no-op (placeholders default to empty, same as today). The cloud-init substitute lines + provisioner.go tfvars need to land in a companion PR to actually populate the values on next-prov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit Kustomization's postBuild.substitute env, which the slot-13 placeholders (#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}. Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment populates the deployment record (#1567 + #1568 fallback). SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes the dependency cycle (hcloud_server.cp doesn't exist at cloudinit render time). Separate PR will source it via metadata-service or post-create ConfigMap patch. Next-prov (t133+) Sovereign Console Settings page now renders real ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b) PR #1552 stripped the `/app` prefix on Sovereign mode to make `/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match. But consoleAppDetailRoute is registered at `/app/$componentId` under consoleLayoutRoute — no chroot route matches `/<componentId>` directly, so stripping leaves an empty render path. Playwright walkthrough on t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render body_len=9 (empty). Invert the logic: only redirect mothership-only sub-paths (/dashboard Fleet view, /install wizard, /sre, /sec, /blueprints) which have no Sovereign Console equivalent. For everything else (component names like `/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match pick consoleAppDetailRoute / consoleAppsRoute. Caught live on t132 via Playwright walker3.js — agent a4825c5a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(handover): re-mint handover JWT on every GetDeployment (D0) D0 Playwright walkthrough on t132 2026-05-17 caught: handoverURL persisted at handover-fire time carries a JWT that expires per DefaultTTL (5min). Operators who click /jobs hours later get the stale token → Sovereign-side /auth/handover rejects with raw JSON {"error":"invalid token"} — no UI fallback, no /auth/handover-error, auto-redirect to /dashboard never fires. Re-mint the JWT on every GetDeployment when deployment is ready + handover-fired so the URL returned to the wizard is always freshly-signed. Best-effort: on mint failure, leave the existing URL in place so a transient signer error doesn't break polling. Helper is idempotent + locked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9e1e4224d8
|
fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b) (#1572)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment reads to populate the deployment record so Sovereign Console Settings page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL, orgName (instead of `—` placeholders). Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName, controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with empty defaults. Per-Sovereign overlays wire actual values from cloud- init substitute placeholders (mirrors regionsJson pattern). Catalyst-api Pod now reads them via valueFrom configMapKeyRef + optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap so env stays empty there — correct, mothership is signer not validator). Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP post-#1568. This PR fills the remaining 3 D22 fields when operator wires the values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(slot-13): add D22 sovereign-side identity placeholders Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} + ${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569) → catalyst-api env → chrootEnsureDeployment populates the deployment record → Settings page renders real values instead of `—`. This PR alone is a no-op (placeholders default to empty, same as today). The cloud-init substitute lines + provisioner.go tfvars need to land in a companion PR to actually populate the values on next-prov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit Kustomization's postBuild.substitute env, which the slot-13 placeholders (#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}. Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment populates the deployment record (#1567 + #1568 fallback). SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes the dependency cycle (hcloud_server.cp doesn't exist at cloudinit render time). Separate PR will source it via metadata-service or post-create ConfigMap patch. Next-prov (t133+) Sovereign Console Settings page now renders real ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b) PR #1552 stripped the `/app` prefix on Sovereign mode to make `/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match. But consoleAppDetailRoute is registered at `/app/$componentId` under consoleLayoutRoute — no chroot route matches `/<componentId>` directly, so stripping leaves an empty render path. Playwright walkthrough on t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render body_len=9 (empty). Invert the logic: only redirect mothership-only sub-paths (/dashboard Fleet view, /install wizard, /sre, /sec, /blueprints) which have no Sovereign Console equivalent. For everything else (component names like `/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match pick consoleAppDetailRoute / consoleAppsRoute. Caught live on t132 via Playwright walker3.js — agent a4825c5a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6618392407
|
fix(chroot): GetDeployment falls back to chrootEnsureDeployment (D22) (#1568)
* feat(handover): auto-seed owner UserAccess CR on chroot (D21)
Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.
After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:
apiVersion: access.openova.io/v1alpha1
kind: UserAccess
metadata:
name: useraccess-owner-<sanitized-email>
annotations:
catalyst.openova.io/user-email: <email> # rbac_matrix:309 hint
spec:
user:
keycloakSubject: <email>
sovereignRef: <fqdn-first-label>
applications:
- app: "*"
role: admin # owner -> admin
The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.
Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.
Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).
Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21
* chore(slot-13): pin bp-catalyst-platform to 1.4.147 (D21+D31 baked)
PR #1562 (D31 wordpress-tenant activeHotStandby) + PR #1564 (D21 owner
UserAccess auto-seed at handover, catalyst-api:8d2a947) both packaged
into chart 1.4.147. Pin slot so t133+ gets both gates on first prov.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5)
PR #1551 single-quoted SOVEREIGN_REGIONS_JSON in the slot file
substitute, but Flux Kustomize's postBuild can still re-parse the
JSON-shaped string as a YAML flow-sequence depending on quoting context.
When that happens .Values.sovereign.regionsJson is a Go []interface{}
of map[interface{}]interface{} and `| quote` prints Go's
`[map[cloudRegion:hel1 ...]]` syntax — catalyst-api's json.Unmarshal of
the env var then fails and Request.Regions is empty.
toJson normalises both string and list inputs to valid JSON.
Caught live on t132 2026-05-16 chart 1.4.147: env var rendered as
`[map[cloudRegion:hel1 ...]]` despite #1551 being in effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chroot): populate deployment Result + Request fields for D22
Settings page on Sovereign Console renders `—` for Region / Sovereign /
Created / DeploymentID / Pool subdomain because chroot's GET
/api/v1/deployments/<id> returns empty strings for those fields.
Populate from existing env vars (best-effort — empty when chart hasn't
wired them yet, which is no worse than today's behaviour):
- Result.ConsoleURL = "https://console.<fqdn>" (derived from selfFQDN)
- Result.GitOpsRepoURL from GITOPS_REPO_URL env
- Result.ControlPlaneIP from SOVEREIGN_CONTROL_PLANE_IP env
- Request.Region = regions[0].CloudRegion (top-level legacy field)
- Request.OrgEmail from OPERATOR_EMAIL env
- Request.OrgName from ORG_NAME env
Companion chart PR will wire the env vars from .Values.global.* +
cloud-init substitute placeholders. This PR is BACKWARD-compatible —
unset env vars produce empty strings, same as today.
Caught live on t132 2026-05-16 — `curl /api/v1/deployments/sovereign-
t132.omani.works` returns empty ownerEmail/region/consoleURL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chroot): GetDeployment falls back to chrootEnsureDeployment (D22)
GetDeployment was the only handler that returned 404 without calling
chrootEnsureDeployment. After a catalyst-api Pod restart on the chroot
the in-memory store is empty until some other handler (StreamLogs,
jobs list) primes it via its own synth call — meanwhile the Sovereign
Console Settings page loads /api/v1/deployments/<id> first and gets
404, rendering the entire page broken.
Mirror the StreamLogs pattern (lines 1247-1254): try in-memory load,
fall through to chrootEnsureDeployment, return 404 only when both miss.
This unblocks PR #1567's deployment-record population — without the
fallback, GetDeployment can never serve the populated record on chroot.
Caught live on t132 2026-05-16 after #1567 image roll: Settings page
404 because in-memory store was empty.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ed63ecd09f
|
fix(chroot): populate deployment Result + Request fields for D22 settings (#1567)
* feat(handover): auto-seed owner UserAccess CR on chroot (D21)
Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.
After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:
apiVersion: access.openova.io/v1alpha1
kind: UserAccess
metadata:
name: useraccess-owner-<sanitized-email>
annotations:
catalyst.openova.io/user-email: <email> # rbac_matrix:309 hint
spec:
user:
keycloakSubject: <email>
sovereignRef: <fqdn-first-label>
applications:
- app: "*"
role: admin # owner -> admin
The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.
Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.
Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).
Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21
* chore(slot-13): pin bp-catalyst-platform to 1.4.147 (D21+D31 baked)
PR #1562 (D31 wordpress-tenant activeHotStandby) + PR #1564 (D21 owner
UserAccess auto-seed at handover, catalyst-api:8d2a947) both packaged
into chart 1.4.147. Pin slot so t133+ gets both gates on first prov.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5)
PR #1551 single-quoted SOVEREIGN_REGIONS_JSON in the slot file
substitute, but Flux Kustomize's postBuild can still re-parse the
JSON-shaped string as a YAML flow-sequence depending on quoting context.
When that happens .Values.sovereign.regionsJson is a Go []interface{}
of map[interface{}]interface{} and `| quote` prints Go's
`[map[cloudRegion:hel1 ...]]` syntax — catalyst-api's json.Unmarshal of
the env var then fails and Request.Regions is empty.
toJson normalises both string and list inputs to valid JSON.
Caught live on t132 2026-05-16 chart 1.4.147: env var rendered as
`[map[cloudRegion:hel1 ...]]` despite #1551 being in effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chroot): populate deployment Result + Request fields for D22
Settings page on Sovereign Console renders `—` for Region / Sovereign /
Created / DeploymentID / Pool subdomain because chroot's GET
/api/v1/deployments/<id> returns empty strings for those fields.
Populate from existing env vars (best-effort — empty when chart hasn't
wired them yet, which is no worse than today's behaviour):
- Result.ConsoleURL = "https://console.<fqdn>" (derived from selfFQDN)
- Result.GitOpsRepoURL from GITOPS_REPO_URL env
- Result.ControlPlaneIP from SOVEREIGN_CONTROL_PLANE_IP env
- Request.Region = regions[0].CloudRegion (top-level legacy field)
- Request.OrgEmail from OPERATOR_EMAIL env
- Request.OrgName from ORG_NAME env
Companion chart PR will wire the env vars from .Values.global.* +
cloud-init substitute placeholders. This PR is BACKWARD-compatible —
unset env vars produce empty strings, same as today.
Caught live on t132 2026-05-16 — `curl /api/v1/deployments/sovereign-
t132.omani.works` returns empty ownerEmail/region/consoleURL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8d2a947cfb
|
feat(handover): auto-seed owner UserAccess CR on chroot (D21) (#1564)
Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.
After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:
apiVersion: access.openova.io/v1alpha1
kind: UserAccess
metadata:
name: useraccess-owner-<sanitized-email>
annotations:
catalyst.openova.io/user-email: <email> # rbac_matrix:309 hint
spec:
user:
keycloakSubject: <email>
sovereignRef: <fqdn-first-label>
applications:
- app: "*"
role: admin # owner -> admin
The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.
Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.
Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).
Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
|
||
|
|
2fd4e3cbf4
|
feat(wizard): default marketplaceEnabled=true for D27 zero-touch (#1555)
Founder ruling 2026-05-16: D27 mandates that a fresh wizard provisions a Sovereign already ready to host tenant orgs (D29). Operator can still flip the toggle off on StepMarketplace if they explicitly want a private Sovereign. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9f096b0b18
|
fix(chroot): populate Result.LoadBalancerIP so canvas shows LB chip (D15) (#1553)
chrootEnsureDeployment was synthesizing a Deployment with Result=nil. The topology loader's buildLBs() returned [] on nil-Result → canvas chip showed `LoadBalancer 0/0` on every chroot Sovereign Console even though the Sovereign ingress LB was allocated and serving console.<fqdn>. Populate Result with LoadBalancerIP from `SOVEREIGN_LB_IP` env (set by bp-catalyst-platform's sovereign-fqdn ConfigMap `lbIP` key per issue #900 / PR #145). buildLBs then emits one LoadBalancer entry per region using the canonical primary LB. Caught on t131 2026-05-16 — DoD D15. Same chroot-synth-enrichment pattern as PR #1534 (SOVEREIGN_REGIONS_JSON). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
124ac13c1d
|
fix(router): chroot Sovereign /app/<name> resolves to AppDetail, not mothership AppsPage (D17b) (#1552)
Two route trees claim `/app`: 1. `appRoute` (line 364) — mothership AppLayout chrome, prefix `/app`, children `/app/$deploymentId/applications/*`, `/app/$deploymentId/ settings`, `/app/dashboard` (fleet view), etc. ~30 children. 2. `consoleAppDetailRoute` (line 1141, under consoleLayoutRoute) — clean `/app/$componentId` for the chroot Sovereign Console's per-app detail. On a chroot Sovereign Console (DETECTED_MODE.mode === 'sovereign') the operator clicks `/apps/<card>` → AppCard generates HREF `/app/<name>` (AppsPage.tsx line ~720, correct for chroot context). TanStack router resolves to the MOTHERSHIP `appRoute` because it matches first (registered earlier under rootRoute) and its children accept `<name>` as $deploymentId. The page renders AppLayout chrome + AppsPage with mothership sidebar — looks nothing like AppDetail. Founder observation (BUG-002 from /tmp/test-matrix-t129.json + reported on t131 2026-05-16): > Application individual pages are not visible at all in the child > while mothership doesn't have that issue, this is the biggest blunder! Fix: `appRoute.beforeLoad` redirects on chroot: - `/app/<componentId>` → `/<componentId>` (caught by consoleAppDetailRoute) - `/app/dashboard`, `/app/install`, `/app/sre/*`, `/app/sec/*`, `/app/blueprints` → `/dashboard` (canonical Sovereign landing; these are mothership-only surfaces — already partially fixed at dashboardRoute level by PR #1547) Mothership behavior unchanged (DETECTED_MODE.mode !== 'sovereign' falls through to the existing AppLayout-rooted tree). Refs DoD D17b. Caught on t131 (623354058b114dd6, 2026-05-16). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fbe23da091
|
fix(ui-nginx): allow Google Fonts domains in CSP (D26) (#1549)
Sovereign Console pages reference Inter + JetBrains Mono fonts via fonts.googleapis.com (index.html lines 9, 11). The nginx CSP only allowed font-src 'self' data: — so the browser blocked the font stylesheet AND the woff2 fetches, falling back to system fonts. Add fonts.googleapis.com to style-src (for the @import CSS) and fonts.gstatic.com to font-src (for the woff2 assets). All 3 CSP occurrences in nginx.conf updated identically. Alternative considered: self-host the woff2 + drop the external references. Skipped for now — sticking with Google Fonts CDN is faster + matches every other web app's posture. If the operator wants air-gap-compatible Sovereigns later, switch to self-hosted. Caught on t129 2026-05-16 — DoD D26. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7845a00799
|
fix(dashboard): add region + vcluster as TreemapDimensions (D16) (#1548)
Multi-region operators on the Sovereign Console couldn't pivot the /dashboard treemap by region or vCluster. The TreemapDimension union (FE) and dashboardDimension set (BE) only included sovereign/cluster/family/namespace/application. This PR: - Adds 'region' + 'vcluster' to TreemapDimension type (products/catalyst/bootstrap/ui/src/lib/treemap.types.ts) - Adds them to the dimension select options (products/catalyst/bootstrap/ui/src/components/TreemapLayerController.tsx) - Adds them to the validated set in dashboard.go - Adds podRow.region + podRow.vcluster fields populated from openova.io/region and catalyst.openova.io/vcluster-role labels - Extends dimensionKey switch to bucket by these new dimensions (fallback: region→cluster, vcluster→"host") Caught on t129 2026-05-16 — DoD D16. Note that full multi-cluster fan-out (aggregating pods across all 3 region kubeconfigs into one treemap) is a separate refactor not included here; this PR delivers the dimension surface so the layer selector is usable + a fresh prov with the chroot's k8scache extended to multi-region will render 3 cluster bubbles when the operator picks Layer-1=cluster. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
52015ff468
|
fix(ui): t129 SPA routing — bp-bp- prefix, PIN /wizard leak, /app/dashboard fleet leak (#1547)
Three operator-visible SPA routing bugs caught on live t129 Sovereign Console (t129.omani.works, 2026-05-16). Closes #1546. BUG-001 (D19) — doubled /app/bp-bp-* href on 10 of 44 app cards. build-catalog.mjs::listBootstrapKit extracted slug from `NN-(.+)\.yaml` without stripping an optional `bp-` already present in some filenames (e.g. `13-bp-catalyst-platform.yaml`). The captured slug became `bp-catalyst-platform`, then `id: \`bp-${slug}\`` doubled it to `bp-bp-catalyst-platform`, breaking the FE↔BE HR-name join and printing the doubled prefix on the AppsPage card href. Fix: strip a leading `bp-` from the captured slug before forming the canonical id. Regenerated catalog.generated.ts + blueprints.json — 10 entries collapse to their single-prefix canonical form (bp-catalyst-platform, bp-cert-manager-powerdns-webhook, bp-k8s-ws-proxy, bp-guacamole, bp-dmz-vcluster, bp-hcloud-ccm, bp-openova-flow-server, bp-openova-flow-emitter, bp-mgmt-vcluster, bp-rtz-vcluster). BUG-015 (D23, extends D0) — PIN-verify lands /wizard on Sovereign. VerifyPinPage default landing was `/wizard` regardless of operating mode. On a chroot Sovereign Console (DETECTED_MODE.mode === 'sovereign' the operator has just been auto-redirected from the mothership handover URL; their Sovereign is already converged. Routing them to the new-prov wizard re-prompts for org details and contradicts D0. Fix: branch on DETECTED_MODE.mode — `/dashboard` on sovereign, `/wizard` on catalyst-zero. Mothership flow unchanged. Test: VerifyPinPage.test.tsx asserts the 3 cases (sovereign default, catalyst-zero default, explicit next= override). BUG-016 (D24) — /app/dashboard exposes mothership fleet view. appRoute's `/dashboard` child mounts DashboardPage (multi-Sovereign fleet, "7 Sovereigns" with duplicate rows). On a Sovereign Console this surface MUST NOT be reachable — the Sovereign owns ONE deployment, fleet is mothership-only. Fix: beforeLoad on dashboardRoute redirects to `/dashboard` (consoleDashboardRoute, the per-Sovereign landing) when DETECTED_MODE.mode === 'sovereign'. Mothership keeps the fleet view as today. Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D19/D23/D24, /tmp/test-matrix-t129.json discoveries BUG-001/015/016. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2b3888eed5
|
fix(ui): suppress chroot-side false-positive notifications (D17, D18) (#1543)
Two notification spammers on the chroot Sovereign Console that produce noise on every /apps + /app/<name> visit: D17 — "Deployment id in the URL is malformed": AppsPage.tsx fires on isDeploymentID(rawDeploymentId)=false. On the chroot, useResolvedDeploymentId resolves to /api/v1/sovereign/self which returns the synthesized canonical id `sovereign-<fqdn>` (26 chars, not hex). The notification claims that path-segment is invalid even though there is no URL segment — the resolution path is in-process. Suppress on DETECTED_MODE.mode === 'sovereign'. D18 — "Per-component install monitoring is unavailable": Fires on state.phase1WatchSkipped. On the chroot, phase1WatchSkipped is a MOTHERSHIP-only concept (mother's observer pod failed to fetch the new cluster's kubeconfig). The Sovereign-side catalyst-api runs IN the cluster it's reporting on — has the in-cluster ServiceAccount + bundled sovereignDynamicClient + informer cache watching HelmReleases natively. Firing this here tells operator to drop to kubectl when the data is on the page. Suppress on chroot. Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — DoD D17 + D18. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
536bfcb699
|
fix(infrastructure): vCluster fallback from namespace label (D15) (#1542)
loadVClusters() queried vcluster.io/v1alpha1 CRs only. Our bootstrap
topology ships loft-sh/vcluster as a plain Helm chart (StatefulSet +
Service, NO CRD installed) so the CR list is always empty on a
converged Sovereign → canvas `vCluster N/N` chip shows `0/0` even
though Pods are Running.
Add a fallback: enumerate Namespaces carrying
`catalyst.openova.io/vcluster-role` label (stamped by
bp-{mgmt,dmz,rtz}-vcluster's namespace template at PR #1526).
Emits one VCluster row per labeled namespace with role = the label
value. Status `healthy` since the namespace exists (operator-visible
Pod state is surfaced elsewhere).
Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — D15.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5b69247135
|
fix(clustermesh): secondary cluster name match tofu scheme (D11) (#1540)
Tofu's `secondary_region_cluster_mesh_name` local at infra/hetzner/main.tf:389 generates secondary names as `<sovereign-stem>-<region-stem-no-digits>` (e.g. `t129-nbg`, `t129-sin`). The bootstrap-kit slot 01-cilium.yaml renders cilium-config cluster.name from this value via the CLUSTER_MESH_NAME envsubst. The orchestrator's clusterName derivation was wrong: it appended `-<region-key>` to the primary's name (e.g. `t129-mesh-nbg1-1`), which matched NEITHER the tofu scheme NOR the cilium-config value. Caught on t129 (6cddff7ef4432bdc, 2026-05-16): TLS, etcd RBAC, and connection all working after PRs #1530, #1536, #1538, #1539 — but agent reported `failed to retrieve cluster configuration: not found` for every secondary peer because it queried `cilium/cluster-config/v1/t129-mesh-nbg1-1` against an etcd that only had `t129-nbg`. Fix: export `DeriveSecondaryClusterMeshName(req, rs)` that mirrors tofu's local exactly, plus a `stripTrailingDigits` helper. Orchestrator's buildRegionSlots uses this for secondaries; primary keeps the `<stem>-mesh` shape. Closes D11 incident chain: #1525 → #1528 → #1530 → #1536 → #1538 → #1539 → this. With this PR landed t129's secondary→primary connection already works (verified on live cluster — secondary agents show "ready, 2 nodes, 113 endpoints, 326 identities"); primary→secondary will work on a fresh prov once the name match is correct from the start. Refs DoD D11. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d0fd32dc04
|
fix(clustermesh): use peer's clustermesh-apiserver-remote-cert (D11) (#1539)
The orchestrator was minting a fresh client cert (CN = local cluster name) for each peer connection. Even with PR #1530's "sign with peer's CA" fix the TLS handshake succeeded but etcd RBAC rejected: error="etcdserver: permission denied" Cilium's clustermesh-apiserver etcd has RBAC with a `remote` user that has read access on the cilium/* prefix. The chart generates `kube-system/clustermesh-apiserver-remote-cert` with CN=`remote`. Canonical `cilium clustermesh connect` CLI copies THIS Secret's tls.crt/tls.key as the client cert the REMOTE cluster presents — matches the etcd RBAC user verbatim. This PR adopts that pattern: snapshotRemoteCert() reads the peer's existing `clustermesh-apiserver-remote-cert` Secret, returns tls.crt + tls.key bytes, and the orchestrator writes them into A's `cilium-clustermesh` Secret instead of minting. Caught on t129 (6cddff7ef4432bdc, 2026-05-16): - TLS handshake succeeded after firewall fix (PR #1538) opened NodePort range so LB→backend health check passed - cilium-dbg status reported `etcd: 1/1 connected, has-quorum=true` (TLS path working) - BUT `remote configuration: expected=true, retrieved=false` and agent logs spammed `etcdserver: permission denied` With this PR's CN=remote cert, etcd authorizes the kvstore List and clustermesh sync completes — agent should flip to `2/2 remote clusters ready`. Completes the D11 chain: #1525 (regionKeyFromSpec) → #1528 (clusterName derivation) → #1530 (cert with peer's CA — no longer needed but kept as defense-in-depth) → #1536 (hostAlias pattern) → #1538 (firewall NodePort range) → this. Refs DoD D11. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
83d771dee9
|
fix(clustermesh): hostAlias pattern — endpoint hostname + DS patch (D11) (#1536)
Cilium clustermesh-apiserver server cert has SANs:
*.mesh.cilium.io, clustermesh-apiserver.kube-system.svc,
127.0.0.1, ::1
No public LB IP SAN. When the orchestrator wrote the peer config blob
with `endpoints: - https://<lb-ip>:2379`, TLS handshake from the
agent failed at hostname verification — `cilium-dbg status --verbose`
reported `0/N remote clusters ready, Waiting for initial connection`.
This PR adopts the canonical Cilium clustermesh hostAlias pattern
(same shape as `cilium clustermesh connect` CLI):
1. buildPeerConfigBlob now writes the endpoint as
`https://<peer>.mesh.cilium.io:2379` — matching the apiserver
server cert's `*.mesh.cilium.io` wildcard SAN.
2. New patchCiliumHostAliases adds one hostAliases entry per peer
to the cilium DaemonSet's pod spec:
- ip: <peer-LB-IP>
hostnames: ["<peer>.mesh.cilium.io"]
So the agent resolves the hostname to the public LB IP at
connect-time. Strategic-merge patch: idempotent re-runs replace
the whole list with the current peer set.
3. Orchestrator step 3 calls patchCiliumHostAliases for each
region's local cilium DaemonSet right before the rollout-restart
of cilium / cilium-operator / clustermesh-apiserver, so the new
pod spec is in effect when the agents come back up.
Caught on t128 (9680edbdce8fefe8, 2026-05-16) — same incident
chain as PRs #1525/#1528/#1530. With this PR landed AND the
existing PR #1530 (cert signed by peer's CA), agents should
flip to `2/2 remote clusters ready` on the next prov.
Refs DoD D11.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1f30a08ae3
|
fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534)
The Sovereign-side catalyst-api runs in "chroot" mode — it has no
parent prov record, so chrootEnsureDeployment synthesises a minimal
in-memory Deployment with only SovereignFQDN set. The
/infrastructure/topology loader then sees empty Request.Regions[]
and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes)
which only sees THIS cluster's Node(s) → emits exactly 1 Region
even on a 3-region Sovereign. /cloud?view=graph renders as
"1 cluster 1 region" — DoD D5 failure.
Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported
`console.t126.omani.works/cloud?view=graph` showed 1 region despite
mothership openova-flow snapshot holding all 3 regions correctly.
This PR threads the canonical multi-region RegionSpec[] from the
mothership prov body all the way to the Sovereign-side catalyst-api:
tofu var.regions
→ jsonencode → sovereign_regions_json tftpl var
→ cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON
→ bp-catalyst-platform slot 13 sovereign.regionsJson value
→ sovereign-fqdn ConfigMap key `regionsJson`
→ catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom)
→ chrootEnsureDeployment parses JSON, populates Request.Regions[]
→ topology loader emits one Region per spec entry
Single-region Sovereigns: var.regions has length 1; chart writes
the array literal; chroot synth still produces 1 Region — no
regression. Empty env: chroot falls back to live-Nodes path
(legacy behavior preserved).
Refs DoD D5.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
050f87e267
|
fix(purge): second name-prefix pass for CCM-named clustermesh LBs (#1532)
Caught repeatedly (t124, t125 wipes both 2026-05-16): tofu destroy left
3 orphan `<fqdn-slug>-<region>-clustermesh` LBs each cycle. Names
don't start with `catalyst-` prefix because they're named by the
Cilium chart overlay
(`clusters/_template/bootstrap-kit/01-cilium.yaml`):
load-balancer.hetzner.cloud/name:
"${SOVEREIGN_FQDN_SLUG:=catalyst}-${SOVEREIGN_REGION_KEY:=primary}-clustermesh"
The first name-prefix pass (`catalyst-<fqdn-slug>`) misses these.
tofu doesn't manage them (CCM allocated post-Phase-1). Manual API
cleanup was forced each cycle.
Fix: add a second `purgeByNamePrefix` pass with the slug-only prefix
(`<fqdn-slug>-`) so any CCM-allocated resource named with the slug
gets swept. Dedup logic in `purgeByNamePrefix` already skips names
already reported by the labelled pass, so totals stay accurate.
Refs feedback_wipe_handler_ccm_lb_orphans.md.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
70d6ada703
|
fix(clustermesh): sign A's peer client cert with B's CA (not A's CA) (#1530)
Caught on t126 (84c0848406dd6fdd, 2026-05-16) after PRs #1525+#1528
unblocked peer Secret writes. Cilium agents reloaded, peer entries
present, but cilium-dbg status --verbose shows:
0/2 remote clusters ready
t126-mesh-nbg1-1: Waiting for initial connection
t126-mesh-sin-2: Waiting for initial connection
TLS probe to peer apiserver returned "unexpected eof while reading":
the mTLS handshake fails because A's client cert was signed by A's
cilium-ca. Cilium clustermesh-apiserver's trust pool is the LOCAL
cilium-ca (B's), so A's cert is rejected at the handshake.
Fix: pass b.caCert/b.caKey to mintPeerClientCert. SAN stays A's
clusterName (matches upstream `cilium clustermesh connect` CLI and
the chart's default RBAC subject authorisation).
Refs DoD D11.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|