Commit Graph

1482 Commits

Author SHA1 Message Date
e3mrah
33ed484e04
fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30) (#1576)
* fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22)

PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".

Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.

The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(parent-domains): short-circuit pdmFlipNS when NS already matches (D30)

When an sme-pool domain's current NS records already match the expected
[ns1.<primary>, ns2.<primary>] pair (because the operator already
delegated the domain to OpenOva's PowerDNS), the PDM registrar-flip
step is a no-op. Skipping avoids:

  1. Burning a Dynadot API credit on a flip that would be idempotent.
  2. The D30 blocker — current Dynadot creds return pdm-status-401
     even when the desired NS state already exists. Caught on t132
     2026-05-16 day-2 add + t134 2026-05-17 fresh-prov body
     parentDomains attempt.

Adds nsAlreadyMatches() helper using net.DefaultResolver.LookupNS with
a 5s timeout. False on lookup error or partial match → fall through to
the original PDM pipeline so a misconfigured/partial domain still goes
through the registrar API.

This unblocks sme-pool entries for omani.homes (already pointing at
ns1/2/3.openova.io). omani.rest / omani.trades still go through the
full flip path because their NS records don't yet match expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 03:21:42 +04:00
github-actions[bot]
a65a024114 deploy: update catalyst images to c148ec6 2026-05-16 22:33:19 +00:00
github-actions[bot]
c5f777056f deploy: update catalyst images to 3568b72 2026-05-16 22:20:19 +00:00
e3mrah
3568b72b5e
fix(cloud): hide non-active 0/0 chips (D15) (#1574)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22)

Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): add D22 sovereign-side identity placeholders

Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} +
${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires
them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569)
→ catalyst-api env → chrootEnsureDeployment populates the deployment
record → Settings page renders real values instead of `—`.

This PR alone is a no-op (placeholders default to empty, same as today).
The cloud-init substitute lines + provisioner.go tfvars need to land in
a companion PR to actually populate the values on next-prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22)

Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block
now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit
Kustomization's postBuild.substitute env, which the slot-13 placeholders
(#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}.

Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile
substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap
keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment
populates the deployment record (#1567 + #1568 fallback).

SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes
the dependency cycle (hcloud_server.cp doesn't exist at cloudinit
render time). Separate PR will source it via metadata-service or
post-create ConfigMap patch.

Next-prov (t133+) Sovereign Console Settings page now renders real
ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b)

PR #1552 stripped the `/app` prefix on Sovereign mode to make
`/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match.
But consoleAppDetailRoute is registered at `/app/$componentId` under
consoleLayoutRoute — no chroot route matches `/<componentId>` directly,
so stripping leaves an empty render path. Playwright walkthrough on
t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render
body_len=9 (empty).

Invert the logic: only redirect mothership-only sub-paths (/dashboard
Fleet view, /install wizard, /sre, /sec, /blueprints) which have no
Sovereign Console equivalent. For everything else (component names like
`/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match
pick consoleAppDetailRoute / consoleAppsRoute.

Caught live on t132 via Playwright walker3.js — agent a4825c5a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(handover): re-mint handover JWT on every GetDeployment (D0)

D0 Playwright walkthrough on t132 2026-05-17 caught: handoverURL
persisted at handover-fire time carries a JWT that expires per
DefaultTTL (5min). Operators who click /jobs hours later get the stale
token → Sovereign-side /auth/handover rejects with raw JSON
{"error":"invalid token"} — no UI fallback, no /auth/handover-error,
auto-redirect to /dashboard never fires.

Re-mint the JWT on every GetDeployment when deployment is ready +
handover-fired so the URL returned to the wizard is always
freshly-signed.

Best-effort: on mint failure, leave the existing URL in place so a
transient signer error doesn't break polling. Helper is idempotent +
locked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): hide non-active 0/0 chips (D15)

Playwright walkthrough on t132 2026-05-17 caught D15 PARTIAL: 15 chips
are correct but Bucket+Volume show 0/0. Founder rule (DoD D15):
"No kind chip shows 0/0 for a resource that actually exists in the
cluster". Bucket+Volume genuinely don't exist on this Sovereign so
showing 0/0 is noise.

Hide chips with count exactly 0 unless they're the active selection
(operator who navigated to an empty kind keeps context).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:18:24 +04:00
github-actions[bot]
44e612f39d deploy: update catalyst images to 58dbb92 2026-05-16 22:18:16 +00:00
e3mrah
58dbb92f4f
fix(handover): re-mint handover JWT on every GetDeployment (D0) (#1573)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22)

Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): add D22 sovereign-side identity placeholders

Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} +
${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires
them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569)
→ catalyst-api env → chrootEnsureDeployment populates the deployment
record → Settings page renders real values instead of `—`.

This PR alone is a no-op (placeholders default to empty, same as today).
The cloud-init substitute lines + provisioner.go tfvars need to land in
a companion PR to actually populate the values on next-prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22)

Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block
now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit
Kustomization's postBuild.substitute env, which the slot-13 placeholders
(#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}.

Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile
substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap
keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment
populates the deployment record (#1567 + #1568 fallback).

SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes
the dependency cycle (hcloud_server.cp doesn't exist at cloudinit
render time). Separate PR will source it via metadata-service or
post-create ConfigMap patch.

Next-prov (t133+) Sovereign Console Settings page now renders real
ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b)

PR #1552 stripped the `/app` prefix on Sovereign mode to make
`/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match.
But consoleAppDetailRoute is registered at `/app/$componentId` under
consoleLayoutRoute — no chroot route matches `/<componentId>` directly,
so stripping leaves an empty render path. Playwright walkthrough on
t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render
body_len=9 (empty).

Invert the logic: only redirect mothership-only sub-paths (/dashboard
Fleet view, /install wizard, /sre, /sec, /blueprints) which have no
Sovereign Console equivalent. For everything else (component names like
`/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match
pick consoleAppDetailRoute / consoleAppsRoute.

Caught live on t132 via Playwright walker3.js — agent a4825c5a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(handover): re-mint handover JWT on every GetDeployment (D0)

D0 Playwright walkthrough on t132 2026-05-17 caught: handoverURL
persisted at handover-fire time carries a JWT that expires per
DefaultTTL (5min). Operators who click /jobs hours later get the stale
token → Sovereign-side /auth/handover rejects with raw JSON
{"error":"invalid token"} — no UI fallback, no /auth/handover-error,
auto-redirect to /dashboard never fires.

Re-mint the JWT on every GetDeployment when deployment is ready +
handover-fired so the URL returned to the wizard is always
freshly-signed.

Best-effort: on mint failure, leave the existing URL in place so a
transient signer error doesn't break polling. Helper is idempotent +
locked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:16:26 +04:00
github-actions[bot]
dea683f5e4 deploy: update catalyst images to 9e1e422 2026-05-16 22:08:01 +00:00
e3mrah
9e1e4224d8
fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b) (#1572)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22)

Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): add D22 sovereign-side identity placeholders

Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} +
${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires
them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569)
→ catalyst-api env → chrootEnsureDeployment populates the deployment
record → Settings page renders real values instead of `—`.

This PR alone is a no-op (placeholders default to empty, same as today).
The cloud-init substitute lines + provisioner.go tfvars need to land in
a companion PR to actually populate the values on next-prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22)

Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block
now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit
Kustomization's postBuild.substitute env, which the slot-13 placeholders
(#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}.

Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile
substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap
keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment
populates the deployment record (#1567 + #1568 fallback).

SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes
the dependency cycle (hcloud_server.cp doesn't exist at cloudinit
render time). Separate PR will source it via metadata-service or
post-create ConfigMap patch.

Next-prov (t133+) Sovereign Console Settings page now renders real
ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): chroot /app/<name> only-redirect mothership-only sub-paths (D17/D17b)

PR #1552 stripped the `/app` prefix on Sovereign mode to make
`/app/bp-cnpg` → `/bp-cnpg`, hoping consoleAppDetailRoute would match.
But consoleAppDetailRoute is registered at `/app/$componentId` under
consoleLayoutRoute — no chroot route matches `/<componentId>` directly,
so stripping leaves an empty render path. Playwright walkthrough on
t132 2026-05-17 confirmed: /app/bp-cnpg + /app/bp-coraza both render
body_len=9 (empty).

Invert the logic: only redirect mothership-only sub-paths (/dashboard
Fleet view, /install wizard, /sre, /sec, /blueprints) which have no
Sovereign Console equivalent. For everything else (component names like
`/app/bp-cnpg`, bare `/app`), let TanStack's natural most-specific-match
pick consoleAppDetailRoute / consoleAppsRoute.

Caught live on t132 via Playwright walker3.js — agent a4825c5a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:05:54 +04:00
github-actions[bot]
4cc880cafd deploy: update catalyst images to 5793958 2026-05-16 21:48:54 +00:00
github-actions[bot]
df193d340e deploy: update catalyst images to 9cbcd23 2026-05-16 21:03:01 +00:00
e3mrah
9cbcd230da
feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) (#1569)
Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:01:00 +04:00
github-actions[bot]
0e0280bbe0 deploy: update catalyst images to 6618392 2026-05-16 20:56:10 +00:00
e3mrah
6618392407
fix(chroot): GetDeployment falls back to chrootEnsureDeployment (D22) (#1568)
* feat(handover): auto-seed owner UserAccess CR on chroot (D21)

Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.

After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:

  apiVersion: access.openova.io/v1alpha1
  kind: UserAccess
  metadata:
    name: useraccess-owner-<sanitized-email>
    annotations:
      catalyst.openova.io/user-email: <email>   # rbac_matrix:309 hint
  spec:
    user:
      keycloakSubject: <email>
    sovereignRef: <fqdn-first-label>
    applications:
      - app: "*"
        role: admin                              # owner -> admin

The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.

Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.

Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21

* chore(slot-13): pin bp-catalyst-platform to 1.4.147 (D21+D31 baked)

PR #1562 (D31 wordpress-tenant activeHotStandby) + PR #1564 (D21 owner
UserAccess auto-seed at handover, catalyst-api:8d2a947) both packaged
into chart 1.4.147. Pin slot so t133+ gets both gates on first prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5)

PR #1551 single-quoted SOVEREIGN_REGIONS_JSON in the slot file
substitute, but Flux Kustomize's postBuild can still re-parse the
JSON-shaped string as a YAML flow-sequence depending on quoting context.
When that happens .Values.sovereign.regionsJson is a Go []interface{}
of map[interface{}]interface{} and `| quote` prints Go's
`[map[cloudRegion:hel1 ...]]` syntax — catalyst-api's json.Unmarshal of
the env var then fails and Request.Regions is empty.

toJson normalises both string and list inputs to valid JSON.

Caught live on t132 2026-05-16 chart 1.4.147: env var rendered as
`[map[cloudRegion:hel1 ...]]` despite #1551 being in effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): populate deployment Result + Request fields for D22

Settings page on Sovereign Console renders `—` for Region / Sovereign /
Created / DeploymentID / Pool subdomain because chroot's GET
/api/v1/deployments/<id> returns empty strings for those fields.

Populate from existing env vars (best-effort — empty when chart hasn't
wired them yet, which is no worse than today's behaviour):
- Result.ConsoleURL = "https://console.<fqdn>" (derived from selfFQDN)
- Result.GitOpsRepoURL from GITOPS_REPO_URL env
- Result.ControlPlaneIP from SOVEREIGN_CONTROL_PLANE_IP env
- Request.Region = regions[0].CloudRegion (top-level legacy field)
- Request.OrgEmail from OPERATOR_EMAIL env
- Request.OrgName from ORG_NAME env

Companion chart PR will wire the env vars from .Values.global.* +
cloud-init substitute placeholders. This PR is BACKWARD-compatible —
unset env vars produce empty strings, same as today.

Caught live on t132 2026-05-16 — `curl /api/v1/deployments/sovereign-
t132.omani.works` returns empty ownerEmail/region/consoleURL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): GetDeployment falls back to chrootEnsureDeployment (D22)

GetDeployment was the only handler that returned 404 without calling
chrootEnsureDeployment. After a catalyst-api Pod restart on the chroot
the in-memory store is empty until some other handler (StreamLogs,
jobs list) primes it via its own synth call — meanwhile the Sovereign
Console Settings page loads /api/v1/deployments/<id> first and gets
404, rendering the entire page broken.

Mirror the StreamLogs pattern (lines 1247-1254): try in-memory load,
fall through to chrootEnsureDeployment, return 404 only when both miss.

This unblocks PR #1567's deployment-record population — without the
fallback, GetDeployment can never serve the populated record on chroot.

Caught live on t132 2026-05-16 after #1567 image roll: Settings page
404 because in-memory store was empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:54:20 +04:00
github-actions[bot]
b094a354b7 deploy: update catalyst images to ed63ecd 2026-05-16 20:31:39 +00:00
e3mrah
ed63ecd09f
fix(chroot): populate deployment Result + Request fields for D22 settings (#1567)
* feat(handover): auto-seed owner UserAccess CR on chroot (D21)

Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.

After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:

  apiVersion: access.openova.io/v1alpha1
  kind: UserAccess
  metadata:
    name: useraccess-owner-<sanitized-email>
    annotations:
      catalyst.openova.io/user-email: <email>   # rbac_matrix:309 hint
  spec:
    user:
      keycloakSubject: <email>
    sovereignRef: <fqdn-first-label>
    applications:
      - app: "*"
        role: admin                              # owner -> admin

The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.

Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.

Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21

* chore(slot-13): pin bp-catalyst-platform to 1.4.147 (D21+D31 baked)

PR #1562 (D31 wordpress-tenant activeHotStandby) + PR #1564 (D21 owner
UserAccess auto-seed at handover, catalyst-api:8d2a947) both packaged
into chart 1.4.147. Pin slot so t133+ gets both gates on first prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5)

PR #1551 single-quoted SOVEREIGN_REGIONS_JSON in the slot file
substitute, but Flux Kustomize's postBuild can still re-parse the
JSON-shaped string as a YAML flow-sequence depending on quoting context.
When that happens .Values.sovereign.regionsJson is a Go []interface{}
of map[interface{}]interface{} and `| quote` prints Go's
`[map[cloudRegion:hel1 ...]]` syntax — catalyst-api's json.Unmarshal of
the env var then fails and Request.Regions is empty.

toJson normalises both string and list inputs to valid JSON.

Caught live on t132 2026-05-16 chart 1.4.147: env var rendered as
`[map[cloudRegion:hel1 ...]]` despite #1551 being in effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): populate deployment Result + Request fields for D22

Settings page on Sovereign Console renders `—` for Region / Sovereign /
Created / DeploymentID / Pool subdomain because chroot's GET
/api/v1/deployments/<id> returns empty strings for those fields.

Populate from existing env vars (best-effort — empty when chart hasn't
wired them yet, which is no worse than today's behaviour):
- Result.ConsoleURL = "https://console.<fqdn>" (derived from selfFQDN)
- Result.GitOpsRepoURL from GITOPS_REPO_URL env
- Result.ControlPlaneIP from SOVEREIGN_CONTROL_PLANE_IP env
- Request.Region = regions[0].CloudRegion (top-level legacy field)
- Request.OrgEmail from OPERATOR_EMAIL env
- Request.OrgName from ORG_NAME env

Companion chart PR will wire the env vars from .Values.global.* +
cloud-init substitute placeholders. This PR is BACKWARD-compatible —
unset env vars produce empty strings, same as today.

Caught live on t132 2026-05-16 — `curl /api/v1/deployments/sovereign-
t132.omani.works` returns empty ownerEmail/region/consoleURL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:29:44 +04:00
github-actions[bot]
d82e06bfe9 deploy: update catalyst images to 0a45fb0 2026-05-16 20:03:41 +00:00
e3mrah
0a45fb0449
fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5) (#1566)
* feat(handover): auto-seed owner UserAccess CR on chroot (D21)

Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.

After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:

  apiVersion: access.openova.io/v1alpha1
  kind: UserAccess
  metadata:
    name: useraccess-owner-<sanitized-email>
    annotations:
      catalyst.openova.io/user-email: <email>   # rbac_matrix:309 hint
  spec:
    user:
      keycloakSubject: <email>
    sovereignRef: <fqdn-first-label>
    applications:
      - app: "*"
        role: admin                              # owner -> admin

The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.

Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.

Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21

* chore(slot-13): pin bp-catalyst-platform to 1.4.147 (D21+D31 baked)

PR #1562 (D31 wordpress-tenant activeHotStandby) + PR #1564 (D21 owner
UserAccess auto-seed at handover, catalyst-api:8d2a947) both packaged
into chart 1.4.147. Pin slot so t133+ gets both gates on first prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): regionsJson uses toJson to defeat YAML flow-seq re-parse (D5)

PR #1551 single-quoted SOVEREIGN_REGIONS_JSON in the slot file
substitute, but Flux Kustomize's postBuild can still re-parse the
JSON-shaped string as a YAML flow-sequence depending on quoting context.
When that happens .Values.sovereign.regionsJson is a Go []interface{}
of map[interface{}]interface{} and `| quote` prints Go's
`[map[cloudRegion:hel1 ...]]` syntax — catalyst-api's json.Unmarshal of
the env var then fails and Request.Regions is empty.

toJson normalises both string and list inputs to valid JSON.

Caught live on t132 2026-05-16 chart 1.4.147: env var rendered as
`[map[cloudRegion:hel1 ...]]` despite #1551 being in effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:01:43 +04:00
github-actions[bot]
f8c8a87151 deploy: update catalyst images to 8d2a947 2026-05-16 19:51:40 +00:00
e3mrah
8d2a947cfb
feat(handover): auto-seed owner UserAccess CR on chroot (D21) (#1564)
Closes the D21 gap on Sovereign DoD: /users page returned empty after
fresh handover because Keycloak `sovereign-admins` membership was
established but no UserAccess CR existed for the operator.

After `keycloak.EnsureUser` succeeds in `AuthHandover`, the helper
`EnsureOwnerUserAccess` upserts a cluster-scoped UserAccess CR shaped
like the canonical user_access.go `CreateUserAccess` write:

  apiVersion: access.openova.io/v1alpha1
  kind: UserAccess
  metadata:
    name: useraccess-owner-<sanitized-email>
    annotations:
      catalyst.openova.io/user-email: <email>   # rbac_matrix:309 hint
  spec:
    user:
      keycloakSubject: <email>
    sovereignRef: <fqdn-first-label>
    applications:
      - app: "*"
        role: admin                              # owner -> admin

The Composition (issue #322) reconciles the Claim into per-app
RoleBindings on the Sovereign so the operator surfaces in /users.

Best-effort + idempotent: AlreadyExists on the second handover is
folded to nil; any other error is logged at Warn and the handover
itself never fails. If the access.openova.io CRD has not rolled yet,
the next handover retries automatically.

Architect-first: mirrors `userAccessToUnstructured` shape and uses
existing `sovereignDynamicClient` + `rbacAssignSlug` seams. Tier
mapping follows the documented lossy `owner -> admin` rule in
`userAccessTierToRole` (CRD only accepts admin|editor|viewer).

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D21

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-16 23:49:32 +04:00
github-actions[bot]
d6b6aca581 deploy: update sme service images to c04b2ec + bump chart to 1.4.147 2026-05-16 19:41:18 +00:00
github-actions[bot]
af4d9b1b87 deploy: update sme service images to f9ed292 + bump chart to 1.4.146 2026-05-16 19:29:50 +00:00
github-actions[bot]
696aa26f83 deploy: update sme service images to a11067d + bump chart to 1.4.145 2026-05-16 19:18:09 +00:00
github-actions[bot]
48eb653f79 deploy: update sme service images to 1fe7067 + bump chart to 1.4.144 2026-05-16 19:05:51 +00:00
e3mrah
7c3724591c
fix(chart): admin pod uses dedicated image tag (D27 SME stack) (#1557)
* feat(billing+notification): wire voucher-issued email (D28)

D28 of the Sovereign DoD requires that issuing a voucher emails it to
the recipient zero-touch. Today POST /billing/vouchers/issue persists
the PromoCode row but never notifies anyone — so a gifted voucher only
reaches its recipient if the operator manually sends the code over a
side channel. This wires sme-billing -> sme-notification so the email
fires automatically on every successful upsert that carries a
recipient_email field.

Architecture follows the existing notification-service seam:
sme-billing POSTs to http://notification.sme.svc.cluster.local:8087/
notification/send with template=voucher-issued; sme-notification renders
the HTML and dispatches via Stalwart over SMTP. No direct SMTP code is
added to billing, no stalwart-mail calls bypass notification.

Server-side only — the owner-UI for issuing vouchers (D28b) is a
separate PR.

Changes:

  notification/templates/templates.go
    + VoucherIssuedEmail(code, creditOMR, description, sovereignFQDN,
      validityHint) — renders code prominently, redeem button to
      https://marketplace.<sovereignFQDN>/redeem/?code=<CODE>; FQDN
      always supplied by caller, NEVER hardcoded.

  notification/handlers/handlers.go
    + renderTemplate("voucher-issued") case parsing
      {code, credit_omr, description, sovereign_fqdn, validity_hint}.
    + Default subject "You've been gifted a voucher for OpenOva SME".

  billing/handlers/handlers.go
    + Handler fields: NotificationURL, SovereignFQDN, NotificationClient.

  billing/handlers/vouchers.go
    + issueVoucherRequest = store.PromoCode + RecipientEmail (request-
      only; never persisted).
    + sendVoucherIssuedEmail() — POSTs to NotificationURL with a 5s
      timeout. Best-effort: a non-2xx or transport error logs but does
      NOT fail the IssueVoucher response, because the row is already
      persisted and re-issuing the same code re-fires the email.
    + Re-issue semantics (#91 resurrects soft-deleted rows) extend to
      the email path — documented in the handler comment.

  billing/main.go
    + Reads NOTIFICATION_SERVICE_URL (default
      http://notification.sme.svc.cluster.local:8087/notification/send)
      and SOVEREIGN_FQDN env vars. Wires a 5s default http.Client.

  products/catalyst/chart/templates/sme-services/billing.yaml
    + Pipes NOTIFICATION_SERVICE_URL (cluster-DNS constant) and
      SOVEREIGN_FQDN (from .Values.global.sovereignFQDN, NEVER
      hardcoded) into the billing Deployment.

Tests:

  notification/handlers/handlers_test.go (new)
    + TestRenderTemplate_VoucherIssued: rendered HTML contains code +
      credit + a redeem URL built from the supplied FQDN; never falls
      back to marketplace.openova.io.
    + TestRenderTemplate_VoucherIssued_CustomSubject + _NoDescription
      + TestRenderTemplate_UnknownTemplate as guard rails.

  billing/handlers/vouchers_test.go
    + TestIssueVoucher_SendsEmail_WhenRecipientPresent: a fake round-
      tripper sees the POST to notification with the right URL +
      template + data (code upper-cased, credit_omr, sovereign_fqdn,
      description) when recipient_email is set.
    + TestIssueVoucher_NoEmail_WhenRecipientAbsent: no notification
      call when recipient is empty.
    + TestIssueVoucher_NotificationFailure_DoesNotFailUpsert:
      operator gets 200 even when notification returns 500.
    + TestIssueVoucher_403WithoutVoucherIssuerRole: role gate preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): admin pod uses dedicated image tag (D27 SME stack)

t132 caught admin pod stuck in ImagePullBackOff on `admin:b0ed216` —
the SME services CI run for that mono-repo SHA published 10 services
but admin's image was missing from GHCR. Decouple admin's tag from
smeTag so a missing-build for one service doesn't wedge the SME stack.

Default to `3c2f7e4` (matches marketplaceApi + console, known-published).
When admin's UI changes, bump in lockstep with those.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:05:09 +04:00
e3mrah
1fe706769f
feat(billing+notification): wire voucher-issued email (D28) (#1556)
D28 of the Sovereign DoD requires that issuing a voucher emails it to
the recipient zero-touch. Today POST /billing/vouchers/issue persists
the PromoCode row but never notifies anyone — so a gifted voucher only
reaches its recipient if the operator manually sends the code over a
side channel. This wires sme-billing -> sme-notification so the email
fires automatically on every successful upsert that carries a
recipient_email field.

Architecture follows the existing notification-service seam:
sme-billing POSTs to http://notification.sme.svc.cluster.local:8087/
notification/send with template=voucher-issued; sme-notification renders
the HTML and dispatches via Stalwart over SMTP. No direct SMTP code is
added to billing, no stalwart-mail calls bypass notification.

Server-side only — the owner-UI for issuing vouchers (D28b) is a
separate PR.

Changes:

  notification/templates/templates.go
    + VoucherIssuedEmail(code, creditOMR, description, sovereignFQDN,
      validityHint) — renders code prominently, redeem button to
      https://marketplace.<sovereignFQDN>/redeem/?code=<CODE>; FQDN
      always supplied by caller, NEVER hardcoded.

  notification/handlers/handlers.go
    + renderTemplate("voucher-issued") case parsing
      {code, credit_omr, description, sovereign_fqdn, validity_hint}.
    + Default subject "You've been gifted a voucher for OpenOva SME".

  billing/handlers/handlers.go
    + Handler fields: NotificationURL, SovereignFQDN, NotificationClient.

  billing/handlers/vouchers.go
    + issueVoucherRequest = store.PromoCode + RecipientEmail (request-
      only; never persisted).
    + sendVoucherIssuedEmail() — POSTs to NotificationURL with a 5s
      timeout. Best-effort: a non-2xx or transport error logs but does
      NOT fail the IssueVoucher response, because the row is already
      persisted and re-issuing the same code re-fires the email.
    + Re-issue semantics (#91 resurrects soft-deleted rows) extend to
      the email path — documented in the handler comment.

  billing/main.go
    + Reads NOTIFICATION_SERVICE_URL (default
      http://notification.sme.svc.cluster.local:8087/notification/send)
      and SOVEREIGN_FQDN env vars. Wires a 5s default http.Client.

  products/catalyst/chart/templates/sme-services/billing.yaml
    + Pipes NOTIFICATION_SERVICE_URL (cluster-DNS constant) and
      SOVEREIGN_FQDN (from .Values.global.sovereignFQDN, NEVER
      hardcoded) into the billing Deployment.

Tests:

  notification/handlers/handlers_test.go (new)
    + TestRenderTemplate_VoucherIssued: rendered HTML contains code +
      credit + a redeem URL built from the supplied FQDN; never falls
      back to marketplace.openova.io.
    + TestRenderTemplate_VoucherIssued_CustomSubject + _NoDescription
      + TestRenderTemplate_UnknownTemplate as guard rails.

  billing/handlers/vouchers_test.go
    + TestIssueVoucher_SendsEmail_WhenRecipientPresent: a fake round-
      tripper sees the POST to notification with the right URL +
      template + data (code upper-cased, credit_omr, sovereign_fqdn,
      description) when recipient_email is set.
    + TestIssueVoucher_NoEmail_WhenRecipientAbsent: no notification
      call when recipient is empty.
    + TestIssueVoucher_NotificationFailure_DoesNotFailUpsert:
      operator gets 200 even when notification returns 500.
    + TestIssueVoucher_403WithoutVoucherIssuerRole: role gate preserved.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:04:46 +04:00
github-actions[bot]
9718ba2924 deploy: update catalyst images to 2fd4e3c 2026-05-16 18:26:16 +00:00
e3mrah
2fd4e3cbf4
feat(wizard): default marketplaceEnabled=true for D27 zero-touch (#1555)
Founder ruling 2026-05-16: D27 mandates that a fresh wizard provisions a
Sovereign already ready to host tenant orgs (D29). Operator can still
flip the toggle off on StepMarketplace if they explicitly want a
private Sovereign.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:24:09 +04:00
github-actions[bot]
564fe4f4e5 deploy: update catalyst images to 9f096b0 2026-05-16 18:01:02 +00:00
e3mrah
9f096b0b18
fix(chroot): populate Result.LoadBalancerIP so canvas shows LB chip (D15) (#1553)
chrootEnsureDeployment was synthesizing a Deployment with Result=nil.
The topology loader's buildLBs() returned [] on nil-Result → canvas
chip showed `LoadBalancer 0/0` on every chroot Sovereign Console
even though the Sovereign ingress LB was allocated and serving
console.<fqdn>.

Populate Result with LoadBalancerIP from `SOVEREIGN_LB_IP` env (set
by bp-catalyst-platform's sovereign-fqdn ConfigMap `lbIP` key per
issue #900 / PR #145). buildLBs then emits one LoadBalancer entry
per region using the canonical primary LB.

Caught on t131 2026-05-16 — DoD D15. Same chroot-synth-enrichment
pattern as PR #1534 (SOVEREIGN_REGIONS_JSON).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:58:53 +04:00
github-actions[bot]
dd9b631740 deploy: update catalyst images to 124ac13 2026-05-16 17:58:31 +00:00
e3mrah
124ac13c1d
fix(router): chroot Sovereign /app/<name> resolves to AppDetail, not mothership AppsPage (D17b) (#1552)
Two route trees claim `/app`:

1. `appRoute` (line 364) — mothership AppLayout chrome, prefix `/app`,
   children `/app/$deploymentId/applications/*`, `/app/$deploymentId/
   settings`, `/app/dashboard` (fleet view), etc. ~30 children.
2. `consoleAppDetailRoute` (line 1141, under consoleLayoutRoute) —
   clean `/app/$componentId` for the chroot Sovereign Console's
   per-app detail.

On a chroot Sovereign Console (DETECTED_MODE.mode === 'sovereign')
the operator clicks `/apps/<card>` → AppCard generates HREF
`/app/<name>` (AppsPage.tsx line ~720, correct for chroot context).
TanStack router resolves to the MOTHERSHIP `appRoute` because it
matches first (registered earlier under rootRoute) and its
children accept `<name>` as $deploymentId. The page renders
AppLayout chrome + AppsPage with mothership sidebar — looks
nothing like AppDetail.

Founder observation (BUG-002 from /tmp/test-matrix-t129.json + reported
on t131 2026-05-16):
> Application individual pages are not visible at all in the child
> while mothership doesn't have that issue, this is the biggest blunder!

Fix: `appRoute.beforeLoad` redirects on chroot:
- `/app/<componentId>` → `/<componentId>` (caught by consoleAppDetailRoute)
- `/app/dashboard`, `/app/install`, `/app/sre/*`, `/app/sec/*`, `/app/blueprints`
  → `/dashboard` (canonical Sovereign landing; these are mothership-only
  surfaces — already partially fixed at dashboardRoute level by PR #1547)

Mothership behavior unchanged (DETECTED_MODE.mode !== 'sovereign'
falls through to the existing AppLayout-rooted tree).

Refs DoD D17b. Caught on t131 (623354058b114dd6, 2026-05-16).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:56:31 +04:00
github-actions[bot]
8980b727fb deploy: update catalyst images to fbe23da 2026-05-16 16:34:04 +00:00
e3mrah
fbe23da091
fix(ui-nginx): allow Google Fonts domains in CSP (D26) (#1549)
Sovereign Console pages reference Inter + JetBrains Mono fonts via
fonts.googleapis.com (index.html lines 9, 11). The nginx CSP only
allowed font-src 'self' data: — so the browser blocked the font
stylesheet AND the woff2 fetches, falling back to system fonts.

Add fonts.googleapis.com to style-src (for the @import CSS) and
fonts.gstatic.com to font-src (for the woff2 assets). All 3 CSP
occurrences in nginx.conf updated identically.

Alternative considered: self-host the woff2 + drop the external
references. Skipped for now — sticking with Google Fonts CDN is
faster + matches every other web app's posture. If the operator
wants air-gap-compatible Sovereigns later, switch to self-hosted.

Caught on t129 2026-05-16 — DoD D26.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:31:51 +04:00
github-actions[bot]
27556577f7 deploy: update catalyst images to 7845a00 2026-05-16 16:30:19 +00:00
e3mrah
7845a00799
fix(dashboard): add region + vcluster as TreemapDimensions (D16) (#1548)
Multi-region operators on the Sovereign Console couldn't pivot the
/dashboard treemap by region or vCluster. The TreemapDimension
union (FE) and dashboardDimension set (BE) only included
sovereign/cluster/family/namespace/application.

This PR:
- Adds 'region' + 'vcluster' to TreemapDimension type
  (products/catalyst/bootstrap/ui/src/lib/treemap.types.ts)
- Adds them to the dimension select options
  (products/catalyst/bootstrap/ui/src/components/TreemapLayerController.tsx)
- Adds them to the validated set in dashboard.go
- Adds podRow.region + podRow.vcluster fields populated from
  openova.io/region and catalyst.openova.io/vcluster-role labels
- Extends dimensionKey switch to bucket by these new dimensions
  (fallback: region→cluster, vcluster→"host")

Caught on t129 2026-05-16 — DoD D16. Note that full multi-cluster
fan-out (aggregating pods across all 3 region kubeconfigs into one
treemap) is a separate refactor not included here; this PR delivers
the dimension surface so the layer selector is usable + a fresh prov
with the chroot's k8scache extended to multi-region will render
3 cluster bubbles when the operator picks Layer-1=cluster.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:24:34 +04:00
github-actions[bot]
477bd0966f deploy: update catalyst images to 52015ff 2026-05-16 16:15:32 +00:00
e3mrah
52015ff468
fix(ui): t129 SPA routing — bp-bp- prefix, PIN /wizard leak, /app/dashboard fleet leak (#1547)
Three operator-visible SPA routing bugs caught on live t129 Sovereign
Console (t129.omani.works, 2026-05-16). Closes #1546.

BUG-001 (D19) — doubled /app/bp-bp-* href on 10 of 44 app cards.
  build-catalog.mjs::listBootstrapKit extracted slug from `NN-(.+)\.yaml`
  without stripping an optional `bp-` already present in some filenames
  (e.g. `13-bp-catalyst-platform.yaml`). The captured slug became
  `bp-catalyst-platform`, then `id: \`bp-${slug}\`` doubled it to
  `bp-bp-catalyst-platform`, breaking the FE↔BE HR-name join and
  printing the doubled prefix on the AppsPage card href. Fix: strip a
  leading `bp-` from the captured slug before forming the canonical id.
  Regenerated catalog.generated.ts + blueprints.json — 10 entries
  collapse to their single-prefix canonical form (bp-catalyst-platform,
  bp-cert-manager-powerdns-webhook, bp-k8s-ws-proxy, bp-guacamole,
  bp-dmz-vcluster, bp-hcloud-ccm, bp-openova-flow-server,
  bp-openova-flow-emitter, bp-mgmt-vcluster, bp-rtz-vcluster).

BUG-015 (D23, extends D0) — PIN-verify lands /wizard on Sovereign.
  VerifyPinPage default landing was `/wizard` regardless of operating
  mode. On a chroot Sovereign Console (DETECTED_MODE.mode === 'sovereign'
  the operator has just been auto-redirected from the mothership
  handover URL; their Sovereign is already converged. Routing them to
  the new-prov wizard re-prompts for org details and contradicts D0.
  Fix: branch on DETECTED_MODE.mode — `/dashboard` on sovereign,
  `/wizard` on catalyst-zero. Mothership flow unchanged. Test:
  VerifyPinPage.test.tsx asserts the 3 cases (sovereign default,
  catalyst-zero default, explicit next= override).

BUG-016 (D24) — /app/dashboard exposes mothership fleet view.
  appRoute's `/dashboard` child mounts DashboardPage (multi-Sovereign
  fleet, "7 Sovereigns" with duplicate rows). On a Sovereign Console
  this surface MUST NOT be reachable — the Sovereign owns ONE deployment,
  fleet is mothership-only. Fix: beforeLoad on dashboardRoute redirects
  to `/dashboard` (consoleDashboardRoute, the per-Sovereign landing)
  when DETECTED_MODE.mode === 'sovereign'. Mothership keeps the fleet
  view as today.

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md D19/D23/D24,
      /tmp/test-matrix-t129.json discoveries BUG-001/015/016.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:13:26 +04:00
github-actions[bot]
1405275af9 deploy: update catalyst images to 2b3888e 2026-05-16 15:48:21 +00:00
e3mrah
2b3888eed5
fix(ui): suppress chroot-side false-positive notifications (D17, D18) (#1543)
Two notification spammers on the chroot Sovereign Console that produce
noise on every /apps + /app/<name> visit:

D17 — "Deployment id in the URL is malformed":
  AppsPage.tsx fires on isDeploymentID(rawDeploymentId)=false. On the
  chroot, useResolvedDeploymentId resolves to /api/v1/sovereign/self
  which returns the synthesized canonical id `sovereign-<fqdn>` (26
  chars, not hex). The notification claims that path-segment is
  invalid even though there is no URL segment — the resolution path
  is in-process. Suppress on DETECTED_MODE.mode === 'sovereign'.

D18 — "Per-component install monitoring is unavailable":
  Fires on state.phase1WatchSkipped. On the chroot, phase1WatchSkipped
  is a MOTHERSHIP-only concept (mother's observer pod failed to fetch
  the new cluster's kubeconfig). The Sovereign-side catalyst-api runs
  IN the cluster it's reporting on — has the in-cluster ServiceAccount
  + bundled sovereignDynamicClient + informer cache watching HelmReleases
  natively. Firing this here tells operator to drop to kubectl when
  the data is on the page. Suppress on chroot.

Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — DoD D17 + D18.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:46:25 +04:00
github-actions[bot]
f8c20137a5 deploy: update catalyst images to 536bfcb 2026-05-16 15:42:43 +00:00
e3mrah
536bfcb699
fix(infrastructure): vCluster fallback from namespace label (D15) (#1542)
loadVClusters() queried vcluster.io/v1alpha1 CRs only. Our bootstrap
topology ships loft-sh/vcluster as a plain Helm chart (StatefulSet +
Service, NO CRD installed) so the CR list is always empty on a
converged Sovereign → canvas `vCluster N/N` chip shows `0/0` even
though Pods are Running.

Add a fallback: enumerate Namespaces carrying
`catalyst.openova.io/vcluster-role` label (stamped by
bp-{mgmt,dmz,rtz}-vcluster's namespace template at PR #1526).
Emits one VCluster row per labeled namespace with role = the label
value. Status `healthy` since the namespace exists (operator-visible
Pod state is surfaced elsewhere).

Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — D15.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:40:50 +04:00
github-actions[bot]
f010ca16a7 deploy: update catalyst images to 5b69247 2026-05-16 15:11:00 +00:00
e3mrah
5b69247135
fix(clustermesh): secondary cluster name match tofu scheme (D11) (#1540)
Tofu's `secondary_region_cluster_mesh_name` local at
infra/hetzner/main.tf:389 generates secondary names as
`<sovereign-stem>-<region-stem-no-digits>` (e.g. `t129-nbg`,
`t129-sin`). The bootstrap-kit slot 01-cilium.yaml renders
cilium-config cluster.name from this value via the
CLUSTER_MESH_NAME envsubst.

The orchestrator's clusterName derivation was wrong: it appended
`-<region-key>` to the primary's name (e.g. `t129-mesh-nbg1-1`),
which matched NEITHER the tofu scheme NOR the cilium-config value.

Caught on t129 (6cddff7ef4432bdc, 2026-05-16): TLS, etcd RBAC,
and connection all working after PRs #1530, #1536, #1538, #1539 —
but agent reported `failed to retrieve cluster configuration:
not found` for every secondary peer because it queried
`cilium/cluster-config/v1/t129-mesh-nbg1-1` against an etcd that
only had `t129-nbg`.

Fix: export `DeriveSecondaryClusterMeshName(req, rs)` that
mirrors tofu's local exactly, plus a `stripTrailingDigits` helper.
Orchestrator's buildRegionSlots uses this for secondaries; primary
keeps the `<stem>-mesh` shape.

Closes D11 incident chain: #1525#1528#1530#1536#1538#1539 → this. With this PR landed t129's secondary→primary
connection already works (verified on live cluster — secondary
agents show "ready, 2 nodes, 113 endpoints, 326 identities");
primary→secondary will work on a fresh prov once the name match
is correct from the start.

Refs DoD D11.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:08:55 +04:00
github-actions[bot]
6b519b5573 deploy: update catalyst images to d0fd32d 2026-05-16 15:01:32 +00:00
e3mrah
d0fd32dc04
fix(clustermesh): use peer's clustermesh-apiserver-remote-cert (D11) (#1539)
The orchestrator was minting a fresh client cert (CN = local cluster
name) for each peer connection. Even with PR #1530's "sign with
peer's CA" fix the TLS handshake succeeded but etcd RBAC rejected:

    error="etcdserver: permission denied"

Cilium's clustermesh-apiserver etcd has RBAC with a `remote` user
that has read access on the cilium/* prefix. The chart generates
`kube-system/clustermesh-apiserver-remote-cert` with CN=`remote`.

Canonical `cilium clustermesh connect` CLI copies THIS Secret's
tls.crt/tls.key as the client cert the REMOTE cluster presents —
matches the etcd RBAC user verbatim.

This PR adopts that pattern: snapshotRemoteCert() reads the peer's
existing `clustermesh-apiserver-remote-cert` Secret, returns
tls.crt + tls.key bytes, and the orchestrator writes them into
A's `cilium-clustermesh` Secret instead of minting.

Caught on t129 (6cddff7ef4432bdc, 2026-05-16):
- TLS handshake succeeded after firewall fix (PR #1538) opened
  NodePort range so LB→backend health check passed
- cilium-dbg status reported `etcd: 1/1 connected, has-quorum=true`
  (TLS path working)
- BUT `remote configuration: expected=true, retrieved=false` and
  agent logs spammed `etcdserver: permission denied`

With this PR's CN=remote cert, etcd authorizes the kvstore List
and clustermesh sync completes — agent should flip to
`2/2 remote clusters ready`.

Completes the D11 chain: #1525 (regionKeyFromSpec) → #1528
(clusterName derivation) → #1530 (cert with peer's CA — no longer
needed but kept as defense-in-depth) → #1536 (hostAlias pattern)
→ #1538 (firewall NodePort range) → this.

Refs DoD D11.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:58:22 +04:00
github-actions[bot]
1cfe0d758f deploy: update catalyst images to 1c988b9 2026-05-16 14:45:56 +00:00
github-actions[bot]
bfc5a6143f deploy: update catalyst images to 83d771d 2026-05-16 14:13:28 +00:00
e3mrah
83d771dee9
fix(clustermesh): hostAlias pattern — endpoint hostname + DS patch (D11) (#1536)
Cilium clustermesh-apiserver server cert has SANs:
  *.mesh.cilium.io, clustermesh-apiserver.kube-system.svc,
  127.0.0.1, ::1
No public LB IP SAN. When the orchestrator wrote the peer config blob
with `endpoints: - https://<lb-ip>:2379`, TLS handshake from the
agent failed at hostname verification — `cilium-dbg status --verbose`
reported `0/N remote clusters ready, Waiting for initial connection`.

This PR adopts the canonical Cilium clustermesh hostAlias pattern
(same shape as `cilium clustermesh connect` CLI):

1. buildPeerConfigBlob now writes the endpoint as
   `https://<peer>.mesh.cilium.io:2379` — matching the apiserver
   server cert's `*.mesh.cilium.io` wildcard SAN.

2. New patchCiliumHostAliases adds one hostAliases entry per peer
   to the cilium DaemonSet's pod spec:
     - ip: <peer-LB-IP>
       hostnames: ["<peer>.mesh.cilium.io"]
   So the agent resolves the hostname to the public LB IP at
   connect-time. Strategic-merge patch: idempotent re-runs replace
   the whole list with the current peer set.

3. Orchestrator step 3 calls patchCiliumHostAliases for each
   region's local cilium DaemonSet right before the rollout-restart
   of cilium / cilium-operator / clustermesh-apiserver, so the new
   pod spec is in effect when the agents come back up.

Caught on t128 (9680edbdce8fefe8, 2026-05-16) — same incident
chain as PRs #1525/#1528/#1530. With this PR landed AND the
existing PR #1530 (cert signed by peer's CA), agents should
flip to `2/2 remote clusters ready` on the next prov.

Refs DoD D11.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:10:21 +04:00
github-actions[bot]
70bb2f2517 deploy: update catalyst images to 1f30a08 2026-05-16 13:48:18 +00:00
e3mrah
1f30a08ae3
fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534)
The Sovereign-side catalyst-api runs in "chroot" mode — it has no
parent prov record, so chrootEnsureDeployment synthesises a minimal
in-memory Deployment with only SovereignFQDN set. The
/infrastructure/topology loader then sees empty Request.Regions[]
and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes)
which only sees THIS cluster's Node(s) → emits exactly 1 Region
even on a 3-region Sovereign. /cloud?view=graph renders as
"1 cluster 1 region" — DoD D5 failure.

Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported
`console.t126.omani.works/cloud?view=graph` showed 1 region despite
mothership openova-flow snapshot holding all 3 regions correctly.

This PR threads the canonical multi-region RegionSpec[] from the
mothership prov body all the way to the Sovereign-side catalyst-api:

  tofu var.regions
    → jsonencode → sovereign_regions_json tftpl var
    → cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON
    → bp-catalyst-platform slot 13 sovereign.regionsJson value
    → sovereign-fqdn ConfigMap key `regionsJson`
    → catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom)
    → chrootEnsureDeployment parses JSON, populates Request.Regions[]
    → topology loader emits one Region per spec entry

Single-region Sovereigns: var.regions has length 1; chart writes
the array literal; chroot synth still produces 1 Region — no
regression. Empty env: chroot falls back to live-Nodes path
(legacy behavior preserved).

Refs DoD D5.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:45:24 +04:00
github-actions[bot]
929083942b deploy: update catalyst images to 357feb0 2026-05-16 13:41:41 +00:00
github-actions[bot]
3b67fc1614 deploy: update catalyst images to 050f87e 2026-05-16 13:32:20 +00:00
e3mrah
050f87e267
fix(purge): second name-prefix pass for CCM-named clustermesh LBs (#1532)
Caught repeatedly (t124, t125 wipes both 2026-05-16): tofu destroy left
3 orphan `<fqdn-slug>-<region>-clustermesh` LBs each cycle. Names
don't start with `catalyst-` prefix because they're named by the
Cilium chart overlay
(`clusters/_template/bootstrap-kit/01-cilium.yaml`):

    load-balancer.hetzner.cloud/name:
      "${SOVEREIGN_FQDN_SLUG:=catalyst}-${SOVEREIGN_REGION_KEY:=primary}-clustermesh"

The first name-prefix pass (`catalyst-<fqdn-slug>`) misses these.
tofu doesn't manage them (CCM allocated post-Phase-1). Manual API
cleanup was forced each cycle.

Fix: add a second `purgeByNamePrefix` pass with the slug-only prefix
(`<fqdn-slug>-`) so any CCM-allocated resource named with the slug
gets swept. Dedup logic in `purgeByNamePrefix` already skips names
already reported by the labelled pass, so totals stay accurate.

Refs feedback_wipe_handler_ccm_lb_orphans.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:29:26 +04:00
github-actions[bot]
ec74a6cc1d deploy: update catalyst images to 70d6ada 2026-05-16 13:26:31 +00:00
e3mrah
70d6ada703
fix(clustermesh): sign A's peer client cert with B's CA (not A's CA) (#1530)
Caught on t126 (84c0848406dd6fdd, 2026-05-16) after PRs #1525+#1528
unblocked peer Secret writes. Cilium agents reloaded, peer entries
present, but cilium-dbg status --verbose shows:

    0/2 remote clusters ready
    t126-mesh-nbg1-1: Waiting for initial connection
    t126-mesh-sin-2:  Waiting for initial connection

TLS probe to peer apiserver returned "unexpected eof while reading":
the mTLS handshake fails because A's client cert was signed by A's
cilium-ca. Cilium clustermesh-apiserver's trust pool is the LOCAL
cilium-ca (B's), so A's cert is rejected at the handshake.

Fix: pass b.caCert/b.caKey to mintPeerClientCert. SAN stays A's
clusterName (matches upstream `cilium clustermesh connect` CLI and
the chart's default RBAC subject authorisation).

Refs DoD D11.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:23:18 +04:00
github-actions[bot]
a80bccad2c deploy: update catalyst images to 38f1f83 2026-05-16 13:15:24 +00:00
e3mrah
38f1f83971
fix(sovereign-dns-records): 404 fallback to FQDN-minus-first-label parent (#1529)
When operator submits sovereignFQDN like "t126.omani.works" without
parentDomains[] AND without sovereignPoolDomain, Validate()'s back-compat
synthesis stamps ParentDomain.Name = SovereignFQDN itself ("t126.omani.works").
The post-Phase-0 upsertSovereignParentZoneRecordsFromResult then PATCHes
zone "t126.omani.works." → PowerDNS 404 (the authoritative zone is
"omani.works") → no A records written → every console.* / auth.* /
gitea.* hostname resolves NXDOMAIN even after handoverFired.

Caught on t126 (84c0848406dd6fdd, 2026-05-16): clustermesh fully meshed
(D10  after PRs #1525+#1528), handover JWT minted, wildcard cert
Ready=True, LB external IP assigned — but DoD D1/D2 stayed red because
the sovereign-dns-records PATCH 404'd silently with only a WARN log.

This PR adds a 404-fallback in upsertSovereignParentZoneRecordsFromResult:
when the synthesized parent equals SovereignFQDN AND the PATCH returns
status 404, retry once with parent-of-FQDN (`SovereignFQDN[i+1:]` where
i is the first `.`). Two-label FQDNs ("customer.com") skip the retry
since there is no parent to derive — preserves BYO-mode behavior.

The provisioner Validate() back-compat synthesis stays untouched
because TestValidate_SynthesisesPrimaryFromSovereignFQDN asserts the
exact "BYO mode keeps SovereignFQDN as parent" semantics for 3-label
apexes like "acme.openova.io" — that's a legitimate case (operator
registered the 3-label apex). The 404-fallback handles the pool-mode
case at the PATCH boundary where we actually know whether the zone
exists.

Refs DoD D1/D2. Same incident chain as PRs #1525 + #1528.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:13:26 +04:00
github-actions[bot]
53175b0d52 deploy: update catalyst images to 48f64a4 2026-05-16 12:38:19 +00:00
e3mrah
48f64a4992
fix(clustermesh): derive cluster name + ID at orchestrator if request unset (#1528)
When operator submits the canonical multi-region body without
ClusterMeshName / ClusterMeshID, the in-memory dep.Request fields stay
empty. tofu's writeTfvars internally calls deriveClusterMeshName /
deriveClusterMeshID and the cilium-config rendered on each region gets
the right cluster.name + cluster.id — but the catalyst-api orchestrator
was reading from dep.Request directly, so:

  - slot.clusterID stayed 0 → cilium reserves 0 → kvstoremesh
    CrashLoopBackOff would happen if any deployment escaped a previous
    coalesce shim (we don't trip this today because cluster.id is set
    by chart values, but slot.clusterID=0 misreports in PeerStatus).
  - slot.clusterName stayed "" → peerEntries dict got "" keys →
    `Create Secret kube-system/cilium-clustermesh: ... a valid config
    key must consist of alphanumeric characters, '-', '_' or '.'`
    rejection → orchestrator wrote zero peers in every region.

Caught on t125 (590ab1490d00c452, 2026-05-16): all 3 regions had
clustermesh-apiserver Pod 3/3 Ready, LB IPs assigned, cilium-ca
present — but cilium-clustermesh Secret stayed absent after PR #1525
unblocked the kubeconfig-path resolution. Orchestrator logged 3x
"clustermesh: Secret apply failed ... data[]: Invalid value: """
with empty region/cluster fields.

This PR:

1. Exports DeriveClusterMeshName + DeriveClusterMeshID from the
   provisioner package so the orchestrator + tofu agree byte-identically
   on derivation (canonical seam — no duplicate logic).
2. buildRegionSlots now calls these exported helpers when dep.Request
   fields are empty. Lifts primary-mesh-name derivation out of the
   per-region loop.
3. Adds a defensive guard in the per-peer inner loop: a peer whose
   clusterName is empty fails with PeerStatus.Error and DOES NOT add
   empty-keyed entries to peerEntries (so even if a future regression
   bypasses the derivation, the Secret-Create error is no longer a
   blast-radius bug killing the whole region's write).

Refs DoD D10/D11. Same incident chain as PR #1525.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:36:25 +04:00
github-actions[bot]
e17920519e deploy: update catalyst images to 56f5917 2026-05-16 11:58:52 +00:00
e3mrah
56f59173af
fix(clustermesh): regionKeyFromSpec off-by-one — use idx not idx+1 (#1525)
Tofu's secondary_regions map keys with the ORIGINAL spec index `i`:
  for i, r in var.regions : "${r.cloudRegion}-${i}" => r if i > 0

cloud-init then PUTs each region's kubeconfig as `?region=<k>` so
catalyst-api stores it at `<kubeconfigsDir>/<id>-<k>.yaml`. With 3
regions (idx 0=primary, idx 1, idx 2) the on-disk files are:

  <id>.yaml               (primary)
  <id>-nbg1-1.yaml        (secondary, idx=1)
  <id>-sin-2.yaml         (secondary, idx=2)

regionKeyFromSpec previously returned `<region>-<idx+1>` giving
`nbg1-2` / `sin-3` — keys that match NEITHER the in-memory
secondaryKubeconfigPaths entries nor the filesystem fallback at
`<dir>/<id>-nbg1-2.yaml`. Every secondary slot ended up with
`slot.err = "kubeconfig path empty"`. The orchestrator's step-3
inner loop then hit `b.err != nil` for every peer pair and built
zero peerEntries. applyClusterMeshSecret silently returned nil on
empty entries (line 743) and the only stdout line was the misleading
`clustermesh: orchestrator completed regions=3 fullyMeshed=0`.

Caught on t124 (1359e4479cbca98d, 2026-05-16) where all 3 regions
showed clustermesh-apiserver Pod 3/3 Ready, LBs assigned with
external IPs (Gap A v3.2 fix), but cilium-clustermesh Secret absent
in every region.

Also adds a `clustermesh: zero peer entries built for region` Warn
log surfacing the per-peer reasons before the silent
applyClusterMeshSecret no-op — so the next regression of this class
is debuggable from logs alone.

Refs DoD D10/D11 per docs/SOVEREIGN-MULTI-REGION-DOD.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:56:36 +04:00
github-actions[bot]
2825dccf55 deploy: update catalyst images to ed19bb3 2026-05-16 11:28:53 +00:00
github-actions[bot]
d9cd363210 deploy: update catalyst images to 0ebd137 2026-05-16 11:00:59 +00:00
github-actions[bot]
9aecdf8782 deploy: update catalyst images to 9240930 2026-05-16 10:58:20 +00:00
e3mrah
9240930b70
fix(sovereign-ui): derive synthetic Apps/Handover stage status from deployment record + auto-redirect after handover (#1522)
Fixes Gaps C + D from session_2026_05_16_t117_dod_partial.md, which
broke DoD gates D6 (0 pending) + D7 (mothership ≡ child) on every
multi-region Sovereign post-handover.

Gap C — UI synthesizes Apps / Handover / Cutover stage rows (and per-
region variants) that catalyst-api's openova-flow snapshot emits at
depth=1 so the canvas surfaces the full five-phase lifecycle. When
those groups have NO descendants — the common case for Apps (no
operator apps installed yet) and Handover (a once-per-Sovereign event
with no per-region job rows) — the API emits Status="pending" and
the bottom-up rollup leaves it there. Result on JobsPage: 8 phantom
"Pending" rows per multi-region prov contradicting the deployment
record's status=ready + handoverFiredAt truth.

  Fix: new `handoverStageOverride.ts` re-derives these stages' status
  from the deployment snapshot. When handover has fired (status=ready
  OR handoverFiredAt non-null), pending/running Apps/Handover/Cutover
  synthetic stages get coerced to "succeeded". Terminal statuses and
  non-lifecycle jobs (bootstrap-kit, provisioner, install-*) are
  passed through untouched — backend signal always wins over UI
  inference. Scoped strictly to the three lifecycle slugs via id-
  suffix match so install-* jobs are never affected.

Gap D — No auto-redirect to the Sovereign Console from JobsPage. The
operator typically watches convergence from the Jobs table; without
an in-page redirect they get stranded on the mothership even after
the Sovereign is ready. AppsPage has the redirect but operators on
/jobs miss it.

  Fix: new `HandoverRedirectBanner.tsx` renders a 3-2-1 countdown +
  CTA + "Stay on mothership" Cancel button when `handoverReady` from
  useDeploymentEvents is set AND not in chroot mode. Auto-fires
  `window.location.assign(handoverURL)` once when countdown reaches 0
  (idempotent guard via redirectFiredRef). Cancel suppresses the
  banner + timer for the rest of the page lifetime.

Per the brief: do NOT touch catalyst-api (`internal/handler/flow_
snapshot_local.go` is the canonical group emitter and its contract is
stable). UI-layer fix only.

Tests:
  - handoverStageOverride.test.ts — 18 unit cases covering the slug
    matcher, the handover gate, and every override branch (terminal
    pass-through, non-lifecycle pass-through, per-region coercion,
    mixed-mode array stability).
  - JobsPage.handover.test.tsx — 5 integration cases proving the
    JobsPage wires both fixes correctly (synthetic stages render as
    Succeeded when ready; banner renders + Cancel suppresses; auto-
    redirect fires `window.location.assign` exactly once when the
    countdown drains; still-installing snapshot keeps stages Pending
    and banner hidden).

All 26 new tests pass. Project lint + typecheck error counts are
unchanged from main baseline (27 typecheck errors + 67 lint errors,
all in unrelated files — see project drift in JobsTable.tsx /
openova-flow/canvas etc.). The new test file inherits the same
pre-existing `import/first` rule-not-found error already present in
JobsPage.flow-merge.test.tsx — same lint-config drift, not new.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:56:16 +04:00
github-actions[bot]
f0051e55d6 deploy: update catalyst images to ef93a2c 2026-05-16 10:14:29 +00:00
github-actions[bot]
babab5c31a deploy: update catalyst images to 7668905 2026-05-16 09:34:15 +00:00
github-actions[bot]
53d8f8e402 deploy: update catalyst images to 05c6edb 2026-05-16 09:17:23 +00:00
github-actions[bot]
3628f8fc31 deploy: update catalyst images to b7140b9 2026-05-16 09:08:59 +00:00
github-actions[bot]
5689ea4f44 deploy: update catalyst images to db116c2 2026-05-16 08:57:54 +00:00
e3mrah
db116c2d18
fix(kubeconfig): honour ?region=<key> on GET /kubeconfig (#1515)
Multi-region Sovereigns store secondary CP kubeconfigs at
<kubeconfigsDir>/<id>-<region>.yaml via the PUT endpoint (L520+). The
GET endpoint always read dep.Result.KubeconfigPath which is the
PRIMARY's path, so any caller asking for ?region=nbg1-1 got primary's
kubeconfig pointing at primary's IP (89.167.22.182 etc.) — silently.

Caught on t117 (7152ad51e7838836, 2026-05-16): D-gate validator
fetched all 3 region kubeconfigs via the GET endpoint with ?region=
and all 3 returned PRIMARY's endpoint. Every per-region check
(D8/D9/D12) inspected primary 3× instead of 3 distinct regions.
Workaround was reading directly from the PVC; this fix unblocks the
canonical API path.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 12:55:55 +04:00
github-actions[bot]
7bfe7266af deploy: update catalyst images to f30a49f 2026-05-16 08:14:32 +00:00
github-actions[bot]
243bb6b03d deploy: update catalyst images to 7f0de7f 2026-05-16 07:36:27 +00:00
github-actions[bot]
c59ae92b55 deploy: update catalyst images to dc59085 2026-05-16 06:59:40 +00:00
github-actions[bot]
ecf256d4a7 deploy: update catalyst images to 0c9e391 2026-05-15 20:04:18 +00:00
github-actions[bot]
2585b439d4 deploy: update catalyst images to 66e7768 2026-05-15 19:56:32 +00:00
e3mrah
66e7768e8e
fix(helmwatch): emit Succeeded events for HRs Ready at attach time (#1510)
When catalyst-api restarts and the bridge re-attaches to an already-
converged child cluster, the informer initial-list returns HRs already
in Ready=True. The previous processEvent path relied implicitly on the
zero-value of w.states[componentID] (empty string) being different
from the derived state — which works today but would silently regress
if a future refactor pre-seeded w.states from a prior snapshot.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): 4 HRs
converged across primary + sin-2 regions before/after the pod restart
at 19:16, but the mothership Jobs API kept reporting:

    install-self-sovereign-cutover  → running   (kubectl: Ready=True)
    install-powerdns                → running   (kubectl: Ready=True)
    install-catalyst-platform       → running   (kubectl: Ready=True)
    install-sin-2:reloader          → failed    (kubectl: Ready=True)

D6 (0 pending / 0 running) and D7 (mothership ≡ child) both failed.

Fix shape: processEvent's emission policy is now EXPLICITLY "first
observation OR real transition". `hadPrev` (the two-return-value map
lookup) is false on the FIRST event for componentID regardless of the
state value, so the dispatch fires unconditionally on attach. The
dedupe via prev != state still suppresses sub-second status-patch
churn that helm-controller's observedGeneration touches produce.

Idempotency: the jobs.Bridge's lastState map dedupes (componentID,
state) re-emissions at the bridge layer (Bridge.OnHelmReleaseEvent
line ~478), and the openova-flow-server's TypeSnapshot envelope is
idempotent at the receiver — so a re-emit propagated by the
flow_emitter periodic loop is safe.

Two new tests pin the contract:
  - TestTransition_AttachTimeReady_EmitsSucceededViaSubscribe asserts
    a Watcher attaching to a child cluster with 4 already-Ready HRs
    emits exactly one State=installed event per HR, BOTH on the
    primary emit callback AND through Subscribe (the bridge wiring).
  - TestTransition_FirstObservation_NeverDedupsAcrossWatchers asserts
    that constructing a new Watcher against the same fake client
    (the Pod-restart shape) re-emits the full component-event set,
    because w.states is independent per Watcher.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:54:25 +04:00
github-actions[bot]
feb42e2f80 deploy: update catalyst images to 5f8ba85 2026-05-15 19:41:57 +00:00
github-actions[bot]
0a63c19cc0 deploy: update catalyst images to 22668f2 2026-05-15 18:18:23 +00:00
e3mrah
22668f2870
feat(catalyst-api): auto-establish Cilium ClusterMesh after Phase-1 (#1508)
Implements DoD gates D9, D10, D11 from
docs/SOVEREIGN-MULTI-REGION-DOD.md. After phase1-watching reports all
HRs Ready, the orchestrator wires every region's clustermesh-apiserver
into a fully-connected peer mesh by writing the cross-cluster trust
material (CA bundles, peer endpoints, mTLS client certs) into each
cluster's kube-system Secrets. Cilium auto-reloads via the chart's
watch mechanism; a rollout-restart guarantees pickup.

- New handler/clustermesh.go orchestrator (AutoEstablishClusterMesh)
- Hook in phase1_watch.go markPhase1Done after fireHandover, runs on
  a goroutine with a 20-minute budget; skips when regions<2
- Idempotent: re-run on partially-meshed Sovereign converges
- Uses LoadBalancer IPs per region (provider-agnostic — A2/A3/A6)
- Hard-fails on Service type != LoadBalancer per invariant A3
- No cilium CLI shell-out (catalyst-api Pod doesn't ship it); mints
  per-peer client certs from the local cilium-ca via crypto/x509
- Three coverage tests against fake clientsets: happy-path 2-region,
  LB-absent peer marked Connected=false, idempotent re-run, single-
  region short-circuit

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 22:16:26 +04:00
github-actions[bot]
9613e69ecc deploy: update catalyst images to 93f6993 2026-05-15 18:06:42 +00:00
github-actions[bot]
b89fdfc9e7 deploy: update catalyst images to 4e199f1 2026-05-15 17:14:47 +00:00
e3mrah
4e199f137b
fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 (#1505)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

* fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):

  dig +short A console.t110.omani.works @ns1.openova.io
  → 49.12.16.160     ← ORPHAN IP — Hetzner reassigned to a 3rd party

The mothership PowerDNS had ZERO records for t110's hostnames. A stale
wildcard `*.omani.works` (manual leftover from earlier provs) was
returning a wrong IP that no longer belonged to the openova project at
Hetzner — sending operator traffic to an unrelated tenant. The deeper
gap: catalyst-api never auto-wrote the per-Sovereign A records that
browsers need to resolve.

The existing parent-domain flow has:
  pdmCreatePowerDNSZone     — stub at parent_domains.go:1096
  certManagerStep           — stub at parent_domains.go:1141
  commitPDMWithRetry        — runs ONLY for pool-allocated FQDNs
                              (otech<N>.<pool>), NOT BYO

So BYO-style (operator-owned parent like omani.works + arbitrary
Sovereign FQDN like t111.omani.works) left the parent zone untouched.

Fix:

  internal/powerdns/client.go
    + PatchRRSets(ctx, zone, rrsets) — PATCH REPLACE on
      /api/v1/servers/{id}/zones/{zone} with idempotent re-runs

  internal/handler/handler.go
    + powerdnsZoneClient interface gains PatchRRSets — wired
      automatically by SetPowerDNSZoneClient

  internal/handler/sovereign_dns_records.go (new)
    + CanonicalSovereignSubdomains: console / auth / gitea / harbor /
      registry / bao / grafana / hubble / pdns / openova-flow /
      marketplace / api / guacamole
    + upsertSovereignParentZoneRecords: PATCH the parent zone with one
      A record per subdomain → primary LB IP
    + upsertSovereignParentZoneRecordsFromResult: deployment-flow
      wrapper that iterates every parentDomain in the request body

  internal/handler/deployments.go
    + Call upsertSovereignParentZoneRecordsFromResult right after
      commitPDMWithRetry on Phase-0 success — best-effort (log +
      continue), so a PowerDNS hiccup doesn't bail the Sovereign

Operator override via CATALYST_SOVEREIGN_SUBDOMAINS not yet wired —
filed as follow-up. Today the canonical list is the chart-side HTTPRoute
list, kept aligned via the comment in sovereign_dns_records.go.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:12:38 +04:00
e3mrah
2c3ea44af8
fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:02:37 +04:00
github-actions[bot]
fc7bbc8711 deploy: update catalyst images to 3a19bb1 2026-05-15 15:51:00 +00:00
github-actions[bot]
51a9f7b1b5 deploy: update catalyst images to 4465cd0 2026-05-15 15:15:38 +00:00
e3mrah
4465cd0d27
fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs (#1502)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 19:13:35 +04:00
github-actions[bot]
aa8c6dc391 deploy: update catalyst images to 49ae2a7 2026-05-15 13:26:36 +00:00
e3mrah
49ae2a7cab
fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values (#1501)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:24:33 +04:00
github-actions[bot]
1f07721204 deploy: update catalyst images to 80fdbcd 2026-05-15 13:20:49 +00:00
e3mrah
80fdbcd8e1
fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges (#1500)
PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:18:40 +04:00
github-actions[bot]
5b2c8b79a8 deploy: update catalyst images to 1cd6c3f 2026-05-15 12:41:58 +00:00
e3mrah
1cd6c3f432
fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race (#1499)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

* fix(sovereign-tls): escape $ in tls-restart Job so Flux doesn't eat the bash vars

Root cause caught on prov t101.omani.works (c9df5eed1c1ba6cf, 2026-05-15):

The cilium-envoy-tls-restart Job's shell command uses bash variables
${SECRET_NS}, ${SECRET_NAME}, ${DS_NS}, ${DS_NAME}, ${tls_crt}, ${i}.
Flux's postBuild.substitute processes ${...} in the YAML BEFORE the
Job manifest lands in the cluster, and replaces every $-reference that
isn't in the Kustomization's substituteFrom map with an empty string.

Result on prov t101 (T+13m, mothership flipped status=ready):

  Job logs: "[tls-restart] waiting for / with non-empty tls.crt"
                                      ^^^ — namespace and name both empty

  Command becomes: `kubectl get secret -n "" "" --ignore-not-found ...`
  → polls a nonexistent secret forever
  → cilium-operator never gets the rollout-restart
  → CiliumEnvoyConfig's additionalAddresses.socketAddress: 0.0.0.0:30443
    bind never lands
  → cilium-envoy host:30443 stays unbound
  → Hetzner LB targets stay unhealthy on 30080/30443
  → console.<fqdn> serves HTTP 000 indefinitely
  → mothership's "Handover gate" timeout fires AT THE WRONG TIME — flips
    deployment status=ready before TLS is actually serving

The "Sovereign was up at t101" reading we saw briefly was a transient
TRAEFIK fallback cert from upstream during cert-issuance, NOT the
Sovereign envoy.

Fix: escape every bash variable reference inside the script as $$VAR so
Flux postBuild.substitute emits a literal $VAR which bash then evaluates
correctly at Job runtime. SOVEREIGN_FQDN in YAML labels stays as
${SOVEREIGN_FQDN} because that IS a Flux substitute (kept intentionally).

This is the third recurrence of "sibling deps lost / cilium-envoy host
bind missing / fresh prov console=000" on the same code path:
  PR #1431 — derive HR dependsOn from live watcher
  PR #1470 — persist DependsOn on every event
  PR #1494 — restart cilium-operator BEFORE cilium-envoy on first install
  PR #1497 — skip TLS verify on Sovereign k3s self-signed CA
  THIS  — escape \$VAR in Job command so Flux doesn't blank them

Each prior PR fixed a layer above the Job's own correctness. The Job
itself was always broken on fresh provs since the cilium-operator
restart line was added.

* fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race

Real architectural fix for the recurring "sibling deps lost on every fresh
provision" regression. PR #1431, PR #1470, PR #1497 each patched a layer
above the actual gap: the per-event emit path at helmwatch.go:1525 had
the unstructured HelmRelease in scope but THREW AWAY spec.dependsOn before
emitting the provisioner.Event. The bridge then wrote Job.DependsOn=[]
on every event, relying on a pre-existing seed having populated deps —
which never happened on fresh provs because the watcher's initial-list
sync (T+2m, right after tofu) fires with 0 HRs (Flux hasn't installed
anything yet).

The fix walks the data end-to-end:

  provisioner.Event   gains DependsOn []string
  helmwatch.processEvent  populates DependsOn: extractDependsOn(u) on
                          every PhaseComponent emit (the unstructured
                          HelmRelease was already in scope, just being
                          dropped at the event boundary)
  spawnSecondaryRegionWatchers  region-prefixes each entry so secondary
                                Jobs (install-<region>:<chart>) wire to
                                intra-region siblings, not bare primary
                                names
  Bridge.OnProvisionerEvent  passes ev.DependsOn to OnHelmReleaseEvent
  Bridge.OnHelmReleaseEvent  new dependsOn []string parameter; resolves
                             with 3-tier preference:
                               prior store value  >
                               event-carried (live HR spec.dependsOn) >
                               empty.
                             The prior-store branch keeps PR #1470's
                             pod-restart preservation; the event-carried
                             branch closes the fresh-prov gap.

No timing race, no re-seed band-aid, no /refresh-watch dependency. Every
HR transition observed by the watcher carries the live spec.dependsOn
through to the Job row — exactly the architecture that ComponentSnapshot
already documents at helmwatch.go:679-689 but the event path had
silently dropped.

Caught on prov t102.omani.works (22af2b1120158239, 2026-05-15) — all
hel1-2 HRs showed Deps:— in the JobsTable despite the bridge being
healthy (verified: x509 errors=0 post PR #1497, kubeconfigs present at
mtime T+2m, OnInitialListSynced fired).

Prior recurrences (each patched a layer above the actual gap):
  PR #1431 (2026-05-11) — derive HR dependsOn from live watcher (seed path)
  PR #1470 (2026-05-14) — persist DependsOn on every event (preserve prior)
  PR #1497 (2026-05-15) — skip TLS verify on Sovereign k3s self-signed CA
  PR #1498 (2026-05-15) — escape $ in tls-restart Job so Flux doesn't blank vars
  THIS  (2026-05-15) — actually plumb spec.dependsOn through the Event

Tests:
  go test ./internal/jobs/... ./internal/helmwatch/... ./internal/provisioner/...
  All green. 9 OnHelmReleaseEvent callsites updated for the new signature.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 16:39:52 +04:00
github-actions[bot]
fdbd47a5a8 deploy: update catalyst images to da63b45 2026-05-15 10:48:25 +00:00
e3mrah
da63b45b53
fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps (#1497)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 14:46:21 +04:00
github-actions[bot]
558d3b7095 deploy: update catalyst images to 1dc21bf 2026-05-14 18:54:01 +00:00
github-actions[bot]
11c9a1bb83 deploy: update catalyst images to 96fc3bf 2026-05-14 18:04:21 +00:00
e3mrah
96fc3bfc76
fix(routes): preserve /sovereign basepath on canonicalisation hard-nav + normalize PIN-login next (#1488)
Two related basepath-stripping bugs in hard-navigation paths:

A. router.tsx rootBeforeLoad canonicalisePath
   TanStack Router passes POST-basepath `location.pathname` (e.g. on
   contabo a visit to `/sovereign/provision/$id/jobs/install-X%3AY`
   arrives as `/provision/$id/jobs/install-X%3AY`). canonicalisePath
   lowercases the path, so `%3A` → `%3a` and the comparison triggers
   a hard-nav. But `window.location.replace(canonical)` operates on
   the FULL URL — the bare `/provision/...` target bypasses the SPA
   mount point and nginx 404s before the SPA loads. Same root cause
   as #1486, different hard-nav site.

B. VerifyPinPage hard-nav post-PIN
   The `next` query param arrives in two forms depending on which
   redirectToLogin variant produced it: SovereignConsoleLayout.tsx:91
   uses `window.location.pathname` (INCLUDES basepath) while :178
   uses currentPathRelativeToBasepath (STRIPS basepath). #1486
   unconditionally re-prefixed which double-prefixed the first form.
   Normalize to "post-basepath" form first, then re-prefix exactly
   once.

Fix shape: every window.location.{replace,assign} that operates on a
URL derived from router-internal data MUST re-add basepath. The router-
based `<Link to>` / `navigate({to})` paths are unaffected because
TanStack Router auto-prefixes those.

Caught live on prov #82 + #84 (omani.works, 2026-05-14): the canvas
row-click + PIN-login + canonicalise paths each generated bare
`/provision/...` URLs that hit nginx's 404 page.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:02:20 +04:00
github-actions[bot]
8c61db0d02 deploy: update catalyst images to a25fd33 2026-05-14 17:19:41 +00:00
e3mrah
a25fd33dea
fix(provisioner): key tofu workdir by DeploymentID, not FQDN (eliminate reprov tfstate carryover) (#1487)
Root cause for the prov #82#83#84 cascade on omani.works:

The per-prov tofu workdir was keyed by `strings.ReplaceAll(FQDN, ".", "-")`,
so every reprovision of the SAME SovereignFQDN reused the SAME directory.
When prov #82's force-wipe failed `tofu destroy` (the workdir held a tftpl
from before #1485's WILDCARD_CERT_ISSUER escape fix), the Hetzner-purge
fallback cleaned the cloud but the tfstate stayed dirty. Prov #83 then
inherited tfstate that referenced destroyed-via-Hetzner-purge resources
and `tofu apply` failed with "Saved plan is stale" / "resource already
exists".

The kubeconfig path was ALREADY keyed by DeploymentID; the tofu workdir
was the outlier. Bring it into alignment so each POST /deployments gets
a hermetic workdir. CreateDeployment generates a unique DeploymentID on
every call, so reprovs are isolated by construction.

Wizard-resume — the original justification for the FQDN-keyed design —
was already fragile (it required a clean prior tfstate), and is better
served by an explicit retry endpoint that re-uses the same DeploymentID
rather than implicit workdir reuse.

Affected callers:
- provisioner.go Provision + Destroy → workdirKey() (returns DeploymentID, falls back to FQDN-slug for legacy paths)
- wipe.go WipeDeployment → uses `id` (chi URL param) directly
- handover.go FinaliseHandover → uses `id` directly

Tests pass: provisioner + handler test packages.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:17:28 +04:00