Commit Graph

30 Commits

Author SHA1 Message Date
e3mrah
c96d7b5089
fix(bp-keycloak): retune install retries to fit HR envelope (#146) (#1352)
Diagnostic finding: prov #21 + #22 hung 30+ min on bp-keycloak install.
Three coupled root causes from Fix #140 retune:

1. availabilityCheck.timeout=900s meant a single failed availability
   attempt busted the 30m HR window before Job-level backoff retried.
2. HR install/upgrade.remediation.retries=3 triggered up to 3 full
   Helm uninstall+reinstall cycles, losing Liquibase state each time.
   Worst case: 90m+ wall-clock before Flux gave up.
3. Liquibase + JVM cold-start legitimately took 5-10 min before
   Keycloak's Service had Endpoints, but bitnami's livenessProbe
   (initialDelaySeconds 300) killed Pods mid-migration.

Three target-state changes (chart 1.4.4 -> 1.4.5):

- platform/keycloak/chart/values.yaml: availabilityCheck.timeout
  900s -> 300s. ~5 attempts fit in 30m HR envelope vs. ~1.5 at 900s.
- platform/keycloak/chart/values.yaml: keycloak.startupProbe.enabled
  with failureThreshold 360 x periodSeconds 5 = 30m budget. Suspends
  livenessProbe until first /realms/master 200. livenessProbe.
  initialDelaySeconds 300 -> 60.
- clusters/_template/bootstrap-kit/09-keycloak.yaml:
  install/upgrade.remediation.retries 3 -> 1 + chart pin 1.4.5.
  Job's own backoffLimit=5 handles retries without losing state.

All knobs remain operator-overridable via per-Sovereign overlay
valuesFrom (Inviolable Principle #4: no hardcoding).

TODO follow-up (out of scope per diagnostic "ship knob bumps first
to validate hypothesis"): move realm-import out of the bitnami
post-install Helm hook into a Catalyst-owned Job that runs after
Keycloak Service has Endpoints. Decouples HR-Ready from realm-
imported and lets the orchestrator wait on the Job CR directly.

Refs #146.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 07:37:14 +04:00
e3mrah
6662d672d3
fix(bp-keycloak): post-upgrade hook regression on prov #21 (#140) (#1349)
Fix #129 (1.4.3) set availabilityCheck.timeout=600s + backoffLimit=5,
which works for fresh-install but races on UPGRADE. On prov #21
(f84f6c3ff2b60296, 2026-05-11) chart-roll on contabo (PR #1346/#1347
stack) triggered an HR upgrade; the keycloak StatefulSet rolled, Helm
fired the post-upgrade hook before the new Pod's admin endpoint
recovered (Liquibase re-validation + JVM cold start on freshly-
provisioned node), and the inner 600s window expired before the first
attempt found Keycloak Ready. With backoffLimit=5 + 10s..6m
exponential backoff the worst-case wall clock exceeds the parent HR's
15m upgrade.timeout -> Helm aborts the hook -> "post-upgrade hooks
failed".

Target-state fix (Principle #3: no half-fix; both chart and HR move
together):

- platform/keycloak/chart/values.yaml: availabilityCheck.timeout
  600s -> 900s (15m inner wait covers a single rolling-restart +
  Liquibase cycle without retry, eliminating most backoff time);
  cleanupAfterFinished.enabled true with 1h TTL so stale hook Pods
  don't race the before-hook-creation delete on subsequent upgrades.
- platform/keycloak/chart/Chart.yaml: 1.4.3 -> 1.4.4 + 1.4.4
  changelog block.
- clusters/_template/bootstrap-kit/09-keycloak.yaml: HR install +
  upgrade timeout 15m -> 30m so Helm's outer hook-wait gracefully
  accommodates the inner 15m availability window plus normal backoff.
  Pin chart version 1.4.3 -> 1.4.4.

All knobs remain operator-overridable via per-Sovereign valuesFrom
(Principle #4: no hardcoding). Hook semantics stay intact (Principle
#3: workaround would be disabling annotations, which breaks
downstream bp-gitea + catalyst-api contract that the realm exists
before HR Ready).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 06:33:57 +04:00
e3mrah
a3005ead67
fix(bp-keycloak): bump keycloak-config-cli hook timeouts (#129) (#1341)
Fresh-Sovereign provision #15 (otech 0ad3687ddd72deb7) wedged at
phase1-watching for 30+ min: bp-keycloak HelmRelease failed with
`post-upgrade hooks failed: timed out waiting for the condition` →
bp-gitea (dependsOn keycloak OIDC) blocked → bp-self-sovereign-cutover
never converged.

Root cause
──────────
The bitnami keycloak subchart's `keycloak-config-cli-job.yaml` is
rendered as a Helm post-install/post-upgrade/post-rollback hook
(default annotations on the Job, weight 5). On a fresh k3s the
realm-import Job fires before Postgres+Liquibase finish bootstrapping
Keycloak (legitimately 3-10 min), and the bitnami subchart defaults
are too tight to absorb that race:

  - keycloakConfigCli.availabilityCheck.timeout="" → keycloak-config-cli
    falls back to its internal ~120s wait for Keycloak's /admin endpoint
  - keycloakConfigCli.backoffLimit: 1 → only 2 Pod attempts total
    before the Job is marked Failed

Both attempts hit the 120s window, Job goes Failed, Helm reports the
post-upgrade hook timed out, HR install/upgrade retries (×3) all hit
the same race, HR remains Failed → downstream blueprints never install.

Fix
───
Tune the hook's internal timing to fit comfortably inside the parent
HR's 15m install/upgrade timeout while leaving headroom for cold image
pull + Pod scheduling:

  keycloak.keycloakConfigCli.availabilityCheck.timeout: "600s"   (was "")
  keycloak.keycloakConfigCli.backoffLimit:               5        (was 1)

Both knobs remain operator-overridable via per-Sovereign
`valuesFrom` (Inviolable Principle #4: no hardcoding). Per
Inviolable Principle #3 (no workarounds), this does NOT disable the
hook semantics — disabling the hook would break the documented
contract that the realm exists before the HR reaches Ready
(downstream bp-gitea + catalyst-api consume the realm).

Files
─────
  platform/keycloak/chart/values.yaml           (+59  inline rationale)
  platform/keycloak/chart/Chart.yaml            (1.4.2 → 1.4.3 + changelog)
  clusters/_template/bootstrap-kit/09-keycloak.yaml (HR pin → 1.4.3)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 02:51:21 +04:00
e3mrah
2ef01849bf
fix(bp-keycloak): truncate catalyst-api-server desc <255 chars (1.4.2 backport) (#1285)
* fix(bp-keycloak): truncate catalyst-api-server description <255 chars (Postgres limit)

Keycloak DB column CLIENT.DESCRIPTION = varchar(255). Previous value was
458 chars, causing realm-config-cli post-install hook to fail with
PSQLException value too long. Caught on omantel provision #6 iter-13
chart roll — keycloak-config-cli Job CrashLoop, bp-keycloak HR False,
upstream HRs blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-keycloak): truncate catalyst-api-server desc <255 chars (Postgres limit)

Keycloak DB column CLIENT.DESCRIPTION = varchar(255). Previous value was
458 chars (since Fix #23 / commit febd5fef), causing realm-config-cli
post-install hook to fail with PSQLException 'value too long for type
character varying(255)' on every fresh Sovereign provision.

Caught on omantel provision #6 — keycloak-config-cli Job CrashLoop,
bp-keycloak HR False, all upstream HRs blocked from converging.

Backport to 1.4.x (1.5.0 had a separate breaking realm-rename change
reverted via PR #1282). Bootstrap-kit pin updated to 1.4.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 15:48:37 +04:00
e3mrah
0a11107630
fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A (#1271)
* fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A

Per `feedback_no_mvp_no_workarounds.md` target-state rule + matrix assertion drift on TC-124, TC-125, TC-159, TC-160, TC-161, TC-176, TC-190, TC-285 (8 TCs in iter-12 audit Phase 4 cluster A): each Sovereign owns its KC realm named after the tenant short-name, not a hardcoded literal `sovereign`.

bp-keycloak chart 1.4.1 → 1.5.0:
- New value `sovereignRealm.name` (default `sovereign` for backward compat with overlays not yet migrated)
- New value `sovereignRealm.displayName` (default `Sovereign`)
- Realm import JSON `"realm"` field + catalyst-kc-sa-credentials Secret `realm` key both flow from `$realmName` so Keycloak realm name and catalyst-api `CATALYST_KC_REALM` env stay in sync (no auth-mismatch risk)

omantel chroot overlay:
- bp-keycloak HelmRelease pinned to chart 1.5.0
- `sovereignRealm.name: omantel` + `displayName: "Omantel Sovereign"` per matrix tenant convention

bp-catalyst-platform 1.4.120 → 1.4.121: chart bump triggers catalyst-api StatefulSet restart so it picks up the new mirrored Secret with realm=omantel. The cutover step-06 patches HR.spec.chart.spec.version dynamically per `incidents.md`.

Backward compat: charts not setting sovereignRealm.name (otech, _template) keep realm `sovereign` (no behaviour change). The contabo Catalyst-Zero realm `openova` is a separate KC instance untouched by this change.

* fix(blueprint): bump bp-keycloak blueprint.yaml to 1.5.0 to match Chart.yaml — qa-loop iter-12 Fix #53A follow-up
2026-05-10 10:48:09 +04:00
e3mrah
142d42e725
fix(cilium): clustermesh-apiserver NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D (#1274)
* fix(cilium): clustermesh-apiserver Service NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D

Per qa-loop-state/incidents.md remediation table path-1 + feedback_no_mvp_no_workarounds.md "no operational hacks": the existing NodePort 32379 was the workaround that triggered Hetzner's stateful firewall to silently drop cross-region SYN packets to BPF-only NodePorts (no LISTEN socket on the host). The canonical multi-region transport is a per-peer Hetzner LoadBalancer via the cloud-controller-manager.

Affects: omantel-fsn chroot Sovereign (this PR). Other Sovereigns (otech, _template) keep their existing setting.

PRECONDITION (separate bootstrap-kit slot, follow-up): Hetzner cloud-controller-manager (hcloud-ccm) must be installed AND each k3s node's spec.providerID rewritten from `k3s://...` to `hcloud://<server-id>` so the LB Service materializes. Without CCM the LB sits in `<pending>` but does not break in-cluster operation (ClusterIP still works for the local cilium-agent).

Test matrix coverage when CCM is also live: TC-260, TC-261, TC-241, TC-050, TC-308, TC-310, TC-311, TC-314, TC-298, TC-297, TC-340, TC-349 (multi-region tests blocked by NodePort filtering).

* fix(blueprint): bump bp-gitea blueprint.yaml to 1.2.5 to match Chart.yaml — pre-existing main drift

* fix(blueprint): bump bp-keycloak blueprint.yaml to 1.4.1 to match Chart.yaml — pre-existing main drift
2026-05-10 10:45:11 +04:00
e3mrah
febd5fef22
fix(bp-keycloak): grant catalyst-api SA manage-realm + view-realm + view-clients (qa-loop iter-4 Fix #23) (#1213)
Root cause of TC-248: the catalyst-api-server service-account in the
sovereign realm was created (PR #604, Phase-8b) with only
impersonation+manage-users+view-users+query-users on realm-management.
Those four roles let the SA mint tokens and provision users, but they
do NOT include manage-realm or view-realm, which are required to
read or write realm-roles via the Keycloak Admin REST API.

When EPIC-3 T2 added the tier-role bootstrap goroutine
(KEYCLOAK_BOOTSTRAP_TIER_ROLES=true,
products/catalyst/bootstrap/api/internal/keycloak/realm_bootstrap.go)
its very first call — GetRealmRole(catalyst-viewer) — returned 403
Forbidden, EnsureRealmRole gave up after 5 retries and the catalog-tier
realm-roles were never materialized. The access-matrix UI (TC-248) then
showed an empty role list.

Fix: extend clientScopeMappings.realm-management AND
users[serviceAccountClientId=catalyst-api-server].clientRoles.realm-management
in the sovereign realm import to include manage-realm + view-realm +
view-clients. After this change a clean Sovereign install converges the
tier-role bootstrap on the FIRST attempt at catalyst-api startup.

Verification on omantel (chart 1.4.0 → 1.4.1, runtime fix applied
manually first then catalyst-api restarted):

  kc-bootstrap: tier-role bootstrap converged (attempt 1, realm=sovereign)

  $ curl /admin/realms/sovereign/roles | jq '.[].name'
    catalyst-admin       (composite=true,  tier-level=40)
    catalyst-developer   (composite=true,  tier-level=20)
    catalyst-operator    (composite=true,  tier-level=30)
    catalyst-owner       (composite=true,  tier-level=50)
    catalyst-viewer      (composite=false, tier-level=10)

  $ catalyst-owner.composites    → catalyst-admin
  $ catalyst-admin.composites    → catalyst-operator
  $ catalyst-operator.composites → catalyst-developer
  $ catalyst-developer.composites → catalyst-viewer

Adds TestEnsureTierRealmRoles_GetRole403_SurfacesPermissionError to
realm_bootstrap_test.go so future regressions of the SA permission
contract surface a debuggable error chain
("ensure realm role \"catalyst-viewer\": ... GET role 403: ...")
rather than a generic "create failed".

Refs: TC-248, EPIC-3 T2 (#1098), bp-keycloak Phase-8b (#604)

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:14:30 +04:00
e3mrah
7f859dbb4b
feat(bp-keycloak): tenant-mode realm with wordpress/openclaw/stalwart OIDC clients (1.4.0, #915) (#918)
PR #911 wired the SME tenant orchestrator to emit
realmConfig.tenant.enabled=true on the per-tenant bp-keycloak
HelmRelease — but the chart had no template that consumed those values,
so the WordPress / OpenClaw / Stalwart OIDC integrations had no client
registered in the tenant realm and SSO failed end-to-end.

This change adds the chart-side template the orchestrator was already
emitting for. When realmConfig.tenant.enabled=true:

  * configmap-sovereign-realm.yaml SKIPS (mutual-exclusion guard added
    on the existing template) so only one realm CM is rendered.
  * NEW templates/configmap-tenant-realm.yaml renders a realm import
    ConfigMap (same name `<release>-sovereign-realm-config` so the
    upstream keycloak-config-cli existingConfigmap reference still
    resolves) carrying the tenant realm + 3 OIDC clients:
      - wordpress  (confidential, auth-code; redirect URIs cover the
                    openid-connect-generic plugin's admin-ajax.php
                    callback + /wp-login.php fallback)
      - openclaw   (confidential, auth-code; redirect URI /oauth/callback
                    per #915 spec)
      - stalwart   (confidential, serviceAccountsEnabled=true so the
                    directory.keycloak type=oidc backend can use
                    client_credentials to introspect IMAP/SMTP tokens;
                    standardFlowEnabled=true for webmail UI auth-code)
  * NEW per-app Secrets emitted in the same template scope as the realm
    ConfigMap so the realm JSON's `secret` field and the K8s Secret
    bytes never drift:
      - wordpress-oidc-client-secret
      - openclaw-oidc-client-secret
      - stalwart-oidc-client-secret  (carries BOTH client-secret AND
                                      OIDC_CLIENT_SECRET keys for the
                                      two consumer paths)
  * Each per-app secret persists across helm upgrade via
    lookup-or-generate (mirrors marketplace-api/secret.yaml pattern from
    issue #887 and the existing catalyst-api-server secret in
    configmap-sovereign-realm.yaml). helm.sh/resource-policy: keep so
    bytes outlive uninstall.
  * Fail-closed validation when realmConfig.tenant.enabled=true and
    any of realmName / parentDomain / subdomain is unset (Inviolable
    Principle #4).

NEW tests/tenant-realm-oidc-clients.sh covers 6 cases:
  1. Sovereign-mode default render unchanged (kubectl + catalyst-ui +
     catalyst-api-server clients present, no tenant artefacts leak).
  2. Tenant-mode render produces exactly ONE realm CM under the
     expected name + zero leaked Sovereign-only resources.
  3. Tenant realm JSON parses + 3 OIDC clients present with the
     redirect-URI / publicClient / serviceAccountsEnabled shape per
     #915 spec; Secret bytes match realm JSON's `secret` fields.
  4. Fail-closed validation when tenant fields missing.
  5. keycloak-config-cli post-install Job projects the realm CM by
     SAME name in BOTH modes.
  6. Operator-supplied per-app clientSecret overrides the
     lookup-or-generate path.

Existing tests/observability-toggle.sh + tests/oidc-kubectl-client.sh
still pass.

Sovereign-mode unchanged. The chart now consumes the values the
orchestrator (PR #911) was already emitting; no orchestrator change
needed.

Closes #915 (C1 sub-task) and unblocks #899 (per-tenant Keycloak
realm-config materialisation).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:29:40 +04:00
e3mrah
93c4b700de
fix(bp-keycloak): templatize existingConfigmap reference for per-tenant installs (#899) (#902)
bp-keycloak 1.3.2 hardcoded `keycloak.keycloakConfigCli.existingConfigmap` to
the literal "keycloak-sovereign-realm-config". This worked for the Sovereign-
mothership bootstrap-kit (releaseName=keycloak emits matching ConfigMap) but
broke for every per-tenant install where releaseName=bp-keycloak emits
"bp-keycloak-sovereign-realm-config" — the post-install keycloak-config-cli
Job stuck in ContainerCreating with `MountVolume.SetUp failed for volume
"config-volume" : configmap "keycloak-sovereign-realm-config" not found`,
HelmRelease InstallFailed after 15m timeout, cascading to bp-openclaw and
bp-wordpress-tenant which dependsOn it.

The bitnami/keycloak subchart's `keycloak.keycloakConfigCli.configmapName`
helper (charts/keycloak/templates/_helpers.tpl) applies `tpl` to the
existingConfigmap value, so embedding `{{ .Release.Name }}` inside the
string resolves at chart-render time. With this single-line change:

  - Sovereign-mothership (releaseName=keycloak) → keycloak-sovereign-realm-config (unchanged)
  - Per-tenant (releaseName=bp-keycloak)        → bp-keycloak-sovereign-realm-config (matches actual emitted ConfigMap)

Verified via helm template both modes — backendRef and config-volume
configMap.name match the actual ConfigMap emitted by
templates/configmap-sovereign-realm.yaml.

Chart bumped 1.3.2 → 1.3.3 + bootstrap-kit slot 09 + blueprint.yaml.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:49:39 +04:00
e3mrah
ab67a48fe7
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.

Aligned spec.version to match Chart.yaml version on:

  cert-manager   1.1.1 -> 1.1.2
  flux           1.1.3 -> 1.1.4
  crossplane     1.1.3 -> 1.1.4
  sealed-secrets 1.1.1 -> 1.1.2
  spire          1.1.4 -> 1.1.7
  nats-jetstream 1.1.1 -> 1.1.2
  openbao        1.2.0  -> 1.2.14
  keycloak       1.3.1 -> 1.3.2
  gitea          1.2.1 -> 1.2.3

Verified locally:

  $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
  --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
      ... all 10 sub-tests pass (cilium + the 9 above)

The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.

Closes #817 once verified on main.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-04 22:32:49 +04:00
e3mrah
2e981f36a5
fix(bp-keycloak): catalyst-kc-sa-credentials addr → in-cluster Service URL (closes #781) (#788)
Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token
mint, EnsureUser) were failing with `dial tcp: lookup
auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's
CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT
forward to the in-cluster PowerDNS that holds those records. Public
DNS works (PowerDNS authoritative), but Pod-side lookups of
auth.<sov-fqdn> return NXDOMAIN.

Live evidence — otech94 2026-05-04: handover URL returned
`{"error":"keycloak error: ensure user"}` from a DNS lookup failure
inside the catalyst-api Pod.

Fix: bp-keycloak chart now writes the in-cluster Service URL
(http://<release>.<namespace>.svc.cluster.local) into the
catalyst-kc-sa-credentials Secret's `addr` key instead of the public
gateway host (https://auth.<sov-fqdn>). This Secret is consumed
EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror
into catalyst-system; it is NEVER exposed to browsers.

The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn>
for operator browsers — only the Pod's intra-cluster OAuth
client_credentials calls switch to the Service URL.

Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero`
(separate chart in openova-private), not bp-keycloak.

Changes:
- platform/keycloak/chart/templates/configmap-sovereign-realm.yaml:
  Secret's $kcAddr unconditionally uses
  http://<release>.<namespace>.svc.cluster.local
- platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2
- clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2
- products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:34:22 +04:00
e3mrah
e96e31a781
fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713) (#714)
Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 01:37:36 +04:00
e3mrah
7ca9541ef9
fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup) (#691)
* fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup)

Sovereign-side catalyst-api needs Keycloak service-account credentials
to provision the operator's user during /auth/handover. Today the chart
references K8s Secret `catalyst-kc-sa-credentials` with keys addr/realm/
client-id/client-secret in the catalyst-system namespace — but no
zero-touch path materialised it. The dead SealedSecret template at
09a-keycloak-catalyst-api-secret.yaml had a different name AND different
keys (CATALYST_KC_*), used PLACEHOLDER_SEALED_VALUE markers no
provisioner replaced, and wasn't even listed in the bootstrap-kit
kustomization.

Symptom on otech48: GET /auth/handover?token=<valid-jwt> returns
"server misconfiguration: keycloak not configured"
(auth_handover.go:169).

Fix: bp-keycloak chart's configmap-sovereign-realm.yaml template now
emits the realm-import ConfigMap AND the catalyst-kc-sa-credentials
Secret in a single template scope so they share the same generated
client secret. Pattern mirrors platform/powerdns/chart/templates/
api-credentials-secret.yaml (canonical seam, ADR-0001 §11.3
anti-duplication).

Secret-value resolution order (first match wins):
  1. operator-supplied .Values.catalystApiServerClientSecret
  2. helm `lookup` of existing Secret in keycloak ns (idempotent)
  3. fresh randAlphaNum 32 (zero-touch on first install)

The Secret carries the four keys exactly as the catalyst-api Pod's
secretKeyRef expects — addr / realm / client-id / client-secret —
with addr derived from gateway.host (https://auth.<sovereignFQDN>).
Reflector annotations auto-mirror the Secret to catalyst-system as
soon as that namespace materialises (bootstrap-kit slot 13).

The realm import already creates the catalyst-api-server client with
serviceAccountsEnabled + impersonation/manage-users/view-users/
query-users role mappings — so once Keycloak is Ready and the realm
imports, the SA is fully provisioned and the K8s Secret carries a
matching client secret. No post-install Job, no Admin-API script,
no out-of-band SealedSecret ceremony.

Cleanup: removes the dead 09a SealedSecret template (not in
kustomization, never produced a working Secret).

Bumps:
  - bp-keycloak chart 1.3.0 -> 1.3.1
  - clusters/_template/bootstrap-kit/09-keycloak.yaml HelmRelease
    pin 1.3.0 -> 1.3.1

Existing per-Sovereign overlays (clusters/otech.omani.works/,
clusters/omantel.omani.works/) intentionally remain on 1.3.0 — fresh
otechN provisioning consumes _template at provision time.

Will be verified live on otech49 — handover end-to-end without ANY
manual Secret creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(keycloak): bump blueprint.yaml spec.version to match chart 1.3.1

TestBootstrapKit_BlueprintCardsHaveRequiredFields/keycloak asserts
Chart.yaml.version == blueprint.yaml.spec.version. Forgot to bump
blueprint.yaml in the previous commit.

Note: 8 other blueprints (cert-manager, flux, crossplane, sealed-secrets,
spire, nats-jetstream, openbao, gitea) carry the same pre-existing
mismatch and the test fails on main too. Out of scope for this PR;
fixing the keycloak case to keep the new chart version internally
consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:50:06 +04:00
e3mrah
737574b19a
feat(bp-keycloak): Phase-8b sovereign realm — token-exchange, catalyst-ui/api-server OIDC clients, SMTP, bump 1.2.2 → 1.3.0 (#604) (#609)
Adds the full Phase-8b identity surface required by the seamless handover flow:

- Token exchange enabled on sovereign realm (attributes.token-exchange: true)
- catalyst-ui public PKCE client: redirectUris + webOrigins keyed on
  console.<sovereignFQDN>, groups + requiredActions in ID token
- catalyst-api-server confidential service-account client: impersonation +
  manage-users + view-users + query-users roles on realm-management; client
  secret injected at provisioning time via .Values.catalystApiServerClientSecret
- WebAuthn (webauthn-register + webauthn-register-passwordless) registered as
  Required Action options on the realm
- UPDATE_PASSWORD set as defaultAction: true for new users
- smtpServer block: pre-handover default = contabo Stalwart relay; fully
  operator-configurable via .Values.smtp.* (Phase-8c-acceptable)
- required-actions client scope + oidc-usermodel-attribute-mapper for
  requiredActions claim in ID token (catalyst-ui first-login UX)

Architectural change: realm JSON moved from inline values.yaml (keycloak:
subchart key — no parent scope access) to a parent-chart template
platform/keycloak/chart/templates/configmap-sovereign-realm.yaml, which can
read .Values.sovereignFQDN and .Values.smtp.* for per-Sovereign interpolation.
The upstream bitnami chart's keycloakConfigCli.existingConfigmap is pointed at
this ConfigMap. Anti-duplication seam: configmap-sovereign-realm.yaml.

New values.yaml keys:
  sovereignFQDN: "" (REQUIRED — per-Sovereign overlay supplies it)
  sovereignRealm.enabled: true
  catalystApiServerClientSecret: "" (REQUIRED — provisioner seals and injects)
  smtp.host/port/from/user/password/ssl/starttls/auth

New bootstrap-kit file:
  09a-keycloak-catalyst-api-secret.yaml — SealedSecret template for
  keycloak-catalyst-api-server-credentials in catalyst-system namespace;
  provisioner fills encryptedData fields at deploy time

Bootstrap-kit refs bumped 1.2.x → 1.3.0 in _template, otech, omantel.
helm template clean with sovereignFQDN=otech.omani.works.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:05:27 +04:00
e3mrah
b1a25c4235
fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598) (#600)
Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was
`<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak)
but bitnami's fullname helper trims the chart-name suffix when Release.Name
already contains it, so the Service is just `keycloak`. Changed default to
`.Release.Name`. Sovereign realm was already imported (config-cli ran
successfully) — only the Gateway routing was broken, returning HTTP 500.

Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had
`helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The
`hook-succeeded` clause caused Helm to delete the SA immediately after the
weight-0 RBAC hook completed, before the weight-5 init Job pod could mount
its SA token and start. Removed all hook annotations from the RBAC resources
so they are managed by regular Helm release lifecycle (created before hooks,
never deleted mid-install).

Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6.

Verified on otech22 (manual remediation): Keycloak sovereign realm
OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:43:32 +04:00
e3mrah
83ec889f06
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)

Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.

Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:21:53 +04:00
e3mrah
20b896070f
feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326) (#448)
Wires the per-Sovereign K8s api-server's --oidc-* validator to the
per-Sovereign Keycloak realm so customer admins can authenticate
kubectl directly against their Sovereign — no static admin-kubeconfig
handoff, no rotated bearer-token exchange.

infra (cloud-init):
  - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in
    infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed
    from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign)
    per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups
    prefixes scope OIDC subjects under "oidc:" so RoleBindings reference
    e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509.

Canonical seam (anti-duplication rule, ADR-0001 §11.3):
  - The bp-keycloak chart already bundles bitnami/keycloak's
    keycloakConfigCli post-install Helm hook Job, which imports realms
    declared under values.keycloak.keycloakConfigCli.configuration. We
    enable the existing seam — no bespoke kubectl-exec realm-creation
    script, no custom Admin-API call from catalyst-api.

bp-keycloak chart (1.1.2 → 1.2.0):
  - Enable keycloakConfigCli + ship inline sovereign-realm.json with:
    realm "sovereign" (invariant per Sovereign — Keycloak resolves the
    issuer claim from the request hostname, so no per-FQDN realm
    rename), default groups sovereign-admins/-ops/-viewers, oidc-group
    -membership-mapper emitting "groups" claim, public OIDC client
    "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc
    -login defaults), publicClient=true (kubectl runs locally and
    cannot safely hold a secret), PKCE S256 enforced.
  - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape).
  - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/,
    otech.omani.works/ to version: 1.2.0.
  - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green.
  - Existing tests/observability-toggle.sh — still green.

Documentation:
  - Add §11 "kubectl OIDC for customer admins" runbook to
    docs/omantel-handover-wbs.md with one-time workstation setup
    (kubectl krew install oidc-login + config set-credentials),
    sovereign-admin RBAC binding (oidc:sovereign-admins → cluster
    -admin), and 401-debugging table mapping common symptoms to
    root causes.
  - Carve #326 out of §7 "Out of scope" — it is shipped.
  - Add §9 status row.

Validation:
  - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl
    → 2 (comment + the actual flag in the curl line)
  - grep -c 'oidc-username-claim' → 2
  - helm template platform/keycloak/chart → renders post-install
    keycloak-config-cli Job + ConfigMap with kubectl client (3 hits
    on grep "kubectl"; 1 hit on "clientId": "kubectl")
  - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode)
  - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle
    gates green

Out of scope (deferred to follow-up tickets):
  - Per-Sovereign user provisioning UI (#322, #323)
  - Refresh-token revocation on RoleBinding deletion (#324)
  - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321)
  - omantel migration / Phase 8 live execution

NO catalyst-api or UI source files touched (those are #319/#322/#323
agents' territories per agent brief).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:07:52 +04:00
e3mrah
a1bd550208
fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402)
Blueprint-release for #401 failed because HTTPRoute templates use
{{- fail }} when gateway.host is not set, which trips the chart default-values
render gate in CI. Switched 6 templates from 'fail loud' to 'skip render':

  if .Values.gateway.host  →  emit HTTPRoute
  else                     →  emit nothing

The Gateway API admission already rejects HTTPRoute with empty hostnames,
so the loud-fail wasn't buying anything an operator wouldn't see at apply
time. Default-values render now produces zero HTTPRoute resources, which
is the correct shape for the upstream chart consumers that don't set
the Sovereign-only gateway block.

Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform.

Verified:
  helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean)
  helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes

Closes the blueprint-release failure on commit abf01b6f.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:23:58 +04:00
e3mrah
abf01b6f21
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from
networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute,
replacing the legacy Traefik-on-Sovereigns assumption with the canonical
Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2
correction note (#388).

The single per-Sovereign Gateway is added as additional documents in
the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml
(NOT a new top-level slot), since Cilium owns the GatewayClass. It
includes:

  - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}`
    from `letsencrypt-dns01-prod` (cert-manager + #373 webhook)
  - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS
    terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All

Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's
existing `templates/` directory):

  | Blueprint           | Host pattern                    | Backend port |
  |---------------------|---------------------------------|--------------|
  | bp-keycloak         | auth.<sov>                      | 80           |
  | bp-gitea            | git.<sov>                       | 3000         |
  | bp-openbao          | bao.<sov>                       | 8200         |
  | bp-grafana          | grafana.<sov>                   | 80           |
  | bp-harbor           | registry.<sov>                  | 80           |
  | bp-powerdns         | pdns.<sov>/api  (dual-mode)     | 8081         |
  | bp-catalyst-platform| console.<sov>, api.<sov>         | 80, 8080     |

bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute
(Sovereign) simultaneously — the per-Sovereign overlay sets
`api.gateway.enabled=true` while leaving `api.enabled=true`. The
Ingress object is harmless on Cilium clusters with no Traefik. This
preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4.

bp-harbor flips `expose.type` from `ingress` to `clusterIP` in
platform/harbor/chart/values.yaml so the upstream chart no longer
emits its own Ingress; the HTTPRoute is the sole HTTP exposure.
TLS terminates at the Gateway (wildcard cert) rather than per-host
Certificates inside the chart.

bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by
.helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml,
which remain contabo-only legacy demo infra). The contabo path keeps
serving console.openova.io/sovereign via Traefik unchanged.

Bootstrap-kit slot updates (per-Sovereign hostname interpolation):

  - 08-openbao.yaml      → gateway.host: bao.${SOVEREIGN_FQDN}
  - 09-keycloak.yaml     → gateway.host: auth.${SOVEREIGN_FQDN}
  - 10-gitea.yaml        → gateway.host: gitea.${SOVEREIGN_FQDN}
  - 11-powerdns.yaml     → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true
  - 19-harbor.yaml       → gateway.host: registry.${SOVEREIGN_FQDN}
  - 25-grafana.yaml      → gateway.host: grafana.${SOVEREIGN_FQDN}

Server-side dry-run validation against the live Cilium Gateway API
CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway
+ Certificate apply cleanly via `kubectl apply --dry-run=server`.

Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy
SME ingresses (console-nova, marketplace, admin, axon, talentmesh,
stalwart, ...) continue to serve via Traefik as before. powerdns
on contabo remains on the Ingress path (api.gateway.enabled defaults
to false at the chart level).

Closes #387.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:19:30 +04:00
e3mrah
fa0e3a494b
fix(bp-keycloak): pin to current Bitnami tag (closes #191) (#198)
* fix(bp-keycloak): pin to current Bitnami Keycloak tag (closes #191)

Bitnami consolidated their tag scheme around 2025-09 (see
https://github.com/bitnami/charts/issues/30852). The chart was pinned to
upstream bitnami/keycloak Helm chart 24.7.1, whose default image tag
`bitnami/keycloak:26.2.4-debian-12-r0` now returns 404 in the Docker Hub
registry — installs hit ImagePullBackOff (verified on omantel).

Changes:
- Upstream Bitnami chart: 24.7.1 -> 25.2.0 (latest, appVersion 26.3.3)
- Override image.registry/image.repository for every Bitnami image used
  by the chart (keycloak app, keycloak-config-cli, postgresql,
  postgres-exporter, os-shell) to point at `bitnamilegacy/*`, where the
  historic debian-12 tags are preserved
- Replace deprecated `proxy: edge` with `proxyHeaders: "xforwarded"`
  (chart 25.x renamed the field; Catalyst fronts Keycloak with Cilium
  Gateway which sets X-Forwarded-* headers)
- bp-keycloak chart version: 1.1.1 -> 1.1.2

Verification (registry HEAD via Bearer token):
  bitnami/keycloak:26.2.4-debian-12-r0          -> 404 (broken pin)
  bitnami/keycloak:26.3.3-debian-12-r0          -> 404 (registry move)
  bitnamilegacy/keycloak:26.3.3-debian-12-r0    -> 200
  bitnamilegacy/keycloak-config-cli:6.4.0-...   -> 200
  bitnamilegacy/postgresql:17.6.0-debian-12-r0  -> 200
  bitnamilegacy/postgres-exporter:0.17.1-...    -> 200
  bitnamilegacy/os-shell:12-debian-12-r50       -> 200

`helm template platform/keycloak/chart` renders cleanly; rendered images
all resolve to bitnamilegacy/* tags listed above.

Long-term follow-up (not blocking): bitnamilegacy is explicitly marked
"no longer updated, may be removed in the future" — Catalyst should
either build its own Keycloak image or migrate to the Bitnami Secure
Image (BSI/Photon) catalog when chart support catches up. Tracked in
the bp-keycloak description block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-keycloak): bump blueprint.yaml version to match Chart.yaml 1.1.2

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:10:17 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
441ebaebb8 fix(charts): pin upstream chart versions/names to ones that exist in their repos
The first Blueprint Release CI run (commit 8c0f766) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories:

- platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0.
- platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally).
- platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1.
- platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed.

Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability.

Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions.

After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.
2026-04-28 12:55:21 +02:00
hatiyildiz
8c0f76640c feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit 07b4bcf) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule.

11 charts created with Chart.yaml + values.yaml + blueprint.yaml each:

Network + GitOps:
- platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API
- platform/flux/chart — wraps flux 2.4.0
- platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest

Security:
- platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor
- platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only)
- platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation)

Catalyst control-plane services:
- platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV)
- platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5)
- platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode)
- platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream)

New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55):
- platform/spire/README.md — workload identity Catalyst control plane component
- platform/nats-jetstream/README.md — control-plane event spine
- platform/sealed-secrets/README.md — transient bootstrap-only

Each blueprint.yaml declares:
- catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3)
- visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card)
- manifests.chart: ./chart pointer
- depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends)

.github/workflows/blueprint-release.yaml:
- New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder)
- Triggers on push to main touching platform/*/chart/** or products/*/chart/**
- detect job: emits matrix of changed Blueprint folders via git diff
- build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation
- Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance

Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live.

After this commit lands, the bootstrap-kit installer in commit 07b4bcf has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR.

Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.
2026-04-28 12:51:06 +02:00
hatiyildiz
70fea3ab8f docs(pass-34): banned-term TENANT sweep + keycloak hostname drift
GLOSSARY's banned term "tenant" survived in Configuration tables and Flux
postBuild substitutions across product READMEs as ${TENANT} (uppercase
ENV var). Prior banned-term greps searched lowercase `tenant` so the
ALL-CAPS form slipped through.

Product README fixes:
- products/cortex: TENANT/DOMAIN → ORGANIZATION/SOVEREIGN_DOMAIN, plus
  two DNS placeholder fixes for llm-gateway and chat URLs (same shape
  Pass 25/31 fixed elsewhere).
- products/fingate: 6 instances (Flux substitution, Configuration table,
  4 URL templates) renamed. URL shape api.openbanking.<org>.<sov-dom>
  flagged as 4-segment FQDN that doesn't match NAMING §5.1 or §5.2 —
  deferred to a deeper architectural pass.
- products/fabric: Configuration table row renamed.

Component README:
- platform/keycloak: shared-sovereign hostname auth.<sovereign-domain>
  and per-organization auth.<org>.<sovereign-domain> both missing
  <location-code> per NAMING §5.1. Fixed.

platform/librechat ${TENANT_ID} preserved — that's Microsoft Azure AD
tenant-ID (external technology, exempted by GLOSSARY).

Validation log Pass 34 entry includes meta-note: always run a global
grep for the surfaced drift category before closing a pass, to avoid
the asymmetric-drift problem Pass 25 warned against.
2026-04-27 22:42:50 +02:00
hatiyildiz
b467dc3f3b docs(pass-18): NAMING DR-as-env_type misexample + Keycloak deployment topology
Pass 18 — drift-detection on NAMING-CONVENTION + platform/keycloak.
Two real findings.

NAMING-CONVENTION §11.1:
- The example list of Catalyst Environments included `bankdhofar-dr`
  — but `dr` is NOT a valid env_type. Canonical values per §2.4 are
  prod / stg / uat / dev / poc. DR is a Placement mode
  (active-active / active-hotstandby across regions inside the
  *-prod Environment), not a separate Environment.
- Replaced `bankdhofar-dr` with `bankdhofar-uat` and added an
  explicit "DR is a Placement, not an Env Type" note.

platform/keycloak/README.md:
- Keycloak Deployment YAML example used `namespace: open-banking`
  with 2 replicas — Fingate-specific narrative that contradicted
  the per-Org / per-Sovereign topology stated in the banner.
  Rewrote with two side-by-side examples:
  * shared-sovereign (3 HA replicas, catalyst-keycloak namespace,
    CNPG-backed)
  * per-organization (1 replica in <org> namespace, optional
    embedded DB for smallest SME tier)
- HA section was a single set of claims (2+ replicas, CNPG, Infinispan)
  that only matched corporate. Now branches on topology — corporate
  gets HA + Infinispan, SME gets single replica with restart-on-
  deploy as acceptable for tier SLAs.

Same kind of drift Pass 17 caught in Harbor: banner says one thing,
body still describes the older model. Both fixed.

VALIDATION-LOG: Pass 18 entry added.

Refs #37
2026-04-27 22:00:42 +02:00
hatiyildiz
14ed84de41 docs(pass-8): role-in-Catalyst banners + dead-link fix in component READMEs
Pass 8 — line-by-line read of platform/cnpg, platform/strimzi,
platform/k8gb, platform/keycloak, platform/cert-manager, platform/cilium.

CNPG and Strimzi: read in full and confirmed clean — they correctly
position themselves as Application Blueprints and don't drift from
the canonical model. CNPG's `<org>-postgres-dr` cluster name
(Application-tier database role) is acceptable per NAMING-CONVENTION
§1.3 (which only forbids primary/dr in K8s host-cluster names, not
in Application-internal CRD names).

Four READMEs updated:

k8gb:
- Header reframed: per-host-cluster infrastructure pointer to
  PLATFORM-TECH-STACK §3.1 and SRE §2.4 split-brain protection.
- Removed dead link to ../failover-controller/docs/ADR-FAILOVER-
  CONTROLLER.md (the failover-controller folder has no docs/);
  replaced with link to that component's README + SRE §2.4.

keycloak:
- Header reframed from "FAPI Authorization Server for Open Banking"
  (narrow) to "User identity for Catalyst Sovereigns" (broad).
  Keycloak handles ALL user identity in Catalyst, not just FAPI.
- Added per-Org / per-Sovereign topology callout matching SECURITY
  §6. Clarified that "Multi-tenant TPP" refers to PSD2 Third Party
  Providers, not Catalyst's Organization-level multi-tenancy.
- FAPI features kept since Keycloak still serves Fingate as the
  FAPI Authorization Server.

cert-manager:
- Header reframed as per-host-cluster infrastructure with pointer
  to PLATFORM-TECH-STACK §3.3.

cilium:
- Header reframed as per-host-cluster infrastructure with pointer
  to PLATFORM-TECH-STACK §3.1, including the install-first note
  (CNI must come before any other workload during Phase 0).

VALIDATION-LOG: Pass 8 entry added.

Refs #37
2026-04-27 21:39:03 +02:00
talent-mesh
c9d04a53b4 refactor: flatten platform/ structure (41 components)
Remove hierarchical grouping (networking/, security/, etc.) and use flat
structure for all 41 platform components.

Changes:
- All components now directly under platform/ (no subfolders)
- AI Hub components moved from meta-platforms/ai-hub/components/ to platform/
- Open Banking components (lago, openmeter) moved to platform/
- meta-platforms/ now only contains README files that reference platform/
- Open Banking custom services remain in meta-platforms/open-banking/services/

Structure:
- platform/ (41 components, flat)
- meta-platforms/ai-hub/ (README only, references platform/)
- meta-platforms/open-banking/ (README + 6 custom services)

All documentation links updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:19:48 +00:00