openova

Author	SHA1	Message	Date
e3mrah	5690867be8	fix(openbao): make auth-bootstrap Job idempotent on post-upgrade (token already revoked) (#1484 ) bp-openbao 1.2.15 (the HTTPRoute backend-name collapse fix) replayed the `auth-bootstrap` post-install,post-upgrade hook against an already-bootstrapped OpenBao. The hook hit `Error enabling kubernetes auth: 403 permission denied` on `bao auth enable -path=kubernetes kubernetes`, the upgrade failed, and Flux auto-rolled the release back to 1.2.14. Net effect: every chart bump that touches bp-openbao is unrecoverable without manual intervention. Root cause is in the hook itself: at the end of the FIRST run it `bao token revoke -self` + deletes the openbao-root-token Secret content (acceptance criterion #6: no root token persists past install). On any post-upgrade replay, the Secret still mounts via valueFrom but the token value is REVOKED, so every privileged call (`auth enable`, `secrets enable`, `policy write`, `write role`) returns 403. The existing idempotency check (`bao auth list \| grep kubernetes/`) doesn't help because `bao auth list` itself silently 403s and the `\|\| echo "{}"` mask makes the script think the auth method is missing. Fix: add a token-validity gate immediately after the `initialized=true sealed=false` wait. Call `bao token lookup` (zero-cost, strictly read-only on the caller's token). If it 403s, BAO_TOKEN was revoked by a prior successful run — exit 0. The auth method, role, kv backend, and ESO policy are all already configured; nothing for this Job to do on a re-run. Chart bump: bp-openbao 1.2.15 → 1.2.16. Caught live on prov #80 (omantel.biz, 2026-05-14) when bp-openbao 1.2.14 → 1.2.15 was rolled by Flux and immediately failed + rolled back in a loop, blocking bp-newapi's dependsOn and stalling the bootstrap-kit Kustomization. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 19:13:34 +04:00
e3mrah	3d929e69d7	fix(httproute): collapse double-prefix when releaseName contains chart name (gitea/harbor/openbao 500/404) (#1483 ) * fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and a default-deny CCNP is present, every public request to a Sovereign host (console, auth, gitea, registry, api, ...) hits the gateway listener and gets DENIED at envoy's cilium.l7policy filter with: cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy. Root cause: Cilium creates a special endpoint with identity reserved:ingress (8) representing the gateway listener. By default this endpoint has policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace endpointSelector does NOT cover this endpoint (it has no io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes are Programmed, backends are healthy in-cluster, but every request 403s. Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork fix (#1480) finally activated host-bind on :30443. Verified by: - envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443 - cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1] - transiently applying the same CCNP via kubectl: console.omantel.biz → 200 Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world, cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver, plus egress to all so envoy can forward to any backend service. This is the canonical Cilium hostNetwork Gateway-API zero-trust pattern. Chart bump: catalyst 1.4.142 → 1.4.143. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(httproute): match upstream chart fullname-collapse when releaseName contains chart name Three Sovereign-facing HTTPRoute templates (gitea, harbor, openbao) had backend defaults hardcoded as `<release>-<chart>-<resource>` (e.g. `gitea-gitea-http`, `harbor-harbor-core`, `openbao-openbao`). The upstream subcharts use a `<chart>.fullname` helper that COLLAPSES the prefix when `.Release.Name` already contains the chart name — i.e. when the bootstrap-kit releaseName is the chart name (the convention), the live Service is `<release>-<resource>` (or just `<release>` for openbao), not `<release>-<chart>-<resource>`. Effect on prov #80 (omantel.biz): - gitea/gitea HTTPRoute → backendRef `gitea-gitea-http` (does not exist; live is `gitea-http`) → BackendNotFound → gitea.omantel.biz returns HTTP 500 - harbor/harbor HTTPRoute → `harbor-harbor-core` (live is `harbor-core`) → registry.omantel.biz returns HTTP 500 - openbao/openbao HTTPRoute → `openbao-openbao` (live is `openbao`) → bao.omantel.biz dead Fix: replicate the upstream chart's `.fullname` collapse logic via `(ternary .Release.Name (printf "%s-<chart>" .Release.Name) (contains "<chart>" .Release.Name))` so the default backend always matches the live Service name regardless of releaseName choice. Operators retain the `gateway.backendService` override for non-standard release names. Chart bumps: bp-gitea 1.2.6 → 1.2.7, bp-harbor 1.2.16 → 1.2.17, bp-openbao 1.2.14 → 1.2.15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>	2026-05-14 19:00:07 +04:00
e3mrah	ab67a48fe7	fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817 ) (#819 ) TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for 9 blueprints because their platform/<name>/chart/Chart.yaml version had been bumped without a matching update to platform/<name>/blueprint.yaml spec.version. The pre-existing failure forced 7 recent PRs to self-merge with --admin, masking real CI failures. Aligned spec.version to match Chart.yaml version on: cert-manager 1.1.1 -> 1.1.2 flux 1.1.3 -> 1.1.4 crossplane 1.1.3 -> 1.1.4 sealed-secrets 1.1.1 -> 1.1.2 spire 1.1.4 -> 1.1.7 nats-jetstream 1.1.1 -> 1.1.2 openbao 1.2.0 -> 1.2.14 keycloak 1.3.1 -> 1.3.2 gitea 1.2.1 -> 1.2.3 Verified locally: $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1 --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s) ... all 10 sub-tests pass (cilium + the 9 above) The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself the drift guardrail: it fails CI whenever Chart.yaml is bumped without a matching blueprint.yaml bump. No additional script needed. Closes #817 once verified on main. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>	2026-05-04 22:32:49 +04:00
e3mrah	a8bcb773c9	fix(bp-openbao): add BAO_TOKEN+NAMESPACE env to auth-bootstrap (chart 1.2.14) (#666 ) PR #663 added the revoke logic at the bottom of the script but the companion env-block additions (BAO_TOKEN sourced from openbao-root-token Secret, NAMESPACE from fieldRef) somehow never landed in the merged diff — only the trailing revoke + DELETE block did. Result on otech44: openbao-root-token Secret IS being created by init-job (PR #663's other half worked), but auth-bootstrap pod env ends at TOKEN_MAX_TTL with no BAO_TOKEN, so 'bao auth enable kubernetes' hits 403 Forbidden again — the exact same failure that PR #663 was supposed to fix. This PR adds the missing env declarations. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 14:02:34 +04:00
e3mrah	561439b6c2	fix(bp-openbao): wire root_token init→auth-bootstrap (chart 1.2.13) (#663 ) Caught live on otech43 after chart 1.2.12 fixed the persist gap and auth-bootstrap finally ran: 'Error enabling kubernetes auth ... Code: 403 permission denied'. The auth-bootstrap Job had no BAO_TOKEN and was making unauthenticated bao API calls. Three coordinated changes: 1. init-job.yaml: after bao operator init succeeds and ROOT_TOKEN is extracted, POST a transient Secret openbao-root-token with the token in data.token. Already-exists (409) is treated as idempotent-re-run, anything else fails the Job loud (was silent before, hid the bug). 2. auth-bootstrap-job.yaml: BAO_TOKEN env sourced via secretKeyRef from openbao-root-token. After running auth enable / secrets enable / policy write / role bind, revoke the token via 'bao token revoke -self' AND attempt DELETE on the Secret. (busybox wget --method=DELETE may silently no-op; the bao-side revoke is the load-bearing acceptance-criterion-6 mechanism.) 3. auto-unseal-rbac.yaml: openbao-root-token added to the mutation rule's resourceNames so the SA can GET/PATCH/UPDATE/DELETE it. Create is already unrestricted from chart 1.2.10's RBAC split. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 12:55:13 +04:00
e3mrah	be9b5ca5bf	fix(bp-openbao): wc -l counts 0 for single-key without trailing newline (1.2.12) — TRUE root cause (#662 ) Caught live on otech42 with chart 1.2.11's per-pod logs: + bao operator init -key-shares=1 -key-threshold=1 -format=json [openbao-init] FATAL: extracted 0 unseal key(s) but threshold=1 key-shares=1 → no comma → tr ',' '\n' is no-op → final sed produces single line WITHOUT trailing newline → wc -l counts 0. Every prior loop attributed to RBAC/wget was a downstream symptom. Fix: append 'awk 1' for trailing newline, swap wc -l for grep -c . Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 12:28:50 +04:00
e3mrah	7bd9aae89b	diag(bp-openbao): restartPolicy: Never (chart 1.2.11) — preserve fresh-init pod logs (#661 ) OnFailure restarts the SAME container in the SAME pod, and only the MOST RECENT failed container's logs are kubectl-loggable. The first attempt's logs (where the FRESH path runs and the persist gap lives) are reaped before later restarts can be inspected. Switching to Never makes each retry a separate Pod via Job's backoffLimit replay. Every failed pod is independently inspectable with kubectl logs <pod> until ttlSecondsAfterFinished tears it down. Combined with chart 1.2.9's openbao-init-trace Secret upload (POST now succeeds with 1.2.10's RBAC split), the fresh-path failure point becomes definitively observable. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 12:13:23 +04:00
e3mrah	b5fee168b5	fix(bp-openbao): split RBAC for create verb (chart 1.2.10) — root cause of unseal-keys never persisted (#660 ) The openbao-auto-unseal Role granted 'create' on Secrets with resourceNames set. Kubernetes RBAC doesn't enforce resourceNames on the create verb (the resource has no name at admission time, so there's nothing to filter), but the kube-apiserver still REJECTS the request because the rule's effective verbs[create]+resourceNames combo doesn't match the bare 'create secrets' permission check. Result: every init Job POST returned 403 Forbidden. The script then fell through to the PUT branch, which silently failed because BusyBox wget (the openbao image's only HTTP client) has no --method flag. Both calls non-zero → script exited 1 with FATAL 'cannot persist'. The first init's logs got reaped before later restarts could be inspected, so the FATAL was never visible — the retries all hit the idempotent FATAL ('vault is sealed but the unseal-keys Secret is missing') with no record of why. Caught live on otech40 with chart 1.2.9's trace upload + a wget auth-can-i probe: kubectl auth can-i create secrets --as=...openbao-auto-unseal → no kubectl auth can-i create secret/openbao-unseal-keys ... → yes Fix: split into two rules per the k8s RBAC pattern. rule 1: verbs[create] WITHOUT resourceNames (allows POST) rule 2: verbs[get,patch,update,delete] WITH resourceNames (mutation stays scoped to known names) This unblocks every fresh Sovereign provisioning. Each subsequent run hits the idempotent path (GET on openbao-unseal-keys → 200) and unseals automatically — no operator intervention. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 11:55:05 +04:00
e3mrah	09e56f1e47	diag(bp-openbao): persist init script trace to Secret across restarts (1.2.9) (#659 ) otech38/39 confirmed: openbao reaches Initialized=true on the first init pod attempt but the unseal-keys Secret is never persisted. The fresh-init container's logs are reaped before subsequent restarts' idempotent FATAL allows them to be inspected, so we keep flying blind on the actual failure point. This change tees every line of the init script (set -x trace + every echo) into /tmp/.script.trace and uploads it to a per-namespace Secret 'openbao-init-trace' on EXIT (success OR failure). The Secret survives Pod recreation and any Job retry; the operator can read it with kubectl after the next provision and see exactly where the fresh-path script exited. Adds 'openbao-init-trace' to the openbao-auto-unseal Role's resourceNames so the Job SA can PUT/POST it. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 11:38:54 +04:00
e3mrah	5f6d1c7d86	diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658 ) otech37/38 hit the same wall: server reaches Initialized=true but openbao-unseal-keys Secret is never persisted; the FIRST init pod's logs that ran fresh init are reaped by container restart before we can capture what happened. Add 'set -x' to shell-trace every command. Now even if the script crashes mid-run, pod logs show the last command attempted. The captured diagnostic on the next provision will tell us whether the failure is in /tmp/init-output.json parsing, the persist wget, or elsewhere. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 11:09:05 +04:00
e3mrah	8447930bf7	fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657 ) * fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) * fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth) The wizard's componentGroups.ts carried hand-maintained `dependencies: [...]` arrays that deviated from the real Flux install graph in clusters/_template/bootstrap-kit/.yaml. Examples (otech34 surfaced this): componentGroups.ts Flux HelmRelease.dependsOn ---------------------- --------------------------- keycloak: [cnpg] keycloak: [cert-manager, gateway-api] openbao: [] openbao: [spire, gateway-api, cnpg] harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager, valkey] gateway-api] Founder's directive: "all the real dependencies are related to real flux related dependencies, if you are hosting irrelevant hardcoded baseless wizard catalog dependencies, I dont know where they are coming from. The single source of truth for the dependencies is flux!!!" — 2026-05-03 This commit: 1. Adds scripts/generate-blueprint-deps.sh that parses every bootstrap-kit HelmRelease and emits blueprint-deps.generated.json keyed by bare component id (bp- prefix stripped on both source and target side). 2. Commits the generated JSON. 3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id). 4. Patches componentGroups.ts so every RAW_COMPONENT's `dependencies` field is OVERRIDDEN at module load with the Flux-canonical list (the inline `dependencies: [...]` literals are now ignored — Flux is canonical). Follow-ups (not in this PR): - CI drift check that re-runs the script and diffs the JSON. - Strip the inline `dependencies: [...]` arrays entirely once the drift check is green. - Wire the FlowPage edge-rendering to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent hardcoded dep map at lines 105-155 that the founder caught — most visibly: keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao The reason the founder kept seeing the spurious arrow on the Flow page. Replace the local table with an import of BLUEPRINT_DEPS from data/blueprintDeps.ts (single source of truth — generated from clusters/_template/bootstrap-kit/.yaml by scripts/generate-blueprint-deps.sh). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> fix(jobs): don't regress status to pending after exec started helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the Job's Status with jobStatusFromHelmState(state) on every event. Flux oscillates HelmReleases between Reconciling and DependencyNotReady while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready — helmwatch maps both back to HelmStatePending. The bridge then flips the row to status='pending' even though an active Execution is streaming exec log lines (startedAt + latestExecutionId already set). Founder caught this on otech34's install-external-secrets job: status='pending' on the Jobs page while Exec Log was actively tailing. Fix: monotonic guard — once activeExecID[component] != "" (Execution allocated), refuse to regress nextStatus to StatusPending. Treat ongoing-after-start as Running so the row reflects the live stream. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(jobs): cascade Failed status through dependsOn (fail-fast) Founder caught on otech34: install-openbao=failed but install-external-secrets stayed pending forever ('masking it and waiting unnecessarily'). Flux's HelmRelease for external-secrets is in DependencyNotReady, helmwatch maps that to StatePending, bridge writes Status=pending — no signal that the upstream FAILED rather than 'still installing'. Add a post-rollup sweep in deriveTreeView that propagates Failed through the dependsOn graph. Up to 8 sweeps cover the deepest bootstrap-kit chain. Idempotent on read; reverses if openbao recovers because it operates on the live snapshot. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files' Diagnosed live during otech35: openbao-init pod crash-looped 4× on 'bao operator init' with: failed to create fsnotify watcher: too many open files Flux mapped to InstallFailed → RetriesExceeded → cascading through external-secrets and external-secrets-stores. The wizard masked the OS-level root cause behind a generic InstallFailed. Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm- controller + 11 CNPG operators + Reflector + Cert-Manager + bao + keycloak-config-cli + ... each grabs instance slots). The instance count exhausts within minutes; the next process to ask for an inotify slot gets EMFILE. Bump well above k8s/k3s production guidance so future blueprints don't tickle the same wall: fs.inotify.max_user_instances = 8192 fs.inotify.max_user_watches = 1048576 fs.inotify.max_queued_events = 16384 Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system' in runcmd. Permanent across reboots. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7) otech37 caught: bao operator init succeeded server-side (Initialized=true), but the script's wget POST to persist openbao-unseal-keys Secret silently failed (\|\| true), and the PUT fallback also silenced. Subsequent Job retries hit Initialized=true on the idempotent path, found no openbao-unseal-keys Secret, and FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every retry forever. Hardening: 1. Capture POST + PUT stdout/stderr to /tmp files instead of /dev/null so the FATAL path can echo them. 2. PUT no longer \|\| true — if both POST and PUT fail, exit 1. 3. Add read-back verification: GET the persisted Secret and assert 'unseal-keys-b64' field is present. Catches partial-write / eventual-consistency cases. Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 10:51:21 +04:00
e3mrah	da61ecdc79	test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647 ) * fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB) CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12 vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default Pod on one node. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal Commit `b1a25c42` (#600) removed the helm.sh/hook-delete-policy from the auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install (the old hook-succeeded clause caused the SA to disappear before the init Job could mount its token). The chart-test still expected ≥5 before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs). Result: Blueprint Release for #600 (run 25251129679) failed at the test gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main already references it. otech30 caught this live: bp-openbao HR stuck with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'. Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao 1.2.6 onto GHCR. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 08:46:31 +04:00
e3mrah	b1a25c4235	fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598 ) (#600 ) Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was `<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak) but bitnami's fullname helper trims the chart-name suffix when Release.Name already contains it, so the Service is just `keycloak`. Changed default to `.Release.Name`. Sovereign realm was already imported (config-cli ran successfully) — only the Gateway routing was broken, returning HTTP 500. Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had `helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The `hook-succeeded` clause caused Helm to delete the SA immediately after the weight-0 RBAC hook completed, before the weight-5 init Job pod could mount its SA token and start. Removed all hook annotations from the RBAC resources so they are managed by regular Helm release lifecycle (created before hooks, never deleted mid-install). Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6. Verified on otech22 (manual remediation): Keycloak sovereign realm OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:43:32 +04:00
e3mrah	ad9cfc0f23	feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560 ) (#565 ) Charts with template image refs (fully rewritten when registry set): - bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst job images now prefixed with global.imageRegistry when non-empty. Default (empty) renders identical manifests. - bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed with global.imageRegistry when non-empty. Verified: dnsdist image rewrites to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14. Subchart-only charts (global.imageRegistry stub added; threading via per-component subchart values.yaml keys documented in comments): - bp-external-secrets 1.1.0→1.1.1 - bp-cnpg 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-nats-jetstream 1.1.1→1.1.2 - bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring vcluster: N/A — no chart directory under platform/vcluster/chart/ Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:52:43 +04:00
e3mrah	8cde771c0f	fix(bp-openbao): unseal on idempotent path + persist keys (Closes #539 ) (#540 ) PR #528 added unseal logic but only on the FRESH-init branch. When a previous Job pod completed `bao operator init` but exited before the unseal block (or when openbao-0 simply restarts under shamir seal), the next reconcile takes the "already initialized" branch and exits without ever running `bao operator unseal`. Symptom on otech21: init-job logs end with `auto-unseal init complete`, but `bao status` reports Initialized=true Sealed=true forever, the bp-openbao HR stays Unknown/Running for the full 15m install timeout, and bp-external-secrets/bp-external-secrets-stores block on the dep. Fix has two parts: 1. Persist `unseal_keys_b64` on fresh init to a new K8s Secret `openbao-unseal-keys` (BEFORE applying the keys, so a unseal crash mid-step is recoverable on next retry). 2. Add a Step 2a "idempotent-path unseal" branch: when bao reports Initialized=true Sealed=true, fetch the persisted keys Secret and apply unseal exactly the same way Step 3a does on fresh init. Verify Sealed=false and exit; otherwise FATAL with the manual-recovery pointer. RBAC: extend the openbao-auto-unseal Role to allow create/get/ patch/update on openbao-unseal-keys (alongside openbao-init-marker). Chart bump 1.2.3 → 1.2.4. HR ref in clusters/_template/bootstrap-kit/08-openbao.yaml updated to match so cloud-init-templated Sovereigns pick up the new chart. Co-authored-by: e3mrah <emrah.baysal@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 10:44:46 +04:00
e3mrah	d90abb1e85	fix(bp-openbao): unseal vault after init in chart Job (Closes #527 ) (#528 ) The init Job ran `bao operator init -key-shares=1 -key-threshold=1` which leaves the cluster Initialized=true but Sealed=true. Without an explicit `bao operator unseal <key>` call the StatefulSet pod stays sealed forever, the bp-openbao HelmRelease never reports Ready=True, and every dependent blueprint (bp-external-secrets, bp-external-secrets-stores) blocks on this dep. This was the 5th and final latent bug in the chart's auto-unseal flow (after PRs #518 #520 #523 #524 #525). On otech17 (6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but `bao status` reported Sealed=true forever. Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init JSON, call `bao operator unseal <key>` $threshold times (1 with the current key-shares=1 / key-threshold=1 config), then assert `bao status -format=json \| grep '"sealed":false'` before the Job exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in clusters/_template/bootstrap-kit/08-openbao.yaml. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:24:57 +04:00
e3mrah	ba5a1929f1	fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517 ) (#525 ) The chart's init Job called `bao operator init -recovery-shares=1 -recovery-threshold=1` which only works with auto-unseal seal types (gcpckms/awskms/transit). The upstream openbao chart's default config uses `seal "shamir"` (no auto-unseal stanza in values.standalone.config / values.ha.config), so the OpenBao API returns 400: "parameters recovery_shares,recovery_threshold not applicable to seal type shamir". Switch to -key-shares=1 -key-threshold=1 which is the correct shamir- seal init flags. Operators wiring auto-unseal seals later will need to flip back via a chart-values toggle. Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new artifact on next reconcile. Refs #517 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:14:05 +04:00
e3mrah	6e3d3d281e	fix(bp-openbao): bump chart 1.2.0→1.2.1 + HR ref for busybox-wget fix (refs #517 ) (#524 ) Bumps platform/openbao/chart/Chart.yaml version to 1.2.1 carrying the busybox-compatible wget flag fix (PR #523). Also bumps the HR's chart.spec.version in clusters/_template/bootstrap-kit/08-openbao.yaml so Sovereigns pull the new bytes once blueprint-release publishes ghcr.io/openova-io/bp-openbao:1.2.1. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:09:06 +04:00
e3mrah	5c0618d920	fix(bp-openbao): use busybox-compatible wget flag in init Job (refs #517 ) (#523 ) The chart's init Job runs inside the openbao image (quay.io/openbao/ openbao:2.1.0) which uses busybox wget. The script's wget calls used `--ca-certificate=$CACERT` which busybox wget does not support, causing wget to print its usage page and fail with "seed Secret has no key recovery-seed" (false negative — the parsing pipeline saw the usage text instead of JSON). Replace with `--no-check-certificate`. The Secret still requires the Bearer token for auth — the lack of CA verification only affects TLS handshake validation against an in-cluster API server reached via the well-known kubernetes.default.svc DNS name (out-of-band attack surface is negligible inside the pod network). The `--method=DELETE` line for cleaning up the seed Secret remains — busybox wget doesn't support method override either, but that line is wrapped in `\|\| true` so the seed deletion failure doesn't block the init Job from succeeding. Seed is single-use anyway and harmless post-init (the recovery key is the OUTPUT of bao operator init, not this seed). Refs #517 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:07:52 +04:00
e3mrah	d2ada908c9	feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316 ) (#408 ) Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns (no managed-KMS available). Selected Option A — Shamir + cloud-init seed because: - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C) is structurally unavailable. - Transit-seal (Option B) requires a peer OpenBao cluster, only applicable to multi-region tier-1; out of scope for single-region omantel. - Manual unseal (Option D) violates the "first sovereign-admin lands on console.<sovereign-fqdn> ready to use" goal in SOVEREIGN-PROVISIONING.md §5. Architecture (per issue #316 spec + acceptance criteria 1-6): 1. Cloud-init on the control-plane node generates a 32-byte recovery seed from /dev/urandom and writes it to a single-use K8s Secret `openbao-recovery-seed` in the openbao namespace, with annotation `openbao.openova.io/single-use: "true"`. Pre-creates the openbao namespace to eliminate the race with Flux's HelmRelease apply. 2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks: - `templates/init-job.yaml` (hook weight 5): consumes the seed, calls `bao operator init -recovery-shares=1 -recovery-threshold=1`, persists the recovery key inside OpenBao's auto-unseal config, deletes the seed Secret on success. Idempotent — re-runs detect Initialized=true and exit 0. - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables the Kubernetes auth method, mounts kv-v2 at `secret/`, writes the `external-secrets-read` policy, binds the `external-secrets` role to the ESO ServiceAccount in `external-secrets-system`. 3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA + Role + RoleBinding the Jobs need (Secret get/list/delete in the openbao namespace; create/get/patch on the openbao-init-marker). Also emits the permanent `system:auth-delegator` ClusterRoleBinding bound to the OpenBao ServiceAccount so the Kubernetes auth method can call tokenreviews.authentication.k8s.io. 4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml` bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true` per-Sovereign. Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }} {{- end }}`) used throughout — never `{{ fail }}`. Default `helm template` render emits NOTHING new; opt-in via autoUnseal.enabled=true. Acceptance criteria coverage: 1. Provision fresh Sovereign — cloud-init writes seed, Flux installs bp-openbao 1.2.0, post-install Jobs run automatically. ✅ 2. bp-openbao HR Ready=True without manual intervention — install keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the init Job drives initialisation out-of-band on the same install). ✅ 3. `bao status` shows Sealed=false, Initialized=true within 5 minutes — init Job polls + retries up to 60×5s. ✅ 4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the auth-bootstrap Job binds the `external-secrets` role to ESO's SA before the Job exits. ✅ 5. Seed Secret deleted post-init — init Job deletes it via K8s API after consuming. ✅ 6. No openbao-root-token Secret in K8s — root token captured to /tmp/.root-token in the Job pod's tmpfs only; never written to a K8s Secret. The recovery key persists ONLY inside OpenBao's Raft state (auto-unseal config). ✅ Tests: - tests/auto-unseal-toggle.sh — 4 cases: * default render → no auto-unseal artefacts (skip-render works) * autoUnseal.enabled=true → both Jobs + correct hook weights * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap * idempotency annotations present on all 5 hook objects - tests/observability-toggle.sh — unchanged, all 3 cases green. - helm lint . — clean. Files: - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0 - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0 - platform/openbao/chart/values.yaml — `autoUnseal.*` block - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new - platform/openbao/chart/templates/init-job.yaml — new - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new - platform/openbao/chart/tests/auto-unseal-toggle.sh — new - platform/openbao/README.md — bootstrap procedure §2-3 expanded; auto-unseal alternatives table added. - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 → 1.2.0, autoUnseal.enabled=true. - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block inserted between ghcr-pull-secret apply and flux-bootstrap apply. - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released. Canonical seam used: extended existing `platform/openbao/chart/` per the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud calls. NO `{{ fail }}`. All knobs configurable via values.yaml per INVIOLABLE-PRINCIPLES.md #4 (never hardcode). Co-authored-by: hatiyildiz <hat.yil@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:45:44 +04:00
e3mrah	a1bd550208	fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402 ) Blueprint-release for #401 failed because HTTPRoute templates use {{- fail }} when gateway.host is not set, which trips the chart default-values render gate in CI. Switched 6 templates from 'fail loud' to 'skip render': if .Values.gateway.host → emit HTTPRoute else → emit nothing The Gateway API admission already rejects HTTPRoute with empty hostnames, so the loud-fail wasn't buying anything an operator wouldn't see at apply time. Default-values render now produces zero HTTPRoute resources, which is the correct shape for the upstream chart consumers that don't set the Sovereign-only gateway block. Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform. Verified: helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean) helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes Closes the blueprint-release failure on commit `abf01b6f`. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:23:58 +04:00
e3mrah	abf01b6f21	feat(platform): Gateway API migration audit (#387 ) (#401 ) Migrates every minimal-Sovereign-set blueprint chart from networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute, replacing the legacy Traefik-on-Sovereigns assumption with the canonical Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2 correction note (#388). The single per-Sovereign Gateway is added as additional documents in the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml (NOT a new top-level slot), since Cilium owns the GatewayClass. It includes: - Certificate `sovereign-wildcard-tls` requesting `.${SOVEREIGN_FQDN}` from `letsencrypt-dns01-prod` (cert-manager + #373 webhook) - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's existing `templates/` directory): \| Blueprint \| Host pattern \| Backend port \| \|---------------------\|---------------------------------\|--------------\| \| bp-keycloak \| auth.<sov> \| 80 \| \| bp-gitea \| git.<sov> \| 3000 \| \| bp-openbao \| bao.<sov> \| 8200 \| \| bp-grafana \| grafana.<sov> \| 80 \| \| bp-harbor \| registry.<sov> \| 80 \| \| bp-powerdns \| pdns.<sov>/api (dual-mode) \| 8081 \| \| bp-catalyst-platform\| console.<sov>, api.<sov> \| 80, 8080 \| bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute (Sovereign) simultaneously — the per-Sovereign overlay sets `api.gateway.enabled=true` while leaving `api.enabled=true`. The Ingress object is harmless on Cilium clusters with no Traefik. This preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4. bp-harbor flips `expose.type` from `ingress` to `clusterIP` in platform/harbor/chart/values.yaml so the upstream chart no longer emits its own Ingress; the HTTPRoute is the sole HTTP exposure. TLS terminates at the Gateway (wildcard cert) rather than per-host Certificates inside the chart. bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by .helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml, which remain contabo-only legacy demo infra). The contabo path keeps serving console.openova.io/sovereign via Traefik unchanged. Bootstrap-kit slot updates (per-Sovereign hostname interpolation): - 08-openbao.yaml → gateway.host: bao.${SOVEREIGN_FQDN} - 09-keycloak.yaml → gateway.host: auth.${SOVEREIGN_FQDN} - 10-gitea.yaml → gateway.host: gitea.${SOVEREIGN_FQDN} - 11-powerdns.yaml → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true - 19-harbor.yaml → gateway.host: registry.${SOVEREIGN_FQDN} - 25-grafana.yaml → gateway.host: grafana.${SOVEREIGN_FQDN} Server-side dry-run validation against the live Cilium Gateway API CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway + Certificate apply cleanly via `kubectl apply --dry-run=server`. Contabo unaffected: clusters/contabo-mkt/ not modified. The legacy SME ingresses (console-nova, marketplace, admin, axon, talentmesh, stalwart, ...) continue to serve via Traefik as before. powerdns on contabo remains on the Ingress path (api.gateway.enabled defaults to false at the chart level). Closes #387. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:19:30 +04:00
e3mrah	1f5c76def1	fix(platform): sync blueprint.yaml versions with Chart.yaml (#199 ) * feat(ui): Playwright cosmetic + step-flow regression guards 15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic- guards.spec.ts that fail HARD when each user-flagged defect class returns: 1. card height drift from canonical 108px 2. reserved right padding eating description width 3. logo tile drift from per-brand LOGO_SURFACE 4. invisible glyph (white-on-white) via luminance proxy 5. wizard step order Org/Topology/Provider/Credentials/Components/ Domain/Review 6. legacy "Choose Your Stack" / "Always Included" tab labels 7. Domain step reachable before Components 8. CPX32 not the recommended Hetzner SKU 9. per-region SKU dropdown shows wrong provider catalog 10. provision page is .html (static) not SPA route 11. legacy bubble/edge DAG SVG markup on provision page 12. admin sidebar drift from canonical core/console (w-56 + 7 labels) 13. AppDetail uses tablist instead of sectioned layout 14. job rows navigate to /job/<id> instead of expand-in-place 15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage Each test prints a failure message naming the canonical reference, the source-of-truth file, and the data-testid PR needed (if any) so the implementing agent has a precise target. No .skip() — per INVIOLABLE-PRINCIPLES #2, missing components fail loud. CI: .github/workflows/cosmetic-guards.yaml runs the suite on every PR that touches products/catalyst/bootstrap/ui/ or core/console/. Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's original complaint, the canonical reference, and the green/red semantics (5 tests intentionally RED on main today — they stay red until the companion-agent's UI work lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:07:55 +04:00
hatiyildiz	1ddd569789	fix(bp-): observability toggles default false — break circular CRD dependency Extends the v1.1.1 hardening that started with cilium / cert-manager / crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints. Every observability toggle in every Catalyst-curated Blueprint now ships `false`/`null` by default; the operator opts in via a per-cluster values overlay at clusters/<sovereign>/bootstrap-kit/ once bp-kube-prometheus-stack reconciles. Live failure mode that prompted this (omantel.omani.works 2026-04-29): bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor to true. The upstream Cilium 1.16.5 chart renders a monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with kube-prometheus-stack — a tier-2 Application Blueprint that depends on the bootstrap-kit (cilium first). Helm install fails on a fresh Sovereign with "no matches for kind ServiceMonitor in version monitoring.coreos.com/v1 — ensure CRDs are installed first" and every downstream HelmRelease reports `dep is not ready`. The earlier trustCRDsExist=true mitigation only suppresses Helm's render-time gate; the apiserver still rejects the resource at install-time. Per-Blueprint changes: - bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false; hubble.metrics.enabled → null (this is the exact value that disables the upstream metrics ServiceMonitor template branch — verified by reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor .enabled → false. tests/observability-toggle.sh extended with Case 4 (default render produces no hubble-relay / hubble-ui Deployments). - bp-flux: flux2.prometheus.podMonitor.create → false. - bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled → false (explicit lock; upstream already defaults false). - bp-spire: spire.global.spire.recommendations.enabled + recommendations.prometheus → false. - bp-nats-jetstream: nats.promExporter.enabled + promExporter.podMonitor.enabled → false. - bp-openbao: openbao.injector.metrics.enabled + openbao.serviceMonitor.enabled → false. - bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled + metrics.prometheusRule.enabled → false. - bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.* serviceMonitor + prometheusRule → false. - bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled → false (forward-compatibility guard; current upstream pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future upstream bump cannot silently regress). Each chart ships a tests/observability-toggle.sh that asserts the rule in three cases (default off / explicit on opt-in / explicit off) — runs under blueprint-release.yaml's chart-test gate (added `bdeb0f54` + the existing wiring) before helm push. A regression that re-introduces a hardcoded enabled: true in any chart fails CI before the OCI artifact is published. Versioning: - All 11 leaf charts bumped 1.1.0 → 1.1.1. - products/catalyst/chart (bp-catalyst-platform umbrella) deps updated to 1.1.1 across the board. - clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to 1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror. docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every toggle disabled across all 11 Blueprints. References docs/INVIOLABLE-PRINCIPLES.md #4. GATES (all green): - helm dep build resolves cleanly post-change for every chart whose upstream is published (umbrella waits on per-leaf publish). - helm lint clean on all 11 leaves. - helm template . default render produces zero monitoring.coreos.com references on every leaf (verified locally). - tests/observability-toggle.sh PASS on all 11 leaves. Live verification: with v1.1.1 published the omantel.omani.works HelmRelease can roll forward without a manual values patch — Flux picks up the new chart digest automatically (semver: 1.x in OCIRepository). Refs: issue #182.	2026-04-29 19:23:52 +02:00
hatiyildiz	43aff20254	feat(bp-): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream Each platform/<name>/chart/Chart.yaml now declares the canonical upstream chart as a dependencies: entry. helm dependency build pulls the upstream payload into the OCI artifact at publish time, so Flux helm install of bp-<name>:1.1.0 actually installs the upstream Helm release alongside the Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer, ExternalSecret) under templates/. Pinned upstream chart versions per platform/<name>/blueprint.yaml: - cilium 1.16.5 https://helm.cilium.io - cert-manager v1.16.2 https://charts.jetstack.io - flux 2.4.0 https://fluxcd-community.github.io/helm-charts - crossplane 1.17.x https://charts.crossplane.io/stable - sealed-secrets 2.16.x https://bitnami-labs.github.io/sealed-secrets - spire ... https://spiffe.github.io/helm-charts-hardened - nats-jetstream ... https://nats-io.github.io/k8s/helm/charts - openbao ... https://openbao.github.io/openbao-helm - keycloak ... https://charts.bitnami.com/bitnami - gitea ... https://dl.gitea.com/charts - catalyst-platform umbrella over the 10 leaf bp- charts via helm dependency values.yaml in each chart adopts the umbrella convention: catalystBlueprint metadata block (provenance + version) at top level, upstream subchart values namespaced under the dependency name. cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER cert-manager controllers are running and CRDs registered (the previous hollow-chart shape ran the ClusterIssuer at install time when CRDs didn't exist yet, which was the omantel cluster's exact failure mode). Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella conversion is a meaningful structural revision). Cluster manifests in clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/ bootstrap-kit/ updated to reference 1.1.0. The blueprint-release.yaml workflow's helm package step needs an explicit helm dependency build before push so the upstream subchart bytes ship inside the OCI artifact. That CI change is a follow-up commit on this same branch (separate file scope).	2026-04-29 17:21:36 +02:00
hatiyildiz	62d9c7d936	fix(charts): drop dependencies block — wrappers carry values overlay only The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks. Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values. This keeps: - blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd) - the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork) - the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>) Changes: - 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package. - 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values. - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up. After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.	2026-04-28 12:57:29 +02:00
hatiyildiz	441ebaebb8	fix(charts): pin upstream chart versions/names to ones that exist in their repos The first Blueprint Release CI run (commit `8c0f766`) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories: - platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0. - platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally). - platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1. - platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed. Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability. Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions. After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.	2026-04-28 12:55:21 +02:00
hatiyildiz	8c0f76640c	feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit `07b4bcf`) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule. 11 charts created with Chart.yaml + values.yaml + blueprint.yaml each: Network + GitOps: - platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API - platform/flux/chart — wraps flux 2.4.0 - platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest Security: - platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor - platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only) - platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation) Catalyst control-plane services: - platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV) - platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5) - platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode) - platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream) New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55): - platform/spire/README.md — workload identity Catalyst control plane component - platform/nats-jetstream/README.md — control-plane event spine - platform/sealed-secrets/README.md — transient bootstrap-only Each blueprint.yaml declares: - catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3) - visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card) - manifests.chart: ./chart pointer - depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends) .github/workflows/blueprint-release.yaml: - New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder) - Triggers on push to main touching platform//chart/* or products//chart/* - detect job: emits matrix of changed Blueprint folders via git diff - build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation - Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live. After this commit lands, the bootstrap-kit installer in commit `07b4bcf` has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR. Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.	2026-04-28 12:51:06 +02:00
hatiyildiz	3993f5fc31	docs(pass-31): openbao + librechat DNS-placeholder carry-over fixes platform/openbao/README.md ingress hosts (line 108) had `bao.<domain>` while the same file's ClusterSecretStore example (line 127) used the canonical `bao.<location-code>.<sovereign-domain>` form. Pass 7's active-active fix addressed the body but missed the ingress placeholder. Aligned with the canonical form. platform/librechat/README.md OAuth callback (line 154) had `chat.ai-hub.<domain>/oauth/openid/callback` — same Application-endpoint shape Pass 25 fixed in llm-gateway. Pass 22 marked the file clean and Pass 29 fixed the Keycloak issuer line but didn't re-sweep. Per NAMING §5.2 Application endpoints are `{app}.{environment}.{sovereign-domain}`. Fixed. docs/GLOSSARY.md verified clean — single-source-of-truth has held across the loop (Pass 6/7/14/20/22/26/27 all consistent with current GLOSSARY). Validation log Pass 31 entry includes meta-note: third file (librechat) that needed re-opening after a "clean" mark — banner scans miss YAML-block drift. Future passes should default to a full placeholder-shape grep on every file touched.	2026-04-27 22:34:10 +02:00
hatiyildiz	42aeb629bb	docs(pass-7): rewrite OpenBao + ESO READMEs to match agreed multi-region semantics Pass 7 — line-by-line read of platform/openbao/README.md and platform/external-secrets/README.md found a major architectural drift: both files described an OLD active-active bidirectional sync model that contradicts docs/SECURITY.md §5 (the canonical reference). The active-active design was rejected during the architecture session because it would have been a stretched cluster — a single region's network blip would block writes everywhere. The agreed model is: - Independent Raft cluster per region (intra-region quorum only). - Single-primary writes; replicas accept reads only. - Async Performance Replication primary → replicas (lag <1s typical). - Explicit DR promotion (sovereign-admin or failover-controller). Fixes: platform/openbao/README.md: - Overview: removed "active-active deployments" / "either region can update secrets". Replaced with "independent Raft cluster per region", "asynchronous Performance Replication". - Architecture diagram: replaced bidirectional-push diagram with the primary→replicas async perf replication topology that matches SECURITY.md §5. - ClusterSecretStores: simplified from "two stores (local+remote)" to "one local store"; reads always pull locally. - Renamed "PushSecret (Bidirectional)" → "Writes go to the primary region" with a single-target PushSecret pointing at bao-primary. - Added DR promotion section pointing at SECURITY.md §5.2. - Status banner: notes that the canonical multi-region reference is SECURITY.md. platform/external-secrets/README.md: - Header line: repositioned as per-host-cluster infrastructure with pointer to PLATFORM-TECH-STACK §3.3. - Removed broken link to non-existent ../openbao/docs/ADR-OPENBAO.md (replaced with link to ../openbao/README.md). - "Multi-region sync \| Push to both OpenBao instances simultaneously" → "Multi-region reads \| Async perf replication". - "PushSecret to Multiple OpenBao Instances" example was writing to two ClusterSecretStores in parallel — replaced with single-target primary write. - "Multi-region sync via single PushSecret" in Consequences → "Cross-region availability via Performance Replication". - Mermaid sequence diagram: "Bootstrap Wizard" actor → "Catalyst Bootstrap (Phase 0)"; "Terraform" → "OpenTofu"; ESO connection description "via K8s auth" → "via SPIFFE SVID (workload identity)". These were the most consequential drift fixes found in any pass — two READMEs were documenting an architecture explicitly rejected by the agreed model. Refs #37	2026-04-27 21:34:09 +02:00
hatiyildiz	119a1e53a0	docs(components): terminology pass across platform and product READMEs Bring per-component READMEs in line with the canonical glossary (docs/GLOSSARY.md). Substantive architectural content unchanged — this is a terminology + reference correctness pass. Placeholder rename: <tenant> → <org> in YAML / IaC examples across - platform/cnpg/README.md (Cluster + Pooler + ScheduledBackup) - platform/debezium/README.md (PostgreSQL connector + topic patterns) - platform/external-secrets/README.md (ExternalSecret / SecretStore) - platform/grafana/README.md (Instrumentation namespace) - platform/k8gb/README.md (Gslb + namespace + kubectl examples) - platform/keda/README.md (ScaledObject + Kafka triggers + Prometheus) - platform/opentofu/README.md (server resource example) - platform/velero/README.md (BackupStorageLocation buckets) - platform/vpa/README.md (VerticalPodAutoscaler examples) - platform/flux/README.md (kustomization name + tenants/ → organizations/) "Catalyst IDP" → "Catalyst console": - platform/crossplane/README.md (integration section retitled and rewritten — Crossplane is platform plumbing, not user-facing) - platform/gitea/README.md (architecture diagram + integration table) - platform/kyverno/README.md (rollout tracking surface) - products/fingate/README.md (TPP onboarding portal) "Bootstrap wizard" → "Catalyst bootstrap": - platform/openbao/README.md (bootstrap procedure rewritten — independent Raft per region clarified; cross-references docs/SECURITY.md §5) - platform/opentofu/README.md (Quick Start) Kyverno labels & prose: - openova.io/tenant → openova.io/organization (label rename for consistency; deployed clusters will add new label as a co-label during migration window) - "tenant labels" / "tenant namespace" prose updated to "Organization labels" / "Organization-labeled namespace" - Priority class names (tenant-high, tenant-default, tenant-batch) retained as deployed artifact names — rename pending in a separate migration ticket No banned-term hits remain in component READMEs (verified by grep in docs/GLOSSARY.md banned-terms table). Refs #37	2026-04-27 20:06:51 +02:00
talent-mesh	10245dff98	feat: ecosystem expansion to 55 components with license compliance - Replace BSL-licensed components with open-source alternatives: Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0), Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0) - Add 14 new platform components: activemq, camel, clickhouse, dapr, debezium, falco, flink, iceberg, opensearch, rabbitmq, superset, temporal, trino, vitess - Rename meta-platforms/ to products/ with new product names: Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse), Fuse (Microservices Integration) - Update all documentation, READMEs, and cross-references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 18:15:11 +00:00

32 Commits