Commit Graph

32 Commits

Author SHA1 Message Date
e3mrah
5690867be8
fix(openbao): make auth-bootstrap Job idempotent on post-upgrade (token already revoked) (#1484)
bp-openbao 1.2.15 (the HTTPRoute backend-name collapse fix) replayed the
`auth-bootstrap` post-install,post-upgrade hook against an already-bootstrapped
OpenBao. The hook hit `Error enabling kubernetes auth: 403 permission denied`
on `bao auth enable -path=kubernetes kubernetes`, the upgrade failed, and Flux
auto-rolled the release back to 1.2.14. Net effect: every chart bump that
touches bp-openbao is unrecoverable without manual intervention.

Root cause is in the hook itself: at the end of the FIRST run it
`bao token revoke -self` + deletes the openbao-root-token Secret content
(acceptance criterion #6: no root token persists past install). On any
post-upgrade replay, the Secret still mounts via valueFrom but the token
value is REVOKED, so every privileged call (`auth enable`, `secrets enable`,
`policy write`, `write role`) returns 403. The existing idempotency check
(`bao auth list | grep kubernetes/`) doesn't help because `bao auth list`
itself silently 403s and the `|| echo "{}"` mask makes the script think the
auth method is missing.

Fix: add a token-validity gate immediately after the
`initialized=true sealed=false` wait. Call `bao token lookup` (zero-cost,
strictly read-only on the caller's token). If it 403s, BAO_TOKEN was
revoked by a prior successful run — exit 0. The auth method, role, kv
backend, and ESO policy are all already configured; nothing for this Job
to do on a re-run.

Chart bump: bp-openbao 1.2.15 → 1.2.16.

Caught live on prov #80 (omantel.biz, 2026-05-14) when bp-openbao
1.2.14 → 1.2.15 was rolled by Flux and immediately failed + rolled back
in a loop, blocking bp-newapi's dependsOn and stalling the bootstrap-kit
Kustomization.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 19:13:34 +04:00
e3mrah
3d929e69d7
fix(httproute): collapse double-prefix when releaseName contains chart name (gitea/harbor/openbao 500/404) (#1483)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(httproute): match upstream chart fullname-collapse when releaseName contains chart name

Three Sovereign-facing HTTPRoute templates (gitea, harbor, openbao) had
backend defaults hardcoded as `<release>-<chart>-<resource>` (e.g.
`gitea-gitea-http`, `harbor-harbor-core`, `openbao-openbao`). The
upstream subcharts use a `<chart>.fullname` helper that COLLAPSES the
prefix when `.Release.Name` already contains the chart name — i.e. when
the bootstrap-kit releaseName is the chart name (the convention), the
live Service is `<release>-<resource>` (or just `<release>` for openbao),
not `<release>-<chart>-<resource>`.

Effect on prov #80 (omantel.biz):
- gitea/gitea HTTPRoute → backendRef `gitea-gitea-http` (does not exist; live is `gitea-http`) → BackendNotFound → gitea.omantel.biz returns HTTP 500
- harbor/harbor HTTPRoute → `harbor-harbor-core` (live is `harbor-core`) → registry.omantel.biz returns HTTP 500
- openbao/openbao HTTPRoute → `openbao-openbao` (live is `openbao`) → bao.omantel.biz dead

Fix: replicate the upstream chart's `.fullname` collapse logic via
`(ternary .Release.Name (printf "%s-<chart>" .Release.Name) (contains "<chart>" .Release.Name))` so the default backend always matches
the live Service name regardless of releaseName choice. Operators retain
the `gateway.backendService` override for non-standard release names.

Chart bumps: bp-gitea 1.2.6 → 1.2.7, bp-harbor 1.2.16 → 1.2.17, bp-openbao 1.2.14 → 1.2.15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 19:00:07 +04:00
e3mrah
ab67a48fe7
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.

Aligned spec.version to match Chart.yaml version on:

  cert-manager   1.1.1 -> 1.1.2
  flux           1.1.3 -> 1.1.4
  crossplane     1.1.3 -> 1.1.4
  sealed-secrets 1.1.1 -> 1.1.2
  spire          1.1.4 -> 1.1.7
  nats-jetstream 1.1.1 -> 1.1.2
  openbao        1.2.0  -> 1.2.14
  keycloak       1.3.1 -> 1.3.2
  gitea          1.2.1 -> 1.2.3

Verified locally:

  $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
  --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
      ... all 10 sub-tests pass (cilium + the 9 above)

The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.

Closes #817 once verified on main.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-04 22:32:49 +04:00
e3mrah
a8bcb773c9
fix(bp-openbao): add BAO_TOKEN+NAMESPACE env to auth-bootstrap (chart 1.2.14) (#666)
PR #663 added the revoke logic at the bottom of the script but the
companion env-block additions (BAO_TOKEN sourced from openbao-root-token
Secret, NAMESPACE from fieldRef) somehow never landed in the merged
diff — only the trailing revoke + DELETE block did.

Result on otech44: openbao-root-token Secret IS being created by
init-job (PR #663's other half worked), but auth-bootstrap pod env
ends at TOKEN_MAX_TTL with no BAO_TOKEN, so 'bao auth enable kubernetes'
hits 403 Forbidden again — the exact same failure that PR #663 was
supposed to fix.

This PR adds the missing env declarations.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 14:02:34 +04:00
e3mrah
561439b6c2
fix(bp-openbao): wire root_token init→auth-bootstrap (chart 1.2.13) (#663)
Caught live on otech43 after chart 1.2.12 fixed the persist gap and
auth-bootstrap finally ran: 'Error enabling kubernetes auth ... Code: 403
permission denied'. The auth-bootstrap Job had no BAO_TOKEN and was
making unauthenticated bao API calls.

Three coordinated changes:

1. init-job.yaml: after bao operator init succeeds and ROOT_TOKEN is
   extracted, POST a transient Secret openbao-root-token with the
   token in data.token. Already-exists (409) is treated as
   idempotent-re-run, anything else fails the Job loud (was silent
   before, hid the bug).

2. auth-bootstrap-job.yaml: BAO_TOKEN env sourced via secretKeyRef
   from openbao-root-token. After running auth enable / secrets enable
   / policy write / role bind, revoke the token via 'bao token revoke
   -self' AND attempt DELETE on the Secret. (busybox wget --method=DELETE
   may silently no-op; the bao-side revoke is the load-bearing
   acceptance-criterion-6 mechanism.)

3. auto-unseal-rbac.yaml: openbao-root-token added to the mutation
   rule's resourceNames so the SA can GET/PATCH/UPDATE/DELETE it.
   Create is already unrestricted from chart 1.2.10's RBAC split.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:55:13 +04:00
e3mrah
be9b5ca5bf
fix(bp-openbao): wc -l counts 0 for single-key without trailing newline (1.2.12) — TRUE root cause (#662)
Caught live on otech42 with chart 1.2.11's per-pod logs:
  + bao operator init -key-shares=1 -key-threshold=1 -format=json
  [openbao-init] FATAL: extracted 0 unseal key(s) but threshold=1

key-shares=1 → no comma → tr ',' '\n' is no-op → final sed produces
single line WITHOUT trailing newline → wc -l counts 0. Every prior
loop attributed to RBAC/wget was a downstream symptom.

Fix: append 'awk 1' for trailing newline, swap wc -l for grep -c .

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:28:50 +04:00
e3mrah
7bd9aae89b
diag(bp-openbao): restartPolicy: Never (chart 1.2.11) — preserve fresh-init pod logs (#661)
OnFailure restarts the SAME container in the SAME pod, and only the
MOST RECENT failed container's logs are kubectl-loggable. The first
attempt's logs (where the FRESH path runs and the persist gap lives)
are reaped before later restarts can be inspected.

Switching to Never makes each retry a separate Pod via Job's
backoffLimit replay. Every failed pod is independently inspectable
with kubectl logs <pod> until ttlSecondsAfterFinished tears it down.
Combined with chart 1.2.9's openbao-init-trace Secret upload (POST
now succeeds with 1.2.10's RBAC split), the fresh-path failure point
becomes definitively observable.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:13:23 +04:00
e3mrah
b5fee168b5
fix(bp-openbao): split RBAC for create verb (chart 1.2.10) — root cause of unseal-keys never persisted (#660)
The openbao-auto-unseal Role granted 'create' on Secrets with
resourceNames set. Kubernetes RBAC doesn't enforce resourceNames on
the create verb (the resource has no name at admission time, so
there's nothing to filter), but the kube-apiserver still REJECTS the
request because the rule's effective verbs[create]+resourceNames combo
doesn't match the bare 'create secrets' permission check. Result:
every init Job POST returned 403 Forbidden.

The script then fell through to the PUT branch, which silently failed
because BusyBox wget (the openbao image's only HTTP client) has no
--method flag. Both calls non-zero → script exited 1 with FATAL
'cannot persist'. The first init's logs got reaped before later
restarts could be inspected, so the FATAL was never visible — the
retries all hit the idempotent FATAL ('vault is sealed but the
unseal-keys Secret is missing') with no record of why.

Caught live on otech40 with chart 1.2.9's trace upload + a wget
auth-can-i probe:
  kubectl auth can-i create secrets --as=...openbao-auto-unseal → no
  kubectl auth can-i create secret/openbao-unseal-keys ... → yes

Fix: split into two rules per the k8s RBAC pattern.
  rule 1: verbs[create] WITHOUT resourceNames (allows POST)
  rule 2: verbs[get,patch,update,delete] WITH resourceNames
          (mutation stays scoped to known names)

This unblocks every fresh Sovereign provisioning. Each subsequent run
hits the idempotent path (GET on openbao-unseal-keys → 200) and
unseals automatically — no operator intervention.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:55:05 +04:00
e3mrah
09e56f1e47
diag(bp-openbao): persist init script trace to Secret across restarts (1.2.9) (#659)
otech38/39 confirmed: openbao reaches Initialized=true on the first
init pod attempt but the unseal-keys Secret is never persisted. The
fresh-init container's logs are reaped before subsequent restarts'
idempotent FATAL allows them to be inspected, so we keep flying blind
on the actual failure point.

This change tees every line of the init script (set -x trace + every
echo) into /tmp/.script.trace and uploads it to a per-namespace
Secret 'openbao-init-trace' on EXIT (success OR failure). The Secret
survives Pod recreation and any Job retry; the operator can read it
with kubectl after the next provision and see exactly where the
fresh-path script exited.

Adds 'openbao-init-trace' to the openbao-auto-unseal Role's
resourceNames so the Job SA can PUT/POST it.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:38:54 +04:00
e3mrah
5f6d1c7d86
diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658)
otech37/38 hit the same wall: server reaches Initialized=true but
openbao-unseal-keys Secret is never persisted; the FIRST init pod's
logs that ran fresh init are reaped by container restart before we
can capture what happened.

Add 'set -x' to shell-trace every command. Now even if the script
crashes mid-run, pod logs show the last command attempted. The
captured diagnostic on the next provision will tell us whether the
failure is in /tmp/init-output.json parsing, the persist wget, or
elsewhere.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:09:05 +04:00
e3mrah
8447930bf7
fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7)

otech37 caught: bao operator init succeeded server-side
(Initialized=true), but the script's wget POST to persist
openbao-unseal-keys Secret silently failed (|| true), and the PUT
fallback also silenced. Subsequent Job retries hit Initialized=true
on the idempotent path, found no openbao-unseal-keys Secret, and
FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every
retry forever.

Hardening:
  1. Capture POST + PUT stdout/stderr to /tmp files instead of
     /dev/null so the FATAL path can echo them.
  2. PUT no longer || true — if both POST and PUT fail, exit 1.
  3. Add read-back verification: GET the persisted Secret and
     assert 'unseal-keys-b64' field is present. Catches
     partial-write / eventual-consistency cases.

Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 10:51:21 +04:00
e3mrah
da61ecdc79
test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:46:31 +04:00
e3mrah
b1a25c4235
fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598) (#600)
Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was
`<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak)
but bitnami's fullname helper trims the chart-name suffix when Release.Name
already contains it, so the Service is just `keycloak`. Changed default to
`.Release.Name`. Sovereign realm was already imported (config-cli ran
successfully) — only the Gateway routing was broken, returning HTTP 500.

Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had
`helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The
`hook-succeeded` clause caused Helm to delete the SA immediately after the
weight-0 RBAC hook completed, before the weight-5 init Job pod could mount
its SA token and start. Removed all hook annotations from the RBAC resources
so they are managed by regular Helm release lifecycle (created before hooks,
never deleted mid-install).

Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6.

Verified on otech22 (manual remediation): Keycloak sovereign realm
OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:43:32 +04:00
e3mrah
ad9cfc0f23
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set):
- bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst
  job images now prefixed with global.imageRegistry when non-empty. Default
  (empty) renders identical manifests.
- bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed
  with global.imageRegistry when non-empty. Verified: dnsdist image rewrites
  to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14.

Subchart-only charts (global.imageRegistry stub added; threading via per-component
subchart values.yaml keys documented in comments):
- bp-external-secrets 1.1.0→1.1.1
- bp-cnpg 1.0.0→1.0.1  (charts/ missing = pre-existing state, not this PR)
- bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR)
- bp-nats-jetstream 1.1.1→1.1.2
- bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring

vcluster: N/A — no chart directory under platform/vcluster/chart/

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:52:43 +04:00
e3mrah
8cde771c0f
fix(bp-openbao): unseal on idempotent path + persist keys (Closes #539) (#540)
PR #528 added unseal logic but only on the FRESH-init branch. When a
previous Job pod completed `bao operator init` but exited before the
unseal block (or when openbao-0 simply restarts under shamir seal),
the next reconcile takes the "already initialized" branch and exits
without ever running `bao operator unseal`. Symptom on otech21:
init-job logs end with `auto-unseal init complete`, but
`bao status` reports Initialized=true Sealed=true forever, the
bp-openbao HR stays Unknown/Running for the full 15m install
timeout, and bp-external-secrets/bp-external-secrets-stores block
on the dep.

Fix has two parts:

1. Persist `unseal_keys_b64` on fresh init to a new K8s Secret
   `openbao-unseal-keys` (BEFORE applying the keys, so a unseal
   crash mid-step is recoverable on next retry).
2. Add a Step 2a "idempotent-path unseal" branch: when bao reports
   Initialized=true Sealed=true, fetch the persisted keys Secret
   and apply unseal exactly the same way Step 3a does on fresh
   init. Verify Sealed=false and exit; otherwise FATAL with the
   manual-recovery pointer.

RBAC: extend the openbao-auto-unseal Role to allow create/get/
patch/update on openbao-unseal-keys (alongside openbao-init-marker).

Chart bump 1.2.3 → 1.2.4. HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml updated to match
so cloud-init-templated Sovereigns pick up the new chart.

Co-authored-by: e3mrah <emrah.baysal@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 10:44:46 +04:00
e3mrah
d90abb1e85
fix(bp-openbao): unseal vault after init in chart Job (Closes #527) (#528)
The init Job ran `bao operator init -key-shares=1 -key-threshold=1`
which leaves the cluster Initialized=true but Sealed=true. Without
an explicit `bao operator unseal <key>` call the StatefulSet pod
stays sealed forever, the bp-openbao HelmRelease never reports
Ready=True, and every dependent blueprint (bp-external-secrets,
bp-external-secrets-stores) blocks on this dep.

This was the 5th and final latent bug in the chart's auto-unseal
flow (after PRs #518 #520 #523 #524 #525). On otech17
(6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but
`bao status` reported Sealed=true forever.

Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init
JSON, call `bao operator unseal <key>` $threshold times (1 with
the current key-shares=1 / key-threshold=1 config), then assert
`bao status -format=json | grep '"sealed":false'` before the Job
exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:24:57 +04:00
e3mrah
ba5a1929f1
fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517) (#525)
The chart's init Job called `bao operator init -recovery-shares=1
-recovery-threshold=1` which only works with auto-unseal seal types
(gcpckms/awskms/transit). The upstream openbao chart's default config
uses `seal "shamir"` (no auto-unseal stanza in
values.standalone.config / values.ha.config), so the OpenBao API
returns 400: "parameters recovery_shares,recovery_threshold not
applicable to seal type shamir".

Switch to -key-shares=1 -key-threshold=1 which is the correct shamir-
seal init flags. Operators wiring auto-unseal seals later will need
to flip back via a chart-values toggle.

Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new
artifact on next reconcile.

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:14:05 +04:00
e3mrah
6e3d3d281e
fix(bp-openbao): bump chart 1.2.0→1.2.1 + HR ref for busybox-wget fix (refs #517) (#524)
Bumps platform/openbao/chart/Chart.yaml version to 1.2.1 carrying the
busybox-compatible wget flag fix (PR #523). Also bumps the HR's
chart.spec.version in clusters/_template/bootstrap-kit/08-openbao.yaml
so Sovereigns pull the new bytes once blueprint-release publishes
ghcr.io/openova-io/bp-openbao:1.2.1.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:09:06 +04:00
e3mrah
5c0618d920
fix(bp-openbao): use busybox-compatible wget flag in init Job (refs #517) (#523)
The chart's init Job runs inside the openbao image (quay.io/openbao/
openbao:2.1.0) which uses busybox wget. The script's wget calls used
`--ca-certificate=$CACERT` which busybox wget does not support, causing
wget to print its usage page and fail with "seed Secret has no key
recovery-seed" (false negative — the parsing pipeline saw the usage
text instead of JSON).

Replace with `--no-check-certificate`. The Secret still requires the
Bearer token for auth — the lack of CA verification only affects
TLS handshake validation against an in-cluster API server reached via
the well-known kubernetes.default.svc DNS name (out-of-band attack
surface is negligible inside the pod network).

The `--method=DELETE` line for cleaning up the seed Secret remains —
busybox wget doesn't support method override either, but that line
is wrapped in `|| true` so the seed deletion failure doesn't block
the init Job from succeeding. Seed is single-use anyway and harmless
post-init (the recovery key is the OUTPUT of bao operator init, not
this seed).

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:07:52 +04:00
e3mrah
d2ada908c9
feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316) (#408)
Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns
(no managed-KMS available). Selected **Option A — Shamir + cloud-init
seed** because:

  - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C)
    is structurally unavailable.
  - Transit-seal (Option B) requires a peer OpenBao cluster, only
    applicable to multi-region tier-1; out of scope for single-region
    omantel.
  - Manual unseal (Option D) violates the "first sovereign-admin lands
    on console.<sovereign-fqdn> ready to use" goal in
    SOVEREIGN-PROVISIONING.md §5.

Architecture (per issue #316 spec + acceptance criteria 1-6):

  1. Cloud-init on the control-plane node generates a 32-byte recovery
     seed from /dev/urandom and writes it to a single-use K8s Secret
     `openbao-recovery-seed` in the openbao namespace, with annotation
     `openbao.openova.io/single-use: "true"`. Pre-creates the openbao
     namespace to eliminate the race with Flux's HelmRelease apply.
  2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks:
       - `templates/init-job.yaml` (hook weight 5): consumes the seed,
         calls `bao operator init -recovery-shares=1 -recovery-threshold=1`,
         persists the recovery key inside OpenBao's auto-unseal config,
         deletes the seed Secret on success. Idempotent — re-runs detect
         Initialized=true and exit 0.
       - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables
         the Kubernetes auth method, mounts kv-v2 at `secret/`, writes
         the `external-secrets-read` policy, binds the `external-secrets`
         role to the ESO ServiceAccount in `external-secrets-system`.
  3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA
     + Role + RoleBinding the Jobs need (Secret get/list/delete in the
     openbao namespace; create/get/patch on the openbao-init-marker).
     Also emits the permanent `system:auth-delegator` ClusterRoleBinding
     bound to the OpenBao ServiceAccount so the Kubernetes auth method
     can call tokenreviews.authentication.k8s.io.
  4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml`
     bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true`
     per-Sovereign.

Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }}
{{- end }}`) used throughout — never `{{ fail }}`. Default `helm
template` render emits NOTHING new; opt-in via autoUnseal.enabled=true.

Acceptance criteria coverage:
  1. Provision fresh Sovereign — cloud-init writes seed, Flux installs
     bp-openbao 1.2.0, post-install Jobs run automatically. 
  2. bp-openbao HR Ready=True without manual intervention — install
     keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the
     init Job drives initialisation out-of-band on the same install). 
  3. `bao status` shows Sealed=false, Initialized=true within 5 minutes
     — init Job polls + retries up to 60×5s. 
  4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the
     auth-bootstrap Job binds the `external-secrets` role to ESO's SA
     before the Job exits. 
  5. Seed Secret deleted post-init — init Job deletes it via K8s API
     after consuming. 
  6. No openbao-root-token Secret in K8s — root token captured to
     /tmp/.root-token in the Job pod's tmpfs only; never written to a
     K8s Secret. The recovery key persists ONLY inside OpenBao's Raft
     state (auto-unseal config). 

Tests:
  - tests/auto-unseal-toggle.sh — 4 cases:
    * default render → no auto-unseal artefacts (skip-render works)
    * autoUnseal.enabled=true → both Jobs + correct hook weights
    * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap
    * idempotency annotations present on all 5 hook objects
  - tests/observability-toggle.sh — unchanged, all 3 cases green.
  - helm lint . — clean.

Files:
  - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/chart/values.yaml — `autoUnseal.*` block
  - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new
  - platform/openbao/chart/templates/init-job.yaml — new
  - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new
  - platform/openbao/chart/tests/auto-unseal-toggle.sh — new
  - platform/openbao/README.md — bootstrap procedure §2-3 expanded;
    auto-unseal alternatives table added.
  - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 →
    1.2.0, autoUnseal.enabled=true.
  - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block
    inserted between ghcr-pull-secret apply and flux-bootstrap apply.
  - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released.

Canonical seam used: extended existing `platform/openbao/chart/` per
the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud
calls. NO `{{ fail }}`. All knobs configurable via values.yaml per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode).

Co-authored-by: hatiyildiz <hat.yil@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:45:44 +04:00
e3mrah
a1bd550208
fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402)
Blueprint-release for #401 failed because HTTPRoute templates use
{{- fail }} when gateway.host is not set, which trips the chart default-values
render gate in CI. Switched 6 templates from 'fail loud' to 'skip render':

  if .Values.gateway.host  →  emit HTTPRoute
  else                     →  emit nothing

The Gateway API admission already rejects HTTPRoute with empty hostnames,
so the loud-fail wasn't buying anything an operator wouldn't see at apply
time. Default-values render now produces zero HTTPRoute resources, which
is the correct shape for the upstream chart consumers that don't set
the Sovereign-only gateway block.

Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform.

Verified:
  helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean)
  helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes

Closes the blueprint-release failure on commit abf01b6f.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:23:58 +04:00
e3mrah
abf01b6f21
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from
networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute,
replacing the legacy Traefik-on-Sovereigns assumption with the canonical
Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2
correction note (#388).

The single per-Sovereign Gateway is added as additional documents in
the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml
(NOT a new top-level slot), since Cilium owns the GatewayClass. It
includes:

  - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}`
    from `letsencrypt-dns01-prod` (cert-manager + #373 webhook)
  - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS
    terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All

Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's
existing `templates/` directory):

  | Blueprint           | Host pattern                    | Backend port |
  |---------------------|---------------------------------|--------------|
  | bp-keycloak         | auth.<sov>                      | 80           |
  | bp-gitea            | git.<sov>                       | 3000         |
  | bp-openbao          | bao.<sov>                       | 8200         |
  | bp-grafana          | grafana.<sov>                   | 80           |
  | bp-harbor           | registry.<sov>                  | 80           |
  | bp-powerdns         | pdns.<sov>/api  (dual-mode)     | 8081         |
  | bp-catalyst-platform| console.<sov>, api.<sov>         | 80, 8080     |

bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute
(Sovereign) simultaneously — the per-Sovereign overlay sets
`api.gateway.enabled=true` while leaving `api.enabled=true`. The
Ingress object is harmless on Cilium clusters with no Traefik. This
preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4.

bp-harbor flips `expose.type` from `ingress` to `clusterIP` in
platform/harbor/chart/values.yaml so the upstream chart no longer
emits its own Ingress; the HTTPRoute is the sole HTTP exposure.
TLS terminates at the Gateway (wildcard cert) rather than per-host
Certificates inside the chart.

bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by
.helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml,
which remain contabo-only legacy demo infra). The contabo path keeps
serving console.openova.io/sovereign via Traefik unchanged.

Bootstrap-kit slot updates (per-Sovereign hostname interpolation):

  - 08-openbao.yaml      → gateway.host: bao.${SOVEREIGN_FQDN}
  - 09-keycloak.yaml     → gateway.host: auth.${SOVEREIGN_FQDN}
  - 10-gitea.yaml        → gateway.host: gitea.${SOVEREIGN_FQDN}
  - 11-powerdns.yaml     → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true
  - 19-harbor.yaml       → gateway.host: registry.${SOVEREIGN_FQDN}
  - 25-grafana.yaml      → gateway.host: grafana.${SOVEREIGN_FQDN}

Server-side dry-run validation against the live Cilium Gateway API
CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway
+ Certificate apply cleanly via `kubectl apply --dry-run=server`.

Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy
SME ingresses (console-nova, marketplace, admin, axon, talentmesh,
stalwart, ...) continue to serve via Traefik as before. powerdns
on contabo remains on the Ingress path (api.gateway.enabled defaults
to false at the chart level).

Closes #387.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:19:30 +04:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
441ebaebb8 fix(charts): pin upstream chart versions/names to ones that exist in their repos
The first Blueprint Release CI run (commit 8c0f766) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories:

- platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0.
- platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally).
- platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1.
- platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed.

Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability.

Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions.

After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.
2026-04-28 12:55:21 +02:00
hatiyildiz
8c0f76640c feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit 07b4bcf) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule.

11 charts created with Chart.yaml + values.yaml + blueprint.yaml each:

Network + GitOps:
- platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API
- platform/flux/chart — wraps flux 2.4.0
- platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest

Security:
- platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor
- platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only)
- platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation)

Catalyst control-plane services:
- platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV)
- platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5)
- platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode)
- platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream)

New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55):
- platform/spire/README.md — workload identity Catalyst control plane component
- platform/nats-jetstream/README.md — control-plane event spine
- platform/sealed-secrets/README.md — transient bootstrap-only

Each blueprint.yaml declares:
- catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3)
- visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card)
- manifests.chart: ./chart pointer
- depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends)

.github/workflows/blueprint-release.yaml:
- New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder)
- Triggers on push to main touching platform/*/chart/** or products/*/chart/**
- detect job: emits matrix of changed Blueprint folders via git diff
- build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation
- Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance

Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live.

After this commit lands, the bootstrap-kit installer in commit 07b4bcf has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR.

Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.
2026-04-28 12:51:06 +02:00
hatiyildiz
3993f5fc31 docs(pass-31): openbao + librechat DNS-placeholder carry-over fixes
platform/openbao/README.md ingress hosts (line 108) had `bao.<domain>` while
the same file's ClusterSecretStore example (line 127) used the canonical
`bao.<location-code>.<sovereign-domain>` form. Pass 7's active-active fix
addressed the body but missed the ingress placeholder. Aligned with the
canonical form.

platform/librechat/README.md OAuth callback (line 154) had
`chat.ai-hub.<domain>/oauth/openid/callback` — same Application-endpoint
shape Pass 25 fixed in llm-gateway. Pass 22 marked the file clean and Pass
29 fixed the Keycloak issuer line but didn't re-sweep. Per NAMING §5.2
Application endpoints are `{app}.{environment}.{sovereign-domain}`. Fixed.

docs/GLOSSARY.md verified clean — single-source-of-truth has held across
the loop (Pass 6/7/14/20/22/26/27 all consistent with current GLOSSARY).

Validation log Pass 31 entry includes meta-note: third file (librechat)
that needed re-opening after a "clean" mark — banner scans miss YAML-block
drift. Future passes should default to a full placeholder-shape grep on
every file touched.
2026-04-27 22:34:10 +02:00
hatiyildiz
42aeb629bb docs(pass-7): rewrite OpenBao + ESO READMEs to match agreed multi-region semantics
Pass 7 — line-by-line read of platform/openbao/README.md and
platform/external-secrets/README.md found a major architectural drift:
both files described an OLD active-active bidirectional sync model
that contradicts docs/SECURITY.md §5 (the canonical reference).

The active-active design was rejected during the architecture session
because it would have been a stretched cluster — a single region's
network blip would block writes everywhere. The agreed model is:

- Independent Raft cluster per region (intra-region quorum only).
- Single-primary writes; replicas accept reads only.
- Async Performance Replication primary → replicas (lag <1s typical).
- Explicit DR promotion (sovereign-admin or failover-controller).

Fixes:

platform/openbao/README.md:
- Overview: removed "active-active deployments" / "either region can
  update secrets". Replaced with "independent Raft cluster per region",
  "asynchronous Performance Replication".
- Architecture diagram: replaced bidirectional-push diagram with the
  primary→replicas async perf replication topology that matches
  SECURITY.md §5.
- ClusterSecretStores: simplified from "two stores (local+remote)" to
  "one local store"; reads always pull locally.
- Renamed "PushSecret (Bidirectional)" → "Writes go to the primary
  region" with a single-target PushSecret pointing at bao-primary.
- Added DR promotion section pointing at SECURITY.md §5.2.
- Status banner: notes that the canonical multi-region reference is
  SECURITY.md.

platform/external-secrets/README.md:
- Header line: repositioned as per-host-cluster infrastructure with
  pointer to PLATFORM-TECH-STACK §3.3.
- Removed broken link to non-existent ../openbao/docs/ADR-OPENBAO.md
  (replaced with link to ../openbao/README.md).
- "Multi-region sync | Push to both OpenBao instances simultaneously"
  → "Multi-region reads | Async perf replication".
- "PushSecret to Multiple OpenBao Instances" example was writing to
  two ClusterSecretStores in parallel — replaced with single-target
  primary write.
- "Multi-region sync via single PushSecret" in Consequences →
  "Cross-region availability via Performance Replication".
- Mermaid sequence diagram: "Bootstrap Wizard" actor → "Catalyst
  Bootstrap (Phase 0)"; "Terraform" → "OpenTofu"; ESO connection
  description "via K8s auth" → "via SPIFFE SVID (workload identity)".

These were the most consequential drift fixes found in any pass —
two READMEs were documenting an architecture explicitly rejected by
the agreed model.

Refs #37
2026-04-27 21:34:09 +02:00
hatiyildiz
119a1e53a0 docs(components): terminology pass across platform and product READMEs
Bring per-component READMEs in line with the canonical glossary
(docs/GLOSSARY.md). Substantive architectural content unchanged —
this is a terminology + reference correctness pass.

Placeholder rename: <tenant> → <org> in YAML / IaC examples across
- platform/cnpg/README.md           (Cluster + Pooler + ScheduledBackup)
- platform/debezium/README.md       (PostgreSQL connector + topic patterns)
- platform/external-secrets/README.md (ExternalSecret / SecretStore)
- platform/grafana/README.md        (Instrumentation namespace)
- platform/k8gb/README.md           (Gslb + namespace + kubectl examples)
- platform/keda/README.md           (ScaledObject + Kafka triggers + Prometheus)
- platform/opentofu/README.md       (server resource example)
- platform/velero/README.md         (BackupStorageLocation buckets)
- platform/vpa/README.md            (VerticalPodAutoscaler examples)
- platform/flux/README.md           (kustomization name + tenants/ → organizations/)

"Catalyst IDP" → "Catalyst console":
- platform/crossplane/README.md     (integration section retitled and
                                      rewritten — Crossplane is platform
                                      plumbing, not user-facing)
- platform/gitea/README.md          (architecture diagram + integration table)
- platform/kyverno/README.md        (rollout tracking surface)
- products/fingate/README.md        (TPP onboarding portal)

"Bootstrap wizard" → "Catalyst bootstrap":
- platform/openbao/README.md        (bootstrap procedure rewritten —
                                      independent Raft per region clarified;
                                      cross-references docs/SECURITY.md §5)
- platform/opentofu/README.md       (Quick Start)

Kyverno labels & prose:
- openova.io/tenant → openova.io/organization (label rename for
  consistency; deployed clusters will add new label as a co-label
  during migration window)
- "tenant labels" / "tenant namespace" prose updated to
  "Organization labels" / "Organization-labeled namespace"
- Priority class names (tenant-high, tenant-default, tenant-batch)
  retained as deployed artifact names — rename pending in a
  separate migration ticket

No banned-term hits remain in component READMEs (verified by grep
in docs/GLOSSARY.md banned-terms table).

Refs #37
2026-04-27 20:06:51 +02:00
talent-mesh
10245dff98 feat: ecosystem expansion to 55 components with license compliance
- Replace BSL-licensed components with open-source alternatives:
  Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0),
  Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0)
- Add 14 new platform components: activemq, camel, clickhouse, dapr,
  debezium, falco, flink, iceberg, opensearch, rabbitmq, superset,
  temporal, trino, vitess
- Rename meta-platforms/ to products/ with new product names:
  Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse),
  Fuse (Microservices Integration)
- Update all documentation, READMEs, and cross-references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 18:15:11 +00:00