openova

Author	SHA1	Message	Date
e3mrah	0ac12970d8	ci(openova-flow): build openova-flow-server + adapter-flux images + sed chart tags (#1398 ) Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go binaries so prov #34 has real images to install. Both auto-bump their chart's values.yaml `image.tag` on every main-branch push and dispatch blueprint-release for chart re-publish. Workflows shipped: - .github/workflows/build-openova-flow-server.yaml · Triggers on push to products/openova-flow/server/** or the chart · `go vet` + `go test -race` + Buildx push to ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest · cosign keyless sign + SBOM attest · awk-bumps platform/openova-flow-server/chart/values.yaml flowServer.image.tag, commits to main with [skip ci] · Dispatches blueprint-release.yaml for chart re-publish - .github/workflows/build-openova-flow-adapter-flux.yaml · Same shape; bumps platform/openova-flow-emitter/chart/values.yaml flowEmitter.image.tag Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no values.yaml changes needed in this PR. Canonical patterns cited (ARCHITECT-FIRST): - Build shape mirrors .github/workflows/build-application-controller.yaml (Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump + blueprint-release dispatch). - awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in .github/workflows/catalyst-build.yaml `deploy` job (with the `[skip ci]` + explicit blueprint-release dispatch fix from #712). Per docs/INVIOLABLE-PRINCIPLES.md: - #4a (GitHub Actions is the only build path) - event-driven (no cron triggers, only push/PR/workflow_dispatch) MIRROR-EVERYTHING: image refs in chart values point at harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and Harbor proxy-pulls. No direct push to harbor. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:03:31 +04:00
e3mrah	3a5d9fc102	fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111 ) (#1331 ) Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 23:31:56 +04:00
e3mrah	f668d791ab	fix(bp-newapi): publish newapi-mirror image + repoint chart to existing tag (qa-loop bounded-cycle audit prov #7 Gap F) (#1315 ) Root cause from live diagnosis (omantel.biz prov #7, kubectl --context=omantel): The bp-newapi chart at platform/newapi/chart/values.yaml referenced `ghcr.io/openova-io/openova/newapi-mirror:v0.4.5` since its first commit (44d0200a, 2026-05-01). However: 1. NO CI workflow ever built that image. There is no `build-bp-newapi.yaml` (or similar) under .github/workflows/. The GHCR package `ghcr.io/openova-io/openova/newapi-mirror` does not exist (404 from /orgs/openova-io/packages/container/...). 2. The tag `v0.4.5` is fictitious — neither upstream Calcium-Ion/new-api (`docker.io/calciumion/new-api`) nor the alternate ancestor (`justsong/one-api`) ever published a `v0.4.5`. The lowest stable Calcium-Ion tag is `v0.6.0.9`; the highest stable v0.x is `v0.13.2` (upstream publish 2026-04-27). Result: every fresh Sovereign's NewAPI Pod ImagePullBackOff'd 403 Forbidden on the never-existed image, blocking alice signup gate 5 (LLM) and surfacing in the bounded-cycle audit as Gap F. Fix (mirrors bp-guacamole CI pattern in .github/workflows/build-bp-guacamole.yaml): - NEW .github/workflows/build-bp-newapi.yaml — push to platform/newapi/chart/* triggers a Job that pulls `docker.io/calciumion/new-api:<UPSTREAM_VER>`, captures the upstream repo digest, re-tags as `ghcr.io/openova-io/openova/newapi-mirror: <UPSTREAM_VER>` + `:latest`, pushes both, then bumps values.yaml + Chart.yaml + dispatches blueprint-release. - platform/newapi/chart/values.yaml — newapi.image.tag bumped from `v0.4.5` (fictitious) to `v0.13.2` (latest stable Calcium-Ion/new-api on Docker Hub). Comment block expanded with full rationale + link to the new build workflow + bump-in-lockstep instructions. - platform/newapi/chart/Chart.yaml — version 1.4.1 → 1.4.2, appVersion `0.4.5` → `0.13.2` (Helm convention: appVersion = upstream version without the `v` prefix). Inline changelog records the audit-prov-7 Gap F lineage. - clusters/_template/bootstrap-kit/80-newapi.yaml — pinned chart version 1.4.1 → 1.4.2 with the same changelog inline. Verified locally: - `helm template smoke platform/newapi/chart --set database.existingSecret=fake --set credentials.existingSecret=fake --set auth.adminUI.mode=masterKey` renders `image: "ghcr.io/openova-io/openova/newapi-mirror:v0.13.2"` and `app.kubernetes.io/version: "0.13.2"`. The v1.0.0-rc.x upstream line is gated on schema migration stabilisation; the channel-seed Job uses the legacy admin-API request shape, so do NOT auto-roll past v0.13.x without re-running the channel-seed integration smoke against NewAPI's `/api/channel/`. Pairs with the Gap C re-investigation memo (no chart fix needed; PR #1309 only gated `defaultCompositionRef`, not the XRD itself; the useraccesses.access.openova.io CRD is present on omantel prov #7). DO NOT MERGE — this PR is for qa-loop bounded-cycle Wave 5 Fix #80 (Gap F) review. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 21:20:49 +04:00
e3mrah	9780e8d72d	fix(chart): bp-catalyst-platform 1.4.116 — chart re-publish + dispatch (qa-loop iter-10 Fix #44 follow-up) (#1264 ) Chart 1.4.115 was published from the merge commit which still had the OLD application-controller image tag (`a3ba200`) in values.yaml — the auto-bump commit landed seconds later but GitHub Actions does NOT trigger workflows from bot pushes by default (anti-recursion safeguard), so blueprint-release was never re-run and the published chart shipped with the wrong image. Sovereigns installing chart 1.4.115 still ran the buggy application-controller without the targetNamespace fix. Fix: - Bump bp-catalyst-platform 1.4.115 → 1.4.116 (this commit is human- authored so blueprint-release fires via the path filter). - Bump clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pin to 1.4.116. - Extend build-application-controller.yaml to dispatch blueprint-release.yaml after the bot bumps values.yaml, so the same race never blocks any future controller image roll-out. Per docs/INVIOLABLE-PRINCIPLES.md #1 (target-state) — operator must never have to manually re-trigger a chart publish after a controller image rebuild. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 06:17:13 +04:00
e3mrah	24aab61207	fix(application-controller): HelmRelease targetNamespace = App's namespace, not Org slug (qa-loop iter-10 Fix #44 ) (#1262 ) Root cause: the application-controller rendered the per-Application HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace = Org` where Org is the parent Organization slug. On omantel the Application(qa-wp) lives in ns `qa-omantel` while the Org is named `omantel-platform` — so the workload Pod landed in the wrong namespace, breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 / TC-263 (all asserting Pod in qa-omantel). Symmetric Kustomization wrapper had the same bug. Existing render unit test only covered the org==namespace case (`acme/acme`) which masked the bug. Fix: - render.Inputs gains AppNamespace field. helmRelease + kustomization templates resolve `metadata.namespace` and `spec.targetNamespace` to AppNamespace (back-compat default = Org). - application_controller.go passes app.GetNamespace() as AppNamespace on every render.Render call. - HelmRelease spec.install.createNamespace = true so a missing workload namespace is provisioned by helm-controller (per docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller must work without an operator pre-creating the namespace). - Org slug is still stamped on the catalyst.openova.io/organization label for traceability. - 3 new Go tests: TestRender_NamespaceIsAppNamespace (omantel scenario via render pkg) TestRender_CreateNamespaceTrue TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace (drives the omantel scenario end-to-end through the controller fake) - build-application-controller.yaml extended with auto-bump of controllers.application.image.tag in values.yaml on push-to-main, so the chart picks up the rebuilt image without a manual operator edit (per feedback_no_mvp_no_workarounds.md rule 1). - bp-catalyst-platform chart 1.4.114 → 1.4.115. Verification (post-roll on omantel): - delete omantel-platform/qa-wp Pod - annotate qa-omantel/qa-wp HR for reconcile - expect: Pod in qa-omantel ns + HR.spec.targetNamespace == qa-omantel Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 05:17:48 +04:00
e3mrah	5ca0a7d178	fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236 ) * fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/*, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/ slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 01:48:25 +04:00
e3mrah	b24475e2c2	fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206 ) Two coupled fixes for QA-loop iter-3 cluster `clusterroles-gvr-and-sha-injection`: Sub-A — clusterroles GVR (TC-122/196/199/248): - Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding to k8scache.DefaultKinds. Both cluster-scoped. - Add matching get/list/watch verbs on catalyst-api-cutover-driver ClusterRole. Per feedback_chroot_in_cluster_fallback.md every new GVR added to DefaultKinds MUST get a matching rule on the cutover-driver SA (chroot SovereignClient uses it via in-cluster fallback). - Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a regression that drops them from the registry fails the unit test. Sub-B — CATALYST_BUILD_SHA env injection (TC-261): - api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION env vars with LITERAL values (not Helm directives) per the dual-mode contract — Kustomize on contabo can't render `{{ .Values... }}` in `value:` fields. - .github/workflows/catalyst-build.yaml: extend the "bump literal image refs" sed pass to also bump the CATALYST_BUILD_SHA env literal so /api/v1/version returns the SHA the Pod is actually running (no drift between image tag and reported SHA). - The handler (version.go) already reads CATALYST_BUILD_SHA via envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change needed; the version_test.go env-override test already covers it. Chart bumped 1.4.94 -> 1.4.95. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:56:21 +04:00
e3mrah	c1b92404ee	fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194 ) EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because the 5 Group C controllers (organization, environment, blueprint, application, useraccess) shipped with `enabled: false` and the KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result: UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never materialised into RoleBindings + composite realm-roles. Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1). Changes: - values.yaml: organization/environment/application/useraccess controllers flipped to `enabled: true` and `image.tag` SHA-pinned to the latest GHCR-published push-on-main builds (organization/environment/application :1b29c71, useraccess :ff2172f) per Inviolable Principle #4a. - values.yaml: blueprint stays `enabled: false` until first push-on-main build of build-blueprint-controller.yaml lands an image in GHCR (never reference an image not built by CI). - values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`. - api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice T2 brief #1098/#1146) instead of hardcoded "false". - .github/workflows/build-blueprint-controller.yaml: new workflow scaffolded (mirror of build-application-controller shape) so the first commit touching core/controllers/blueprint/** ships a CI-built, SHA-pinned, cosign-signed image to GHCR. - Chart.yaml: bumped 1.4.89 → 1.4.90. Verified via `helm template`: - 4 controller Deployments + 4 controller ClusterRoles render (blueprint pending image build). - KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default. - 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}` render from platform/crossplane-claims/chart/. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 15:41:58 +04:00
e3mrah	7ca4abddd2	feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101 ) (#1159 ) * feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState \| 404 \| 401 PUT /lease/<slot> → 200 + LeaseState \| 412 + state \| 401 DELETE /lease/<slot> → 204 \| 412 \| 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 08:01:44 +04:00
e3mrah	746901b671	feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101 ) (#1153 ) EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a companion to bp-cnpg: primary CNPG Cluster CR in region A, replica Cluster CR in region B configured as a CNPG replica cluster (replica.enabled=true + externalCluster), WAL streaming over a Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the only canonical inter-region transport — never public TLS. What ships: platform/cnpg-pair/ ├── chart/ │ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off │ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY │ ├── templates/ │ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation │ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity) │ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[]) │ │ ├── service-replication.yaml # Cilium ClusterMesh global Service │ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold │ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe │ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits │ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY) │ ├── README.md # 80-line deployment + failover semantics │ └── tests/cnpg-pair-render.sh # 5-case render gate └── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan Default-OFF gate per the brief: helm template with default values renders ZERO resources; helm template with cnpgPair.enabled=true + both regions + image.tag renders 8 resources (2 Cluster CRs, 1 Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap). Empty image.tag fails fast at template-render per Inviolable Principle #4a; same primary/replica region fails fast (degenerate pair). All 5 render gates pass locally; helm lint + YAML parse clean. CI smoke-render gate fix (single-line behavior change in blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render- mode: default-off` annotation opt-in so charts that legitimately render zero at default values (this chart + future bp--pair Blueprints) skip the `<5 lines` empty-render check. The chart's own tests/cnpg-pair-render.sh covers the enabled-render path; without the annotation the empty-render check still fires unchanged. Seam-map additions (return diff for 01-canonical-seams.md Platform table): - service.cilium.io/global=true ClusterMesh global Service annotation (first chart in the repo to use it; pattern reused by Continuum K-Cont-2 for HTTPRoute weight=0 cross-region drains) - bp--pair active-hotstandby cluster-pair pattern (primary+replica Cluster CRs colocated in one Blueprint, region-pinned via openova.io/region node-affinity) - audit-config ConfigMap co-located with the emitting Blueprint (label-selector discovery for K-Cont-2 + U-DR-1; future bp--pair Blueprints follow this convention) - smoke-render-mode=default-off Chart.yaml annotation opt-in for the blueprint-release smoke gate C-DB-2 (publish): existing blueprint-release.yaml workflow auto- detects `platform//chart/**` paths — no allowlist edit required. First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build. C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the future implementer's brief is self-contained. Tests: - bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS - helm lint platform/cnpg-pair/chart ✓ clean - helm template ... \| python3 yaml.safe_load_all ✓ 8 docs parse clean - smoke-gate logic simulated locally ✓ default-off annotation honored Pre-existing CI failures untouched: - TestPinIssue rate-limit flake — not affected by chart-only slice - TestBootstrapKit/gitea version drift — only iterates over a fixed 10-chart bootstrap list (no cnpg-pair entry) Out of scope per brief (all deferred to dedicated slices): - K-Cont-2 reconciler logic - K-Cont-3 lease witness - K-Cont-4 Cloudflare Worker - C-DB-3 1M-row acceptance test - Application controller changes - U-DR-1 UI Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 05:16:55 +04:00
e3mrah	ddbe44918f	feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101 ) (#1151 ) Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton: - core/controllers/continuum/{cmd,internal/{controller,events}} - cmd/main.go — controller-runtime Manager bootstrap; leader election; /healthz, /readyz, /metrics endpoints; env-only config per INVIOLABLE-PRINCIPLES #4 - internal/controller — ContinuumReconciler with no-op Reconcile() (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen) - internal/events — placeholder package documenting K-Cont-2's NATS audit-event-type list - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534 - products/continuum/chart/ — full Helm chart shape (default-OFF): - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty; fail-fast on empty tag at render time) - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac, networkpolicy}.yaml - blueprint.yaml — OpenOva Blueprint manifest with configSchema + placementSchema (single-region: management cluster) + depends: bp-cnpg-pair + bp-powerdns - crds/README.md — pointer to the canonical Continuum CRD shipped in products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated - products/continuum/DESIGN.md — chart-vs-binary split decision (Option A: binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill list, K-Cont-3 lease witness API contract sketch - .github/workflows/build-continuum-controller.yaml — event-driven CI (NO cron) with go vet + go test -race + helm template ON/OFF resource count gates + fail-fast verification + GHCR build & push (cosign keyless signed) + repository_dispatch for chart-bump fan-out helm template verification: - continuum.enabled=false → 0 resources (default OFF) - continuum.enabled=true + image.tag=ci-test → 6 resources (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, NetworkPolicy) - continuum.enabled=true + empty image.tag → render fails per #4a go vet ./continuum/... → clean. go test -count=1 -race → all green. Out of scope (per the K-Cont-1 brief): - Reconcile body — K-Cont-2 - Lease witness implementations — K-Cont-3 - Cloudflare Worker source — K-Cont-4 - bp-cnpg-pair Blueprint — C-DB-1 Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 04:45:00 +04:00
e3mrah	b0ed216e81	feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097 ) (#1148 ) EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST service backed by Gitea (3 sources: public mirror, sovereign-curated, per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3 (different scope: SME's was Org-bound; catalyst-catalog is Sovereign- wide multi-source). L1 — core/services/catalyst-catalog/ Go service: - Separate go.mod (services group is for HTTP services, controllers group is for CRD reconcilers — documented in DESIGN.md). - Imports the unified Gitea client via Go module replace directive. - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog (a sibling Go module) can import it (Go internal/ rule). 5 Group C controllers updated atomically. - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions, /{name}/versions/{version}} + /healthz. - Source resolution priority on collision: private > sovereign > public. - Per-Org access filter: caller's Claims.Groups[] determines visible private blueprints; Org A user does NOT see Org B's private set. - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default). - Session-cookie / Bearer / ?access_token= claim extraction matching catalyst-api's seam; expired-token rejection in-process. - Containerfile: distroless-static, non-root UID 65532. L2 — products/catalyst/chart/templates/services/catalog/ wiring: - 5 templates (deployment, service, serviceaccount, rbac, httproute) + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled. - helm template: 0 catalog resources when OFF, 6 when ON. - Empty image.tag fail-fasts at render per Inviolable Principle #4a. - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname. - Chart bumped 1.4.85 → 1.4.86. Gitea client extension (canonical seam, NOT per-service variant): - +ListOrgRepos(ctx, org) []Repo — paginated repo listing. - +ListContents(ctx, org, repo, branch, path) []ContentEntry — directory listing for per-Org shared-blueprints fan-out. GitHub Actions workflow: - .github/workflows/catalyst-catalog-build.yaml — push-on-paths + pull_request + workflow_dispatch (NO cron). go vet + go test (race + count=1) + image build → GHCR :<sha>. repository_dispatch fan-out to chart-bump matches the Group C controllers' pattern. Tests (3-tier gate): unit (config, cache, auth, source, handler) + integration (httptest-backed Gitea fixtures across all 3 sources + priority + per-Org access). All green; race detector on. L3 (SME catalog retirement) is deferred per the EPIC-2 master brief. GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps for a feature no UI consumer has asked for yet). Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 04:04:52 +04:00
e3mrah	66fd0bbae3	refactor(controllers): promote duplicated internal/ packages to shared core/controllers/internal/ (CC1, #1095 ) (#1135 ) Slice CC1 of EPIC-0 (#1095) — Coordinator-led consolidation. The 5 Group C controllers (slices C1-C5: organization, environment, blueprint, application, useraccess) all merged with their own per-controller go.mod + per-controller internal/ tree. This PR canonicalizes the shared layout per `02-implementer-canon.md` §1+§2: * One go.mod at core/controllers/go.mod (Path A — single shared module) * Shared helpers under core/controllers/internal/: - semver/ (was: blueprint/internal/semver + application/internal/semver, now exposes blueprint's IsValidRange + app's IsExact, with the union of both test corpora) - placement/ (was: application/internal/placement; promoted per seam map) - render/ (was: application/internal/render; promoted per seam map) - labels/ (was: useraccess/internal/labels; promoted per seam map — Manara-style scope matcher, owner-of-record C5) Module-discipline decision (Path A vs Path B): Path A. The 5 controllers' go.mod files use the same controller-runtime v0.19.0, k8s.io/* @ 0.31.x, sigs.k8s.io/yaml v1.4.0, etc. The only drift was organization-controller on k8s.io/api 0.31.0 vs the others on 0.31.1 — a trivial bump. Independent dep-version pinning would only be valuable if a controller needed a hostile dep the others shouldn't pull; nothing in the current tree is hostile. Containerfiles + workflows updated: * 5 Containerfiles now COPY core/controllers/{go.mod,go.sum,internal/} plus the per-controller tree from a repo-root build context. * 4 per-controller workflows (application/environment/organization/ useraccess; blueprint-controller has no dedicated workflow yet) now trigger on core/controllers/{<name>/, internal/, go.mod, go.sum} and run go vet + go test scoped to their own tree + shared internal. * useraccess workflow context flipped from core/controllers/useraccess to . (repo root) so the Containerfile can reach the shared go.mod. Subpackages NOT promoted in this PR (compromise — flagged for follow-up): * gitea/ — 4 of 5 controllers each ship a Gitea HTTP client. The APIs DIVERGE (organization has Org+Repo CRUD with Repo struct return values; application/blueprint/environment have File CRUD with Org-not-found sentinel). A SUPERSET package would require renaming methods (e.g. EnsureRepo collides on signature) which crosses the brief's "no API redesign" line. CC2 follow-up slice should design the unified surface before promoting. * validate/ — application's package validates Application.spec.parameters against a JSON Schema (santhosh-tekuri lib); blueprint's validates Blueprint CR business rules (semver-backed). Same dir name, completely different functions — not actually duplicates. * gitops/ — environment's renders Flux GitRepository for an Environment; organization's renders HelmRelease+Namespace for an Org. Same dir name, different inputs and outputs. Test-coverage delta: pre-consolidation 134 root-level tests (sum across 5 modules); post-consolidation 133 tests. Net delta -1: blueprint and application each had their own TestIsValidRange in their semver pkg; the shared semver pkg's TestIsValidRange now exercises the union of both controllers' valid+invalid input corpora — coverage strictly improved even though one redundant test name disappeared. Verified locally: go build + go vet + `go test -count=1 -race ./...` all clean; all 5 controller binaries (cmd/) link successfully. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:54:42 +04:00
e3mrah	dbf585744c	feat(controllers): land application-controller (slice C4, #1095 ) (#1133 ) Watches Application.apps.openova.io/v1 CRs and reconciles each Application to per-region kustomization + helmrelease manifests in the per-Org Gitea repo (gitea.<location-code>.<sovereign-domain>/<org>/<app>). Reconcile flow per slice C4 brief: 1. Resolve parents: spec.environmentRef → Environment CR, then Environment.spec.organizationRef → Organization CR. Pending-on-miss. 2. Fetch Blueprint at spec.blueprintRef.{name,version} (v1 with v1alpha1 fallback). Pending-on-miss. 3. Validate spec.parameters against Blueprint.spec.configSchema via github.com/santhosh-tekuri/jsonschema/v5. On invalid → status.phase= Failed + Condition reason=Invalid listing every failing JSON pointer. 4. Validate placement against Blueprint.spec.placementSchema.modes. 5. Resolve placement → per-region work plan: - single-region: regions[0] only, role=primary - active-active: every region rendered identically (sorted for byte-stability), role=active, no primaryRegion - active-hotstandby: regions[0] primary, regions[1..] standby (replicas: 0 + _openova_standby: true overlay; Continuum #1101 flips on switchover) 6. Render kustomization.yaml + helmrelease.yaml per region under clusters/<region>/applications/<app>/{...}.yaml on the env-type- mapped branch (develop\|staging\|main per NAMING §11.2). 7. Idempotent commit via gitea.PutFile's byte-equality short-circuit — re-reconcile on steady state = 0 Gitea writes (slice C4 brief test #7). 8. Status update: phase / primaryRegion / regions[] / giteaRepo / installedBlueprint{name,version,digest} / conditions[]. 9. Finalizer + cascade delete: on metadata.deletionTimestamp, removes every manifest the controller wrote and releases the finalizer. Architecture compliance per docs/INVIOLABLE-PRINCIPLES.md: - Flux is the only reconciler. Controller writes to Gitea; Flux applies. NO direct K8s create of HelmRelease/Kustomization/Service. - Dynamic client + unstructured.Unstructured (no controller-gen, no zz_generated_deepcopy.go). - Every value is environment-configurable (GITEA_API_URL, GITEA_TOKEN, GITEA_PUBLIC_URL, SOURCE_NAMESPACE, HELMRELEASE_INTERVAL, CATALOG_SOURCE_REF, REQUEUE_AFTER_SECONDS, METRICS_ADDR, HEALTH_ADDR, LEADER_ELECT, LEADER_ELECT_NS, LOG_LEVEL). - SHA-pinned images via the focused build-application-controller.yaml workflow (push-on-paths + PR + workflow_dispatch — no cron). Tests cover the full 9-test matrix from the brief plus 3 bonus paths: T1 Pending on missing Environment (no Gitea writes). T2 Pending on missing Blueprint (no Gitea writes). T3 Invalid on parameters schema mismatch — Condition message names the failing path 'replicas'; no Gitea writes. T4 single-region happy path → expected manifests written under clusters/<region>/applications/<app>/ on branch=main, finalizer added, status.phase=Provisioning, status.primaryRegion populated, status.giteaRepo populated. T5 active-active fan-out → 2 regions, 2 manifest sets byte-equal after region-name canonicalisation. status.primaryRegion empty. T6 active-hotstandby → primary renders replicas:3 (user param); standby renders replicas:0 + _openova_standby:true marker. T7 Idempotency → re-reconcile after success = 0 Gitea writes (PutFile byte-equality short-circuit). T8 Deletion cascade → manifests removed from Gitea, finalizer released after delete pass. T9 Drift detection → Gitea-side manifest hand-edited; controller restores byte-identical original on next pass. + Pending on Gitea Org missing (org doesn't exist in Gitea even though Organization CR exists — slice C1 hasn't run yet). + Invalid placement-vs-blueprint-allowed-modes (placement-active-active rejected on a Blueprint declaring only single-region). Module path: github.com/openova-io/openova/core/controllers/application (per-controller go.mod, matching siblings C1/C2/C3/C5; CC1 promotes shared internals to core/controllers/internal/ in a follow-up slice). `go vet ./...` clean. `go test -count=1 -race ./...` all green. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:34:22 +04:00
e3mrah	8988cd9e4f	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 ) Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/*. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:29:44 +04:00
e3mrah	2ab442544e	feat(controllers): land environment-controller (slice C2, #1095 ) (#1127 ) Implements slice C2 of EPIC-0 #1095 — the environment-controller Go binary. Watches Environment.catalyst.openova.io/v1 CRs (cluster-scoped) and reconciles each Environment to: 1. Verify the per-Org Gitea Org exists (parent Organization gate). Missing org surfaces GiteaOrgReady=False + Pending phase, never panics or crashloops. 2. Track the canonical branch name for this Environment in status.giteaRepoRef.{org,branch} per NAMING-CONVENTION.md §11.2 item 1 (develop/staging/main ↔ dev/stg/prod; uat/poc map to their own branch name). 3. Idempotently write per-vCluster Flux GitRepository manifests into the Org's Gitea repo at the canonical path `clusters/<host-cluster>/environments/<env-name>/gitrepository.yaml` per NAMING §11.2 item 3. Multi-region Environments fan out one commit per spec.regions[]. Identical bytes short-circuit (zero spurious commits in repo history); drift triggers an overwrite with the existing blob SHA. 4. Surface the canonical JetStream subject prefix `ws.{organizationRef}-{envType}.>` on status.jetstreamSubjectPrefix per NAMING §11.2 item 4 + ARCHITECTURE.md §5. Per-Environment NATS Stream CR creation is OUT OF SCOPE here — NACK isn't installed yet (future slice). 5. Set status.phase, status.regionCount (printer column), status.vclusters[], status.observedGeneration, and the Ready/GiteaOrgReady/GitRepositoryWritten conditions. Architecture rules honored (per docs/INVIOLABLE-PRINCIPLES.md + docs/adr/0001-catalyst-control-plane-architecture.md): - Flux is the only reconciler in production. The controller writes manifests to Gitea; Flux applies them. NO kubectl apply, NO helm install, NO exec.Command in the codebase. - Crossplane is cloud-only. This controller is K8s-to-K8s native via controller-runtime + client-go. - DR is a Placement, not an Env Type. The controller treats spec.envType as the schema-validated enum {prod\|stg\|uat\|dev\|poc} with no special-case for DR (per NAMING §11.1). - Sovereign-independent. The Gitea base URL, secret ref, branch suffix, commit author, and Flux interval are ALL runtime config (per Inviolable Principle #4 — never hardcode). Files: - core/controllers/environment/api/v1/types.go — Environment Go types matching the CRD; hand-written DeepCopy to avoid build-time codegen tool dependency. - core/controllers/environment/internal/gitea/client.go — minimal GitHub-compatible REST client targeting Gitea's /api/v1 (GET /orgs/{org}, GET/POST/PUT /repos/{org}/{repo}/contents/{path}). Idempotent UpsertFile with byte-equality short-circuit + blob-SHA conflict refusal. - core/controllers/environment/internal/gitops/render.go — pure template rendering of the Flux GitRepository CR. Deterministic field ordering for byte-equality idempotency. - core/controllers/environment/internal/controller/environment_controller.go — reconciler: validate spec, gate on Gitea Org, fan out per-region manifest writes, set status + conditions. - core/controllers/environment/cmd/main.go — controller-runtime manager entry point with leader election. - core/controllers/environment/Containerfile — two-stage build, alpine:3.20 runtime, non-root UID 65534, ENTRYPOINT. - core/controllers/environment/deploy/rbac.yaml — ClusterRole watching Environments + status subresource + leader election lease. - .github/workflows/build-environment-controller.yaml — CI mirrors build-cert-manager-dynadot-webhook.yaml: vet + race tests, docker buildx + cosign keyless sign + SBOM attest, push to ghcr.io/openova-io/openova/environment-controller. Tests (35 total, all GREEN, race-detector enabled): - internal/controller (T1–T11): T1 happy-path single-region reconcile T2 idempotent re-reconcile (zero spurious commits) T3 parent Org missing → Pending + GiteaOrgReady=False (no panic) T4 multi-region fan-out (3 commits, 3 regions) T5 drift detection — operator hand-edit gets overwritten T6 placement-vs-regions cardinality violations → Failed T7 env_type→branch mapping table T8 Gitea repo missing → Pending + GiteaRepoMissing reason T9 partial-failure one region → Degraded with that region Failed T10 Config.Defaults applies the documented defaults T11 NotFound between dequeue and Get is benign - internal/gitea: GET /orgs OK + 404 + 500; UpsertFile create / idempotent / update with SHA / repo-not-found; pathEscape preserves slashes; arg-validation. - internal/gitops: BranchForEnvType / JetStreamSubjectPrefix / HostClusterName (with override) / GitRepositoryPath / RenderGitRepository (deterministic + complete + anonymous + default interval + required-field validation) / EnvironmentName. go vet ./... clean. go test -count=1 -race ./... GREEN. Out of scope per slice brief: organization-controller (C1), blueprint-controller (C3), application-controller (C4), useraccess-controller (C5), catalyst-api codebase changes, NACK install, per-Environment NATS Stream CRs. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:05:53 +04:00
e3mrah	84167a768e	feat(controllers): land organization-controller (slice C1, #1095 ) (#1129 ) A thin in-cluster Go controller that watches Organization CRs (orgs.openova.io/v1) and reconciles four downstream artifacts per the EPICS-1-6 unified design §3.3 + §3.7 and ADR-0001 §2.7: 1. vCluster HelmRelease — written into the per-Org Gitea repo (NOT direct apply; Flux reconciles per ADR-0001 §2.1). 2. Keycloak group — at path /<slug> with attributes {org=[<slug>], tier=[<sme\|corporate>]}. 3. Gitea Org — auto-created if absent; one repo per Org seeds the vCluster + tenant manifests. 4. UserAccess CR — one per spec.owners[] entry; slice C5's useraccess-controller materializes the RoleBindings. Per ADR-0001 §2.2 (Crossplane is cloud-only) this is K8s-to-K8s reconciliation NOT a Crossplane Composition. Per §2.1 the controller writes manifests via the Gitea HTTP contents API — never kubectl apply, never helm install, never exec.Command("helm", ...). Idempotent: re-running on a steady-state CR is a no-op (every "ensure" is find-or-create with byte-equal short-circuit on PutFile). What ships: - core/controllers/organization/cmd/main.go — entry point with envconfig, leader election, signal handling - core/controllers/organization/internal/controller/ — reconciler + KeycloakClient interface + LiveKeycloak impl - core/controllers/organization/internal/gitea/ — minimal Gitea Admin REST client (Org/Repo + contents-API). Self-contained — extractable to core/pkg/gitea-client/ when slice C2 needs it. - core/controllers/organization/internal/gitops/ — manifest renderer (namespace + vcluster HelmRelease + kustomization) - core/controllers/organization/internal/orgapi/ — Organization Go types mirroring the CRD schema (no deepcopy-gen — inlined) - core/controllers/organization/Containerfile — multi-stage build (alpine-based, runs as UID 65534) - core/controllers/organization/config/{rbac,manager}/ — ClusterRole + Deployment scaffolding for chart consumption (slice F1) - .github/workflows/build-organization-controller.yaml — push/PR/ manual triggers, no cron Tests: 9 unit tests across 3 packages cover happy-path reconcile, idempotency (zero net writes on second reconcile), Keycloak group already exists, Gitea Org already exists, slug/metadata drift, missing CR no-op, byte-equal PutFile no-op, 422-race re-find, template structural-YAML validity, and label-vocabulary compliance. go test -count=1 -race ./... and go vet ./... both clean. Out of scope: environment-controller (C2), application-controller (C4), useraccess-controller (C5 — this controller only WRITES UserAccess CRs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:04:29 +04:00
e3mrah	dd1699afe3	feat(controllers): land useraccess-controller — fix silently broken Crossplane path (slice C5, #1095 , P0) (#1128 ) Per docs/EPICS-1-6-unified-design.md §3.5 and ADR-0001 §2.3 amendment, K8s-to-K8s reconciliation belongs to thin in-cluster controllers, not Crossplane Compositions. The existing useraccess.compose.openova.io Composition writes RoleBindings via provider-kubernetes — but provider-kubernetes is NOT installed on any production Sovereign (caught in the EPIC-0 audit). Every UserAccess CR has been silently no-op'd. This controller fixes that. What lands: - core/controllers/useraccess/cmd/main.go — controller-runtime Manager with leader election + signal handling, environment-only config - internal/controller/{reconciler,desired,spec,status,types}.go — the reconciler. Watches UserAccess.access.openova.io/v1alpha1 (cluster- scoped, unstructured client) and owns RoleBinding + ClusterRoleBinding via Owns() so drift triggers reconcile via ownerRef indexing - internal/labels/scope.go — Manara DNA scope matcher: AND-within / OR-across, wildcard scopes, EnforcedScopes() per catalog tier (the developer auto-injection of openova.io/env-type=dev) - internal/controller/_test.go + internal/labels/scope_test.go — 26 unit tests with the controller-runtime fake client. Covers happy-path, multi-app/multi-ns fan-out, namespaces:[""]→CRB, group subjects, drift detection+restore, orphan deletion on spec shrink, idempotency, invalid spec, ownerRef shape, NotFound no-op, and the 5-catalog-tier matrix - deploy/{rbac,deployment}.yaml — ClusterRole/SA/Deployment with non-root, read-only-rootfs, drop-ALL caps, leader-election Role - Containerfile — Alpine 3.20 final stage, CGO_ENABLED=0, UID 65534 - .github/workflows/useraccess-controller-build.yaml — event-driven build (push-on-main + PR test job), SHA-pinned image tags Behaviour: - Per UserAccess CR, materialises RoleBindings (per namespace) or ClusterRoleBindings (when namespaces:["*"]) referencing the canonical openova:application-{admin,editor,viewer} ClusterRoles - ownerRef back to the UserAccess CR with controller=true + blockOwnerDeletion=true so K8s GC cascades deletes - Drift detection: hand-mutated bindings are restored on next pass + Condition Drift=True surfaced for the UI - Idempotent: steady-state reconcile = 0 K8s writes - Status: phase (Pending\|Active\|Failed), rolebindingsCreated, observedGeneration, conditions[] Out of scope per the brief: - Crossplane Composition deletion (operator retires post-verify) - 5-catalog-tier role inheritance (lands with EPIC-3 #1098) - Keycloak realm-role sync (slice D1b, this controller is consumer) Tests: go vet ./... # clean go test -count=1 -race ./... # 26/26 pass go test ./internal/labels/... -run TestScope # full 5-tier matrix Co-authored-by: Hatice Yildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:04:07 +04:00
e3mrah	358c32c032	ci: add cluster bootstrap-kit drift guardrail (slice H2 scope-reduced, #1095 ) (#1122 ) Adds .github/workflows/cluster-template-drift.yaml — a warn-only workflow that reports drift between each clusters/<sovereign>/bootstrap-kit/ tree and the canonical clusters/_template/bootstrap-kit/. Why warn-only, not enforce: - Every existing Sovereign carries some legitimate drift (per-Sovereign image SHAs, region-specific values overlay) — blocking PRs on diff count would prevent ALL cluster work. - The right place to enforce the boundary is Catalyst's organization- controller (slice C1 of #1095), not CI. Once C1 ships, every new Sovereign bootstrap-kit is generated from _template and the attestation lives at apply-time, not at CI-time. - Retroactively reconciling the existing omantel.omani.works/ and otech.omani.works/ trees (which have 20+ differing files plus structural changes — extra files on each side) is a high-blast-radius maintenance-window operation, NOT a CI scoped slice. What this workflow does: - Triggers on push to main + PR + workflow_dispatch when clusters/** changes. - For each clusters/<sovereign>/ directory, runs `diff -rq` against clusters/_template/bootstrap-kit/ and writes a Markdown report to the run summary AND a sticky PR comment. - Counts differing files + only-in-template + only-in-Sovereign per Sovereign so reviewers can quickly see whether new drift was introduced. Per docs/EPICS-1-6-unified-design.md §3.9 row 2 + §11 row 6 (decision amended from "reconcile + CI gate" to "warn-only CI gate"; structural reconcile deferred to slice C1 organization-controller). Per docs/INVIOLABLE-PRINCIPLES.md #4a — workflow only inspects YAML; no images built, no cloud calls. Refs: #1094, #1095, slice C1 (organization-controller). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 23:09:50 +04:00
e3mrah	eb6a3c1812	fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060 ) * fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. Two stacked headers + sidebar inside sidebar ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. "✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps. This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:10:31 +04:00
e3mrah	953ef8290f	fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs (#980 ) * fix(catalyst-ui): drop stale params={{ deploymentId }} from clean-root Links (#975) #976 collapsed `to="/provision/$deploymentId/<page>"` to clean root paths (`to="/<page>"`) but left the `params={{ deploymentId }}` prop on every callsite, breaking the Vite tsc build with TS2353. Fixes: - Drop `params={{ deploymentId }}` from Links whose target is now a parameterless clean root path (StatusStrip, AppDetail, AppsPage, DecommissionPage, FlowPage, JobDetail, JobsPage, JobsTimeline, SettingsPage, DeploymentsList). - For Links whose `to` still uses `$componentId`/`$jobId`, cast `params` with `as never` to match the existing pattern in cloud-compute/cloud-network/cloud-storage/Sidebar/UserAccess (the dual-mount under provisionRoute + consoleLayoutRoute defeats TS's strict params inference; the runtime path is correct). - Drop `deploymentId` prop + interface field from JobCard / JobRow / JobsTable / AppCard now that the Links don't need it; update test fixtures + the JobsTable row-link assertion to match the new clean `/jobs/$jobId` href. - Drop the unused ArchEdgeType import in k8sAdapter (TS6196). - Dashboard navigateToApp uses `as never` casts to align with the same pattern. * fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs Two paths consume the catalyst-api / catalyst-ui images: 1. bp-catalyst-platform OCI chart (Sovereigns) — values.yaml driven, tag in values.yaml is rendered at helm install time by Sovereign Flux. 2. contabo Kustomize-path — literal image refs in templates/api-deployment.yaml and templates/ui-deployment.yaml. Flux kustomize-controller on contabo reconciles those files directly. The CI deploy step was bumping BOTH on every PR, which auto-rolled contabo every time anyone merged a catalyst-api code change. On 2026-05-05 PR #975's k8scache feature broke contabo startup on the auto-roll because contabo has 27 dead-Sovereign kubeconfigs that the new code iterates synchronously at startup, blocking readiness. Fix: keep the values.yaml bump (Sovereigns auto-pick-up via OCI chart which is the right behaviour for fresh provisions). Drop the templates/*-deployment.yaml bump so contabo only rolls when an operator manually commits a validated SHA into those files. Closes the auto-deploy-to-contabo blast radius on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:24:57 +04:00
e3mrah	2ff50f0591	fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955 ) Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:47:37 +04:00
e3mrah	db332f6767	fix(ci): services-build auto-bumps chart patch + dispatches blueprint-release (#874 ) * fix(bp-catalyst-platform): bump 1.4.8 -> 1.4.9 to republish with current services-auth image (#871) Chart 1.4.8 was published from commit `95a06f56` BEFORE the deploy-bot updated templates/sme-services/auth.yaml's image pin from services-auth:fa4395f -> services-auth:95a06f5 (which has the /auth/send-pin alias from PR #869). The blueprint-release workflow fired on `95a06f56` only, so the OCI artifact for 1.4.8 was published with the OLD image SHA in chart bytes. otech103 reconciled 1.4.8 and rendered the auth Deployment with the OLD image -> /auth/send-pin returns 404 -> SME marketplace signup blocked. Same deploy-step race documented in feedback_idempotent_iac_purge.md and the overnight DoD bookmark. Long-term fix is a double-bump sequencing PR (file separately); short-term fix is bumping the chart version so blueprint-release republishes the artifact with the current image pin. No template change. Lockstep slot 13 pin in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml bumps from 1.4.8 -> 1.4.9. Closes #871 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): services-build deploy auto-bumps chart patch + dispatches blueprint-release (#872) Eliminate the recurring race between services-build's deploy commit and blueprint-release's path-trigger on chart-version-bumping PRs. Before: a PR bumping `products/catalyst/chart/Chart.yaml` AND touching `core/services/*` triggered both workflows on the same merge SHA in parallel. blueprint-release packaged the chart at the merge commit (which still held the OLD image SHAs) and published the bumped chart version with stale image refs. services-build's deploy commit landed AFTER, but per GitHub Actions design GITHUB_TOKEN-authored pushes do NOT re-trigger workflows, so blueprint-release never fired again on the corrected chart. A manual no-op chart bump PR was the only way to republish (PR #865 chasing PR #864 was the live incident). After: services-build's deploy step 1. sed-rewrites image: lines under products/catalyst/chart/templates/sme-services/.yaml (unchanged) 2. Pure-bash semver patch-bumps Chart.yaml `version:` and `appVersion:` atomically 3. Single commit captures both rewrites 4. Explicit `gh workflow run blueprint-release.yaml -f blueprint=catalyst -f tree=products` dispatches the chart publish (matches catalyst-build's PR #720 pattern) 5. Idempotent push retry re-reads origin/main and bumps from THAT version on conflict, so concurrent CI runs produce strictly increasing patch versions instead of clobbering each other Adds `actions: write` to the deploy job permissions so the gh workflow run dispatch doesn't return HTTP 403. The manual chart-version field in author PRs becomes a floor; CI auto-bumps from there. PR authors should NOT bump the patch themselves any more — the deploy step does it. Major/minor bumps remain the author's call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 08:32:34 +04:00
e3mrah	1d93b6c5af	feat(e2e): SME demo Playwright spec — full 6-step happy path (#805 ) (#823 ) Authors the load-bearing investor-demo proof artefact for the SME-tenant turnkey experience epic (#795). The spec walks the FULL happy path against the catalyst-ui SPA and emits 1440×900 screenshots at every assertion so the DoD checklist is satisfied with visual evidence rather than narrative. What landed: - products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear spec covering Step 1 (marketplace signup) → Step 2 (provisioning) → Step 3 (SME admin first login + dashboard) → Step 4 (create alice via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to unblocking issues. - products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry of every URL, hostname, fixture user, and UUID the spec uses. Per feedback_never_hardcode_urls.md, no test inlines a hostname; every asserted host derives from OTECH_FQDN + SME_SLUG. - products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape- faithful page.route mocks for tenant discovery, /api/v1/whoami, /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment endpoints, app placeholders for WordPress/OpenClaw/webmail, and the /api/v1/sme/billing/ledger surface. Each helper is the seam between mock-mode (today) and live-mode (post-#804) so the spec opts out of any single mock by simply not calling that helper. - .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger that runs the spec against a freshly-installed dev tree with VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the SovereignConsoleLayout's auth gate has a non-null sovereignFQDN. Uploads the 805-* screenshot evidence as a 30-day artefact. Run today on a fresh checkout: cd products/catalyst/bootstrap/ui VITE_CATALYST_MODE=sovereign \ VITE_SOVEREIGN_FQDN=acme.otech.example \ npm run dev & PLAYWRIGHT_HOST=http://localhost:5173 \ npx playwright test e2e/sme-demo.spec.ts Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 / #798 / #802-followup). Live-mode follow-up (after #804 lands a fresh otech with the SME tenant pipeline wired): drop the mock installers from beforeEach and flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper calls change. Per docs/INVIOLABLE-PRINCIPLES.md: #1 (waterfall): the canonical 6-step contract from #805 is asserted in this first cut, not staged across cycles. #2 (never compromise): every step that's deferred is fixme'd with a blocker link, never silently skipped. #4 (never hardcode): every URL routes through e2e/lib/config.ts. Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 22:52:07 +04:00
e3mrah	9645a9044a	feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798 ) (#818 ) * feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add the SME-2 metering integration end-to-end. NewAPI is consumed as the upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned mirror, not a fork) — the metering envelope is produced by a Go sidecar that observes the OpenAI-style `usage.total_tokens` field on every 2xx /v1/* response. This avoids forking the upstream binary while still producing the canonical envelope shape on `catalyst.usage.recorded`. A) NewAPI metering sidecar — core/services/metering-sidecar/ - Transparent reverse proxy in front of NewAPI on its own port; the bp-newapi Service routes the cluster-fronting port to the sidecar, which forwards to NewAPI on the pod's loopback. - Observes successful /v1/* JSON responses, parses `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes one envelope on `catalyst.usage.recorded` per completed request. - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed. - Customer-facing latency is NEVER blocked on metering: the response body is restored before publish; on NATS unreachable the envelope is persisted to disk and retried by a background drain loop. - 14 unit tests (proxy + publisher + safeFilename guards). B) sme-billing NATS subscriber — core/services/billing/handlers/ metering_consumer.go - JetStream durable consumer `sme-billing-metering` on stream `CATALYST_USAGE` (provisioned by sme-billing on startup). - Idempotent on metadata.request_id via a UNIQUE partial index on credit_ledger.external_ref; redelivery from the broker collapses to a single ledger row. - Customer auto-create on cold start (the rbac sme.user.created envelope may land AFTER the first metered request; we don't strand usage waiting for it). - 11 unit tests covering happy-path, idempotency, malformed-payload poison-pill, missing-request-id, non-negative amount guard, resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak. C) HTTP handler POST /billing/metering/record — handlers/metering.go - Synchronous validate → INSERT credit_ledger → return {ledger_entry_id, balance_after_omr, balance_after_micro_omr, duplicate}. Same payload + idempotency guard as the NATS path. - Auth: superadmin OR sovereign-admin (operator-admin model; end-user LLM traffic flows through the sidecar, never this URL). - 8 unit tests covering happy-path, idempotency, role gating, malformed-JSON, positive-amount rejection, customer-not-found. D) Schema — core/services/billing/store/store.go - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR exact integer — preserves precision at metering rates). - ADD COLUMN external_ref TEXT + UNIQUE partial index for idempotency dedup. - ADD COLUMN metadata JSONB for the raw envelope. - GetCreditBalance projects both amount_omr (legacy) and amount_micro_omr (new) into the integer-OMR view. - GetCreditBalanceMicroOMR returns canonical precision. - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0) distinguishes fresh insert from duplicate without a follow-up SELECT. E) Wiring - core/services/shared/events/nats.go — minimal NATS JetStream publisher + subscriber surface; legacy RedPanda producer/consumer in events.go untouched per [Q-mine-3]. - core/services/billing/main.go — NATS_URL env; subscriber wired in parallel with the existing RedPanda tenant-events consumer. - middleware/jwt.go — exported test helper WithClaims so handler tests can construct an authenticated context without minting a real signed token. - .github/workflows/services-build.yaml — metering-sidecar added to the build matrix; deploy job skips it (image consumed by the bp-newapi chart, not products/catalyst sme-services). F) bp-newapi chart (1.0.0 → 1.1.0) - meteringSidecar block in values.yaml: image, port, NATS URL, priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool dir, header names, resources, securityContext (read-only-rootfs). - deployment.yaml renders the sidecar container + emptyDir spool volume when meteringSidecar.enabled (default true). - service.yaml routes the cluster-fronting :3000 to the sidecar when enabled, exposes a separate :3001 → NewAPI direct port for bp-catalyst-platform admin-API traffic (ADR-0003 §3.2). - networkpolicy.yaml allows the sidecar's port + nats-system egress for JetStream publish. Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green. Helm template renders cleanly with sidecar enabled and disabled. Closes #798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798) Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000") that does NOT scan directly into Go int64 — the integration test TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on the post-redeem balance read. Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is unambiguously bigint and Scan target stays uniform across pre-#798 rows (amount_omr only) and post-#798 rows (amount_micro_omr present). Affects: - GetCreditBalance - GetCreditBalanceMicroOMR - RecordUsage's running-balance read Test mocks updated to match the new SQL prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:32:42 +04:00
e3mrah	93bd3ace5b	feat(bp-openclaw): workspace controller + per-user pod chart (#803 ) (#810 ) Implements locked decision [A] of epic #795: per-SME-tenant workspace controller deployment + per-user runtime pod, identity-blind by construction. Consumes the per-user newapi-key-{uuid} Secrets rendered by the unified-rbac user-create hook (ADR-0003 §3.3). What this delivers: - platform/openclaw/chart/ bp-openclaw v0.1.0 (no-upstream) - platform/openclaw/runtime/ Go reference runtime (NEWAPI_BASE_URL + NEWAPI_KEY env contract only) - .github/workflows/openclaw-runtime.yaml Event-driven build for the runtime image (paths-on-push + manual rerun; NO schedule:cron per CLAUDE.md). - platform/openclaw/blueprint.yaml Catalyst registration + configSchema. Chart highlights: - Required values guarded by _helpers.tpl :: assertRequired so missing realmURL/clientSecretName/tenant.namespace/baseURL/host fail render with helpful messages. - RBAC: namespaced Role in tenant ns; create verbs split into separate rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md. Label-based ownership (catalyst.openova.io/openclaw-user) enforced at the controller, not in RBAC. - ingress: cert-manager.io/cluster-issuer annotation triggers ACME auto-issuance for openclaw.<sme-domain>. - per-user pod template ConfigMap holds the pod-spec the controller renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders filled at session-start. - networkPolicy covers controller pod only; per-user pod NetworkPolicy is rendered by the controller at session-start (target hostname is read from the per-user Secret which doesn't exist at chart-render time — documented in README.md). Tests: chart/tests/render-toggles.sh (7 cases) covers required-value enforcement, RBAC create+resourceNames violation guard, ServiceMonitor default-off, networkPolicy toggle, pod-template placeholder presence, cert-manager annotation. All seven gates pass locally. Closes part of #795 (epic still open). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:10:24 +04:00
e3mrah	9adca8442a	fix: ci actions:write + auth-layout overflow scroll (#712 followup, #721 followup) (#728 ) Two unrelated production-bug fixes squashed because they came out of the same live verification pass on console.openova.io 2026-05-04. 1. catalyst-build.yaml deploy job permissions PR #720 added a `gh workflow run blueprint-release.yaml` dispatch step at the end of the deploy job to close the bot-deploy-doesn't- trigger-workflows gap from #712. Step has been failing on every run since with HTTP 403 "Resource not accessible by integration" because GITHUB_TOKEN lacks `actions: write` by default. Result: blueprint-release was never dispatched after PR #722–727 merged; the bp-catalyst-platform OCI artifact stayed on the pre-fix chart and any Sovereign provisioned afterwards picked up the buggy chart. Add the missing permission so dispatch succeeds. 2. AuthLayout.tsx vertical centering at small viewport heights The sign-in / verify cards were mathematically centered at 1440×900 (Δ=0.008px verified via getBoundingClientRect in Playwright) but founder reports the card sitting at the top of the screen on real-world viewports. Root cause: the right panel had `flex flex-1 items-center justify-center` which centers ONLY if the inner content fits within the viewport — at smaller heights the form's natural content flow pushed the card off-screen with no scroll fallback. Fix: add `items-stretch` to the outer flex (so the right panel fills full viewport height), `overflow-y-auto` on the right column (so the card can scroll inside its column when too tall), and `py-8` padding on the card wrapper (breathing room when scrolling kicks in). Result: card is vertically centered when content fits, and stays visible (column-scrollable) when it doesn't, on every viewport height from 1024×600 up. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 12:44:44 +04:00
e3mrah	35183af5be	fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712 ) (#720 ) * feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute .<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/ and marketplace-api/* (only .kustomization.yaml + .ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"\|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the .<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub Actions design, commits authored by GITHUB_TOKEN don't re-trigger workflows. blueprint-release.yaml's `on.push.paths: products//chart/*` filter matches the deploy commit's diff (chart/values.yaml + chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire, but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck on whatever catalyst-api SHA was current at the last manual chart- touching PR. Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb six PRs after the SHA was superseded — every fresh Sovereign installed the buggy pre-#701 image and rejected handover with 401 unauthenticated. Fix: after `git push` succeeds in the deploy job, dispatch blueprint-release explicitly via `gh workflow run`. The dispatched run re-renders + re-publishes the chart with the just-pushed values.yaml. Closes #712. --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 07:49:03 +04:00
e3mrah	b5c9839da7	feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611 ) Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/ → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:17:56 +04:00
e3mrah	10c8e997c4	fix(catalyst): restore literal image refs in Kustomize-path deployment YAMLs (#614 ) The feat/global-imageRegistry (#580) PR converted the literal image refs in api-deployment.yaml and ui-deployment.yaml to Helm template expressions ({{ .Values.global.imageRegistry }}...) without updating the CI deploy step to also patch those files. Since the catalyst-platform Flux Kustomization reads these files as raw manifests (not via helm-controller), the Helm template syntax was never rendered, leaving a literal '{{ if ... }}' string as the image reference → InvalidImageName on every Pod start. Root cause: two consumers of the same file — Helm chart path (Sovereign clusters) and Kustomize path (contabo-mkt) — but only the Helm path was handled by the deploy job. Fix: - Restore literal `ghcr.io/openova-io/openova/catalyst-{api,ui}:b50a600` image refs in the Kustomize-path deployment YAMLs (immediate unblock). - Update CI deploy step to sed-patch those literal refs on every deploy commit so future image rolls keep both paths in sync (durable fix). Closes: the InvalidImageName regression introduced in #580. Unblocks: issue #608 (Phase-8b Agent A magic-link auth) — catalyst-api was stuck at InvalidImageName since commit `83ec889f`, preventing the CATALYST_KC_ADDR / session-cookie auth gate from loading. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 18:29:09 +04:00
hatiyildiz	59fb2b742c	fix(ci): use awk instead of python heredoc in deploy — fixes YAML parse error	2026-05-02 13:48:17 +02:00
hatiyildiz	885e032dc5	fix(ci): deploy job updates values.yaml SHA tags, not Helm template files The previous sed targeted ui-deployment.yaml + api-deployment.yaml for `image: ghcr.io/.../catalyst-ui:.*` but those files use Helm template expressions (`{{ .Values.images.catalystUi.tag }}`), so sed silently no-ops. Result: every catalyst build committed "No changes" and the deployed image was never updated. Fix: switch deploy job to update images.catalystUi.tag and images.catalystApi.tag in products/catalyst/chart/values.yaml via python3 regex (handles multiline YAML reliably). Also bump catalystUi + catalystApi tags to `32c5e43` (the build from #596 / PR #599 — Vite base: '/' fix). Fixes #596 deploy path.	2026-05-02 13:46:03 +02:00
e3mrah	942be6f58d	fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583 ) containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index includes an attestation manifest (the unknown/unknown platform entry added by docker/build-push-action when provenance=true). Containerd resolves the manifest index, encounters the attestation entry, fetches its descriptor from GHCR which returns an HTML 404 page, and then caches that HTML page as a blob SHA — every subsequent pull of ANY tag for that image returns the same HTML SHA instead of the real layer. Fix: set provenance=false + sbom=false on the build-push-action step. SBOM attestation is handled separately by cosign attest, which does not embed its manifest into the OCI index. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 14:29:58 +04:00
e3mrah	52c6938e02	ci(catalyst-build): watch infra/hetzner/ so cloudinit changes rebuild catalyst-api (#472 ) Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without this path in the build trigger, fixes to that file do NOT rebuild the image — the running pod keeps using the stale tftpl and provisioning keeps failing with the same Tofu error. Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path filter MUST cover every directory the image depends on. Missing infra/hetzner/ was a long-standing latent CI bug — surfaced by Phase-8a #454 first live provision attempt. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:34:13 +04:00
e3mrah	1628a1b3aa	ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470 ) First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401 'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages. #460's agent fixed it for B in c26fbcaf. #461's already had GHCR login. This commit applies the same helm-registry-login pattern to A and E. WBS state on main after this commit: - done (35): all chart-level + #317 + #319 + #453 + 4 preflights - wip (0) - blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven) The preflights' first runs ALREADY surfaced a real CI bug pattern that would have hit Phase 8a — exactly what they're for. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:06:36 +04:00
e3mrah	4a7eb42d26	feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462 ) (#468 ) Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0 ships a sovereign realm + a public kubectl OIDC client via the upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm hook (issue #326); this workflow proves it actually wires up on a clean cluster before we run it on a real Sovereign. Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits for the keycloak StatefulSet to roll out, polls for the keycloakConfigCli post-install Job by label (app.kubernetes.io/component=keycloak-config-cli), waits for it to Complete, port-forwards svc/keycloak and asserts: 1. /realms/sovereign returns 200 (realm exists in Keycloak's DB). 2. The kubectl OIDC client is provisioned with publicClient=true, redirectUris contains http://localhost:8000 (kubectl-oidc-login default), and the groups client scope is wired with the oidc-group-membership-mapper (the per-Sovereign k3s api-server's --oidc-groups-claim flag depends on this). Acceptance per ticket: if the post-install Job fails, the workflow summary captures Job logs + StatefulSet logs + cluster state via GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running. Triggers are event-driven only per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled" rule — push on the workflow file itself plus workflow_dispatch for ad-hoc re-runs. Closes #462. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:30 +04:00
e3mrah	abac00d8b3	feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459 ) (#467 ) Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit reconcile-chain order untested under load) before Phase 8a (#454) burns Hetzner credit on test.omani.works. New workflow .github/workflows/preflight-bootstrap-kit.yaml: - kind v0.25.0 + kindest/node:v1.30.6 - Gateway API CRDs v1.2.0 standard channel - Full Flux controller set (fluxcd/flux2/action@main + flux install) - Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials, flux-system/ghcr-pull - Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern in tests/e2e/bootstrap-kit/main_test.go:247) - 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop at first) - $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready condition + per-HR describe blocks for non-Ready + recent flux-system events + raw hrs.json artefact (14d retention) - Event-driven only: push on self-edit + workflow_dispatch; no schedule: cron (per CLAUDE.md "every workflow MUST be event-driven") Canonical seam reused (no duplication): - kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml - bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the same overlay production Sovereigns consume; substitution shape mirrors tests/e2e/bootstrap-kit/main_test.go:247) - event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428) Out of scope (sibling preflights): - #460 Crossplane provider-hcloud Healthy probe - #461 Cilium Gateway HTTPRoute admission - #462 Keycloak realm-import Validated: actionlint clean, YAML parses cleanly. WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:26 +04:00
e3mrah	6f9ee43a9d	fix(ci): GHCR auth for bp-crossplane OCI pull in preflight (#460 ) (#466 ) Run 25221515110 surfaced the exact blocking error the workflow was designed to surface — but for the install step, not the Healthy probe: Error: INSTALLATION FAILED: failed to perform "FetchReference" on source: GET "https://ghcr.io/v2/openova-io/bp-crossplane/manifests/1.1.3": ... 401: unauthorized: authentication required bp-crossplane is a PRIVATE GHCR package (verified via `gh api /orgs/openova-io/packages/container/bp-crossplane`). The fix mirrors the canonical seam in .github/workflows/blueprint-release.yaml: add `packages: read` to the job permissions and run `helm registry login ghcr.io` against GITHUB_TOKEN before the `helm install oci://...` step. No new pattern; just reuse. This unblocks the actual goal of #460 — observing provider-hcloud Healthy=True (or surfacing whatever blocks it) on a kind cluster. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:15 +04:00
e3mrah	48b73af6ae	feat(ci): Phase-8a preflight C — Cilium Gateway HTTPRoute admission on kind (closes #461 ) (#465 ) Surfaces Risk-register R3 (docs/omantel-handover-wbs.md §9a) — Cilium Gateway HTTPRoute admission was untested on contabo because contabo runs Traefik (no `cilium-gateway` Gateway present per ADR-0001 §9.4). This workflow boots a kind cluster, installs upstream Cilium 1.16.5 with `gatewayAPI.enabled=true`, applies the per-Sovereign Gateway shape from `clusters/_template/bootstrap-kit/01-cilium.yaml` (HTTP listener only — TLS is Phase 8a), pulls bp-catalyst-platform:1.1.8 from GHCR, renders its httproute.yaml template with sovereign overlay values, and asserts that `catalyst-ui` and `catalyst-api` HTTPRoutes both reach Accepted=True against the Cilium Gateway. Anti-duplication: GHCR helm-registry-login mirrors blueprint-release .yaml (lines 173-177); kind+Cilium pattern matches playwright-smoke shape; per-Sovereign Gateway is a 1:1 mirror of the canonical bootstrap-kit slot 01 (HTTP listener), no new shape invented. Trigger pattern is event-driven per CLAUDE.md: push on this file or the chart templates it validates, plus workflow_dispatch for re-runs. No cron. Out of scope (Phase 8a/8b): TLS termination, real DNS resolution, backend Deployment health, the 10 leaf bp-* dependencies (which have their own chart-verify smoke runs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:01 +04:00
e3mrah	48a1623b28	feat(ci): Phase-8a preflight B — Crossplane provider-hcloud Healthy on kind (closes #460 ) (#463 ) Surfaces Risk-register R2 (docs/omantel-handover-wbs.md §9a — provider-hcloud Healthy=True never observed). New workflow spins up kind, installs bp-crossplane 1.1.3 from GHCR, applies the EXACT Provider + ProviderConfig shape from infra/hetzner/cloudinit-control-plane.tftpl (#425), waits up to 5 min for Healthy=True, plants a fake hcloud-token Secret in flux-system to match the canonical secretRef, and asserts the ProviderConfig is accepted by the API. Reuses existing seams: - helm/kind-action@v1 pattern from .github/workflows/test-bootstrap-kit.yaml - event-driven trigger shape from .github/workflows/check-vendor-coupling.yaml - canonical Provider/ProviderConfig YAML from infra/hetzner/cloudinit-control-plane.tftpl No schedule: cron (per CLAUDE.md "every workflow MUST be event-driven"). No live Hetzner calls — fake-readonly-token only; real-credential validation is Phase 8a, not this preflight. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 19:58:32 +04:00
e3mrah	1e7d1e67c9	test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429 ) (#432 ) Phase 8 of the omantel handover (#369) needs an automated E2E that proves DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with zero contabo dependency post-handover. Today this is a SCAFFOLD — when Phase 4/6/7 land, dispatching the new workflow against a live omantel is the entire Phase 8. Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md): - tests/e2e/playwright/tests/ ← mirror of sovereign-wizard.spec.ts shape (NOT specs/ as the issue body said — actual repo path is tests/) - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries, workers=1, reporter=list) — reused as-is - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the pre-flight skip-when-unreachable pattern - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4, setup-node v4, npm install, playwright install --with-deps chromium, upload-artifact on failure) — mirrored, NOT duplicated What ships: - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests): 1. sovereign Ready + 23/23 blueprints 2. all bp-* HelmReleases Ready=True 3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready") 4. vendor-agnostic Object Storage (post-#425 canonical secret name flux-system/object-storage — NOT hetzner-object-storage) 5. dig +trace omantel.omani.works ends at omantel NS, not contabo 6. zero contabo dependency (omantel /api/healthz keeps returning 200) Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset. - .github/workflows/omantel-e2e-handover.yaml (NEW): workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled"). Inputs let the operator override base URLs at dispatch time. - docs/omantel-handover-wbs.md: new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1 with the spec test() blocks; §9 status row added for #429 (🟢 scaffold-shipped). Local verification: cd tests/e2e/playwright && npm install && \ npx playwright test --list tests/omantel-handover.spec.ts → 6 tests listed cleanly npx playwright test tests/omantel-handover.spec.ts → 6 skipped (env vars unset, expected) Out of scope (per #425 / #428 territory split): - internal/hetzner/, infra/hetzner/, platform/velero/chart/, clusters/.../34-velero.yaml — #425's vendor-agnostic sweep - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 17:52:18 +04:00
e3mrah	0fdd411e79	ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428 ) (#431 ) Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names (hetzner\|aws\|gcp\|azure\|oci) appearing in capability-named slots: 1. <vendor>-object-storage (sealed-secret / overlay-secret name) 2. <chart>Overlay\.<vendor>\. (chart values block keyed to vendor) 3. <vendor>ObjectStorage (camelCase payload field) Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/, internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may discuss the rule). Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425 work-in-progress); hard-fail once that directory lands. Locally on this branch the script emits 49 warnings to stderr and exits 0 against the existing hetzner-coupled references in platform/velero, platform/seaweedfs, and clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those warnings disappear and any future re-introduction fails CI. Workflow trigger surface: push-to-main + pull_request on the scanned paths + workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled". Canonical seam used: scripts/ + .github/workflows/ (mirrors scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml shape). NOT a duplicate - no prior vendor-coupling guard existed. Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map) docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:49:49 +04:00
e3mrah	956b976558	fix(ci): playwright-smoke port 4321→5173 for Vite 8 default (#335 ) (#418 ) The catalyst-ui dev-server bind moved from 4321 to 5173 when Vite default changed (Vite 8). The smoke workflow's curl-wait + BASE_URL env still pointed at 4321, so: Vite 8 starts fine on 5173 → workflow polls 4321 for 60s → never returns 200 → step exits 1 before Playwright ever runs. Effect across last ~30 main commits: every push generated a 'Playwright UI smoke failed' email despite the UI itself being healthy. We've been shipping with --admin bypass + post-deploy verification against console.openova.io. This restores actual smoke coverage on every PR. Three substitutions on .github/workflows/playwright-smoke.yaml: - line 80 curl wait URL: localhost:4321 → localhost:5173 - line 93 BASE_URL env: 4321 → 5173 - line 72-73 comment: stale 'Vite binds 4321 by default' → 5173 Closes #335. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:04:11 +04:00
e3mrah	4d24914ae4	feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318 ) (#346 ) * feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) Adds a first-class Phase-0 recovery surface so an operator can purge a failed pre-handover deployment from the wizard UI without dropping to hcloud CLI runbooks. Two entry-points, one canonical implementation. ## Backend NEW: products/catalyst/bootstrap/api/internal/handler/wipe.go POST /api/v1/deployments/{id}/wipe — single-flight destructive op: 1. tofu destroy against the per-deployment workdir (idempotent). 2. Hetzner orphan force-purge by label-selector `catalyst-deployment-id=<id>` (servers, load balancers, networks, firewalls, ssh-keys). Belt-and-braces — catches resources tofu didn't track (half-failed cloud-init, manual experiments). Per docs/INVIOLABLE-PRINCIPLES.md #3 this direct API path is fallback ONLY for orphan cleanup, never new resource creation. 3. PDM /v1/release for pool-subdomain Sovereigns (best-effort). 4. Local cleanup: kubeconfig file (mode 0600), tofu workdir, on-disk deployment record JSON. 5. SSE events stream throughout on the same channel as the original provisioning + Phase-1 watch. 6. Marks Status="wiped"; sync.Map entry reaped after a 60s TTL. NEW: products/catalyst/bootstrap/api/internal/hetzner/purge.go Hetzner Cloud API enumeration + force-delete by label selector. Uses a 60s timeout (vs the 10s ValidateToken default) because async server-delete jobs can queue. 404s treated as success (already gone). NEW: products/catalyst/bootstrap/api/internal/provisioner/provisioner.go Provisioner.Destroy() — runs `tofu destroy -auto-approve` against the per-deployment workdir, then removes the workdir on success so re-provisioning starts fresh. Re-stages module + tfvars first so a partially-cleaned workdir still has what tofu needs. TOUCHED: products/catalyst/bootstrap/api/cmd/api/main.go Registers POST /api/v1/deployments/{id}/wipe. ## Frontend (aligned with existing CrudModals conventions per founder ## directive — no ad-hoc surface) NEW: products/catalyst/bootstrap/ui/src/components/CrudModals/WipeDeploymentModal.tsx Two-stage modal built on the canonical ModalShell. Pre-wipe confirm view requires the operator to: - Type the sovereign FQDN to confirm scope. - Re-paste their Hetzner Cloud API token (catalyst-api intentionally GCs the original after writeTfvars per credential hygiene). Post-wipe success view shows the PurgeReport (servers, lbs, networks, firewalls, ssh-keys removed; tofu/PDM/local-state ✓/✗) and a "Start fresh deployment" CTA that nav's to /sovereign. TOUCHED: products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts Re-exports WipeDeploymentModal + WipeReport. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/AppsPage.tsx FailureCard now exposes a "Cancel & Wipe" red button next to "Retry stream" / "Back to wizard" — opens WipeDeploymentModal. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/InfrastructureTopology.tsx Cloud → Architecture canvas: the `cloud` (root) node action menu gains "Cancel & Wipe deployment" as a `danger:true` action, alongside the existing "+ Add region". Distinct from the per-resource DeleteCascadeConfirm on region/cluster/vCluster — this is deployment-scope (Phase-0 orphan purge), the others are Crossplane-XRC scope (day-2). The two paths coexist; operators choose by what state the deployment is in. ## Why two entry-points Wizard banner (failed state on AppsPage) — recovery from a known failure. Already a red-banner page; the button is right there. Cloud → Architecture cloud-node action — proactive cancel from the canvas, mirrors how the existing per-resource deletes are reachable. Same modal, same backend. ## Constraints honoured - Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2 IaC): the per-resource DELETE handler at infrastructure.go is unchanged and continues to flip XRC deletionPolicy. Wipe operates ONLY in Phase-0 scope where Crossplane never adopted resources. - Per #4 (never hardcode): every endpoint lives behind API_BASE; the Hetzner purge enumerates by deterministic label selector built from var.sovereign_fqdn (the OpenTofu module's existing tagging convention). - Per credential hygiene: the Hetzner token is re-prompted at wipe time rather than persisted; the modal uses an <input type="password">. ## Refs #318 — pre-handover wipe spec (this PR closes it) #317 — handover finalisation (sibling; this PR is the failure-path complement) feedback_idempotent_iac_purge.md — operator runbook this implements PR #313 — sealed-secrets cleanup (independent; safe to land in any order) PR #334 — bp-external-secrets split (independent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): catalyst-build event-driven only — drop cron, push-on-main with path filter Per docs/INVIOLABLE-PRINCIPLES.md (event-driven end to end — Flux dependsOn, NATS JetStream, SSE, Helm hooks), GitHub Actions must follow the same model. The previous `schedule: cron 0 3 * * ` daily build was the only canonical deploy path, which created a 24h roll latency on every change to the catalyst surface and incentivised "wait for cron" stalls in operator workflows. Replaces with: on: push: branches: [main] paths: - 'core/console/' - 'core/admin/' - 'core/marketplace/' - 'core/marketplace-api/' - 'products/catalyst/bootstrap/' - 'products/catalyst/chart/*' - '.github/workflows/catalyst-build.yaml' workflow_dispatch: `workflow_dispatch` retained for ad-hoc re-runs (config-only changes that bypass the path filter, e.g. a secret rotation that doesn't touch code). Path filter mirrors the actual surface this workflow rebuilds. After this lands, every merge to main that touches the catalyst surface auto-deploys. No cron lag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 09:24:40 +04:00
e3mrah	2de8bb68b9	fix(ci): bump helm 3.16.3 → 3.18.4 in blueprint-release — fixes seaweedfs smoke-render (#336 ) 'function fromToml not defined' error on bp-seaweedfs publish. Upstream seaweedfs/seaweedfs 4.22.0 (templates/shared/security-configmap.yaml:21) uses fromToml which exists in 3.13+ but the rendered context in the smoke step needs newer Sprig functions present in 3.18+. Bump unblocks the chain of HRs (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana) all blocked on bp-seaweedfs publish. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 23:27:45 +04:00
e3mrah	5502d9aa48	feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159 ) (#291 ) Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer in bp-cert-manager by shipping the missing piece — a Go binary that satisfies cert-manager's external webhook contract (`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json. Architecture ============ * `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with pool-domain-manager and catalyst-dns). Encapsulates the api3.json transport, command builders, response decoding, and the safe read-modify-write semantics required to never accidentally wipe a zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2` variant is unexported. * `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook binary. Implements `Solver.Present` via the client's append-only `AddRecord` path and `Solver.CleanUp` via the read-modify-write `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`) rejects challenges for unmanaged apexes BEFORE any Dynadot call. * `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm wrapper. Templates Deployment + Service + APIService + serving Certificate (CA chain via cert-manager Issuer self-signing) + RBAC + ServiceAccount. Mirrors the standard cert-manager external- webhook deployment shape. * `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the paired ClusterIssuer activates. The interim http01 issuer remains templated as the rollback path. Test results ============ core/pkg/dynadot-client — 7 tests PASS (race-clean) core/cmd/cert-manager-dynadot-... — 9 tests PASS (race-clean) Test coverage includes a Present/CleanUp round-trip against an httptest fixture that models Dynadot's zone state, an explicit unmanaged-domain rejection, a regression preserving a pre-existing CNAME across the DNS-01 round-trip (the zone-wipe defence), and a typed-error propagation test that surfaces `ErrInvalidToken` to cert-manager so the controller will retry. Helm template smoke render ========================== `helm template` against the new chart with default values yields 12 resources / 424 lines (APIService, Certificate, ClusterRoleBinding, Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The modified bp-cert-manager chart still renders both ClusterIssuers (`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default values; flipping `certManager.issuers.dns01.enabled=false` is the clean rollback. Smoke command (post-deploy) =========================== kubectl get apiservices.apiregistration.k8s.io \ v1alpha1.acme.dynadot.openova.io # Issue a *.<sovereign>.<pool> wildcard cert and watch the # Order/Challenge progress through cert-manager. CI == `.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the pool-domain-manager-build pattern (cosign keyless signing, SBOM attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager- dynadot-webhook:<sha>`). Triggered by changes to either the binary or the shared dynadot-client package. Closes #159 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:37:47 +04:00
e3mrah	0289f0388d	feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259 ) Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml, the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3. The script parses every clusters/_template/bootstrap-kit/.yaml, extracts metadata.name + spec.dependsOn for the HelmRelease document(s), and mechanically verifies the actual graph against the expected DAG declared in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch (W2.K1-K4) on success. Behaviour against the in-flight expansion: HRs declared expected but not yet on disk are reported as "deferred" (informational, not an error), so that this script can be the static authoritative list while W2.K1-K4 PRs land their HR files in series. After all four W2 PRs merge, the "deferred" count drops to 0 and the audit goes 100% green. Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a new dependency-graph-audit job that runs on every PR touching: - clusters/* (any HR file edit) - scripts/check-bootstrap-deps.sh - scripts/expected-bootstrap-deps.yaml - .github/workflows/test-bootstrap-kit.yaml Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:16:16 +04:00
e3mrah	2d1799d738	fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247 ) Resolves install ordering on fresh clusters where the apiserver rejects CompositeResourceDefinition CRs because the apiextensions.crossplane.io CRDs registered by the crossplane subchart aren't live yet at apply time. - bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload) - NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions - Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane] - composition-validate.sh + fixtures relocate to the new chart - blueprint-release CI: opt-out annotation catalyst.openova.io/no-upstream=true permits zero-deps charts that legitimately ship only Catalyst-authored CRs (the original hollow-chart rule remains in force for every other umbrella chart) Live error this fixes (from otech.omani.works): no matches for kind "CompositeResourceDefinition" in version "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn. Apply universally to similar cases going forward. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:55:05 +04:00
e3mrah	fad36836ed	fix(ci): tempo + ntfy logos are now .svg (logo-fix-batch-2) (#213 ) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-29 21:41:29 +02:00
e3mrah	1f5c76def1	fix(platform): sync blueprint.yaml versions with Chart.yaml (#199 ) * feat(ui): Playwright cosmetic + step-flow regression guards 15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic- guards.spec.ts that fail HARD when each user-flagged defect class returns: 1. card height drift from canonical 108px 2. reserved right padding eating description width 3. logo tile drift from per-brand LOGO_SURFACE 4. invisible glyph (white-on-white) via luminance proxy 5. wizard step order Org/Topology/Provider/Credentials/Components/ Domain/Review 6. legacy "Choose Your Stack" / "Always Included" tab labels 7. Domain step reachable before Components 8. CPX32 not the recommended Hetzner SKU 9. per-region SKU dropdown shows wrong provider catalog 10. provision page is .html (static) not SPA route 11. legacy bubble/edge DAG SVG markup on provision page 12. admin sidebar drift from canonical core/console (w-56 + 7 labels) 13. AppDetail uses tablist instead of sectioned layout 14. job rows navigate to /job/<id> instead of expand-in-place 15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage Each test prints a failure message naming the canonical reference, the source-of-truth file, and the data-testid PR needed (if any) so the implementing agent has a precise target. No .skip() — per INVIOLABLE-PRINCIPLES #2, missing components fail loud. CI: .github/workflows/cosmetic-guards.yaml runs the suite on every PR that touches products/catalyst/bootstrap/ui/ or core/console/. Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's original complaint, the canonical reference, and the green/red semantics (5 tests intentionally RED on main today — they stay red until the companion-agent's UI work lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:07:55 +04:00

1 2

84 Commits