Commit Graph

84 Commits

Author SHA1 Message Date
e3mrah
0ac12970d8
ci(openova-flow): build openova-flow-server + adapter-flux images + sed chart tags (#1398)
Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go
binaries so prov #34 has real images to install. Both auto-bump their
chart's values.yaml `image.tag` on every main-branch push and dispatch
blueprint-release for chart re-publish.

Workflows shipped:
 - .github/workflows/build-openova-flow-server.yaml
     · Triggers on push to products/openova-flow/server/** or the chart
     · `go vet` + `go test -race` + Buildx push to
       ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest
     · cosign keyless sign + SBOM attest
     · awk-bumps platform/openova-flow-server/chart/values.yaml
       flowServer.image.tag, commits to main with [skip ci]
     · Dispatches blueprint-release.yaml for chart re-publish
 - .github/workflows/build-openova-flow-adapter-flux.yaml
     · Same shape; bumps platform/openova-flow-emitter/chart/values.yaml
       flowEmitter.image.tag

Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no
values.yaml changes needed in this PR.

Canonical patterns cited (ARCHITECT-FIRST):
 - Build shape mirrors .github/workflows/build-application-controller.yaml
   (Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump +
   blueprint-release dispatch).
 - awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in
   .github/workflows/catalyst-build.yaml `deploy` job (with the
   `[skip ci]` + explicit blueprint-release dispatch fix from #712).

Per docs/INVIOLABLE-PRINCIPLES.md:
 - #4a (GitHub Actions is the only build path)
 - event-driven (no cron triggers, only push/PR/workflow_dispatch)

MIRROR-EVERYTHING: image refs in chart values point at
harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and
Harbor proxy-pulls. No direct push to harbor.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:03:31 +04:00
e3mrah
3a5d9fc102
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min
of provision-cycle waste per regression event documented in Fix #101.

## Fix A — CI guard against unescaped tftpl shell expansion

Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml
that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default}
inside YAML comment lines. Uses PCRE negative-lookbehind so correctly
escaped \$\${VAR:-default} (templatefile() literal-dollar) does not
trip the guard.

Background: PR #1311 (Fix #73) added a YAML comment with bare
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL
\${...} sequences regardless of YAML/HCL/shell context; the colon
in the interpolation hits HCL's reserved conditional grammar and
crashes 'tofu plan' with "Template interpolation doesn't expect
a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted
~30 min before PR #1328 fixed the one offender. Without the guard,
the next operator who adds a similar comment repeats the incident.

Documented in infra/hetzner/README.md so editors learn the \$\$
escape pattern before they trip the CI gate.

## Fix B — bucket-name suffix to escape global Hetzner namespace

Hetzner Object Storage bucket names share a GLOBAL namespace
across every tenant. The previous BucketNameForSovereign(fqdn)
derivation 'catalyst-<fqdn-with-dashes>' would collide on the
second CreateDeployment for the same FQDN (re-provision after
wipe, two operators on adjacent pools, race conditions) and the
second 'tofu apply' would fail with BucketAlreadyExists.

Change BucketNameForSovereign signature to (fqdn, deploymentID)
and append the first 8 chars of the deployment-id as a suffix:

  catalyst-omantel-omani-works-b3b837a2

newID() already returns 16-hex random — the leading 8 chars are
32 bits of fresh entropy, enough to make collisions cryptographically
negligible. Backward-compat: empty deploymentID (legacy on-disk
records) falls back to first-8-hex of sha256(fqdn) so wipes of
pre-Fix-111 Sovereigns remain deterministic.

Call-sites updated:
  - handler/deployments.go: id := newID() moved before
    bucket-name derivation; uses hetzner.BucketNameForSovereign
  - handler/wipe.go: passes dep.ID to PurgeBuckets and to
    BucketNameForSovereign in the report
  - hetzner/buckets.go: PurgeBuckets signature now takes
    deploymentID; bucketSuffix() handles the fallback

Tests:
  - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign
    table covers canonical newID() shape, collision avoidance,
    uppercase normalisation, empty + non-hex fallback paths.
    New TestBucketNameForSovereign_CollisionAvoidance asserts
    the Fix #111 invariant directly.
  - handler/deployments_test.go:
    TestCreateDeployment_DerivesObjectStorageBucketFromFQDN
    now asserts the suffixed shape against the actual dep.ID.
  - All produced names re-validated against the S3 bucket-naming
    RFC (mirrored regex from provisioner.s3BucketNamePattern).

## Claimed TCs

_None directly — infrastructure hardening; eliminates 30+ min
wasted per cycle from regressions like PR #1311 + bucket-collision_

## Verification

- go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS
- go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS
- go vet ./... → clean
- go build ./... → clean
- yaml.safe_load on workflow → clean
- pre-existing handler-package fails (whoami, continuum-switchover)
  are unrelated and present on origin/main

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:31:56 +04:00
e3mrah
f668d791ab
fix(bp-newapi): publish newapi-mirror image + repoint chart to existing tag (qa-loop bounded-cycle audit prov #7 Gap F) (#1315)
Root cause from live diagnosis (omantel.biz prov #7, kubectl --context=omantel):

The bp-newapi chart at platform/newapi/chart/values.yaml referenced
`ghcr.io/openova-io/openova/newapi-mirror:v0.4.5` since its first commit
(44d0200a, 2026-05-01). However:

1. NO CI workflow ever built that image. There is no
   `build-bp-newapi*.yaml` (or similar) under .github/workflows/. The
   GHCR package `ghcr.io/openova-io/openova/newapi-mirror` does not
   exist (404 from /orgs/openova-io/packages/container/...).

2. The tag `v0.4.5` is fictitious — neither upstream Calcium-Ion/new-api
   (`docker.io/calciumion/new-api`) nor the alternate ancestor
   (`justsong/one-api`) ever published a `v0.4.5`. The lowest stable
   Calcium-Ion tag is `v0.6.0.9`; the highest stable v0.x is `v0.13.2`
   (upstream publish 2026-04-27).

Result: every fresh Sovereign's NewAPI Pod ImagePullBackOff'd 403
Forbidden on the never-existed image, blocking alice signup gate 5
(LLM) and surfacing in the bounded-cycle audit as Gap F.

Fix (mirrors bp-guacamole CI pattern in
.github/workflows/build-bp-guacamole.yaml):

- NEW .github/workflows/build-bp-newapi.yaml — push to
  platform/newapi/chart/** triggers a Job that pulls
  `docker.io/calciumion/new-api:<UPSTREAM_VER>`, captures the upstream
  repo digest, re-tags as `ghcr.io/openova-io/openova/newapi-mirror:
  <UPSTREAM_VER>` + `:latest`, pushes both, then bumps values.yaml +
  Chart.yaml + dispatches blueprint-release.

- platform/newapi/chart/values.yaml — newapi.image.tag bumped from
  `v0.4.5` (fictitious) to `v0.13.2` (latest stable Calcium-Ion/new-api
  on Docker Hub). Comment block expanded with full rationale + link to
  the new build workflow + bump-in-lockstep instructions.

- platform/newapi/chart/Chart.yaml — version 1.4.1 → 1.4.2, appVersion
  `0.4.5` → `0.13.2` (Helm convention: appVersion = upstream version
  without the `v` prefix). Inline changelog records the audit-prov-7
  Gap F lineage.

- clusters/_template/bootstrap-kit/80-newapi.yaml — pinned chart
  version 1.4.1 → 1.4.2 with the same changelog inline.

Verified locally:
- `helm template smoke platform/newapi/chart --set
  database.existingSecret=fake --set credentials.existingSecret=fake
  --set auth.adminUI.mode=masterKey` renders
  `image: "ghcr.io/openova-io/openova/newapi-mirror:v0.13.2"` and
  `app.kubernetes.io/version: "0.13.2"`.

The v1.0.0-rc.x upstream line is gated on schema migration
stabilisation; the channel-seed Job uses the legacy admin-API request
shape, so do NOT auto-roll past v0.13.x without re-running the
channel-seed integration smoke against NewAPI's `/api/channel/`.

Pairs with the Gap C re-investigation memo (no chart fix needed; PR
#1309 only gated `defaultCompositionRef`, not the XRD itself; the
useraccesses.access.openova.io CRD is present on omantel prov #7).

DO NOT MERGE — this PR is for qa-loop bounded-cycle Wave 5 Fix #80
(Gap F) review.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:20:49 +04:00
e3mrah
9780e8d72d
fix(chart): bp-catalyst-platform 1.4.116 — chart re-publish + dispatch (qa-loop iter-10 Fix #44 follow-up) (#1264)
Chart 1.4.115 was published from the merge commit which still had the
OLD application-controller image tag (a3ba200) in values.yaml — the
auto-bump commit landed seconds later but GitHub Actions does NOT
trigger workflows from bot pushes by default (anti-recursion safeguard),
so blueprint-release was never re-run and the published chart shipped
with the wrong image. Sovereigns installing chart 1.4.115 still ran
the buggy application-controller without the targetNamespace fix.

Fix:
- Bump bp-catalyst-platform 1.4.115 → 1.4.116 (this commit is human-
  authored so blueprint-release fires via the path filter).
- Bump clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
  pin to 1.4.116.
- Extend build-application-controller.yaml to dispatch
  blueprint-release.yaml after the bot bumps values.yaml, so the same
  race never blocks any future controller image roll-out.

Per docs/INVIOLABLE-PRINCIPLES.md #1 (target-state) — operator must
never have to manually re-trigger a chart publish after a controller
image rebuild.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 06:17:13 +04:00
e3mrah
24aab61207
fix(application-controller): HelmRelease targetNamespace = App's namespace, not Org slug (qa-loop iter-10 Fix #44) (#1262)
Root cause: the application-controller rendered the per-Application
HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace
= Org` where Org is the parent Organization slug. On omantel the
Application(qa-wp) lives in ns `qa-omantel` while the Org is named
`omantel-platform` — so the workload Pod landed in the wrong namespace,
breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 / TC-263 (all
asserting Pod in qa-omantel). Symmetric Kustomization wrapper had the
same bug. Existing render unit test only covered the org==namespace
case (`acme/acme`) which masked the bug.

Fix:
- render.Inputs gains AppNamespace field. helmRelease + kustomization
  templates resolve `metadata.namespace` and `spec.targetNamespace` to
  AppNamespace (back-compat default = Org).
- application_controller.go passes app.GetNamespace() as AppNamespace
  on every render.Render call.
- HelmRelease spec.install.createNamespace = true so a missing workload
  namespace is provisioned by helm-controller (per
  docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller must work
  without an operator pre-creating the namespace).
- Org slug is still stamped on the catalyst.openova.io/organization
  label for traceability.
- 3 new Go tests:
    TestRender_NamespaceIsAppNamespace (omantel scenario via render pkg)
    TestRender_CreateNamespaceTrue
    TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace (drives the
    omantel scenario end-to-end through the controller fake)
- build-application-controller.yaml extended with auto-bump of
  controllers.application.image.tag in values.yaml on push-to-main, so
  the chart picks up the rebuilt image without a manual operator edit
  (per feedback_no_mvp_no_workarounds.md rule 1).
- bp-catalyst-platform chart 1.4.114 → 1.4.115.

Verification (post-roll on omantel):
- delete omantel-platform/qa-wp Pod
- annotate qa-omantel/qa-wp HR for reconcile
- expect: Pod in qa-omantel ns + HR.spec.targetNamespace == qa-omantel

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 05:17:48 +04:00
e3mrah
5ca0a7d178
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots

Closes the scope-narrow confessed by Fix #36: bp-guacamole +
bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI
image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 /
TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment
NotFound".

CI workflows
------------
- .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless
  sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps
  platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml
  patch version + dispatches blueprint-release.
- .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache
  Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry
  we own — no Docker Hub rate limits, no upstream availability risk),
  bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches
  blueprint-release.

Charts (target-state)
---------------------
- bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy`
  regardless of release name (DaemonSet + Service + ClusterRole +
  ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so
  matrix can address them by canonical short name).
- bp-guacamole v0.1.1: canonical short resource names (`guacd`,
  `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream
  images; realm-patch ConfigMap correctly lands in `keycloak`
  namespace (was: realm-name, which would have failed silently on
  every Sovereign); `realmConfig.namespace` override surface added.
- Both charts: `catalyst.openova.io/smoke-render-mode: default-off`
  annotation so blueprint-release smoke-render gate honors the
  default-OFF render shape.

Bootstrap-kit slots
-------------------
- clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml +
  37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned
  to 0.1.1, default-OFF gate flipped via slot values, install/upgrade
  disableWait per session-2026-04-30 architectural decision.
- clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same
  shape with omantel.biz hostnames matching the live HTTPRoutes on
  console.omantel.biz / auth.omantel.biz.

API: shells/issue handler (matrix-canonical URL surface)
--------------------------------------------------------
- POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container=
  alias for the existing
  POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session
  with matrix-canonical response fields (`sessionId`, `guacamoleUrl`,
  `recordingPath`). Same business logic, same audit surface
  (`guacamole-session-opened`), same RBAC gate (tier-developer or
  higher). 6 test cases, all PASS under -race.

TCs that flip PASS in iter-8
-----------------------------
- TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath
- TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system
- TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system
- TC-237: kubectl logs ds/k8s-ws-proxy → "listening"
- TC-245: viewer-cookie POST /shells/issue → 403
- TC-246: operator-cookie POST /shells/issue → 200 sessionId

Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every
gap Fix #36 confessed is closed in this PR. Per
feedback_machine_saturation_3rd_violation.md: CI-only build path,
no local docker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up)

CI dependency-graph-audit caught a slot-number collision: slots 36-48
are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative,
bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge,
bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix,
bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the
exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+
slot range) and add their entries to the expected DAG.

- clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-*
- clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-*
- kustomization.yaml updates (both _template + omantel)
- scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full
  dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets,
  bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+
  seaweedfs+k8s-ws-proxy)

scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55
declared HRs, 42 present on disk, 13 deferred (W2.K1-K4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:48:25 +04:00
e3mrah
b24475e2c2
fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206)
Two coupled fixes for QA-loop iter-3 cluster
`clusterroles-gvr-and-sha-injection`:

Sub-A — clusterroles GVR (TC-122/196/199/248):
  - Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding
    to k8scache.DefaultKinds. Both cluster-scoped.
  - Add matching get/list/watch verbs on
    catalyst-api-cutover-driver ClusterRole. Per
    feedback_chroot_in_cluster_fallback.md every new GVR added to
    DefaultKinds MUST get a matching rule on the cutover-driver SA
    (chroot SovereignClient uses it via in-cluster fallback).
  - Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a
    regression that drops them from the registry fails the unit test.

Sub-B — CATALYST_BUILD_SHA env injection (TC-261):
  - api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION
    env vars with LITERAL values (not Helm directives) per the
    dual-mode contract — Kustomize on contabo can't render
    `{{ .Values... }}` in `value:` fields.
  - .github/workflows/catalyst-build.yaml: extend the "bump literal
    image refs" sed pass to also bump the CATALYST_BUILD_SHA env
    literal so /api/v1/version returns the SHA the Pod is actually
    running (no drift between image tag and reported SHA).
  - The handler (version.go) already reads CATALYST_BUILD_SHA via
    envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change
    needed; the version_test.go env-override test already covers it.

Chart bumped 1.4.94 -> 1.4.95.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:56:21 +04:00
e3mrah
c1b92404ee
fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194)
EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because
the 5 Group C controllers (organization, environment, blueprint,
application, useraccess) shipped with `enabled: false` and the
KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result:
UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never
materialised into RoleBindings + composite realm-roles.

Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1).

Changes:
- values.yaml: organization/environment/application/useraccess controllers
  flipped to `enabled: true` and `image.tag` SHA-pinned to the latest
  GHCR-published push-on-main builds (organization/environment/application
  :1b29c71, useraccess :ff2172f) per Inviolable Principle #4a.
- values.yaml: blueprint stays `enabled: false` until first
  push-on-main build of build-blueprint-controller.yaml lands an image
  in GHCR (never reference an image not built by CI).
- values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`.
- api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its
  default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice
  T2 brief #1098/#1146) instead of hardcoded "false".
- .github/workflows/build-blueprint-controller.yaml: new workflow
  scaffolded (mirror of build-application-controller shape) so the
  first commit touching core/controllers/blueprint/** ships a
  CI-built, SHA-pinned, cosign-signed image to GHCR.
- Chart.yaml: bumped 1.4.89 → 1.4.90.

Verified via `helm template`:
- 4 controller Deployments + 4 controller ClusterRoles render (blueprint
  pending image build).
- KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default.
- 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
  render from platform/crossplane-claims/chart/.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:41:58 +04:00
e3mrah
7ca4abddd2
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101)

Implements the server side of the Cloudflare KV lease-witness pattern
that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/
witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare
Workers KV namespace with read-then-CAS-write semantics enforced via
the If-Match header — exact contract per K-Cont-3 #1158 report (item d)
and the canonical-seams "Cloudflare KV Worker contract" entry.

Routes:
  GET    /lease/<slot-url-encoded>  → 200 + LeaseState | 404 | 401
  PUT    /lease/<slot>              → 200 + LeaseState | 412 + state | 401
  DELETE /lease/<slot>              → 204 | 412 | 401

All 7 K-Cont-3 trap behaviors verified by 46 vitest tests:
  1. If-Match: 0 = first-acquire-on-empty-slot
  2. Generation increments unconditionally (incl. Release)
  3. 412 includes current state body
  4. TTL eviction is server-authoritative in stamping (Worker doesn't
     auto-evict — controller's IsHeldBy decides)
  5. X-Holder mismatch on DELETE returns 412 (stale region can't
     evict new primary)
  6. Bearer token validation against env-bound allow-list
  7. Optional X-Lease-Slot header logged for KV granularity

Files:
  products/continuum/cloudflare-worker/{package.json, tsconfig.json,
    wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore,
    DESIGN.md, src/{index,auth,kv,types}.ts,
    src/handlers/{get,put,delete}.ts,
    test/{handlers,contract,env.d}.ts}
  infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf
    + README.md
  .github/workflows/cloudflare-worker-leases-build.yaml
    (event-driven, NO cron — push-on-paths + PR + workflow_dispatch)

Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean.
tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB
bundle.

Per the brief: tofu module ships ready for operator action — no
auto-deploy. Operator runbook in DESIGN.md §"Operator runbook —
deploy a new Sovereign".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource)

`tofu validate` failed on `cloudflare_workers_secret` — that resource
was REMOVED in cloudflare/cloudflare v5 (it consolidated into the
inline `bindings = [...]` array on `cloudflare_workers_script` with
`type = "secret_text"`). Same security guarantee — encrypted at rest
in CF, never visible via dashboard read API once written. `tofu fmt`
also wanted versions.tf alignment + the .terraform.lock.hcl pinning
the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/
which commits its lock file).

Per Inviolable Principle #5 the bearer token value still flows from
TF_VAR_bearer_tokens_csv extracted at apply time from a K8s
SealedSecret — never inlined here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:01:44 +04:00
e3mrah
746901b671
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.

What ships:
  platform/cnpg-pair/
  ├── chart/
  │   ├── Chart.yaml             # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
  │   ├── values.yaml            # default-OFF gate; placement schema constrains active-hotstandby ONLY
  │   ├── templates/
  │   │   ├── _helpers.tpl              # fail-fast on empty image.tag; region pair validation
  │   │   ├── primary-cluster.yaml      # CNPG Cluster CR (region-pinned via openova.io/region affinity)
  │   │   ├── replica-cluster.yaml      # CNPG Cluster CR (replica.enabled=true; externalClusters[])
  │   │   ├── service-replication.yaml  # Cilium ClusterMesh global Service
  │   │   ├── failover-readiness.yaml   # probe Pod flips Ready when WAL lag < threshold
  │   │   ├── networkpolicy.yaml        # default-deny carve-outs for replication + probe
  │   │   └── audit-config.yaml         # NATS audit subjects + types this Blueprint emits
  │   ├── blueprint.yaml          # configSchema + placementSchema (active-hotstandby ONLY)
  │   ├── README.md               # 80-line deployment + failover semantics
  │   └── tests/cnpg-pair-render.sh  # 5-case render gate
  └── DESIGN.md                   # topology, lag-threshold rationale, deferred C-DB-3 plan

Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.

CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.

Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
  - service.cilium.io/global=true ClusterMesh global Service annotation
    (first chart in the repo to use it; pattern reused by Continuum
    K-Cont-2 for HTTPRoute weight=0 cross-region drains)
  - bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
    Cluster CRs colocated in one Blueprint, region-pinned via
    openova.io/region node-affinity)
  - audit-config ConfigMap co-located with the emitting Blueprint
    (label-selector discovery for K-Cont-2 + U-DR-1; future
    bp-*-pair Blueprints follow this convention)
  - smoke-render-mode=default-off Chart.yaml annotation opt-in for
    the blueprint-release smoke gate

C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.

C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.

Tests:
  - bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
  - helm lint platform/cnpg-pair/chart ✓ clean
  - helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
  - smoke-gate logic simulated locally ✓ default-off annotation honored

Pre-existing CI failures untouched:
  - TestPinIssue rate-limit flake — not affected by chart-only slice
  - TestBootstrapKit/gitea version drift — only iterates over a fixed
    10-chart bootstrap list (no cnpg-pair entry)

Out of scope per brief (all deferred to dedicated slices):
  - K-Cont-2 reconciler logic
  - K-Cont-3 lease witness
  - K-Cont-4 Cloudflare Worker
  - C-DB-3 1M-row acceptance test
  - Application controller changes
  - U-DR-1 UI

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 05:16:55 +04:00
e3mrah
ddbe44918f
feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151)
Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton:

- core/controllers/continuum/{cmd,internal/{controller,events}}
  - cmd/main.go — controller-runtime Manager bootstrap; leader election;
    /healthz, /readyz, /metrics endpoints; env-only config per
    INVIOLABLE-PRINCIPLES #4
  - internal/controller — ContinuumReconciler with no-op Reconcile()
    (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs
    via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen)
  - internal/events — placeholder package documenting K-Cont-2's NATS
    audit-event-type list
  - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534
- products/continuum/chart/ — full Helm chart shape (default-OFF):
  - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty;
    fail-fast on empty tag at render time)
  - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac,
    networkpolicy}.yaml
  - blueprint.yaml — OpenOva Blueprint manifest with configSchema +
    placementSchema (single-region: management cluster) + depends:
    bp-cnpg-pair + bp-powerdns
  - crds/README.md — pointer to the canonical Continuum CRD shipped in
    products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated
- products/continuum/DESIGN.md — chart-vs-binary split decision (Option A:
  binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill
  list, K-Cont-3 lease witness API contract sketch
- .github/workflows/build-continuum-controller.yaml — event-driven CI
  (NO cron) with go vet + go test -race + helm template ON/OFF resource
  count gates + fail-fast verification + GHCR build & push (cosign
  keyless signed) + repository_dispatch for chart-bump fan-out

helm template verification:
- continuum.enabled=false → 0 resources (default OFF)
- continuum.enabled=true + image.tag=ci-test → 6 resources
  (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service,
  NetworkPolicy)
- continuum.enabled=true + empty image.tag → render fails per #4a

go vet ./continuum/... → clean. go test -count=1 -race → all green.

Out of scope (per the K-Cont-1 brief):
- Reconcile body — K-Cont-2
- Lease witness implementations — K-Cont-3
- Cloudflare Worker source — K-Cont-4
- bp-cnpg-pair Blueprint — C-DB-1

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 04:45:00 +04:00
e3mrah
b0ed216e81
feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097) (#1148)
EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST
service backed by Gitea (3 sources: public mirror, sovereign-curated,
per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3
(different scope: SME's was Org-bound; catalyst-catalog is Sovereign-
wide multi-source).

L1 — core/services/catalyst-catalog/ Go service:

  - Separate go.mod (services group is for HTTP services, controllers
    group is for CRD reconcilers — documented in DESIGN.md).
  - Imports the unified Gitea client via Go module replace directive.
  - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog
    (a sibling Go module) can import it (Go internal/ rule). 5 Group C
    controllers updated atomically.
  - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions,
    /{name}/versions/{version}} + /healthz.
  - Source resolution priority on collision: private > sovereign > public.
  - Per-Org access filter: caller's Claims.Groups[] determines visible
    private blueprints; Org A user does NOT see Org B's private set.
  - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default).
  - Session-cookie / Bearer / ?access_token= claim extraction matching
    catalyst-api's seam; expired-token rejection in-process.
  - Containerfile: distroless-static, non-root UID 65532.

L2 — products/catalyst/chart/templates/services/catalog/ wiring:

  - 5 templates (deployment, service, serviceaccount, rbac, httproute)
    + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled.
  - helm template: 0 catalog resources when OFF, 6 when ON.
  - Empty image.tag fail-fasts at render per Inviolable Principle #4a.
  - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname.
  - Chart bumped 1.4.85 → 1.4.86.

Gitea client extension (canonical seam, NOT per-service variant):

  - +ListOrgRepos(ctx, org) []Repo — paginated repo listing.
  - +ListContents(ctx, org, repo, branch, path) []ContentEntry —
    directory listing for per-Org shared-blueprints fan-out.

GitHub Actions workflow:

  - .github/workflows/catalyst-catalog-build.yaml — push-on-paths +
    pull_request + workflow_dispatch (NO cron). go vet + go test (race +
    count=1) + image build → GHCR :<sha>. repository_dispatch fan-out
    to chart-bump matches the Group C controllers' pattern.

Tests (3-tier gate): unit (config, cache, auth, source, handler) +
integration (httptest-backed Gitea fixtures across all 3 sources +
priority + per-Org access). All green; race detector on.

L3 (SME catalog retirement) is deferred per the EPIC-2 master brief.
GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps
for a feature no UI consumer has asked for yet).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 04:04:52 +04:00
e3mrah
66fd0bbae3
refactor(controllers): promote duplicated internal/ packages to shared core/controllers/internal/ (CC1, #1095) (#1135)
Slice CC1 of EPIC-0 (#1095) — Coordinator-led consolidation. The 5 Group C
controllers (slices C1-C5: organization, environment, blueprint, application,
useraccess) all merged with their own per-controller go.mod + per-controller
internal/ tree. This PR canonicalizes the shared layout per
`02-implementer-canon.md` §1+§2:

  * One go.mod at core/controllers/go.mod (Path A — single shared module)
  * Shared helpers under core/controllers/internal/:
      - semver/    (was: blueprint/internal/semver + application/internal/semver,
                    now exposes blueprint's IsValidRange + app's IsExact, with
                    the union of both test corpora)
      - placement/ (was: application/internal/placement; promoted per seam map)
      - render/    (was: application/internal/render; promoted per seam map)
      - labels/    (was: useraccess/internal/labels; promoted per seam map —
                    Manara-style scope matcher, owner-of-record C5)

Module-discipline decision (Path A vs Path B): Path A. The 5 controllers'
go.mod files use the same controller-runtime v0.19.0, k8s.io/* @ 0.31.x,
sigs.k8s.io/yaml v1.4.0, etc. The only drift was organization-controller
on k8s.io/api 0.31.0 vs the others on 0.31.1 — a trivial bump.
Independent dep-version pinning would only be valuable if a controller
needed a hostile dep the others shouldn't pull; nothing in the current
tree is hostile.

Containerfiles + workflows updated:
  * 5 Containerfiles now COPY core/controllers/{go.mod,go.sum,internal/}
    plus the per-controller tree from a repo-root build context.
  * 4 per-controller workflows (application/environment/organization/
    useraccess; blueprint-controller has no dedicated workflow yet) now
    trigger on core/controllers/{<name>/**, internal/**, go.mod, go.sum}
    and run go vet + go test scoped to their own tree + shared internal.
  * useraccess workflow context flipped from core/controllers/useraccess
    to . (repo root) so the Containerfile can reach the shared go.mod.

Subpackages NOT promoted in this PR (compromise — flagged for follow-up):
  * gitea/ — 4 of 5 controllers each ship a Gitea HTTP client. The APIs
    DIVERGE (organization has Org+Repo CRUD with Repo struct return values;
    application/blueprint/environment have File CRUD with Org-not-found
    sentinel). A SUPERSET package would require renaming methods (e.g.
    EnsureRepo collides on signature) which crosses the brief's "no API
    redesign" line. CC2 follow-up slice should design the unified surface
    before promoting.
  * validate/ — application's package validates Application.spec.parameters
    against a JSON Schema (santhosh-tekuri lib); blueprint's validates
    Blueprint CR business rules (semver-backed). Same dir name, completely
    different functions — not actually duplicates.
  * gitops/ — environment's renders Flux GitRepository for an Environment;
    organization's renders HelmRelease+Namespace for an Org. Same dir name,
    different inputs and outputs.

Test-coverage delta: pre-consolidation 134 root-level tests (sum across
5 modules); post-consolidation 133 tests. Net delta -1: blueprint and
application each had their own TestIsValidRange in their semver pkg; the
shared semver pkg's TestIsValidRange now exercises the union of both
controllers' valid+invalid input corpora — coverage strictly improved
even though one redundant test name disappeared.

Verified locally: go build + go vet + `go test -count=1 -race ./...`
all clean; all 5 controller binaries (cmd/) link successfully.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:54:42 +04:00
e3mrah
dbf585744c
feat(controllers): land application-controller (slice C4, #1095) (#1133)
Watches Application.apps.openova.io/v1 CRs and reconciles each
Application to per-region kustomization + helmrelease manifests in
the per-Org Gitea repo (gitea.<location-code>.<sovereign-domain>/<org>/<app>).

Reconcile flow per slice C4 brief:

  1. Resolve parents: spec.environmentRef → Environment CR, then
     Environment.spec.organizationRef → Organization CR. Pending-on-miss.
  2. Fetch Blueprint at spec.blueprintRef.{name,version} (v1 with
     v1alpha1 fallback). Pending-on-miss.
  3. Validate spec.parameters against Blueprint.spec.configSchema via
     github.com/santhosh-tekuri/jsonschema/v5. On invalid → status.phase=
     Failed + Condition reason=Invalid listing every failing JSON pointer.
  4. Validate placement against Blueprint.spec.placementSchema.modes.
  5. Resolve placement → per-region work plan:
       - single-region:      regions[0] only, role=primary
       - active-active:      every region rendered identically (sorted
         for byte-stability), role=active, no primaryRegion
       - active-hotstandby:  regions[0] primary, regions[1..] standby
         (replicas: 0 + _openova_standby: true overlay; Continuum
         #1101 flips on switchover)
  6. Render kustomization.yaml + helmrelease.yaml per region under
     clusters/<region>/applications/<app>/{...}.yaml on the env-type-
     mapped branch (develop|staging|main per NAMING §11.2).
  7. Idempotent commit via gitea.PutFile's byte-equality short-circuit
     — re-reconcile on steady state = 0 Gitea writes (slice C4 brief
     test #7).
  8. Status update: phase / primaryRegion / regions[] / giteaRepo /
     installedBlueprint{name,version,digest} / conditions[].
  9. Finalizer + cascade delete: on metadata.deletionTimestamp, removes
     every manifest the controller wrote and releases the finalizer.

Architecture compliance per docs/INVIOLABLE-PRINCIPLES.md:

  - Flux is the only reconciler. Controller writes to Gitea; Flux
    applies. NO direct K8s create of HelmRelease/Kustomization/Service.
  - Dynamic client + unstructured.Unstructured (no controller-gen, no
    zz_generated_deepcopy.go).
  - Every value is environment-configurable (GITEA_API_URL, GITEA_TOKEN,
    GITEA_PUBLIC_URL, SOURCE_NAMESPACE, HELMRELEASE_INTERVAL,
    CATALOG_SOURCE_REF, REQUEUE_AFTER_SECONDS, METRICS_ADDR, HEALTH_ADDR,
    LEADER_ELECT, LEADER_ELECT_NS, LOG_LEVEL).
  - SHA-pinned images via the focused build-application-controller.yaml
    workflow (push-on-paths + PR + workflow_dispatch — no cron).

Tests cover the full 9-test matrix from the brief plus 3 bonus paths:

  T1 Pending on missing Environment (no Gitea writes).
  T2 Pending on missing Blueprint (no Gitea writes).
  T3 Invalid on parameters schema mismatch — Condition message names
     the failing path 'replicas'; no Gitea writes.
  T4 single-region happy path → expected manifests written under
     clusters/<region>/applications/<app>/ on branch=main, finalizer
     added, status.phase=Provisioning, status.primaryRegion populated,
     status.giteaRepo populated.
  T5 active-active fan-out → 2 regions, 2 manifest sets byte-equal
     after region-name canonicalisation. status.primaryRegion empty.
  T6 active-hotstandby → primary renders replicas:3 (user param);
     standby renders replicas:0 + _openova_standby:true marker.
  T7 Idempotency → re-reconcile after success = 0 Gitea writes
     (PutFile byte-equality short-circuit).
  T8 Deletion cascade → manifests removed from Gitea, finalizer
     released after delete pass.
  T9 Drift detection → Gitea-side manifest hand-edited; controller
     restores byte-identical original on next pass.
  + Pending on Gitea Org missing (org doesn't exist in Gitea even
    though Organization CR exists — slice C1 hasn't run yet).
  + Invalid placement-vs-blueprint-allowed-modes (placement-active-active
    rejected on a Blueprint declaring only single-region).

Module path: github.com/openova-io/openova/core/controllers/application
(per-controller go.mod, matching siblings C1/C2/C3/C5; CC1 promotes
shared internals to core/controllers/internal/ in a follow-up slice).

`go vet ./...` clean. `go test -count=1 -race ./...` all green.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:34:22 +04:00
e3mrah
8988cd9e4f
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today
infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard
payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum
DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md
§3.8 + §11), so this slice closes the gap.

Architecture: hybrid singular-path + secondary-region overlay.
- The legacy singular path (var.region + count = local.control_plane_count)
  STAYS untouched — every existing Sovereign state (omantel, otech*) keeps
  its resource addresses (hcloud_server.control_plane[0],
  hcloud_load_balancer.main, etc) and produces a no-op plan diff.
- New regions (regions[1+]) are realised via a parallel for_each set keyed
  by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region
  gets its own /24 subnet inside the shared /16 hcloud_network, its own
  CP server, its own workers, and its own lb11 load balancer. The shared
  hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign).

Why hybrid not full for_each: a wholesale refactor would change every
existing resource address (hcloud_server.control_plane[0] →
hcloud_server.control_plane["mgmt"]), forcing every running Sovereign
to run `tofu state mv` for ~12 resources or face destructive recreates.
The brief explicitly bans that. Hybrid is purely additive — secondary
resources are NEW addresses no existing state carries.

No `tofu state mv` runbook required. Existing Sovereigns provisioned
with var.regions = [] or len(var.regions) == 1 produce identical plans
before and after this PR.

Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary
regions and adds per-cluster GitOps path differentiation; today every
secondary CP renders an identical Flux Kustomization pointed at
clusters/<sovereign_fqdn>/.

Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via
mock_provider + override_resource (no real Hetzner):
  - legacy_no_regions_payload (var.regions=[])
  - single_region_entry_does_not_double_provision (len==1)
  - three_region_mgmt_fsn_hel (EPIC-6 shape)
  - same_region_duplicates_produce_distinct_keys
  - non_hetzner_regions_are_filtered_out (oci entries skipped)
All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check
+ test on every PR touching infra/hetzner/**.

Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled":
push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron.

Validation:
  $ tofu validate
  Success! The configuration is valid.
  $ tofu fmt -check -recursive
  exit=0
  $ tofu test
  tests/multi_region.tftest.hcl... pass
    run "legacy_no_regions_payload"... pass
    run "single_region_entry_does_not_double_provision"... pass
    run "three_region_mgmt_fsn_hel"... pass
    run "same_region_duplicates_produce_distinct_keys"... pass
    run "non_hetzner_regions_are_filtered_out"... pass
  Success! 5 passed, 0 failed.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:29:44 +04:00
e3mrah
2ab442544e
feat(controllers): land environment-controller (slice C2, #1095) (#1127)
Implements slice C2 of EPIC-0 #1095 — the environment-controller Go
binary. Watches Environment.catalyst.openova.io/v1 CRs (cluster-scoped)
and reconciles each Environment to:

1. Verify the per-Org Gitea Org exists (parent Organization gate).
   Missing org surfaces GiteaOrgReady=False + Pending phase, never
   panics or crashloops.

2. Track the canonical branch name for this Environment in
   status.giteaRepoRef.{org,branch} per NAMING-CONVENTION.md §11.2
   item 1 (develop/staging/main ↔ dev/stg/prod; uat/poc map to their
   own branch name).

3. Idempotently write per-vCluster Flux GitRepository manifests into
   the Org's Gitea repo at the canonical path
   `clusters/<host-cluster>/environments/<env-name>/gitrepository.yaml`
   per NAMING §11.2 item 3. Multi-region Environments fan out one
   commit per spec.regions[]. Identical bytes short-circuit (zero
   spurious commits in repo history); drift triggers an overwrite
   with the existing blob SHA.

4. Surface the canonical JetStream subject prefix
   `ws.{organizationRef}-{envType}.>` on
   status.jetstreamSubjectPrefix per NAMING §11.2 item 4 +
   ARCHITECTURE.md §5. Per-Environment NATS Stream CR creation is
   OUT OF SCOPE here — NACK isn't installed yet (future slice).

5. Set status.phase, status.regionCount (printer column),
   status.vclusters[], status.observedGeneration, and the
   Ready/GiteaOrgReady/GitRepositoryWritten conditions.

Architecture rules honored (per docs/INVIOLABLE-PRINCIPLES.md +
docs/adr/0001-catalyst-control-plane-architecture.md):

- Flux is the only reconciler in production. The controller writes
  manifests to Gitea; Flux applies them. NO kubectl apply, NO
  helm install, NO exec.Command in the codebase.
- Crossplane is cloud-only. This controller is K8s-to-K8s native
  via controller-runtime + client-go.
- DR is a Placement, not an Env Type. The controller treats
  spec.envType as the schema-validated enum {prod|stg|uat|dev|poc}
  with no special-case for DR (per NAMING §11.1).
- Sovereign-independent. The Gitea base URL, secret ref, branch
  suffix, commit author, and Flux interval are ALL runtime config
  (per Inviolable Principle #4 — never hardcode).

Files:
- core/controllers/environment/api/v1/types.go — Environment
  Go types matching the CRD; hand-written DeepCopy to avoid
  build-time codegen tool dependency.
- core/controllers/environment/internal/gitea/client.go — minimal
  GitHub-compatible REST client targeting Gitea's /api/v1
  (GET /orgs/{org}, GET/POST/PUT /repos/{org}/{repo}/contents/{path}).
  Idempotent UpsertFile with byte-equality short-circuit + blob-SHA
  conflict refusal.
- core/controllers/environment/internal/gitops/render.go — pure
  template rendering of the Flux GitRepository CR. Deterministic
  field ordering for byte-equality idempotency.
- core/controllers/environment/internal/controller/environment_controller.go
  — reconciler: validate spec, gate on Gitea Org, fan out per-region
  manifest writes, set status + conditions.
- core/controllers/environment/cmd/main.go — controller-runtime
  manager entry point with leader election.
- core/controllers/environment/Containerfile — two-stage build,
  alpine:3.20 runtime, non-root UID 65534, ENTRYPOINT.
- core/controllers/environment/deploy/rbac.yaml — ClusterRole
  watching Environments + status subresource + leader election lease.
- .github/workflows/build-environment-controller.yaml — CI mirrors
  build-cert-manager-dynadot-webhook.yaml: vet + race tests,
  docker buildx + cosign keyless sign + SBOM attest, push to
  ghcr.io/openova-io/openova/environment-controller.

Tests (35 total, all GREEN, race-detector enabled):

- internal/controller (T1–T11):
  T1 happy-path single-region reconcile
  T2 idempotent re-reconcile (zero spurious commits)
  T3 parent Org missing → Pending + GiteaOrgReady=False (no panic)
  T4 multi-region fan-out (3 commits, 3 regions)
  T5 drift detection — operator hand-edit gets overwritten
  T6 placement-vs-regions cardinality violations → Failed
  T7 env_type→branch mapping table
  T8 Gitea repo missing → Pending + GiteaRepoMissing reason
  T9 partial-failure one region → Degraded with that region Failed
  T10 Config.Defaults applies the documented defaults
  T11 NotFound between dequeue and Get is benign

- internal/gitea: GET /orgs OK + 404 + 500; UpsertFile create / idempotent /
  update with SHA / repo-not-found; pathEscape preserves slashes;
  arg-validation.

- internal/gitops: BranchForEnvType / JetStreamSubjectPrefix /
  HostClusterName (with override) / GitRepositoryPath /
  RenderGitRepository (deterministic + complete + anonymous +
  default interval + required-field validation) / EnvironmentName.

go vet ./... clean. go test -count=1 -race ./... GREEN.

Out of scope per slice brief: organization-controller (C1),
blueprint-controller (C3), application-controller (C4),
useraccess-controller (C5), catalyst-api codebase changes, NACK
install, per-Environment NATS Stream CRs.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:05:53 +04:00
e3mrah
84167a768e
feat(controllers): land organization-controller (slice C1, #1095) (#1129)
A thin in-cluster Go controller that watches Organization CRs
(orgs.openova.io/v1) and reconciles four downstream artifacts per
the EPICS-1-6 unified design §3.3 + §3.7 and ADR-0001 §2.7:

  1. vCluster HelmRelease — written into the per-Org Gitea repo
     (NOT direct apply; Flux reconciles per ADR-0001 §2.1).
  2. Keycloak group — at path /<slug> with attributes
     {org=[<slug>], tier=[<sme|corporate>]}.
  3. Gitea Org — auto-created if absent; one repo per Org seeds
     the vCluster + tenant manifests.
  4. UserAccess CR — one per spec.owners[] entry; slice C5's
     useraccess-controller materializes the RoleBindings.

Per ADR-0001 §2.2 (Crossplane is cloud-only) this is K8s-to-K8s
reconciliation NOT a Crossplane Composition. Per §2.1 the controller
writes manifests via the Gitea HTTP contents API — never kubectl
apply, never helm install, never exec.Command("helm", ...).

Idempotent: re-running on a steady-state CR is a no-op (every
"ensure" is find-or-create with byte-equal short-circuit on PutFile).

What ships:
- core/controllers/organization/cmd/main.go — entry point with
  envconfig, leader election, signal handling
- core/controllers/organization/internal/controller/ — reconciler +
  KeycloakClient interface + LiveKeycloak impl
- core/controllers/organization/internal/gitea/ — minimal Gitea Admin
  REST client (Org/Repo + contents-API). Self-contained — extractable
  to core/pkg/gitea-client/ when slice C2 needs it.
- core/controllers/organization/internal/gitops/ — manifest renderer
  (namespace + vcluster HelmRelease + kustomization)
- core/controllers/organization/internal/orgapi/ — Organization Go
  types mirroring the CRD schema (no deepcopy-gen — inlined)
- core/controllers/organization/Containerfile — multi-stage build
  (alpine-based, runs as UID 65534)
- core/controllers/organization/config/{rbac,manager}/ — ClusterRole
  + Deployment scaffolding for chart consumption (slice F1)
- .github/workflows/build-organization-controller.yaml — push/PR/
  manual triggers, no cron

Tests: 9 unit tests across 3 packages cover happy-path reconcile,
idempotency (zero net writes on second reconcile), Keycloak group
already exists, Gitea Org already exists, slug/metadata drift,
missing CR no-op, byte-equal PutFile no-op, 422-race re-find,
template structural-YAML validity, and label-vocabulary compliance.
go test -count=1 -race ./... and go vet ./... both clean.

Out of scope: environment-controller (C2), application-controller
(C4), useraccess-controller (C5 — this controller only WRITES
UserAccess CRs).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:04:29 +04:00
e3mrah
dd1699afe3
feat(controllers): land useraccess-controller — fix silently broken Crossplane path (slice C5, #1095, P0) (#1128)
Per docs/EPICS-1-6-unified-design.md §3.5 and ADR-0001 §2.3 amendment,
K8s-to-K8s reconciliation belongs to thin in-cluster controllers, not
Crossplane Compositions. The existing useraccess.compose.openova.io
Composition writes RoleBindings via provider-kubernetes — but
provider-kubernetes is NOT installed on any production Sovereign
(caught in the EPIC-0 audit). Every UserAccess CR has been silently
no-op'd. This controller fixes that.

What lands:
- core/controllers/useraccess/cmd/main.go — controller-runtime Manager
  with leader election + signal handling, environment-only config
- internal/controller/{reconciler,desired,spec,status,types}.go — the
  reconciler. Watches UserAccess.access.openova.io/v1alpha1 (cluster-
  scoped, unstructured client) and owns RoleBinding +
  ClusterRoleBinding via Owns() so drift triggers reconcile via
  ownerRef indexing
- internal/labels/scope.go — Manara DNA scope matcher: AND-within /
  OR-across, wildcard scopes, EnforcedScopes() per catalog tier (the
  developer auto-injection of openova.io/env-type=dev)
- internal/controller/*_test.go + internal/labels/scope_test.go —
  26 unit tests with the controller-runtime fake client. Covers
  happy-path, multi-app/multi-ns fan-out, namespaces:["*"]→CRB,
  group subjects, drift detection+restore, orphan deletion on spec
  shrink, idempotency, invalid spec, ownerRef shape, NotFound no-op,
  and the 5-catalog-tier matrix
- deploy/{rbac,deployment}.yaml — ClusterRole/SA/Deployment with
  non-root, read-only-rootfs, drop-ALL caps, leader-election Role
- Containerfile — Alpine 3.20 final stage, CGO_ENABLED=0, UID 65534
- .github/workflows/useraccess-controller-build.yaml — event-driven
  build (push-on-main + PR test job), SHA-pinned image tags

Behaviour:
- Per UserAccess CR, materialises RoleBindings (per namespace) or
  ClusterRoleBindings (when namespaces:["*"]) referencing the
  canonical openova:application-{admin,editor,viewer} ClusterRoles
- ownerRef back to the UserAccess CR with controller=true +
  blockOwnerDeletion=true so K8s GC cascades deletes
- Drift detection: hand-mutated bindings are restored on next pass +
  Condition Drift=True surfaced for the UI
- Idempotent: steady-state reconcile = 0 K8s writes
- Status: phase (Pending|Active|Failed), rolebindingsCreated,
  observedGeneration, conditions[]

Out of scope per the brief:
- Crossplane Composition deletion (operator retires post-verify)
- 5-catalog-tier role inheritance (lands with EPIC-3 #1098)
- Keycloak realm-role sync (slice D1b, this controller is consumer)

Tests:
  go vet ./...                                # clean
  go test -count=1 -race ./...                # 26/26 pass
  go test ./internal/labels/... -run TestScope # full 5-tier matrix

Co-authored-by: Hatice Yildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:04:07 +04:00
e3mrah
358c32c032
ci: add cluster bootstrap-kit drift guardrail (slice H2 scope-reduced, #1095) (#1122)
Adds .github/workflows/cluster-template-drift.yaml — a warn-only workflow
that reports drift between each clusters/<sovereign>/bootstrap-kit/ tree
and the canonical clusters/_template/bootstrap-kit/.

Why warn-only, not enforce:
- Every existing Sovereign carries some legitimate drift (per-Sovereign
  image SHAs, region-specific values overlay) — blocking PRs on diff
  count would prevent ALL cluster work.
- The right place to enforce the boundary is Catalyst's organization-
  controller (slice C1 of #1095), not CI. Once C1 ships, every new
  Sovereign bootstrap-kit is generated from _template and the
  attestation lives at apply-time, not at CI-time.
- Retroactively reconciling the existing omantel.omani.works/ and
  otech.omani.works/ trees (which have 20+ differing files plus
  structural changes — extra files on each side) is a high-blast-radius
  maintenance-window operation, NOT a CI scoped slice.

What this workflow does:
- Triggers on push to main + PR + workflow_dispatch when clusters/**
  changes.
- For each clusters/<sovereign>/ directory, runs `diff -rq` against
  clusters/_template/bootstrap-kit/ and writes a Markdown report to
  the run summary AND a sticky PR comment.
- Counts differing files + only-in-template + only-in-Sovereign per
  Sovereign so reviewers can quickly see whether new drift was
  introduced.

Per docs/EPICS-1-6-unified-design.md §3.9 row 2 + §11 row 6 (decision
amended from "reconcile + CI gate" to "warn-only CI gate"; structural
reconcile deferred to slice C1 organization-controller).

Per docs/INVIOLABLE-PRINCIPLES.md #4a — workflow only inspects YAML;
no images built, no cloud calls.

Refs: #1094, #1095, slice C1 (organization-controller).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:09:50 +04:00
e3mrah
eb6a3c1812
fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:10:31 +04:00
e3mrah
953ef8290f
fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs (#980)
* fix(catalyst-ui): drop stale params={{ deploymentId }} from clean-root Links (#975)

#976 collapsed `to="/provision/$deploymentId/<page>"` to clean root
paths (`to="/<page>"`) but left the `params={{ deploymentId }}` prop
on every callsite, breaking the Vite tsc build with TS2353. Fixes:

- Drop `params={{ deploymentId }}` from Links whose target is now a
  parameterless clean root path (StatusStrip, AppDetail, AppsPage,
  DecommissionPage, FlowPage, JobDetail, JobsPage, JobsTimeline,
  SettingsPage, DeploymentsList).
- For Links whose `to` still uses `$componentId`/`$jobId`, cast
  `params` with `as never` to match the existing pattern in
  cloud-compute/cloud-network/cloud-storage/Sidebar/UserAccess
  (the dual-mount under provisionRoute + consoleLayoutRoute defeats
  TS's strict params inference; the runtime path is correct).
- Drop `deploymentId` prop + interface field from JobCard / JobRow /
  JobsTable / AppCard now that the Links don't need it; update test
  fixtures + the JobsTable row-link assertion to match the new
  clean `/jobs/$jobId` href.
- Drop the unused ArchEdgeType import in k8sAdapter (TS6196).
- Dashboard navigateToApp uses `as never` casts to align with the
  same pattern.

* fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs

Two paths consume the catalyst-api / catalyst-ui images:
1. bp-catalyst-platform OCI chart (Sovereigns) — values.yaml driven, tag
   in values.yaml is rendered at helm install time by Sovereign Flux.
2. contabo Kustomize-path — literal image refs in templates/api-deployment.yaml
   and templates/ui-deployment.yaml. Flux kustomize-controller on contabo
   reconciles those files directly.

The CI deploy step was bumping BOTH on every PR, which auto-rolled
contabo every time anyone merged a catalyst-api code change. On
2026-05-05 PR #975's k8scache feature broke contabo startup on the
auto-roll because contabo has 27 dead-Sovereign kubeconfigs that the
new code iterates synchronously at startup, blocking readiness.

Fix: keep the values.yaml bump (Sovereigns auto-pick-up via OCI chart
which is the right behaviour for fresh provisions). Drop the
templates/*-deployment.yaml bump so contabo only rolls when an
operator manually commits a validated SHA into those files.

Closes the auto-deploy-to-contabo blast radius on every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:24:57 +04:00
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
db332f6767
fix(ci): services-build auto-bumps chart patch + dispatches blueprint-release (#874)
* fix(bp-catalyst-platform): bump 1.4.8 -> 1.4.9 to republish with current services-auth image (#871)

Chart 1.4.8 was published from commit 95a06f56 BEFORE the deploy-bot
updated templates/sme-services/auth.yaml's image pin from
services-auth:fa4395f -> services-auth:95a06f5 (which has the
/auth/send-pin alias from PR #869). The blueprint-release workflow
fired on 95a06f56 only, so the OCI artifact for 1.4.8 was published
with the OLD image SHA in chart bytes. otech103 reconciled 1.4.8 and
rendered the auth Deployment with the OLD image -> /auth/send-pin
returns 404 -> SME marketplace signup blocked.

Same deploy-step race documented in feedback_idempotent_iac_purge.md
and the overnight DoD bookmark. Long-term fix is a double-bump
sequencing PR (file separately); short-term fix is bumping the chart
version so blueprint-release republishes the artifact with the
current image pin.

No template change. Lockstep slot 13 pin in
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml bumps
from 1.4.8 -> 1.4.9.

Closes #871

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): services-build deploy auto-bumps chart patch + dispatches blueprint-release (#872)

Eliminate the recurring race between services-build's deploy commit
and blueprint-release's path-trigger on chart-version-bumping PRs.

Before: a PR bumping `products/catalyst/chart/Chart.yaml` AND touching
`core/services/**` triggered both workflows on the same merge SHA in
parallel. blueprint-release packaged the chart at the merge commit
(which still held the OLD image SHAs) and published the bumped
chart version with stale image refs. services-build's deploy commit
landed AFTER, but per GitHub Actions design GITHUB_TOKEN-authored
pushes do NOT re-trigger workflows, so blueprint-release never fired
again on the corrected chart. A manual no-op chart bump PR was the
only way to republish (PR #865 chasing PR #864 was the live incident).

After: services-build's deploy step
  1. sed-rewrites image: lines under products/catalyst/chart/templates/sme-services/*.yaml (unchanged)
  2. Pure-bash semver patch-bumps Chart.yaml `version:` and `appVersion:` atomically
  3. Single commit captures both rewrites
  4. Explicit `gh workflow run blueprint-release.yaml -f blueprint=catalyst -f tree=products` dispatches the chart publish (matches catalyst-build's PR #720 pattern)
  5. Idempotent push retry re-reads origin/main and bumps from THAT version on conflict, so concurrent CI runs produce strictly increasing patch versions instead of clobbering each other

Adds `actions: write` to the deploy job permissions so the
gh workflow run dispatch doesn't return HTTP 403.

The manual chart-version field in author PRs becomes a floor; CI
auto-bumps from there. PR authors should NOT bump the patch
themselves any more — the deploy step does it. Major/minor bumps
remain the author's call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 08:32:34 +04:00
e3mrah
1d93b6c5af
feat(e2e): SME demo Playwright spec — full 6-step happy path (#805) (#823)
Authors the load-bearing investor-demo proof artefact for the
SME-tenant turnkey experience epic (#795). The spec walks the FULL
happy path against the catalyst-ui SPA and emits 1440×900 screenshots
at every assertion so the DoD checklist is satisfied with visual
evidence rather than narrative.

What landed:

- products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear
  spec covering Step 1 (marketplace signup) → Step 2 (provisioning) →
  Step 3 (SME admin first login + dashboard) → Step 4 (create alice
  via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a
  (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to
  unblocking issues.

- products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry
  of every URL, hostname, fixture user, and UUID the spec uses. Per
  feedback_never_hardcode_urls.md, no test inlines a hostname; every
  asserted host derives from OTECH_FQDN + SME_SLUG.

- products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape-
  faithful page.route mocks for tenant discovery, /api/v1/whoami,
  /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment
  endpoints, app placeholders for WordPress/OpenClaw/webmail, and the
  /api/v1/sme/billing/ledger surface. Each helper is the seam between
  mock-mode (today) and live-mode (post-#804) so the spec opts out of
  any single mock by simply not calling that helper.

- .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger
  that runs the spec against a freshly-installed dev tree with
  VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the
  SovereignConsoleLayout's auth gate has a non-null sovereignFQDN.
  Uploads the 805-* screenshot evidence as a 30-day artefact.

Run today on a fresh checkout:

    cd products/catalyst/bootstrap/ui
    VITE_CATALYST_MODE=sovereign \
      VITE_SOVEREIGN_FQDN=acme.otech.example \
      npm run dev &
    PLAYWRIGHT_HOST=http://localhost:5173 \
      npx playwright test e2e/sme-demo.spec.ts

Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 /
#798 / #802-followup).

Live-mode follow-up (after #804 lands a fresh otech with the SME
tenant pipeline wired): drop the mock installers from beforeEach and
flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper
calls change.

Per docs/INVIOLABLE-PRINCIPLES.md:
  #1 (waterfall): the canonical 6-step contract from #805 is asserted
     in this first cut, not staged across cycles.
  #2 (never compromise): every step that's deferred is fixme'd with a
     blocker link, never silently skipped.
  #4 (never hardcode): every URL routes through e2e/lib/config.ts.

Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 22:52:07 +04:00
e3mrah
9645a9044a
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798)

Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add
the SME-2 metering integration end-to-end. NewAPI is consumed as the
upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned
mirror, not a fork) — the metering envelope is produced by a Go sidecar
that observes the OpenAI-style `usage.total_tokens` field on every
2xx /v1/* response. This avoids forking the upstream binary while still
producing the canonical envelope shape on `catalyst.usage.recorded`.

A) NewAPI metering sidecar — core/services/metering-sidecar/
   - Transparent reverse proxy in front of NewAPI on its own port; the
     bp-newapi Service routes the cluster-fronting port to the sidecar,
     which forwards to NewAPI on the pod's loopback.
   - Observes successful /v1/* JSON responses, parses
     `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes
     amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes
     one envelope on `catalyst.usage.recorded` per completed request.
   - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed.
   - Customer-facing latency is NEVER blocked on metering: the response
     body is restored before publish; on NATS unreachable the envelope
     is persisted to disk and retried by a background drain loop.
   - 14 unit tests (proxy + publisher + safeFilename guards).

B) sme-billing NATS subscriber — core/services/billing/handlers/
   metering_consumer.go
   - JetStream durable consumer `sme-billing-metering` on stream
     `CATALYST_USAGE` (provisioned by sme-billing on startup).
   - Idempotent on metadata.request_id via a UNIQUE partial index on
     credit_ledger.external_ref; redelivery from the broker collapses
     to a single ledger row.
   - Customer auto-create on cold start (the rbac sme.user.created
     envelope may land AFTER the first metered request; we don't strand
     usage waiting for it).
   - 11 unit tests covering happy-path, idempotency, malformed-payload
     poison-pill, missing-request-id, non-negative amount guard,
     resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak.

C) HTTP handler POST /billing/metering/record — handlers/metering.go
   - Synchronous validate → INSERT credit_ledger → return
     {ledger_entry_id, balance_after_omr, balance_after_micro_omr,
     duplicate}. Same payload + idempotency guard as the NATS path.
   - Auth: superadmin OR sovereign-admin (operator-admin model;
     end-user LLM traffic flows through the sidecar, never this URL).
   - 8 unit tests covering happy-path, idempotency, role gating,
     malformed-JSON, positive-amount rejection, customer-not-found.

D) Schema — core/services/billing/store/store.go
   - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT
     (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR
     exact integer — preserves precision at metering rates).
   - ADD COLUMN external_ref TEXT + UNIQUE partial index for
     idempotency dedup.
   - ADD COLUMN metadata JSONB for the raw envelope.
   - GetCreditBalance projects both amount_omr (legacy) and
     amount_micro_omr (new) into the integer-OMR view.
   - GetCreditBalanceMicroOMR returns canonical precision.
   - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0)
     distinguishes fresh insert from duplicate without a follow-up
     SELECT.

E) Wiring
   - core/services/shared/events/nats.go — minimal NATS JetStream
     publisher + subscriber surface; legacy RedPanda producer/consumer
     in events.go untouched per [Q-mine-3].
   - core/services/billing/main.go — NATS_URL env; subscriber wired
     in parallel with the existing RedPanda tenant-events consumer.
   - middleware/jwt.go — exported test helper WithClaims so handler
     tests can construct an authenticated context without minting a
     real signed token.
   - .github/workflows/services-build.yaml — metering-sidecar added
     to the build matrix; deploy job skips it (image consumed by the
     bp-newapi chart, not products/catalyst sme-services).

F) bp-newapi chart (1.0.0 → 1.1.0)
   - meteringSidecar block in values.yaml: image, port, NATS URL,
     priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool
     dir, header names, resources, securityContext (read-only-rootfs).
   - deployment.yaml renders the sidecar container + emptyDir spool
     volume when meteringSidecar.enabled (default true).
   - service.yaml routes the cluster-fronting :3000 to the sidecar
     when enabled, exposes a separate :3001 → NewAPI direct port for
     bp-catalyst-platform admin-API traffic (ADR-0003 §3.2).
   - networkpolicy.yaml allows the sidecar's port + nats-system
     egress for JetStream publish.

Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green.
Helm template renders cleanly with sidecar enabled and disabled.

Closes #798

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798)

Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which
lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000")
that does NOT scan directly into Go int64 — the integration test
TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on
the post-redeem balance read.

Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is
unambiguously bigint and Scan target stays uniform across pre-#798 rows
(amount_omr only) and post-#798 rows (amount_micro_omr present).

Affects:
  - GetCreditBalance
  - GetCreditBalanceMicroOMR
  - RecordUsage's running-balance read

Test mocks updated to match the new SQL prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:32:42 +04:00
e3mrah
93bd3ace5b
feat(bp-openclaw): workspace controller + per-user pod chart (#803) (#810)
Implements locked decision [A] of epic #795: per-SME-tenant workspace
controller deployment + per-user runtime pod, identity-blind by
construction. Consumes the per-user newapi-key-{uuid} Secrets rendered
by the unified-rbac user-create hook (ADR-0003 §3.3).

What this delivers:
- platform/openclaw/chart/        bp-openclaw v0.1.0 (no-upstream)
- platform/openclaw/runtime/      Go reference runtime (NEWAPI_BASE_URL
                                  + NEWAPI_KEY env contract only)
- .github/workflows/openclaw-runtime.yaml
                                  Event-driven build for the runtime
                                  image (paths-on-push + manual rerun;
                                  NO schedule:cron per CLAUDE.md).
- platform/openclaw/blueprint.yaml
                                  Catalyst registration + configSchema.

Chart highlights:
- Required values guarded by _helpers.tpl :: assertRequired so missing
  realmURL/clientSecretName/tenant.namespace/baseURL/host fail render
  with helpful messages.
- RBAC: namespaced Role in tenant ns; create verbs split into separate
  rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md.
  Label-based ownership (catalyst.openova.io/openclaw-user) enforced at
  the controller, not in RBAC.
- ingress: cert-manager.io/cluster-issuer annotation triggers ACME
  auto-issuance for openclaw.<sme-domain>.
- per-user pod template ConfigMap holds the pod-spec the controller
  renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders
  filled at session-start.
- networkPolicy covers controller pod only; per-user pod NetworkPolicy
  is rendered by the controller at session-start (target hostname is
  read from the per-user Secret which doesn't exist at chart-render
  time — documented in README.md).

Tests: chart/tests/render-toggles.sh (7 cases) covers required-value
enforcement, RBAC create+resourceNames violation guard, ServiceMonitor
default-off, networkPolicy toggle, pod-template placeholder presence,
cert-manager annotation. All seven gates pass locally.

Closes part of #795 (epic still open).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:10:24 +04:00
e3mrah
9adca8442a
fix: ci actions:write + auth-layout overflow scroll (#712 followup, #721 followup) (#728)
Two unrelated production-bug fixes squashed because they came out of
the same live verification pass on console.openova.io 2026-05-04.

1. catalyst-build.yaml deploy job permissions
   PR #720 added a `gh workflow run blueprint-release.yaml` dispatch
   step at the end of the deploy job to close the bot-deploy-doesn't-
   trigger-workflows gap from #712. Step has been failing on every run
   since with HTTP 403 "Resource not accessible by integration"
   because GITHUB_TOKEN lacks `actions: write` by default.
   Result: blueprint-release was never dispatched after PR #722–727
   merged; the bp-catalyst-platform OCI artifact stayed on the
   pre-fix chart and any Sovereign provisioned afterwards picked up
   the buggy chart. Add the missing permission so dispatch succeeds.

2. AuthLayout.tsx vertical centering at small viewport heights
   The sign-in / verify cards were mathematically centered at
   1440×900 (Δ=0.008px verified via getBoundingClientRect in
   Playwright) but founder reports the card sitting at the top of
   the screen on real-world viewports. Root cause: the right panel
   had `flex flex-1 items-center justify-center` which centers ONLY
   if the inner content fits within the viewport — at smaller heights
   the form's natural content flow pushed the card off-screen with
   no scroll fallback.
   Fix: add `items-stretch` to the outer flex (so the right panel
   fills full viewport height), `overflow-y-auto` on the right
   column (so the card can scroll inside its column when too tall),
   and `py-8` padding on the card wrapper (breathing room when
   scrolling kicks in). Result: card is vertically centered when
   content fits, and stays visible (column-scrollable) when it
   doesn't, on every viewport height from 1024×600 up.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 12:44:44 +04:00
e3mrah
35183af5be
fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) (#720)
* feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710)

Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

* fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712)

The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub
Actions design, commits authored by GITHUB_TOKEN don't re-trigger
workflows. blueprint-release.yaml's `on.push.paths: products/*/chart/**`
filter matches the deploy commit's diff (chart/values.yaml +
chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire,
but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck
on whatever catalyst-api SHA was current at the last manual chart-
touching PR.

Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb
six PRs after the SHA was superseded — every fresh Sovereign installed
the buggy pre-#701 image and rejected handover with 401 unauthenticated.

Fix: after `git push` succeeds in the deploy job, dispatch
blueprint-release explicitly via `gh workflow run`. The dispatched run
re-renders + re-publishes the chart with the just-pushed values.yaml.

Closes #712.

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:49:03 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
10c8e997c4
fix(catalyst): restore literal image refs in Kustomize-path deployment YAMLs (#614)
The feat/global-imageRegistry (#580) PR converted the literal image refs
in api-deployment.yaml and ui-deployment.yaml to Helm template expressions
({{ .Values.global.imageRegistry }}...) without updating the CI deploy step
to also patch those files. Since the catalyst-platform Flux Kustomization
reads these files as raw manifests (not via helm-controller), the Helm
template syntax was never rendered, leaving a literal '{{ if ... }}'
string as the image reference → InvalidImageName on every Pod start.

Root cause: two consumers of the same file — Helm chart path (Sovereign
clusters) and Kustomize path (contabo-mkt) — but only the Helm path was
handled by the deploy job.

Fix:
- Restore literal `ghcr.io/openova-io/openova/catalyst-{api,ui}:b50a600`
  image refs in the Kustomize-path deployment YAMLs (immediate unblock).
- Update CI deploy step to sed-patch those literal refs on every deploy
  commit so future image rolls keep both paths in sync (durable fix).

Closes: the InvalidImageName regression introduced in #580.
Unblocks: issue #608 (Phase-8b Agent A magic-link auth) — catalyst-api
was stuck at InvalidImageName since commit 83ec889f, preventing the
CATALYST_KC_ADDR / session-cookie auth gate from loading.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 18:29:09 +04:00
hatiyildiz
59fb2b742c fix(ci): use awk instead of python heredoc in deploy — fixes YAML parse error 2026-05-02 13:48:17 +02:00
hatiyildiz
885e032dc5 fix(ci): deploy job updates values.yaml SHA tags, not Helm template files
The previous sed targeted ui-deployment.yaml + api-deployment.yaml for
`image: ghcr.io/.../catalyst-ui:.*` but those files use Helm template
expressions (`{{ .Values.images.catalystUi.tag }}`), so sed silently
no-ops. Result: every catalyst build committed "No changes" and the
deployed image was never updated.

Fix: switch deploy job to update images.catalystUi.tag and
images.catalystApi.tag in products/catalyst/chart/values.yaml via
python3 regex (handles multiline YAML reliably).

Also bump catalystUi + catalystApi tags to 32c5e43 (the build from
#596 / PR #599 — Vite base: '/' fix).

Fixes #596 deploy path.
2026-05-02 13:46:03 +02:00
e3mrah
942be6f58d
fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583)
containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index
includes an attestation manifest (the unknown/unknown platform entry added
by docker/build-push-action when provenance=true).  Containerd resolves
the manifest index, encounters the attestation entry, fetches its descriptor
from GHCR which returns an HTML 404 page, and then caches that HTML page as
a blob SHA — every subsequent pull of ANY tag for that image returns the same
HTML SHA instead of the real layer.

Fix: set provenance=false + sbom=false on the build-push-action step.
SBOM attestation is handled separately by cosign attest, which does not
embed its manifest into the OCI index.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:29:58 +04:00
e3mrah
52c6938e02
ci(catalyst-build): watch infra/hetzner/** so cloudinit changes rebuild catalyst-api (#472)
Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api
Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without
this path in the build trigger, fixes to that file do NOT rebuild the
image — the running pod keeps using the stale tftpl and provisioning
keeps failing with the same Tofu error.

Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path
filter MUST cover every directory the image depends on. Missing
infra/hetzner/** was a long-standing latent CI bug — surfaced by
Phase-8a #454 first live provision attempt.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:34:13 +04:00
e3mrah
1628a1b3aa
ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470)
First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the
same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401
'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages.

#460's agent fixed it for B in c26fbcaf. #461's already had GHCR login.
This commit applies the same helm-registry-login pattern to A and E.

WBS state on main after this commit:
- done (35): all chart-level + #317 + #319 + #453 + 4 preflights
- wip (0)
- blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven)

The preflights' first runs ALREADY surfaced a real CI bug pattern that
would have hit Phase 8a — exactly what they're for.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:06:36 +04:00
e3mrah
4a7eb42d26
feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462) (#468)
Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak
realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0
ships a sovereign realm + a public kubectl OIDC client via the
upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm
hook (issue #326); this workflow proves it actually wires up on a
clean cluster before we run it on a real Sovereign.

Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action
v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits
for the keycloak StatefulSet to roll out, polls for the
keycloakConfigCli post-install Job by label
(app.kubernetes.io/component=keycloak-config-cli), waits for it to
Complete, port-forwards svc/keycloak and asserts:

  1. /realms/sovereign returns 200 (realm exists in Keycloak's DB).
  2. The kubectl OIDC client is provisioned with publicClient=true,
     redirectUris contains http://localhost:8000 (kubectl-oidc-login
     default), and the groups client scope is wired with the
     oidc-group-membership-mapper (the per-Sovereign k3s api-server's
     --oidc-groups-claim flag depends on this).

Acceptance per ticket: if the post-install Job fails, the workflow
summary captures Job logs + StatefulSet logs + cluster state via
GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running.

Triggers are event-driven only per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled" rule — push on the workflow file itself
plus workflow_dispatch for ad-hoc re-runs.

Closes #462.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:01:30 +04:00
e3mrah
abac00d8b3
feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459) (#467)
Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit
reconcile-chain order untested under load) before Phase 8a (#454) burns
Hetzner credit on test.omani.works.

New workflow .github/workflows/preflight-bootstrap-kit.yaml:
- kind v0.25.0 + kindest/node:v1.30.6
- Gateway API CRDs v1.2.0 standard channel
- Full Flux controller set (fluxcd/flux2/action@main + flux install)
- Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials,
  flux-system/ghcr-pull
- Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER
  + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern
  in tests/e2e/bootstrap-kit/main_test.go:247)
- 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop
  at first)
- $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready
  condition + per-HR describe blocks for non-Ready + recent flux-system
  events + raw hrs.json artefact (14d retention)
- Event-driven only: push on self-edit + workflow_dispatch; no schedule:
  cron (per CLAUDE.md "every workflow MUST be event-driven")

Canonical seam reused (no duplication):
- kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml
- bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the
  same overlay production Sovereigns consume; substitution shape mirrors
  tests/e2e/bootstrap-kit/main_test.go:247)
- event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428)

Out of scope (sibling preflights):
- #460 Crossplane provider-hcloud Healthy probe
- #461 Cilium Gateway HTTPRoute admission
- #462 Keycloak realm-import

Validated: actionlint clean, YAML parses cleanly.

WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:01:26 +04:00
e3mrah
6f9ee43a9d
fix(ci): GHCR auth for bp-crossplane OCI pull in preflight (#460) (#466)
Run 25221515110 surfaced the exact blocking error the workflow was
designed to surface — but for the install step, not the Healthy probe:

  Error: INSTALLATION FAILED: failed to perform "FetchReference" on
  source: GET "https://ghcr.io/v2/openova-io/bp-crossplane/manifests/1.1.3":
  ... 401: unauthorized: authentication required

bp-crossplane is a PRIVATE GHCR package (verified via
`gh api /orgs/openova-io/packages/container/bp-crossplane`). The fix
mirrors the canonical seam in .github/workflows/blueprint-release.yaml:
add `packages: read` to the job permissions and run
`helm registry login ghcr.io` against GITHUB_TOKEN before the
`helm install oci://...` step. No new pattern; just reuse.

This unblocks the actual goal of #460 — observing provider-hcloud
Healthy=True (or surfacing whatever blocks it) on a kind cluster.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:01:15 +04:00
e3mrah
48b73af6ae
feat(ci): Phase-8a preflight C — Cilium Gateway HTTPRoute admission on kind (closes #461) (#465)
Surfaces Risk-register R3 (docs/omantel-handover-wbs.md §9a) — Cilium
Gateway HTTPRoute admission was untested on contabo because contabo
runs Traefik (no `cilium-gateway` Gateway present per ADR-0001 §9.4).

This workflow boots a kind cluster, installs upstream Cilium 1.16.5
with `gatewayAPI.enabled=true`, applies the per-Sovereign Gateway
shape from `clusters/_template/bootstrap-kit/01-cilium.yaml` (HTTP
listener only — TLS is Phase 8a), pulls bp-catalyst-platform:1.1.8
from GHCR, renders its httproute.yaml template with sovereign overlay
values, and asserts that `catalyst-ui` and `catalyst-api` HTTPRoutes
both reach Accepted=True against the Cilium Gateway.

Anti-duplication: GHCR helm-registry-login mirrors blueprint-release
.yaml (lines 173-177); kind+Cilium pattern matches playwright-smoke
shape; per-Sovereign Gateway is a 1:1 mirror of the canonical
bootstrap-kit slot 01 (HTTP listener), no new shape invented.

Trigger pattern is event-driven per CLAUDE.md: push on this file or
the chart templates it validates, plus workflow_dispatch for re-runs.
No cron.

Out of scope (Phase 8a/8b): TLS termination, real DNS resolution,
backend Deployment health, the 10 leaf bp-* dependencies (which have
their own chart-verify smoke runs).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:01:01 +04:00
e3mrah
48a1623b28
feat(ci): Phase-8a preflight B — Crossplane provider-hcloud Healthy on kind (closes #460) (#463)
Surfaces Risk-register R2 (docs/omantel-handover-wbs.md §9a — provider-hcloud
Healthy=True never observed). New workflow spins up kind, installs bp-crossplane
1.1.3 from GHCR, applies the EXACT Provider + ProviderConfig shape from
infra/hetzner/cloudinit-control-plane.tftpl (#425), waits up to 5 min for
Healthy=True, plants a fake hcloud-token Secret in flux-system to match the
canonical secretRef, and asserts the ProviderConfig is accepted by the API.

Reuses existing seams:
- helm/kind-action@v1 pattern from .github/workflows/test-bootstrap-kit.yaml
- event-driven trigger shape from .github/workflows/check-vendor-coupling.yaml
- canonical Provider/ProviderConfig YAML from infra/hetzner/cloudinit-control-plane.tftpl

No schedule: cron (per CLAUDE.md "every workflow MUST be event-driven").
No live Hetzner calls — fake-readonly-token only; real-credential validation
is Phase 8a, not this preflight.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:58:32 +04:00
e3mrah
1e7d1e67c9
test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429) (#432)
Phase 8 of the omantel handover (#369) needs an automated E2E that proves
DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with
zero contabo dependency post-handover. Today this is a SCAFFOLD — when
Phase 4/6/7 land, dispatching the new workflow against a live omantel is
the entire Phase 8.

Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md):
  - tests/e2e/playwright/tests/  ← mirror of sovereign-wizard.spec.ts shape
    (NOT specs/ as the issue body said — actual repo path is tests/)
  - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries,
    workers=1, reporter=list) — reused as-is
  - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the
    pre-flight skip-when-unreachable pattern
  - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4,
    setup-node v4, npm install, playwright install --with-deps chromium,
    upload-artifact on failure) — mirrored, NOT duplicated

What ships:
  - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests):
      1. sovereign Ready + 23/23 blueprints
      2. all bp-* HelmReleases Ready=True
      3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready")
      4. vendor-agnostic Object Storage (post-#425 canonical secret name
         flux-system/object-storage — NOT hetzner-object-storage)
      5. dig +trace omantel.omani.works ends at omantel NS, not contabo
      6. zero contabo dependency (omantel /api/healthz keeps returning 200)
    Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset.

  - .github/workflows/omantel-e2e-handover.yaml (NEW):
    workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow
    MUST be event-driven, NEVER scheduled"). Inputs let the operator override
    base URLs at dispatch time.

  - docs/omantel-handover-wbs.md:
    new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1
    with the spec test() blocks; §9 status row added for #429
    (🟢 scaffold-shipped).

Local verification:
  cd tests/e2e/playwright && npm install && \
    npx playwright test --list tests/omantel-handover.spec.ts
  → 6 tests listed cleanly
  npx playwright test tests/omantel-handover.spec.ts
  → 6 skipped (env vars unset, expected)

Out of scope (per #425 / #428 territory split):
  - internal/hetzner/, infra/hetzner/, platform/velero/chart/,
    clusters/.../34-velero.yaml — #425's vendor-agnostic sweep
  - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:52:18 +04:00
e3mrah
0fdd411e79
ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428) (#431)
Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml
that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names
(hetzner|aws|gcp|azure|oci) appearing in capability-named slots:

  1. <vendor>-object-storage          (sealed-secret / overlay-secret name)
  2. <chart>Overlay\.<vendor>\.       (chart values block keyed to vendor)
  3. <vendor>ObjectStorage            (camelCase payload field)

Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/,
internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR
refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may
discuss the rule).

Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425
work-in-progress); hard-fail once that directory lands. Locally on this branch
the script emits 49 warnings to stderr and exits 0 against the existing
hetzner-coupled references in platform/velero, platform/seaweedfs, and
clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those
warnings disappear and any future re-introduction fails CI.

Workflow trigger surface: push-to-main + pull_request on the scanned paths +
workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled".

Canonical seam used: scripts/ + .github/workflows/ (mirrors
scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml
shape). NOT a duplicate - no prior vendor-coupling guard existed.

Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map)
      docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:49:49 +04:00
e3mrah
956b976558
fix(ci): playwright-smoke port 4321→5173 for Vite 8 default (#335) (#418)
The catalyst-ui dev-server bind moved from 4321 to 5173 when Vite default
changed (Vite 8). The smoke workflow's curl-wait + BASE_URL env still
pointed at 4321, so:

  Vite 8 starts fine on 5173 →
    workflow polls 4321 for 60s → never returns 200 →
      step exits 1 before Playwright ever runs.

Effect across last ~30 main commits: every push generated a 'Playwright UI
smoke failed' email despite the UI itself being healthy. We've been
shipping with --admin bypass + post-deploy verification against
console.openova.io. This restores actual smoke coverage on every PR.

Three substitutions on .github/workflows/playwright-smoke.yaml:
  - line 80 curl wait URL: localhost:4321 → localhost:5173
  - line 93 BASE_URL env: 4321 → 5173
  - line 72-73 comment: stale 'Vite binds 4321 by default' → 5173

Closes #335.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:04:11 +04:00
e3mrah
4d24914ae4
feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) (#346)
* feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318)

Adds a first-class Phase-0 recovery surface so an operator can purge a
failed pre-handover deployment from the wizard UI without dropping to
hcloud CLI runbooks. Two entry-points, one canonical implementation.

## Backend

NEW: products/catalyst/bootstrap/api/internal/handler/wipe.go
  POST /api/v1/deployments/{id}/wipe — single-flight destructive op:
    1. tofu destroy against the per-deployment workdir (idempotent).
    2. Hetzner orphan force-purge by label-selector
       `catalyst-deployment-id=<id>` (servers, load balancers,
       networks, firewalls, ssh-keys). Belt-and-braces — catches
       resources tofu didn't track (half-failed cloud-init, manual
       experiments). Per docs/INVIOLABLE-PRINCIPLES.md #3 this direct
       API path is fallback ONLY for orphan cleanup, never new
       resource creation.
    3. PDM /v1/release for pool-subdomain Sovereigns (best-effort).
    4. Local cleanup: kubeconfig file (mode 0600), tofu workdir,
       on-disk deployment record JSON.
    5. SSE events stream throughout on the same channel as the
       original provisioning + Phase-1 watch.
    6. Marks Status="wiped"; sync.Map entry reaped after a 60s TTL.

NEW: products/catalyst/bootstrap/api/internal/hetzner/purge.go
  Hetzner Cloud API enumeration + force-delete by label selector.
  Uses a 60s timeout (vs the 10s ValidateToken default) because async
  server-delete jobs can queue. 404s treated as success (already gone).

NEW: products/catalyst/bootstrap/api/internal/provisioner/provisioner.go
  Provisioner.Destroy() — runs `tofu destroy -auto-approve` against
  the per-deployment workdir, then removes the workdir on success so
  re-provisioning starts fresh. Re-stages module + tfvars first so a
  partially-cleaned workdir still has what tofu needs.

TOUCHED: products/catalyst/bootstrap/api/cmd/api/main.go
  Registers POST /api/v1/deployments/{id}/wipe.

## Frontend (aligned with existing CrudModals conventions per founder
##           directive — no ad-hoc surface)

NEW: products/catalyst/bootstrap/ui/src/components/CrudModals/WipeDeploymentModal.tsx
  Two-stage modal built on the canonical ModalShell. Pre-wipe confirm
  view requires the operator to:
    - Type the sovereign FQDN to confirm scope.
    - Re-paste their Hetzner Cloud API token (catalyst-api intentionally
      GCs the original after writeTfvars per credential hygiene).
  Post-wipe success view shows the PurgeReport (servers, lbs, networks,
  firewalls, ssh-keys removed; tofu/PDM/local-state ✓/✗) and a
  "Start fresh deployment" CTA that nav's to /sovereign.

TOUCHED: products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts
  Re-exports WipeDeploymentModal + WipeReport.

TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/AppsPage.tsx
  FailureCard now exposes a "Cancel & Wipe" red button next to
  "Retry stream" / "Back to wizard" — opens WipeDeploymentModal.

TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/InfrastructureTopology.tsx
  Cloud → Architecture canvas: the `cloud` (root) node action menu
  gains "Cancel & Wipe deployment" as a `danger:true` action,
  alongside the existing "+ Add region". Distinct from the
  per-resource DeleteCascadeConfirm on region/cluster/vCluster — this
  is deployment-scope (Phase-0 orphan purge), the others are
  Crossplane-XRC scope (day-2). The two paths coexist; operators
  choose by what state the deployment is in.

## Why two entry-points

Wizard banner (failed state on AppsPage) — recovery from a known
failure. Already a red-banner page; the button is right there.

Cloud → Architecture cloud-node action — proactive cancel from the
canvas, mirrors how the existing per-resource deletes are reachable.
Same modal, same backend.

## Constraints honoured

- Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2
  IaC): the per-resource DELETE handler at infrastructure.go is
  unchanged and continues to flip XRC deletionPolicy. Wipe operates
  ONLY in Phase-0 scope where Crossplane never adopted resources.
- Per #4 (never hardcode): every endpoint lives behind API_BASE; the
  Hetzner purge enumerates by deterministic label selector built from
  var.sovereign_fqdn (the OpenTofu module's existing tagging convention).
- Per credential hygiene: the Hetzner token is re-prompted at wipe time
  rather than persisted; the modal uses an <input type="password">.

## Refs

#318 — pre-handover wipe spec (this PR closes it)
#317 — handover finalisation (sibling; this PR is the failure-path
       complement)
feedback_idempotent_iac_purge.md — operator runbook this implements
PR #313 — sealed-secrets cleanup (independent; safe to land in any order)
PR #334 — bp-external-secrets split (independent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): catalyst-build event-driven only — drop cron, push-on-main with path filter

Per docs/INVIOLABLE-PRINCIPLES.md (event-driven end to end — Flux
dependsOn, NATS JetStream, SSE, Helm hooks), GitHub Actions must follow
the same model. The previous `schedule: cron 0 3 * * *` daily build was
the only canonical deploy path, which created a 24h roll latency on
every change to the catalyst surface and incentivised "wait for cron"
stalls in operator workflows.

Replaces with:
  on:
    push:
      branches: [main]
      paths:
        - 'core/console/**'
        - 'core/admin/**'
        - 'core/marketplace/**'
        - 'core/marketplace-api/**'
        - 'products/catalyst/bootstrap/**'
        - 'products/catalyst/chart/**'
        - '.github/workflows/catalyst-build.yaml'
    workflow_dispatch:

`workflow_dispatch` retained for ad-hoc re-runs (config-only changes
that bypass the path filter, e.g. a secret rotation that doesn't touch
code). Path filter mirrors the actual surface this workflow rebuilds.

After this lands, every merge to main that touches the catalyst surface
auto-deploys. No cron lag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:24:40 +04:00
e3mrah
2de8bb68b9
fix(ci): bump helm 3.16.3 → 3.18.4 in blueprint-release — fixes seaweedfs smoke-render (#336)
'function fromToml not defined' error on bp-seaweedfs publish.
Upstream seaweedfs/seaweedfs 4.22.0 (templates/shared/security-configmap.yaml:21)
uses fromToml which exists in 3.13+ but the rendered context in the smoke
step needs newer Sprig functions present in 3.18+. Bump unblocks the
chain of HRs (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana)
all blocked on bp-seaweedfs publish.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 23:27:45 +04:00
e3mrah
5502d9aa48
feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291)
Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer
in bp-cert-manager by shipping the missing piece — a Go binary that
satisfies cert-manager's external webhook contract
(`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json.

Architecture
============

* `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with
  pool-domain-manager and catalyst-dns). Encapsulates the api3.json
  transport, command builders, response decoding, and the safe
  read-modify-write semantics required to never accidentally wipe a
  zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2`
  variant is unexported.
* `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook
  binary. Implements `Solver.Present` via the client's append-only
  `AddRecord` path and `Solver.CleanUp` via the read-modify-write
  `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`)
  rejects challenges for unmanaged apexes BEFORE any Dynadot call.
* `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm
  wrapper. Templates Deployment + Service + APIService + serving
  Certificate (CA chain via cert-manager Issuer self-signing) +
  RBAC + ServiceAccount. Mirrors the standard cert-manager external-
  webhook deployment shape.
* `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the
  paired ClusterIssuer activates. The interim http01 issuer remains
  templated as the rollback path.

Test results
============

  core/pkg/dynadot-client          — 7 tests PASS  (race-clean)
  core/cmd/cert-manager-dynadot-... — 9 tests PASS  (race-clean)

Test coverage includes a Present/CleanUp round-trip against an
httptest fixture that models Dynadot's zone state, an explicit
unmanaged-domain rejection, a regression preserving a pre-existing
CNAME across the DNS-01 round-trip (the zone-wipe defence), and a
typed-error propagation test that surfaces `ErrInvalidToken` to
cert-manager so the controller will retry.

Helm template smoke render
==========================

`helm template` against the new chart with default values yields 12
resources / 424 lines (APIService, Certificate, ClusterRoleBinding,
Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The
modified bp-cert-manager chart still renders both ClusterIssuers
(`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default
values; flipping `certManager.issuers.dns01.enabled=false` is the
clean rollback.

Smoke command (post-deploy)
===========================

  kubectl get apiservices.apiregistration.k8s.io \
    v1alpha1.acme.dynadot.openova.io
  # Issue a *.<sovereign>.<pool> wildcard cert and watch the
  # Order/Challenge progress through cert-manager.

CI
==

`.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the
pool-domain-manager-build pattern (cosign keyless signing, SBOM
attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager-
dynadot-webhook:<sha>`). Triggered by changes to either the binary or
the shared dynadot-client package.

Closes #159

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:47 +04:00
e3mrah
0289f0388d
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml,
the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3.

The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts
metadata.name + spec.dependsOn for the HelmRelease document(s), and
mechanically verifies the actual graph against the expected DAG declared
in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's
algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch
(W2.K1-K4) on success.

Behaviour against the in-flight expansion: HRs declared expected but not
yet on disk are reported as "deferred" (informational, not an error), so
that this script can be the static authoritative list while W2.K1-K4
PRs land their HR files in series. After all four W2 PRs merge, the
"deferred" count drops to 0 and the audit goes 100% green.

Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a
new dependency-graph-audit job that runs on every PR touching:
  - clusters/** (any HR file edit)
  - scripts/check-bootstrap-deps.sh
  - scripts/expected-bootstrap-deps.yaml
  - .github/workflows/test-bootstrap-kit.yaml

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:16:16 +04:00
e3mrah
2d1799d738
fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247)
Resolves install ordering on fresh clusters where the apiserver rejects
CompositeResourceDefinition CRs because the apiextensions.crossplane.io
CRDs registered by the crossplane subchart aren't live yet at apply time.

- bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload)
- NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions
- Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane]
- composition-validate.sh + fixtures relocate to the new chart
- blueprint-release CI: opt-out annotation
  catalyst.openova.io/no-upstream=true permits zero-deps charts that
  legitimately ship only Catalyst-authored CRs (the original hollow-chart
  rule remains in force for every other umbrella chart)

Live error this fixes (from otech.omani.works):
  no matches for kind "CompositeResourceDefinition" in version
  "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first

Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn.
Apply universally to similar cases going forward.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:05 +04:00
e3mrah
fad36836ed
fix(ci): tempo + ntfy logos are now .svg (logo-fix-batch-2) (#213)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 21:41:29 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00