sandbox-wave1-controller-chart
84 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0ac12970d8
|
ci(openova-flow): build openova-flow-server + adapter-flux images + sed chart tags (#1398)
Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go binaries so prov #34 has real images to install. Both auto-bump their chart's values.yaml `image.tag` on every main-branch push and dispatch blueprint-release for chart re-publish. Workflows shipped: - .github/workflows/build-openova-flow-server.yaml · Triggers on push to products/openova-flow/server/** or the chart · `go vet` + `go test -race` + Buildx push to ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest · cosign keyless sign + SBOM attest · awk-bumps platform/openova-flow-server/chart/values.yaml flowServer.image.tag, commits to main with [skip ci] · Dispatches blueprint-release.yaml for chart re-publish - .github/workflows/build-openova-flow-adapter-flux.yaml · Same shape; bumps platform/openova-flow-emitter/chart/values.yaml flowEmitter.image.tag Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no values.yaml changes needed in this PR. Canonical patterns cited (ARCHITECT-FIRST): - Build shape mirrors .github/workflows/build-application-controller.yaml (Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump + blueprint-release dispatch). - awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in .github/workflows/catalyst-build.yaml `deploy` job (with the `[skip ci]` + explicit blueprint-release dispatch fix from #712). Per docs/INVIOLABLE-PRINCIPLES.md: - #4a (GitHub Actions is the only build path) - event-driven (no cron triggers, only push/PR/workflow_dispatch) MIRROR-EVERYTHING: image refs in chart values point at harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and Harbor proxy-pulls. No direct push to harbor. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3a5d9fc102
|
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f668d791ab
|
fix(bp-newapi): publish newapi-mirror image + repoint chart to existing tag (qa-loop bounded-cycle audit prov #7 Gap F) (#1315)
Root cause from live diagnosis (omantel.biz prov #7, kubectl --context=omantel): The bp-newapi chart at platform/newapi/chart/values.yaml referenced `ghcr.io/openova-io/openova/newapi-mirror:v0.4.5` since its first commit (44d0200a, 2026-05-01). However: 1. NO CI workflow ever built that image. There is no `build-bp-newapi*.yaml` (or similar) under .github/workflows/. The GHCR package `ghcr.io/openova-io/openova/newapi-mirror` does not exist (404 from /orgs/openova-io/packages/container/...). 2. The tag `v0.4.5` is fictitious — neither upstream Calcium-Ion/new-api (`docker.io/calciumion/new-api`) nor the alternate ancestor (`justsong/one-api`) ever published a `v0.4.5`. The lowest stable Calcium-Ion tag is `v0.6.0.9`; the highest stable v0.x is `v0.13.2` (upstream publish 2026-04-27). Result: every fresh Sovereign's NewAPI Pod ImagePullBackOff'd 403 Forbidden on the never-existed image, blocking alice signup gate 5 (LLM) and surfacing in the bounded-cycle audit as Gap F. Fix (mirrors bp-guacamole CI pattern in .github/workflows/build-bp-guacamole.yaml): - NEW .github/workflows/build-bp-newapi.yaml — push to platform/newapi/chart/** triggers a Job that pulls `docker.io/calciumion/new-api:<UPSTREAM_VER>`, captures the upstream repo digest, re-tags as `ghcr.io/openova-io/openova/newapi-mirror: <UPSTREAM_VER>` + `:latest`, pushes both, then bumps values.yaml + Chart.yaml + dispatches blueprint-release. - platform/newapi/chart/values.yaml — newapi.image.tag bumped from `v0.4.5` (fictitious) to `v0.13.2` (latest stable Calcium-Ion/new-api on Docker Hub). Comment block expanded with full rationale + link to the new build workflow + bump-in-lockstep instructions. - platform/newapi/chart/Chart.yaml — version 1.4.1 → 1.4.2, appVersion `0.4.5` → `0.13.2` (Helm convention: appVersion = upstream version without the `v` prefix). Inline changelog records the audit-prov-7 Gap F lineage. - clusters/_template/bootstrap-kit/80-newapi.yaml — pinned chart version 1.4.1 → 1.4.2 with the same changelog inline. Verified locally: - `helm template smoke platform/newapi/chart --set database.existingSecret=fake --set credentials.existingSecret=fake --set auth.adminUI.mode=masterKey` renders `image: "ghcr.io/openova-io/openova/newapi-mirror:v0.13.2"` and `app.kubernetes.io/version: "0.13.2"`. The v1.0.0-rc.x upstream line is gated on schema migration stabilisation; the channel-seed Job uses the legacy admin-API request shape, so do NOT auto-roll past v0.13.x without re-running the channel-seed integration smoke against NewAPI's `/api/channel/`. Pairs with the Gap C re-investigation memo (no chart fix needed; PR #1309 only gated `defaultCompositionRef`, not the XRD itself; the useraccesses.access.openova.io CRD is present on omantel prov #7). DO NOT MERGE — this PR is for qa-loop bounded-cycle Wave 5 Fix #80 (Gap F) review. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9780e8d72d
|
fix(chart): bp-catalyst-platform 1.4.116 — chart re-publish + dispatch (qa-loop iter-10 Fix #44 follow-up) (#1264)
Chart 1.4.115 was published from the merge commit which still had the
OLD application-controller image tag (
|
||
|
|
24aab61207
|
fix(application-controller): HelmRelease targetNamespace = App's namespace, not Org slug (qa-loop iter-10 Fix #44) (#1262)
Root cause: the application-controller rendered the per-Application HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace = Org` where Org is the parent Organization slug. On omantel the Application(qa-wp) lives in ns `qa-omantel` while the Org is named `omantel-platform` — so the workload Pod landed in the wrong namespace, breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 / TC-263 (all asserting Pod in qa-omantel). Symmetric Kustomization wrapper had the same bug. Existing render unit test only covered the org==namespace case (`acme/acme`) which masked the bug. Fix: - render.Inputs gains AppNamespace field. helmRelease + kustomization templates resolve `metadata.namespace` and `spec.targetNamespace` to AppNamespace (back-compat default = Org). - application_controller.go passes app.GetNamespace() as AppNamespace on every render.Render call. - HelmRelease spec.install.createNamespace = true so a missing workload namespace is provisioned by helm-controller (per docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller must work without an operator pre-creating the namespace). - Org slug is still stamped on the catalyst.openova.io/organization label for traceability. - 3 new Go tests: TestRender_NamespaceIsAppNamespace (omantel scenario via render pkg) TestRender_CreateNamespaceTrue TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace (drives the omantel scenario end-to-end through the controller fake) - build-application-controller.yaml extended with auto-bump of controllers.application.image.tag in values.yaml on push-to-main, so the chart picks up the rebuilt image without a manual operator edit (per feedback_no_mvp_no_workarounds.md rule 1). - bp-catalyst-platform chart 1.4.114 → 1.4.115. Verification (post-roll on omantel): - delete omantel-platform/qa-wp Pod - annotate qa-omantel/qa-wp HR for reconcile - expect: Pod in qa-omantel ns + HR.spec.targetNamespace == qa-omantel Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5ca0a7d178
|
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b24475e2c2
|
fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206)
Two coupled fixes for QA-loop iter-3 cluster
`clusterroles-gvr-and-sha-injection`:
Sub-A — clusterroles GVR (TC-122/196/199/248):
- Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding
to k8scache.DefaultKinds. Both cluster-scoped.
- Add matching get/list/watch verbs on
catalyst-api-cutover-driver ClusterRole. Per
feedback_chroot_in_cluster_fallback.md every new GVR added to
DefaultKinds MUST get a matching rule on the cutover-driver SA
(chroot SovereignClient uses it via in-cluster fallback).
- Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a
regression that drops them from the registry fails the unit test.
Sub-B — CATALYST_BUILD_SHA env injection (TC-261):
- api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION
env vars with LITERAL values (not Helm directives) per the
dual-mode contract — Kustomize on contabo can't render
`{{ .Values... }}` in `value:` fields.
- .github/workflows/catalyst-build.yaml: extend the "bump literal
image refs" sed pass to also bump the CATALYST_BUILD_SHA env
literal so /api/v1/version returns the SHA the Pod is actually
running (no drift between image tag and reported SHA).
- The handler (version.go) already reads CATALYST_BUILD_SHA via
envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change
needed; the version_test.go env-override test already covers it.
Chart bumped 1.4.94 -> 1.4.95.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c1b92404ee
|
fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194)
EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because
the 5 Group C controllers (organization, environment, blueprint,
application, useraccess) shipped with `enabled: false` and the
KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result:
UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never
materialised into RoleBindings + composite realm-roles.
Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1).
Changes:
- values.yaml: organization/environment/application/useraccess controllers
flipped to `enabled: true` and `image.tag` SHA-pinned to the latest
GHCR-published push-on-main builds (organization/environment/application
:1b29c71, useraccess :ff2172f) per Inviolable Principle #4a.
- values.yaml: blueprint stays `enabled: false` until first
push-on-main build of build-blueprint-controller.yaml lands an image
in GHCR (never reference an image not built by CI).
- values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`.
- api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its
default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice
T2 brief #1098/#1146) instead of hardcoded "false".
- .github/workflows/build-blueprint-controller.yaml: new workflow
scaffolded (mirror of build-application-controller shape) so the
first commit touching core/controllers/blueprint/** ships a
CI-built, SHA-pinned, cosign-signed image to GHCR.
- Chart.yaml: bumped 1.4.89 → 1.4.90.
Verified via `helm template`:
- 4 controller Deployments + 4 controller ClusterRoles render (blueprint
pending image build).
- KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default.
- 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
render from platform/crossplane-claims/chart/.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7ca4abddd2
|
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState | 404 | 401 PUT /lease/<slot> → 200 + LeaseState | 412 + state | 401 DELETE /lease/<slot> → 204 | 412 | 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
746901b671
|
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.
What ships:
platform/cnpg-pair/
├── chart/
│ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
│ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY
│ ├── templates/
│ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation
│ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity)
│ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[])
│ │ ├── service-replication.yaml # Cilium ClusterMesh global Service
│ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold
│ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe
│ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits
│ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY)
│ ├── README.md # 80-line deployment + failover semantics
│ └── tests/cnpg-pair-render.sh # 5-case render gate
└── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan
Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.
CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.
Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
- service.cilium.io/global=true ClusterMesh global Service annotation
(first chart in the repo to use it; pattern reused by Continuum
K-Cont-2 for HTTPRoute weight=0 cross-region drains)
- bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
Cluster CRs colocated in one Blueprint, region-pinned via
openova.io/region node-affinity)
- audit-config ConfigMap co-located with the emitting Blueprint
(label-selector discovery for K-Cont-2 + U-DR-1; future
bp-*-pair Blueprints follow this convention)
- smoke-render-mode=default-off Chart.yaml annotation opt-in for
the blueprint-release smoke gate
C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.
C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.
Tests:
- bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
- helm lint platform/cnpg-pair/chart ✓ clean
- helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
- smoke-gate logic simulated locally ✓ default-off annotation honored
Pre-existing CI failures untouched:
- TestPinIssue rate-limit flake — not affected by chart-only slice
- TestBootstrapKit/gitea version drift — only iterates over a fixed
10-chart bootstrap list (no cnpg-pair entry)
Out of scope per brief (all deferred to dedicated slices):
- K-Cont-2 reconciler logic
- K-Cont-3 lease witness
- K-Cont-4 Cloudflare Worker
- C-DB-3 1M-row acceptance test
- Application controller changes
- U-DR-1 UI
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ddbe44918f
|
feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151)
Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton: - core/controllers/continuum/{cmd,internal/{controller,events}} - cmd/main.go — controller-runtime Manager bootstrap; leader election; /healthz, /readyz, /metrics endpoints; env-only config per INVIOLABLE-PRINCIPLES #4 - internal/controller — ContinuumReconciler with no-op Reconcile() (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen) - internal/events — placeholder package documenting K-Cont-2's NATS audit-event-type list - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534 - products/continuum/chart/ — full Helm chart shape (default-OFF): - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty; fail-fast on empty tag at render time) - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac, networkpolicy}.yaml - blueprint.yaml — OpenOva Blueprint manifest with configSchema + placementSchema (single-region: management cluster) + depends: bp-cnpg-pair + bp-powerdns - crds/README.md — pointer to the canonical Continuum CRD shipped in products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated - products/continuum/DESIGN.md — chart-vs-binary split decision (Option A: binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill list, K-Cont-3 lease witness API contract sketch - .github/workflows/build-continuum-controller.yaml — event-driven CI (NO cron) with go vet + go test -race + helm template ON/OFF resource count gates + fail-fast verification + GHCR build & push (cosign keyless signed) + repository_dispatch for chart-bump fan-out helm template verification: - continuum.enabled=false → 0 resources (default OFF) - continuum.enabled=true + image.tag=ci-test → 6 resources (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, NetworkPolicy) - continuum.enabled=true + empty image.tag → render fails per #4a go vet ./continuum/... → clean. go test -count=1 -race → all green. Out of scope (per the K-Cont-1 brief): - Reconcile body — K-Cont-2 - Lease witness implementations — K-Cont-3 - Cloudflare Worker source — K-Cont-4 - bp-cnpg-pair Blueprint — C-DB-1 Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b0ed216e81
|
feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097) (#1148)
EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST service backed by Gitea (3 sources: public mirror, sovereign-curated, per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3 (different scope: SME's was Org-bound; catalyst-catalog is Sovereign- wide multi-source). L1 — core/services/catalyst-catalog/ Go service: - Separate go.mod (services group is for HTTP services, controllers group is for CRD reconcilers — documented in DESIGN.md). - Imports the unified Gitea client via Go module replace directive. - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog (a sibling Go module) can import it (Go internal/ rule). 5 Group C controllers updated atomically. - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions, /{name}/versions/{version}} + /healthz. - Source resolution priority on collision: private > sovereign > public. - Per-Org access filter: caller's Claims.Groups[] determines visible private blueprints; Org A user does NOT see Org B's private set. - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default). - Session-cookie / Bearer / ?access_token= claim extraction matching catalyst-api's seam; expired-token rejection in-process. - Containerfile: distroless-static, non-root UID 65532. L2 — products/catalyst/chart/templates/services/catalog/ wiring: - 5 templates (deployment, service, serviceaccount, rbac, httproute) + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled. - helm template: 0 catalog resources when OFF, 6 when ON. - Empty image.tag fail-fasts at render per Inviolable Principle #4a. - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname. - Chart bumped 1.4.85 → 1.4.86. Gitea client extension (canonical seam, NOT per-service variant): - +ListOrgRepos(ctx, org) []Repo — paginated repo listing. - +ListContents(ctx, org, repo, branch, path) []ContentEntry — directory listing for per-Org shared-blueprints fan-out. GitHub Actions workflow: - .github/workflows/catalyst-catalog-build.yaml — push-on-paths + pull_request + workflow_dispatch (NO cron). go vet + go test (race + count=1) + image build → GHCR :<sha>. repository_dispatch fan-out to chart-bump matches the Group C controllers' pattern. Tests (3-tier gate): unit (config, cache, auth, source, handler) + integration (httptest-backed Gitea fixtures across all 3 sources + priority + per-Org access). All green; race detector on. L3 (SME catalog retirement) is deferred per the EPIC-2 master brief. GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps for a feature no UI consumer has asked for yet). Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
66fd0bbae3
|
refactor(controllers): promote duplicated internal/ packages to shared core/controllers/internal/ (CC1, #1095) (#1135)
Slice CC1 of EPIC-0 (#1095) — Coordinator-led consolidation. The 5 Group C controllers (slices C1-C5: organization, environment, blueprint, application, useraccess) all merged with their own per-controller go.mod + per-controller internal/ tree. This PR canonicalizes the shared layout per `02-implementer-canon.md` §1+§2: * One go.mod at core/controllers/go.mod (Path A — single shared module) * Shared helpers under core/controllers/internal/: - semver/ (was: blueprint/internal/semver + application/internal/semver, now exposes blueprint's IsValidRange + app's IsExact, with the union of both test corpora) - placement/ (was: application/internal/placement; promoted per seam map) - render/ (was: application/internal/render; promoted per seam map) - labels/ (was: useraccess/internal/labels; promoted per seam map — Manara-style scope matcher, owner-of-record C5) Module-discipline decision (Path A vs Path B): Path A. The 5 controllers' go.mod files use the same controller-runtime v0.19.0, k8s.io/* @ 0.31.x, sigs.k8s.io/yaml v1.4.0, etc. The only drift was organization-controller on k8s.io/api 0.31.0 vs the others on 0.31.1 — a trivial bump. Independent dep-version pinning would only be valuable if a controller needed a hostile dep the others shouldn't pull; nothing in the current tree is hostile. Containerfiles + workflows updated: * 5 Containerfiles now COPY core/controllers/{go.mod,go.sum,internal/} plus the per-controller tree from a repo-root build context. * 4 per-controller workflows (application/environment/organization/ useraccess; blueprint-controller has no dedicated workflow yet) now trigger on core/controllers/{<name>/**, internal/**, go.mod, go.sum} and run go vet + go test scoped to their own tree + shared internal. * useraccess workflow context flipped from core/controllers/useraccess to . (repo root) so the Containerfile can reach the shared go.mod. Subpackages NOT promoted in this PR (compromise — flagged for follow-up): * gitea/ — 4 of 5 controllers each ship a Gitea HTTP client. The APIs DIVERGE (organization has Org+Repo CRUD with Repo struct return values; application/blueprint/environment have File CRUD with Org-not-found sentinel). A SUPERSET package would require renaming methods (e.g. EnsureRepo collides on signature) which crosses the brief's "no API redesign" line. CC2 follow-up slice should design the unified surface before promoting. * validate/ — application's package validates Application.spec.parameters against a JSON Schema (santhosh-tekuri lib); blueprint's validates Blueprint CR business rules (semver-backed). Same dir name, completely different functions — not actually duplicates. * gitops/ — environment's renders Flux GitRepository for an Environment; organization's renders HelmRelease+Namespace for an Org. Same dir name, different inputs and outputs. Test-coverage delta: pre-consolidation 134 root-level tests (sum across 5 modules); post-consolidation 133 tests. Net delta -1: blueprint and application each had their own TestIsValidRange in their semver pkg; the shared semver pkg's TestIsValidRange now exercises the union of both controllers' valid+invalid input corpora — coverage strictly improved even though one redundant test name disappeared. Verified locally: go build + go vet + `go test -count=1 -race ./...` all clean; all 5 controller binaries (cmd/) link successfully. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dbf585744c
|
feat(controllers): land application-controller (slice C4, #1095) (#1133)
Watches Application.apps.openova.io/v1 CRs and reconciles each
Application to per-region kustomization + helmrelease manifests in
the per-Org Gitea repo (gitea.<location-code>.<sovereign-domain>/<org>/<app>).
Reconcile flow per slice C4 brief:
1. Resolve parents: spec.environmentRef → Environment CR, then
Environment.spec.organizationRef → Organization CR. Pending-on-miss.
2. Fetch Blueprint at spec.blueprintRef.{name,version} (v1 with
v1alpha1 fallback). Pending-on-miss.
3. Validate spec.parameters against Blueprint.spec.configSchema via
github.com/santhosh-tekuri/jsonschema/v5. On invalid → status.phase=
Failed + Condition reason=Invalid listing every failing JSON pointer.
4. Validate placement against Blueprint.spec.placementSchema.modes.
5. Resolve placement → per-region work plan:
- single-region: regions[0] only, role=primary
- active-active: every region rendered identically (sorted
for byte-stability), role=active, no primaryRegion
- active-hotstandby: regions[0] primary, regions[1..] standby
(replicas: 0 + _openova_standby: true overlay; Continuum
#1101 flips on switchover)
6. Render kustomization.yaml + helmrelease.yaml per region under
clusters/<region>/applications/<app>/{...}.yaml on the env-type-
mapped branch (develop|staging|main per NAMING §11.2).
7. Idempotent commit via gitea.PutFile's byte-equality short-circuit
— re-reconcile on steady state = 0 Gitea writes (slice C4 brief
test #7).
8. Status update: phase / primaryRegion / regions[] / giteaRepo /
installedBlueprint{name,version,digest} / conditions[].
9. Finalizer + cascade delete: on metadata.deletionTimestamp, removes
every manifest the controller wrote and releases the finalizer.
Architecture compliance per docs/INVIOLABLE-PRINCIPLES.md:
- Flux is the only reconciler. Controller writes to Gitea; Flux
applies. NO direct K8s create of HelmRelease/Kustomization/Service.
- Dynamic client + unstructured.Unstructured (no controller-gen, no
zz_generated_deepcopy.go).
- Every value is environment-configurable (GITEA_API_URL, GITEA_TOKEN,
GITEA_PUBLIC_URL, SOURCE_NAMESPACE, HELMRELEASE_INTERVAL,
CATALOG_SOURCE_REF, REQUEUE_AFTER_SECONDS, METRICS_ADDR, HEALTH_ADDR,
LEADER_ELECT, LEADER_ELECT_NS, LOG_LEVEL).
- SHA-pinned images via the focused build-application-controller.yaml
workflow (push-on-paths + PR + workflow_dispatch — no cron).
Tests cover the full 9-test matrix from the brief plus 3 bonus paths:
T1 Pending on missing Environment (no Gitea writes).
T2 Pending on missing Blueprint (no Gitea writes).
T3 Invalid on parameters schema mismatch — Condition message names
the failing path 'replicas'; no Gitea writes.
T4 single-region happy path → expected manifests written under
clusters/<region>/applications/<app>/ on branch=main, finalizer
added, status.phase=Provisioning, status.primaryRegion populated,
status.giteaRepo populated.
T5 active-active fan-out → 2 regions, 2 manifest sets byte-equal
after region-name canonicalisation. status.primaryRegion empty.
T6 active-hotstandby → primary renders replicas:3 (user param);
standby renders replicas:0 + _openova_standby:true marker.
T7 Idempotency → re-reconcile after success = 0 Gitea writes
(PutFile byte-equality short-circuit).
T8 Deletion cascade → manifests removed from Gitea, finalizer
released after delete pass.
T9 Drift detection → Gitea-side manifest hand-edited; controller
restores byte-identical original on next pass.
+ Pending on Gitea Org missing (org doesn't exist in Gitea even
though Organization CR exists — slice C1 hasn't run yet).
+ Invalid placement-vs-blueprint-allowed-modes (placement-active-active
rejected on a Blueprint declaring only single-region).
Module path: github.com/openova-io/openova/core/controllers/application
(per-controller go.mod, matching siblings C1/C2/C3/C5; CC1 promotes
shared internals to core/controllers/internal/ in a follow-up slice).
`go vet ./...` clean. `go test -count=1 -race ./...` all green.
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8988cd9e4f
|
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech*) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/**. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2ab442544e
|
feat(controllers): land environment-controller (slice C2, #1095) (#1127)
Implements slice C2 of EPIC-0 #1095 — the environment-controller Go binary. Watches Environment.catalyst.openova.io/v1 CRs (cluster-scoped) and reconciles each Environment to: 1. Verify the per-Org Gitea Org exists (parent Organization gate). Missing org surfaces GiteaOrgReady=False + Pending phase, never panics or crashloops. 2. Track the canonical branch name for this Environment in status.giteaRepoRef.{org,branch} per NAMING-CONVENTION.md §11.2 item 1 (develop/staging/main ↔ dev/stg/prod; uat/poc map to their own branch name). 3. Idempotently write per-vCluster Flux GitRepository manifests into the Org's Gitea repo at the canonical path `clusters/<host-cluster>/environments/<env-name>/gitrepository.yaml` per NAMING §11.2 item 3. Multi-region Environments fan out one commit per spec.regions[]. Identical bytes short-circuit (zero spurious commits in repo history); drift triggers an overwrite with the existing blob SHA. 4. Surface the canonical JetStream subject prefix `ws.{organizationRef}-{envType}.>` on status.jetstreamSubjectPrefix per NAMING §11.2 item 4 + ARCHITECTURE.md §5. Per-Environment NATS Stream CR creation is OUT OF SCOPE here — NACK isn't installed yet (future slice). 5. Set status.phase, status.regionCount (printer column), status.vclusters[], status.observedGeneration, and the Ready/GiteaOrgReady/GitRepositoryWritten conditions. Architecture rules honored (per docs/INVIOLABLE-PRINCIPLES.md + docs/adr/0001-catalyst-control-plane-architecture.md): - Flux is the only reconciler in production. The controller writes manifests to Gitea; Flux applies them. NO kubectl apply, NO helm install, NO exec.Command in the codebase. - Crossplane is cloud-only. This controller is K8s-to-K8s native via controller-runtime + client-go. - DR is a Placement, not an Env Type. The controller treats spec.envType as the schema-validated enum {prod|stg|uat|dev|poc} with no special-case for DR (per NAMING §11.1). - Sovereign-independent. The Gitea base URL, secret ref, branch suffix, commit author, and Flux interval are ALL runtime config (per Inviolable Principle #4 — never hardcode). Files: - core/controllers/environment/api/v1/types.go — Environment Go types matching the CRD; hand-written DeepCopy to avoid build-time codegen tool dependency. - core/controllers/environment/internal/gitea/client.go — minimal GitHub-compatible REST client targeting Gitea's /api/v1 (GET /orgs/{org}, GET/POST/PUT /repos/{org}/{repo}/contents/{path}). Idempotent UpsertFile with byte-equality short-circuit + blob-SHA conflict refusal. - core/controllers/environment/internal/gitops/render.go — pure template rendering of the Flux GitRepository CR. Deterministic field ordering for byte-equality idempotency. - core/controllers/environment/internal/controller/environment_controller.go — reconciler: validate spec, gate on Gitea Org, fan out per-region manifest writes, set status + conditions. - core/controllers/environment/cmd/main.go — controller-runtime manager entry point with leader election. - core/controllers/environment/Containerfile — two-stage build, alpine:3.20 runtime, non-root UID 65534, ENTRYPOINT. - core/controllers/environment/deploy/rbac.yaml — ClusterRole watching Environments + status subresource + leader election lease. - .github/workflows/build-environment-controller.yaml — CI mirrors build-cert-manager-dynadot-webhook.yaml: vet + race tests, docker buildx + cosign keyless sign + SBOM attest, push to ghcr.io/openova-io/openova/environment-controller. Tests (35 total, all GREEN, race-detector enabled): - internal/controller (T1–T11): T1 happy-path single-region reconcile T2 idempotent re-reconcile (zero spurious commits) T3 parent Org missing → Pending + GiteaOrgReady=False (no panic) T4 multi-region fan-out (3 commits, 3 regions) T5 drift detection — operator hand-edit gets overwritten T6 placement-vs-regions cardinality violations → Failed T7 env_type→branch mapping table T8 Gitea repo missing → Pending + GiteaRepoMissing reason T9 partial-failure one region → Degraded with that region Failed T10 Config.Defaults applies the documented defaults T11 NotFound between dequeue and Get is benign - internal/gitea: GET /orgs OK + 404 + 500; UpsertFile create / idempotent / update with SHA / repo-not-found; pathEscape preserves slashes; arg-validation. - internal/gitops: BranchForEnvType / JetStreamSubjectPrefix / HostClusterName (with override) / GitRepositoryPath / RenderGitRepository (deterministic + complete + anonymous + default interval + required-field validation) / EnvironmentName. go vet ./... clean. go test -count=1 -race ./... GREEN. Out of scope per slice brief: organization-controller (C1), blueprint-controller (C3), application-controller (C4), useraccess-controller (C5), catalyst-api codebase changes, NACK install, per-Environment NATS Stream CRs. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
84167a768e
|
feat(controllers): land organization-controller (slice C1, #1095) (#1129)
A thin in-cluster Go controller that watches Organization CRs
(orgs.openova.io/v1) and reconciles four downstream artifacts per
the EPICS-1-6 unified design §3.3 + §3.7 and ADR-0001 §2.7:
1. vCluster HelmRelease — written into the per-Org Gitea repo
(NOT direct apply; Flux reconciles per ADR-0001 §2.1).
2. Keycloak group — at path /<slug> with attributes
{org=[<slug>], tier=[<sme|corporate>]}.
3. Gitea Org — auto-created if absent; one repo per Org seeds
the vCluster + tenant manifests.
4. UserAccess CR — one per spec.owners[] entry; slice C5's
useraccess-controller materializes the RoleBindings.
Per ADR-0001 §2.2 (Crossplane is cloud-only) this is K8s-to-K8s
reconciliation NOT a Crossplane Composition. Per §2.1 the controller
writes manifests via the Gitea HTTP contents API — never kubectl
apply, never helm install, never exec.Command("helm", ...).
Idempotent: re-running on a steady-state CR is a no-op (every
"ensure" is find-or-create with byte-equal short-circuit on PutFile).
What ships:
- core/controllers/organization/cmd/main.go — entry point with
envconfig, leader election, signal handling
- core/controllers/organization/internal/controller/ — reconciler +
KeycloakClient interface + LiveKeycloak impl
- core/controllers/organization/internal/gitea/ — minimal Gitea Admin
REST client (Org/Repo + contents-API). Self-contained — extractable
to core/pkg/gitea-client/ when slice C2 needs it.
- core/controllers/organization/internal/gitops/ — manifest renderer
(namespace + vcluster HelmRelease + kustomization)
- core/controllers/organization/internal/orgapi/ — Organization Go
types mirroring the CRD schema (no deepcopy-gen — inlined)
- core/controllers/organization/Containerfile — multi-stage build
(alpine-based, runs as UID 65534)
- core/controllers/organization/config/{rbac,manager}/ — ClusterRole
+ Deployment scaffolding for chart consumption (slice F1)
- .github/workflows/build-organization-controller.yaml — push/PR/
manual triggers, no cron
Tests: 9 unit tests across 3 packages cover happy-path reconcile,
idempotency (zero net writes on second reconcile), Keycloak group
already exists, Gitea Org already exists, slug/metadata drift,
missing CR no-op, byte-equal PutFile no-op, 422-race re-find,
template structural-YAML validity, and label-vocabulary compliance.
go test -count=1 -race ./... and go vet ./... both clean.
Out of scope: environment-controller (C2), application-controller
(C4), useraccess-controller (C5 — this controller only WRITES
UserAccess CRs).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
dd1699afe3
|
feat(controllers): land useraccess-controller — fix silently broken Crossplane path (slice C5, #1095, P0) (#1128)
Per docs/EPICS-1-6-unified-design.md §3.5 and ADR-0001 §2.3 amendment,
K8s-to-K8s reconciliation belongs to thin in-cluster controllers, not
Crossplane Compositions. The existing useraccess.compose.openova.io
Composition writes RoleBindings via provider-kubernetes — but
provider-kubernetes is NOT installed on any production Sovereign
(caught in the EPIC-0 audit). Every UserAccess CR has been silently
no-op'd. This controller fixes that.
What lands:
- core/controllers/useraccess/cmd/main.go — controller-runtime Manager
with leader election + signal handling, environment-only config
- internal/controller/{reconciler,desired,spec,status,types}.go — the
reconciler. Watches UserAccess.access.openova.io/v1alpha1 (cluster-
scoped, unstructured client) and owns RoleBinding +
ClusterRoleBinding via Owns() so drift triggers reconcile via
ownerRef indexing
- internal/labels/scope.go — Manara DNA scope matcher: AND-within /
OR-across, wildcard scopes, EnforcedScopes() per catalog tier (the
developer auto-injection of openova.io/env-type=dev)
- internal/controller/*_test.go + internal/labels/scope_test.go —
26 unit tests with the controller-runtime fake client. Covers
happy-path, multi-app/multi-ns fan-out, namespaces:["*"]→CRB,
group subjects, drift detection+restore, orphan deletion on spec
shrink, idempotency, invalid spec, ownerRef shape, NotFound no-op,
and the 5-catalog-tier matrix
- deploy/{rbac,deployment}.yaml — ClusterRole/SA/Deployment with
non-root, read-only-rootfs, drop-ALL caps, leader-election Role
- Containerfile — Alpine 3.20 final stage, CGO_ENABLED=0, UID 65534
- .github/workflows/useraccess-controller-build.yaml — event-driven
build (push-on-main + PR test job), SHA-pinned image tags
Behaviour:
- Per UserAccess CR, materialises RoleBindings (per namespace) or
ClusterRoleBindings (when namespaces:["*"]) referencing the
canonical openova:application-{admin,editor,viewer} ClusterRoles
- ownerRef back to the UserAccess CR with controller=true +
blockOwnerDeletion=true so K8s GC cascades deletes
- Drift detection: hand-mutated bindings are restored on next pass +
Condition Drift=True surfaced for the UI
- Idempotent: steady-state reconcile = 0 K8s writes
- Status: phase (Pending|Active|Failed), rolebindingsCreated,
observedGeneration, conditions[]
Out of scope per the brief:
- Crossplane Composition deletion (operator retires post-verify)
- 5-catalog-tier role inheritance (lands with EPIC-3 #1098)
- Keycloak realm-role sync (slice D1b, this controller is consumer)
Tests:
go vet ./... # clean
go test -count=1 -race ./... # 26/26 pass
go test ./internal/labels/... -run TestScope # full 5-tier matrix
Co-authored-by: Hatice Yildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
358c32c032
|
ci: add cluster bootstrap-kit drift guardrail (slice H2 scope-reduced, #1095) (#1122)
Adds .github/workflows/cluster-template-drift.yaml — a warn-only workflow that reports drift between each clusters/<sovereign>/bootstrap-kit/ tree and the canonical clusters/_template/bootstrap-kit/. Why warn-only, not enforce: - Every existing Sovereign carries some legitimate drift (per-Sovereign image SHAs, region-specific values overlay) — blocking PRs on diff count would prevent ALL cluster work. - The right place to enforce the boundary is Catalyst's organization- controller (slice C1 of #1095), not CI. Once C1 ships, every new Sovereign bootstrap-kit is generated from _template and the attestation lives at apply-time, not at CI-time. - Retroactively reconciling the existing omantel.omani.works/ and otech.omani.works/ trees (which have 20+ differing files plus structural changes — extra files on each side) is a high-blast-radius maintenance-window operation, NOT a CI scoped slice. What this workflow does: - Triggers on push to main + PR + workflow_dispatch when clusters/** changes. - For each clusters/<sovereign>/ directory, runs `diff -rq` against clusters/_template/bootstrap-kit/ and writes a Markdown report to the run summary AND a sticky PR comment. - Counts differing files + only-in-template + only-in-Sovereign per Sovereign so reviewers can quickly see whether new drift was introduced. Per docs/EPICS-1-6-unified-design.md §3.9 row 2 + §11 row 6 (decision amended from "reconcile + CI gate" to "warn-only CI gate"; structural reconcile deferred to slice C1 organization-controller). Per docs/INVIOLABLE-PRINCIPLES.md #4a — workflow only inspects YAML; no images built, no cloud calls. Refs: #1094, #1095, slice C1 (organization-controller). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eb6a3c1812
|
fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
953ef8290f
|
fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs (#980)
* fix(catalyst-ui): drop stale params={{ deploymentId }} from clean-root Links (#975)
#976 collapsed `to="/provision/$deploymentId/<page>"` to clean root
paths (`to="/<page>"`) but left the `params={{ deploymentId }}` prop
on every callsite, breaking the Vite tsc build with TS2353. Fixes:
- Drop `params={{ deploymentId }}` from Links whose target is now a
parameterless clean root path (StatusStrip, AppDetail, AppsPage,
DecommissionPage, FlowPage, JobDetail, JobsPage, JobsTimeline,
SettingsPage, DeploymentsList).
- For Links whose `to` still uses `$componentId`/`$jobId`, cast
`params` with `as never` to match the existing pattern in
cloud-compute/cloud-network/cloud-storage/Sidebar/UserAccess
(the dual-mount under provisionRoute + consoleLayoutRoute defeats
TS's strict params inference; the runtime path is correct).
- Drop `deploymentId` prop + interface field from JobCard / JobRow /
JobsTable / AppCard now that the Links don't need it; update test
fixtures + the JobsTable row-link assertion to match the new
clean `/jobs/$jobId` href.
- Drop the unused ArchEdgeType import in k8sAdapter (TS6196).
- Dashboard navigateToApp uses `as never` casts to align with the
same pattern.
* fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs
Two paths consume the catalyst-api / catalyst-ui images:
1. bp-catalyst-platform OCI chart (Sovereigns) — values.yaml driven, tag
in values.yaml is rendered at helm install time by Sovereign Flux.
2. contabo Kustomize-path — literal image refs in templates/api-deployment.yaml
and templates/ui-deployment.yaml. Flux kustomize-controller on contabo
reconciles those files directly.
The CI deploy step was bumping BOTH on every PR, which auto-rolled
contabo every time anyone merged a catalyst-api code change. On
2026-05-05 PR #975's k8scache feature broke contabo startup on the
auto-roll because contabo has 27 dead-Sovereign kubeconfigs that the
new code iterates synchronously at startup, blocking readiness.
Fix: keep the values.yaml bump (Sovereigns auto-pick-up via OCI chart
which is the right behaviour for fresh provisions). Drop the
templates/*-deployment.yaml bump so contabo only rolls when an
operator manually commits a validated SHA into those files.
Closes the auto-deploy-to-contabo blast radius on every PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
db332f6767
|
fix(ci): services-build auto-bumps chart patch + dispatches blueprint-release (#874)
* fix(bp-catalyst-platform): bump 1.4.8 -> 1.4.9 to republish with current services-auth image (#871) Chart 1.4.8 was published from commit |
||
|
|
1d93b6c5af
|
feat(e2e): SME demo Playwright spec — full 6-step happy path (#805) (#823)
Authors the load-bearing investor-demo proof artefact for the SME-tenant turnkey experience epic (#795). The spec walks the FULL happy path against the catalyst-ui SPA and emits 1440×900 screenshots at every assertion so the DoD checklist is satisfied with visual evidence rather than narrative. What landed: - products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear spec covering Step 1 (marketplace signup) → Step 2 (provisioning) → Step 3 (SME admin first login + dashboard) → Step 4 (create alice via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to unblocking issues. - products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry of every URL, hostname, fixture user, and UUID the spec uses. Per feedback_never_hardcode_urls.md, no test inlines a hostname; every asserted host derives from OTECH_FQDN + SME_SLUG. - products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape- faithful page.route mocks for tenant discovery, /api/v1/whoami, /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment endpoints, app placeholders for WordPress/OpenClaw/webmail, and the /api/v1/sme/billing/ledger surface. Each helper is the seam between mock-mode (today) and live-mode (post-#804) so the spec opts out of any single mock by simply not calling that helper. - .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger that runs the spec against a freshly-installed dev tree with VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the SovereignConsoleLayout's auth gate has a non-null sovereignFQDN. Uploads the 805-* screenshot evidence as a 30-day artefact. Run today on a fresh checkout: cd products/catalyst/bootstrap/ui VITE_CATALYST_MODE=sovereign \ VITE_SOVEREIGN_FQDN=acme.otech.example \ npm run dev & PLAYWRIGHT_HOST=http://localhost:5173 \ npx playwright test e2e/sme-demo.spec.ts Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 / #798 / #802-followup). Live-mode follow-up (after #804 lands a fresh otech with the SME tenant pipeline wired): drop the mock installers from beforeEach and flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper calls change. Per docs/INVIOLABLE-PRINCIPLES.md: #1 (waterfall): the canonical 6-step contract from #805 is asserted in this first cut, not staged across cycles. #2 (never compromise): every step that's deferred is fixme'd with a blocker link, never silently skipped. #4 (never hardcode): every URL routes through e2e/lib/config.ts. Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
9645a9044a
|
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add the SME-2 metering integration end-to-end. NewAPI is consumed as the upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned mirror, not a fork) — the metering envelope is produced by a Go sidecar that observes the OpenAI-style `usage.total_tokens` field on every 2xx /v1/* response. This avoids forking the upstream binary while still producing the canonical envelope shape on `catalyst.usage.recorded`. A) NewAPI metering sidecar — core/services/metering-sidecar/ - Transparent reverse proxy in front of NewAPI on its own port; the bp-newapi Service routes the cluster-fronting port to the sidecar, which forwards to NewAPI on the pod's loopback. - Observes successful /v1/* JSON responses, parses `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes one envelope on `catalyst.usage.recorded` per completed request. - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed. - Customer-facing latency is NEVER blocked on metering: the response body is restored before publish; on NATS unreachable the envelope is persisted to disk and retried by a background drain loop. - 14 unit tests (proxy + publisher + safeFilename guards). B) sme-billing NATS subscriber — core/services/billing/handlers/ metering_consumer.go - JetStream durable consumer `sme-billing-metering` on stream `CATALYST_USAGE` (provisioned by sme-billing on startup). - Idempotent on metadata.request_id via a UNIQUE partial index on credit_ledger.external_ref; redelivery from the broker collapses to a single ledger row. - Customer auto-create on cold start (the rbac sme.user.created envelope may land AFTER the first metered request; we don't strand usage waiting for it). - 11 unit tests covering happy-path, idempotency, malformed-payload poison-pill, missing-request-id, non-negative amount guard, resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak. C) HTTP handler POST /billing/metering/record — handlers/metering.go - Synchronous validate → INSERT credit_ledger → return {ledger_entry_id, balance_after_omr, balance_after_micro_omr, duplicate}. Same payload + idempotency guard as the NATS path. - Auth: superadmin OR sovereign-admin (operator-admin model; end-user LLM traffic flows through the sidecar, never this URL). - 8 unit tests covering happy-path, idempotency, role gating, malformed-JSON, positive-amount rejection, customer-not-found. D) Schema — core/services/billing/store/store.go - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR exact integer — preserves precision at metering rates). - ADD COLUMN external_ref TEXT + UNIQUE partial index for idempotency dedup. - ADD COLUMN metadata JSONB for the raw envelope. - GetCreditBalance projects both amount_omr (legacy) and amount_micro_omr (new) into the integer-OMR view. - GetCreditBalanceMicroOMR returns canonical precision. - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0) distinguishes fresh insert from duplicate without a follow-up SELECT. E) Wiring - core/services/shared/events/nats.go — minimal NATS JetStream publisher + subscriber surface; legacy RedPanda producer/consumer in events.go untouched per [Q-mine-3]. - core/services/billing/main.go — NATS_URL env; subscriber wired in parallel with the existing RedPanda tenant-events consumer. - middleware/jwt.go — exported test helper WithClaims so handler tests can construct an authenticated context without minting a real signed token. - .github/workflows/services-build.yaml — metering-sidecar added to the build matrix; deploy job skips it (image consumed by the bp-newapi chart, not products/catalyst sme-services). F) bp-newapi chart (1.0.0 → 1.1.0) - meteringSidecar block in values.yaml: image, port, NATS URL, priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool dir, header names, resources, securityContext (read-only-rootfs). - deployment.yaml renders the sidecar container + emptyDir spool volume when meteringSidecar.enabled (default true). - service.yaml routes the cluster-fronting :3000 to the sidecar when enabled, exposes a separate :3001 → NewAPI direct port for bp-catalyst-platform admin-API traffic (ADR-0003 §3.2). - networkpolicy.yaml allows the sidecar's port + nats-system egress for JetStream publish. Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green. Helm template renders cleanly with sidecar enabled and disabled. Closes #798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798) Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000") that does NOT scan directly into Go int64 — the integration test TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on the post-redeem balance read. Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is unambiguously bigint and Scan target stays uniform across pre-#798 rows (amount_omr only) and post-#798 rows (amount_micro_omr present). Affects: - GetCreditBalance - GetCreditBalanceMicroOMR - RecordUsage's running-balance read Test mocks updated to match the new SQL prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
93bd3ace5b
|
feat(bp-openclaw): workspace controller + per-user pod chart (#803) (#810)
Implements locked decision [A] of epic #795: per-SME-tenant workspace controller deployment + per-user runtime pod, identity-blind by construction. Consumes the per-user newapi-key-{uuid} Secrets rendered by the unified-rbac user-create hook (ADR-0003 §3.3). What this delivers: - platform/openclaw/chart/ bp-openclaw v0.1.0 (no-upstream) - platform/openclaw/runtime/ Go reference runtime (NEWAPI_BASE_URL + NEWAPI_KEY env contract only) - .github/workflows/openclaw-runtime.yaml Event-driven build for the runtime image (paths-on-push + manual rerun; NO schedule:cron per CLAUDE.md). - platform/openclaw/blueprint.yaml Catalyst registration + configSchema. Chart highlights: - Required values guarded by _helpers.tpl :: assertRequired so missing realmURL/clientSecretName/tenant.namespace/baseURL/host fail render with helpful messages. - RBAC: namespaced Role in tenant ns; create verbs split into separate rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md. Label-based ownership (catalyst.openova.io/openclaw-user) enforced at the controller, not in RBAC. - ingress: cert-manager.io/cluster-issuer annotation triggers ACME auto-issuance for openclaw.<sme-domain>. - per-user pod template ConfigMap holds the pod-spec the controller renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders filled at session-start. - networkPolicy covers controller pod only; per-user pod NetworkPolicy is rendered by the controller at session-start (target hostname is read from the per-user Secret which doesn't exist at chart-render time — documented in README.md). Tests: chart/tests/render-toggles.sh (7 cases) covers required-value enforcement, RBAC create+resourceNames violation guard, ServiceMonitor default-off, networkPolicy toggle, pod-template placeholder presence, cert-manager annotation. All seven gates pass locally. Closes part of #795 (epic still open). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9adca8442a
|
fix: ci actions:write + auth-layout overflow scroll (#712 followup, #721 followup) (#728)
Two unrelated production-bug fixes squashed because they came out of the same live verification pass on console.openova.io 2026-05-04. 1. catalyst-build.yaml deploy job permissions PR #720 added a `gh workflow run blueprint-release.yaml` dispatch step at the end of the deploy job to close the bot-deploy-doesn't- trigger-workflows gap from #712. Step has been failing on every run since with HTTP 403 "Resource not accessible by integration" because GITHUB_TOKEN lacks `actions: write` by default. Result: blueprint-release was never dispatched after PR #722–727 merged; the bp-catalyst-platform OCI artifact stayed on the pre-fix chart and any Sovereign provisioned afterwards picked up the buggy chart. Add the missing permission so dispatch succeeds. 2. AuthLayout.tsx vertical centering at small viewport heights The sign-in / verify cards were mathematically centered at 1440×900 (Δ=0.008px verified via getBoundingClientRect in Playwright) but founder reports the card sitting at the top of the screen on real-world viewports. Root cause: the right panel had `flex flex-1 items-center justify-center` which centers ONLY if the inner content fits within the viewport — at smaller heights the form's natural content flow pushed the card off-screen with no scroll fallback. Fix: add `items-stretch` to the outer flex (so the right panel fills full viewport height), `overflow-y-auto` on the right column (so the card can scroll inside its column when too tall), and `py-8` padding on the card wrapper (breathing room when scrolling kicks in). Result: card is vertically centered when content fits, and stays visible (column-scrollable) when it doesn't, on every viewport height from 1024×600 up. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
35183af5be
|
fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) (#720)
* feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/* and marketplace-api/* (only *.kustomization.yaml + *.ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the *.<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green * fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub Actions design, commits authored by GITHUB_TOKEN don't re-trigger workflows. blueprint-release.yaml's `on.push.paths: products/*/chart/**` filter matches the deploy commit's diff (chart/values.yaml + chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire, but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck on whatever catalyst-api SHA was current at the last manual chart- touching PR. Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb six PRs after the SHA was superseded — every fresh Sovereign installed the buggy pre-#701 image and rejected handover with 401 unauthenticated. Fix: after `git push` succeeds in the deploy job, dispatch blueprint-release explicitly via `gh workflow run`. The dispatched run re-renders + re-publishes the chart with the just-pushed values.yaml. Closes #712. --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
b5c9839da7
|
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/*), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io> |
||
|
|
10c8e997c4
|
fix(catalyst): restore literal image refs in Kustomize-path deployment YAMLs (#614)
The feat/global-imageRegistry (#580) PR converted the literal image refs
in api-deployment.yaml and ui-deployment.yaml to Helm template expressions
({{ .Values.global.imageRegistry }}...) without updating the CI deploy step
to also patch those files. Since the catalyst-platform Flux Kustomization
reads these files as raw manifests (not via helm-controller), the Helm
template syntax was never rendered, leaving a literal '{{ if ... }}'
string as the image reference → InvalidImageName on every Pod start.
Root cause: two consumers of the same file — Helm chart path (Sovereign
clusters) and Kustomize path (contabo-mkt) — but only the Helm path was
handled by the deploy job.
Fix:
- Restore literal `ghcr.io/openova-io/openova/catalyst-{api,ui}:b50a600`
image refs in the Kustomize-path deployment YAMLs (immediate unblock).
- Update CI deploy step to sed-patch those literal refs on every deploy
commit so future image rolls keep both paths in sync (durable fix).
Closes: the InvalidImageName regression introduced in #580.
Unblocks: issue #608 (Phase-8b Agent A magic-link auth) — catalyst-api
was stuck at InvalidImageName since commit
|
||
|
|
59fb2b742c | fix(ci): use awk instead of python heredoc in deploy — fixes YAML parse error | ||
|
|
885e032dc5 |
fix(ci): deploy job updates values.yaml SHA tags, not Helm template files
The previous sed targeted ui-deployment.yaml + api-deployment.yaml for
`image: ghcr.io/.../catalyst-ui:.*` but those files use Helm template
expressions (`{{ .Values.images.catalystUi.tag }}`), so sed silently
no-ops. Result: every catalyst build committed "No changes" and the
deployed image was never updated.
Fix: switch deploy job to update images.catalystUi.tag and
images.catalystApi.tag in products/catalyst/chart/values.yaml via
python3 regex (handles multiline YAML reliably).
Also bump catalystUi + catalystApi tags to
|
||
|
|
942be6f58d
|
fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583)
containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index includes an attestation manifest (the unknown/unknown platform entry added by docker/build-push-action when provenance=true). Containerd resolves the manifest index, encounters the attestation entry, fetches its descriptor from GHCR which returns an HTML 404 page, and then caches that HTML page as a blob SHA — every subsequent pull of ANY tag for that image returns the same HTML SHA instead of the real layer. Fix: set provenance=false + sbom=false on the build-push-action step. SBOM attestation is handled separately by cosign attest, which does not embed its manifest into the OCI index. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
52c6938e02
|
ci(catalyst-build): watch infra/hetzner/** so cloudinit changes rebuild catalyst-api (#472)
Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without this path in the build trigger, fixes to that file do NOT rebuild the image — the running pod keeps using the stale tftpl and provisioning keeps failing with the same Tofu error. Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path filter MUST cover every directory the image depends on. Missing infra/hetzner/** was a long-standing latent CI bug — surfaced by Phase-8a #454 first live provision attempt. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
1628a1b3aa
|
ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470)
First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401 'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages. #460's agent fixed it for B in c26fbcaf. #461's already had GHCR login. This commit applies the same helm-registry-login pattern to A and E. WBS state on main after this commit: - done (35): all chart-level + #317 + #319 + #453 + 4 preflights - wip (0) - blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven) The preflights' first runs ALREADY surfaced a real CI bug pattern that would have hit Phase 8a — exactly what they're for. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
4a7eb42d26
|
feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462) (#468)
Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0 ships a sovereign realm + a public kubectl OIDC client via the upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm hook (issue #326); this workflow proves it actually wires up on a clean cluster before we run it on a real Sovereign. Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits for the keycloak StatefulSet to roll out, polls for the keycloakConfigCli post-install Job by label (app.kubernetes.io/component=keycloak-config-cli), waits for it to Complete, port-forwards svc/keycloak and asserts: 1. /realms/sovereign returns 200 (realm exists in Keycloak's DB). 2. The kubectl OIDC client is provisioned with publicClient=true, redirectUris contains http://localhost:8000 (kubectl-oidc-login default), and the groups client scope is wired with the oidc-group-membership-mapper (the per-Sovereign k3s api-server's --oidc-groups-claim flag depends on this). Acceptance per ticket: if the post-install Job fails, the workflow summary captures Job logs + StatefulSet logs + cluster state via GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running. Triggers are event-driven only per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled" rule — push on the workflow file itself plus workflow_dispatch for ad-hoc re-runs. Closes #462. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
abac00d8b3
|
feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459) (#467)
Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit reconcile-chain order untested under load) before Phase 8a (#454) burns Hetzner credit on test.omani.works. New workflow .github/workflows/preflight-bootstrap-kit.yaml: - kind v0.25.0 + kindest/node:v1.30.6 - Gateway API CRDs v1.2.0 standard channel - Full Flux controller set (fluxcd/flux2/action@main + flux install) - Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials, flux-system/ghcr-pull - Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern in tests/e2e/bootstrap-kit/main_test.go:247) - 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop at first) - $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready condition + per-HR describe blocks for non-Ready + recent flux-system events + raw hrs.json artefact (14d retention) - Event-driven only: push on self-edit + workflow_dispatch; no schedule: cron (per CLAUDE.md "every workflow MUST be event-driven") Canonical seam reused (no duplication): - kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml - bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the same overlay production Sovereigns consume; substitution shape mirrors tests/e2e/bootstrap-kit/main_test.go:247) - event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428) Out of scope (sibling preflights): - #460 Crossplane provider-hcloud Healthy probe - #461 Cilium Gateway HTTPRoute admission - #462 Keycloak realm-import Validated: actionlint clean, YAML parses cleanly. WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
6f9ee43a9d
|
fix(ci): GHCR auth for bp-crossplane OCI pull in preflight (#460) (#466)
Run 25221515110 surfaced the exact blocking error the workflow was designed to surface — but for the install step, not the Healthy probe: Error: INSTALLATION FAILED: failed to perform "FetchReference" on source: GET "https://ghcr.io/v2/openova-io/bp-crossplane/manifests/1.1.3": ... 401: unauthorized: authentication required bp-crossplane is a PRIVATE GHCR package (verified via `gh api /orgs/openova-io/packages/container/bp-crossplane`). The fix mirrors the canonical seam in .github/workflows/blueprint-release.yaml: add `packages: read` to the job permissions and run `helm registry login ghcr.io` against GITHUB_TOKEN before the `helm install oci://...` step. No new pattern; just reuse. This unblocks the actual goal of #460 — observing provider-hcloud Healthy=True (or surfacing whatever blocks it) on a kind cluster. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
48b73af6ae
|
feat(ci): Phase-8a preflight C — Cilium Gateway HTTPRoute admission on kind (closes #461) (#465)
Surfaces Risk-register R3 (docs/omantel-handover-wbs.md §9a) — Cilium Gateway HTTPRoute admission was untested on contabo because contabo runs Traefik (no `cilium-gateway` Gateway present per ADR-0001 §9.4). This workflow boots a kind cluster, installs upstream Cilium 1.16.5 with `gatewayAPI.enabled=true`, applies the per-Sovereign Gateway shape from `clusters/_template/bootstrap-kit/01-cilium.yaml` (HTTP listener only — TLS is Phase 8a), pulls bp-catalyst-platform:1.1.8 from GHCR, renders its httproute.yaml template with sovereign overlay values, and asserts that `catalyst-ui` and `catalyst-api` HTTPRoutes both reach Accepted=True against the Cilium Gateway. Anti-duplication: GHCR helm-registry-login mirrors blueprint-release .yaml (lines 173-177); kind+Cilium pattern matches playwright-smoke shape; per-Sovereign Gateway is a 1:1 mirror of the canonical bootstrap-kit slot 01 (HTTP listener), no new shape invented. Trigger pattern is event-driven per CLAUDE.md: push on this file or the chart templates it validates, plus workflow_dispatch for re-runs. No cron. Out of scope (Phase 8a/8b): TLS termination, real DNS resolution, backend Deployment health, the 10 leaf bp-* dependencies (which have their own chart-verify smoke runs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
48a1623b28
|
feat(ci): Phase-8a preflight B — Crossplane provider-hcloud Healthy on kind (closes #460) (#463)
Surfaces Risk-register R2 (docs/omantel-handover-wbs.md §9a — provider-hcloud Healthy=True never observed). New workflow spins up kind, installs bp-crossplane 1.1.3 from GHCR, applies the EXACT Provider + ProviderConfig shape from infra/hetzner/cloudinit-control-plane.tftpl (#425), waits up to 5 min for Healthy=True, plants a fake hcloud-token Secret in flux-system to match the canonical secretRef, and asserts the ProviderConfig is accepted by the API. Reuses existing seams: - helm/kind-action@v1 pattern from .github/workflows/test-bootstrap-kit.yaml - event-driven trigger shape from .github/workflows/check-vendor-coupling.yaml - canonical Provider/ProviderConfig YAML from infra/hetzner/cloudinit-control-plane.tftpl No schedule: cron (per CLAUDE.md "every workflow MUST be event-driven"). No live Hetzner calls — fake-readonly-token only; real-credential validation is Phase 8a, not this preflight. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
1e7d1e67c9
|
test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429) (#432)
Phase 8 of the omantel handover (#369) needs an automated E2E that proves DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with zero contabo dependency post-handover. Today this is a SCAFFOLD — when Phase 4/6/7 land, dispatching the new workflow against a live omantel is the entire Phase 8. Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md): - tests/e2e/playwright/tests/ ← mirror of sovereign-wizard.spec.ts shape (NOT specs/ as the issue body said — actual repo path is tests/) - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries, workers=1, reporter=list) — reused as-is - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the pre-flight skip-when-unreachable pattern - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4, setup-node v4, npm install, playwright install --with-deps chromium, upload-artifact on failure) — mirrored, NOT duplicated What ships: - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests): 1. sovereign Ready + 23/23 blueprints 2. all bp-* HelmReleases Ready=True 3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready") 4. vendor-agnostic Object Storage (post-#425 canonical secret name flux-system/object-storage — NOT hetzner-object-storage) 5. dig +trace omantel.omani.works ends at omantel NS, not contabo 6. zero contabo dependency (omantel /api/healthz keeps returning 200) Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset. - .github/workflows/omantel-e2e-handover.yaml (NEW): workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled"). Inputs let the operator override base URLs at dispatch time. - docs/omantel-handover-wbs.md: new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1 with the spec test() blocks; §9 status row added for #429 (🟢 scaffold-shipped). Local verification: cd tests/e2e/playwright && npm install && \ npx playwright test --list tests/omantel-handover.spec.ts → 6 tests listed cleanly npx playwright test tests/omantel-handover.spec.ts → 6 skipped (env vars unset, expected) Out of scope (per #425 / #428 territory split): - internal/hetzner/, infra/hetzner/, platform/velero/chart/, clusters/.../34-velero.yaml — #425's vendor-agnostic sweep - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
0fdd411e79
|
ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428) (#431)
Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml
that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names
(hetzner|aws|gcp|azure|oci) appearing in capability-named slots:
1. <vendor>-object-storage (sealed-secret / overlay-secret name)
2. <chart>Overlay\.<vendor>\. (chart values block keyed to vendor)
3. <vendor>ObjectStorage (camelCase payload field)
Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/,
internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR
refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may
discuss the rule).
Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425
work-in-progress); hard-fail once that directory lands. Locally on this branch
the script emits 49 warnings to stderr and exits 0 against the existing
hetzner-coupled references in platform/velero, platform/seaweedfs, and
clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those
warnings disappear and any future re-introduction fails CI.
Workflow trigger surface: push-to-main + pull_request on the scanned paths +
workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled".
Canonical seam used: scripts/ + .github/workflows/ (mirrors
scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml
shape). NOT a duplicate - no prior vendor-coupling guard existed.
Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map)
docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode)
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
956b976558
|
fix(ci): playwright-smoke port 4321→5173 for Vite 8 default (#335) (#418)
The catalyst-ui dev-server bind moved from 4321 to 5173 when Vite default
changed (Vite 8). The smoke workflow's curl-wait + BASE_URL env still
pointed at 4321, so:
Vite 8 starts fine on 5173 →
workflow polls 4321 for 60s → never returns 200 →
step exits 1 before Playwright ever runs.
Effect across last ~30 main commits: every push generated a 'Playwright UI
smoke failed' email despite the UI itself being healthy. We've been
shipping with --admin bypass + post-deploy verification against
console.openova.io. This restores actual smoke coverage on every PR.
Three substitutions on .github/workflows/playwright-smoke.yaml:
- line 80 curl wait URL: localhost:4321 → localhost:5173
- line 93 BASE_URL env: 4321 → 5173
- line 72-73 comment: stale 'Vite binds 4321 by default' → 5173
Closes #335.
Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4d24914ae4
|
feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) (#346)
* feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) Adds a first-class Phase-0 recovery surface so an operator can purge a failed pre-handover deployment from the wizard UI without dropping to hcloud CLI runbooks. Two entry-points, one canonical implementation. ## Backend NEW: products/catalyst/bootstrap/api/internal/handler/wipe.go POST /api/v1/deployments/{id}/wipe — single-flight destructive op: 1. tofu destroy against the per-deployment workdir (idempotent). 2. Hetzner orphan force-purge by label-selector `catalyst-deployment-id=<id>` (servers, load balancers, networks, firewalls, ssh-keys). Belt-and-braces — catches resources tofu didn't track (half-failed cloud-init, manual experiments). Per docs/INVIOLABLE-PRINCIPLES.md #3 this direct API path is fallback ONLY for orphan cleanup, never new resource creation. 3. PDM /v1/release for pool-subdomain Sovereigns (best-effort). 4. Local cleanup: kubeconfig file (mode 0600), tofu workdir, on-disk deployment record JSON. 5. SSE events stream throughout on the same channel as the original provisioning + Phase-1 watch. 6. Marks Status="wiped"; sync.Map entry reaped after a 60s TTL. NEW: products/catalyst/bootstrap/api/internal/hetzner/purge.go Hetzner Cloud API enumeration + force-delete by label selector. Uses a 60s timeout (vs the 10s ValidateToken default) because async server-delete jobs can queue. 404s treated as success (already gone). NEW: products/catalyst/bootstrap/api/internal/provisioner/provisioner.go Provisioner.Destroy() — runs `tofu destroy -auto-approve` against the per-deployment workdir, then removes the workdir on success so re-provisioning starts fresh. Re-stages module + tfvars first so a partially-cleaned workdir still has what tofu needs. TOUCHED: products/catalyst/bootstrap/api/cmd/api/main.go Registers POST /api/v1/deployments/{id}/wipe. ## Frontend (aligned with existing CrudModals conventions per founder ## directive — no ad-hoc surface) NEW: products/catalyst/bootstrap/ui/src/components/CrudModals/WipeDeploymentModal.tsx Two-stage modal built on the canonical ModalShell. Pre-wipe confirm view requires the operator to: - Type the sovereign FQDN to confirm scope. - Re-paste their Hetzner Cloud API token (catalyst-api intentionally GCs the original after writeTfvars per credential hygiene). Post-wipe success view shows the PurgeReport (servers, lbs, networks, firewalls, ssh-keys removed; tofu/PDM/local-state ✓/✗) and a "Start fresh deployment" CTA that nav's to /sovereign. TOUCHED: products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts Re-exports WipeDeploymentModal + WipeReport. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/AppsPage.tsx FailureCard now exposes a "Cancel & Wipe" red button next to "Retry stream" / "Back to wizard" — opens WipeDeploymentModal. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/InfrastructureTopology.tsx Cloud → Architecture canvas: the `cloud` (root) node action menu gains "Cancel & Wipe deployment" as a `danger:true` action, alongside the existing "+ Add region". Distinct from the per-resource DeleteCascadeConfirm on region/cluster/vCluster — this is deployment-scope (Phase-0 orphan purge), the others are Crossplane-XRC scope (day-2). The two paths coexist; operators choose by what state the deployment is in. ## Why two entry-points Wizard banner (failed state on AppsPage) — recovery from a known failure. Already a red-banner page; the button is right there. Cloud → Architecture cloud-node action — proactive cancel from the canvas, mirrors how the existing per-resource deletes are reachable. Same modal, same backend. ## Constraints honoured - Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2 IaC): the per-resource DELETE handler at infrastructure.go is unchanged and continues to flip XRC deletionPolicy. Wipe operates ONLY in Phase-0 scope where Crossplane never adopted resources. - Per #4 (never hardcode): every endpoint lives behind API_BASE; the Hetzner purge enumerates by deterministic label selector built from var.sovereign_fqdn (the OpenTofu module's existing tagging convention). - Per credential hygiene: the Hetzner token is re-prompted at wipe time rather than persisted; the modal uses an <input type="password">. ## Refs #318 — pre-handover wipe spec (this PR closes it) #317 — handover finalisation (sibling; this PR is the failure-path complement) feedback_idempotent_iac_purge.md — operator runbook this implements PR #313 — sealed-secrets cleanup (independent; safe to land in any order) PR #334 — bp-external-secrets split (independent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): catalyst-build event-driven only — drop cron, push-on-main with path filter Per docs/INVIOLABLE-PRINCIPLES.md (event-driven end to end — Flux dependsOn, NATS JetStream, SSE, Helm hooks), GitHub Actions must follow the same model. The previous `schedule: cron 0 3 * * *` daily build was the only canonical deploy path, which created a 24h roll latency on every change to the catalyst surface and incentivised "wait for cron" stalls in operator workflows. Replaces with: on: push: branches: [main] paths: - 'core/console/**' - 'core/admin/**' - 'core/marketplace/**' - 'core/marketplace-api/**' - 'products/catalyst/bootstrap/**' - 'products/catalyst/chart/**' - '.github/workflows/catalyst-build.yaml' workflow_dispatch: `workflow_dispatch` retained for ad-hoc re-runs (config-only changes that bypass the path filter, e.g. a secret rotation that doesn't touch code). Path filter mirrors the actual surface this workflow rebuilds. After this lands, every merge to main that touches the catalyst surface auto-deploys. No cron lag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2de8bb68b9
|
fix(ci): bump helm 3.16.3 → 3.18.4 in blueprint-release — fixes seaweedfs smoke-render (#336)
'function fromToml not defined' error on bp-seaweedfs publish. Upstream seaweedfs/seaweedfs 4.22.0 (templates/shared/security-configmap.yaml:21) uses fromToml which exists in 3.13+ but the rendered context in the smoke step needs newer Sprig functions present in 3.18+. Bump unblocks the chain of HRs (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana) all blocked on bp-seaweedfs publish. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
5502d9aa48
|
feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291)
Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer
in bp-cert-manager by shipping the missing piece — a Go binary that
satisfies cert-manager's external webhook contract
(`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json.
Architecture
============
* `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with
pool-domain-manager and catalyst-dns). Encapsulates the api3.json
transport, command builders, response decoding, and the safe
read-modify-write semantics required to never accidentally wipe a
zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2`
variant is unexported.
* `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook
binary. Implements `Solver.Present` via the client's append-only
`AddRecord` path and `Solver.CleanUp` via the read-modify-write
`RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`)
rejects challenges for unmanaged apexes BEFORE any Dynadot call.
* `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm
wrapper. Templates Deployment + Service + APIService + serving
Certificate (CA chain via cert-manager Issuer self-signing) +
RBAC + ServiceAccount. Mirrors the standard cert-manager external-
webhook deployment shape.
* `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the
paired ClusterIssuer activates. The interim http01 issuer remains
templated as the rollback path.
Test results
============
core/pkg/dynadot-client — 7 tests PASS (race-clean)
core/cmd/cert-manager-dynadot-... — 9 tests PASS (race-clean)
Test coverage includes a Present/CleanUp round-trip against an
httptest fixture that models Dynadot's zone state, an explicit
unmanaged-domain rejection, a regression preserving a pre-existing
CNAME across the DNS-01 round-trip (the zone-wipe defence), and a
typed-error propagation test that surfaces `ErrInvalidToken` to
cert-manager so the controller will retry.
Helm template smoke render
==========================
`helm template` against the new chart with default values yields 12
resources / 424 lines (APIService, Certificate, ClusterRoleBinding,
Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The
modified bp-cert-manager chart still renders both ClusterIssuers
(`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default
values; flipping `certManager.issuers.dns01.enabled=false` is the
clean rollback.
Smoke command (post-deploy)
===========================
kubectl get apiservices.apiregistration.k8s.io \
v1alpha1.acme.dynadot.openova.io
# Issue a *.<sovereign>.<pool> wildcard cert and watch the
# Order/Challenge progress through cert-manager.
CI
==
`.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the
pool-domain-manager-build pattern (cosign keyless signing, SBOM
attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager-
dynadot-webhook:<sha>`). Triggered by changes to either the binary or
the shared dynadot-client package.
Closes #159
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
0289f0388d
|
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml, the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3. The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts metadata.name + spec.dependsOn for the HelmRelease document(s), and mechanically verifies the actual graph against the expected DAG declared in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch (W2.K1-K4) on success. Behaviour against the in-flight expansion: HRs declared expected but not yet on disk are reported as "deferred" (informational, not an error), so that this script can be the static authoritative list while W2.K1-K4 PRs land their HR files in series. After all four W2 PRs merge, the "deferred" count drops to 0 and the audit goes 100% green. Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a new dependency-graph-audit job that runs on every PR touching: - clusters/** (any HR file edit) - scripts/check-bootstrap-deps.sh - scripts/expected-bootstrap-deps.yaml - .github/workflows/test-bootstrap-kit.yaml Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2d1799d738
|
fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247)
Resolves install ordering on fresh clusters where the apiserver rejects CompositeResourceDefinition CRs because the apiextensions.crossplane.io CRDs registered by the crossplane subchart aren't live yet at apply time. - bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload) - NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions - Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane] - composition-validate.sh + fixtures relocate to the new chart - blueprint-release CI: opt-out annotation catalyst.openova.io/no-upstream=true permits zero-deps charts that legitimately ship only Catalyst-authored CRs (the original hollow-chart rule remains in force for every other umbrella chart) Live error this fixes (from otech.omani.works): no matches for kind "CompositeResourceDefinition" in version "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn. Apply universally to similar cases going forward. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fad36836ed
|
fix(ci): tempo + ntfy logos are now .svg (logo-fix-batch-2) (#213)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
1f5c76def1
|
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards
15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:
1. card height drift from canonical 108px
2. reserved right padding eating description width
3. logo tile drift from per-brand LOGO_SURFACE
4. invisible glyph (white-on-white) via luminance proxy
5. wizard step order Org/Topology/Provider/Credentials/Components/
Domain/Review
6. legacy "Choose Your Stack" / "Always Included" tab labels
7. Domain step reachable before Components
8. CPX32 not the recommended Hetzner SKU
9. per-region SKU dropdown shows wrong provider catalog
10. provision page is .html (static) not SPA route
11. legacy bubble/edge DAG SVG markup on provision page
12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
13. AppDetail uses tablist instead of sectioned layout
14. job rows navigate to /job/<id> instead of expand-in-place
15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage
Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.
CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.
Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|