e5c2797ce6
26 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ce76a7b7ab
|
fix(bp-powerdns): root-cause Job DeadlineExceeded recurrence (post Fix #144) (#1425)
Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook. That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12) both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT `DeadlineExceeded`. The deadline never got a chance to fire. Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl get hr bp-powerdns -o yaml`): status: Helm install failed for release powerdns/powerdns with chart bp-powerdns@1.2.2: failed post-install: 1 error occurred: * job powerdns-zone-bootstrap failed: BackoffLimitExceeded Pod events for powerdns-zone-bootstrap-tq7qq: 59m Started container zone-bootstrap 56m Back-off restarting failed container zone-bootstrap 55m Job has reached the specified backoff limit Root cause walked end-to-end (per CLAUDE.md TRACE rule): TEST: bp-powerdns HR Ready=True ↑ HR: Helm install succeeds (post-install Job exits 0) ↑ Zone-bootstrap Job: curl POST succeeds ↑ powerdns:8081 Service: reachable (has Ready endpoints) ↑ powerdns Deployment: Pods Ready (3 replicas) ← Pending, blocked here ↑ CNPG cluster: pdns-pg-app Secret exists ↑ pdns-pg-1-initdb Pod: scheduled, Running, Completed ← Pending too ↑ Worker node has capacity ← 99% CPU requested The zone-bootstrap container curl'd `http://powerdns:8081`, hit "connection refused" (empty Service endpoints), exited 7, container restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level backoffs (≈10min wall-time with exponential delay), the Job declared `BackoffLimitExceeded` — well before activeDeadlineSeconds=840s (14min) could even consider firing. Fix #144 was directionally right (the upstream IS slow on cold k3s) but operated on the wrong knob. The container's outer-loop retry budget is bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds. Bumping only the deadline left the BackoffLimit ceiling unchanged. Architectural fix (this commit): 1. Move the wait-for-API loop INSIDE the container (one Pod, one inner poll loop, restartPolicy=Never). The inner loop polls GET /api/v1/servers every 10s until HTTP 200, bounded by new `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container run owns the full wait budget instead of N short-lived containers racing the backoff timer. 2. restartPolicy: OnFailure → Never. The container script handles its own retry; Kubernetes-level backoff is reserved for genuinely transient pod failures (image-pull, OS eviction) where the Job-level backoffLimit=6 still triggers a fresh Pod. 3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower clusters can raise the inner deadline without forking the chart (per docs/INVIOLABLE-PRINCIPLES.md #4). 4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s). Sits below activeDeadlineSeconds (840s) so the zone-creation phase keeps ≥240s of headroom AFTER the API comes Ready. Curl status handling in the wait loop: 200 → API up, proceed to bootstrap 401|403 → auth failure, FATAL (no retry — operator misconfig) 000|5xx|... → transient, sleep & retry until inner deadline Files changed: - platform/powerdns/chart/Chart.yaml 1.2.2 → 1.2.3 + history - platform/powerdns/chart/values.yaml + apiReadyTimeoutSeconds knob - platform/powerdns/chart/templates/ zone-bootstrap-job.yaml inner wait-for-API loop; restartPolicy: Never - clusters/_template/bootstrap-kit/ 11-powerdns.yaml pin to 1.2.3 + HR comment Why this is sufficient where Fix #144 was not: Fix #144 worked the chart-level deadline. This commit works the inner-loop ownership — the wait budget is now owned by the script inside the container, not by the Job spec arithmetic (backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds still caps the worst-case runtime (no runaway poll), but the script now actually GETS to use it. Verification: - helm template renders cleanly (deps build OK, empty-zones short- circuit preserved, non-empty zones render Job + RBAC + Audit CM) - kubectl create --dry-run=client --validate=false: 5/5 resources created (sa, role, rb, cm, job) - chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml Companion infrastructure note (NOT addressed by this commit, flagged for Coordinator): The DEEPER bottom of the trace stack is worker capacity. Prov #38's single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The cluster-autoscaler attempted 2→3 scale-up but is in backoff because two unscheduled pods (gitea/gitea-* PV affinity conflict from a previous wedged install; trivy-system/node-collector NodeAffinity) poison the autoscaler's "can the template node fit" check. Even with this chart fix in place, the powerdns Deployment cannot become Ready until either: (a) the worker autoscales successfully (gitea PV migrated / trivy taints relaxed), or (b) worker_count is bumped from 2 to 3 in the provisioning body, or (c) qa_worker_size is bumped to cpx42. This chart fix ensures bp-powerdns survives a slow CNPG cold-start. It does NOT fix a fundamentally undersized cluster. Coordinator next step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart landed. Either should converge. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
930529136c
|
fix(bp-powerdns): raise zone-bootstrap Job deadline 300s -> 840s (#144) (#1351)
Cold Sovereign on prov #22 (e2fc1004362ce765) hit terminal HR FAILED on bp-powerdns: post-install hook Job DeadlineExceeded after 5m, Helm hook reported `post-install: timed out waiting for the condition`, HelmRelease retried 4x then went terminal. Root cause: zoneBootstrap.activeDeadlineSeconds default of 300s was shorter than the time bp-cnpg needed to synthesise the `pdns-pg-app` Secret on a cold k3s control plane. The powerdns Pod was not Ready, curl against http://powerdns:8081 inside the Job kept failing under backoffLimit=6, and the 5-minute Job-level deadline killed it. Canonical seam: chart values.yaml (the Job spec consumes {{ .Values.zoneBootstrap.activeDeadlineSeconds }} via the existing templated knob — no new template plumbing required, principle 18 met). Fix: raise default 300s -> 840s (14m). Sits below the HR install.timeout of 15m in clusters/_template/bootstrap-kit/11-powerdns.yaml, so a true chart failure still surfaces via Flux's own remediation path rather than wedging on a Helm wait that outlives its outer wrapper. Chart bump: 1.2.1 -> 1.2.2. _template HR pinned to 1.2.2 with a comment explaining the prov-#22 incident. Per-Sovereign HR files (clusters/omantel.omani.works/, otech.omani.works/) remain pinned to 1.1.5 — pre-existing drift, not in scope here. New Sovereign provisioning reads from the _template path. Same fix family as #127, #131, #143 (HR/Job timeout-ladder alignment where a downstream Job's deadline must fit inside its HR wrapper cap). ## Claimed TCs - prov-22-bp-powerdns-hr-ready - prov-22-zone-bootstrap-job-completes-cold-cnpg Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
25ef20a8e5
|
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.
Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
(legacy, served, not storage) and v1 (canonical, served, storage). The
shared schema means the 38 existing v1alpha1 files in platform/ +
products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
tagline interchangeable; category | family interchangeable; docs |
documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
observability, outputs, depends[].values, manifests.values, etc.
Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
category (25), family (20), docs (20), documentation (14+1), icon (25),
tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
§3. Those 5 files are fixed in this commit:
* platform/cert-manager-powerdns-webhook/blueprint.yaml
* platform/cert-manager-dynadot-webhook/blueprint.yaml
* platform/crossplane-claims/blueprint.yaml
* platform/powerdns/blueprint.yaml
* platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
--dry-run=server) against the new CRD.
Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.
This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.
Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c42e98216c
|
fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (curl -o + readOnlyRootFS) (#843)
* fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0) Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover: 1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision. 2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job). 3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep. Verified live state on otech103.omani.works (deployment id 12dff5098e33053e): - bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1' - bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver) - bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret) After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion. * fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (readOnlyRootFS + curl -o) Caught live on otech103 2026-05-04: zone-bootstrap Job exit 23 (curl write error) because curl -o /tmp/zone-resp + readOnlyRootFilesystem=true and no /tmp emptyDir mount. Bumps bp-powerdns 1.2.0 → 1.2.1 + slot 11 pin lockstep. Without /tmp/zone-resp writable the Job CrashLoops every retry, never completes, bp-external-dns dependency stuck, Phase-1 watcher never reaches ready, handover never auto-fires. --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
e96741a0ca
|
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`*.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
684759564e
|
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681) calls contabo's authoritative PowerDNS at pdns.openova.io to write DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook needs an X-API-Key Secret in the Sovereign's cert-manager namespace — PR #681 didn't ship the materialization seam, so on otech43..otech47 the Secret was missing and the wildcard cert never issued. This commit closes the seam from contabo to the Sovereign: 1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on openova-system/powerdns-api-credentials extended from "external-dns" to "external-dns,catalyst" so contabo catalyst-api can mount the API key. 2. bp-powerdns: api.basicAuth.enabled flips default true to false. Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that blocked machine-to-machine API access from Sovereigns. The X-API-Key contract is unchanged. 3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key secret (optional=true so Sovereign-side catalyst-api Pods that don't reflect this still start clean). 4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every Request before Validate(). Forwards as tofu var powerdns_api_key. 5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive, default ""). 6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct dynadot-api-credentials Secret block (PR #681 dropped bp-cert-manager-dynadot-webhook) with a new cert-manager/powerdns-api-credentials Secret block. runcmd applies it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook. End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token. Will be verified live on otech48 (next provision after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8bb66fe43e
|
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled Two related Phase-8a stragglers diagnosed live during otech28: 1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError forever ("secret pdns-pg-app not found"). Adds explicit dependsOn. 2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set reflection-allowed; missing reflection-auto-enabled. Reflector races when destination Secret (harbor-database-secret) is created BEFORE CNPG provisions the source (harbor-pg-app). Reflector logs "Source could not be found" once and never retries — leaving harbor- core stuck CreateContainerConfigError. Adding auto-enabled makes Reflector actively watch the source and re-fire when it appears. Bumps: bp-harbor 1.2.8 -> 1.2.9 bp-gitea 1.2.1 -> 1.2.2 bp-powerdns 1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump) Bootstrap-kit references updated to pull the new chart versions on the next Sovereign provisioning. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
5a403e66b1
|
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:
FATAL: database "registry" does not exist (SQLSTATE 3D000)
Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.
Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.
Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix
Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:
1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
- values.yaml: `webhook.solverName: powerdns` → `pdns`
- The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
"powerdns" cert-manager gets 404 → "server could not find the resource".
2. cert-manager-dynadot-webhook solver_test.go mock format:
- writeOK() and error injection used old ResponseHeader-wrapped format
- Real api3.json returns ResponseCode/Status directly in SetDnsResponse
- This caused the image build to fail at
|
||
|
|
ad9cfc0f23
|
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set): - bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst job images now prefixed with global.imageRegistry when non-empty. Default (empty) renders identical manifests. - bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed with global.imageRegistry when non-empty. Verified: dnsdist image rewrites to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14. Subchart-only charts (global.imageRegistry stub added; threading via per-component subchart values.yaml keys documented in comments): - bp-external-secrets 1.1.0→1.1.1 - bp-cnpg 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-nats-jetstream 1.1.1→1.1.2 - bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring vcluster: N/A — no chart directory under platform/vcluster/chart/ Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
7d264d9647
|
fix(bp-powerdns): default cluster.namespace=powerdns not openova-system (Closes #553) (#556)
bp-powerdns HelmRelease upgrade fails on Sovereigns with: failed to create resource: namespaces "openova-system" not found The chart's CNPG Cluster CR template targets postgres.cluster.namespace which defaulted to openova-system (a contabo-only legacy ns). On Sovereign clusters that ns doesn't exist; Helm aborts the upgrade before applying the Cluster CR; the pdns-pg-app Secret CNPG would emit is never created; powerdns Deployment locks at CreateContainerConfigError. Default to powerdns (chart targetNamespace per bootstrap-kit overlay). Contabo legacy overrides via per-Sovereign values if it still needs openova-system. Bump bp-powerdns 1.1.4 -> 1.1.5 across template + omantel + otech overlays. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
902d857702
|
fix(bp-powerdns): reflect powerdns-api-credentials to external-dns namespace (Closes #544) (#552)
Add reflector.v1.k8s.emberstack.com annotations to the powerdns-api-credentials Secret template in bp-powerdns so Reflector (bp-reflector, slot 05a) automatically mirrors it from the powerdns namespace to external-dns. Bump chart version 1.1.3 → 1.1.4. Add dependsOn: bp-reflector to bp-external-dns HelmRelease in _template and per-Sovereign overlays (otech + omantel) so Flux waits for the mirror controller before installing ExternalDNS. Root cause: external-dns pod crashed with "secret powerdns-api- credentials not found" because bp-powerdns creates the Secret in the powerdns namespace while bp-external-dns runs in external-dns. No cross-namespace propagation existed. Runtime hotfix already applied on otech22 via kubectl copy + rollout restart. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
abf01b6f21
|
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute, replacing the legacy Traefik-on-Sovereigns assumption with the canonical Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2 correction note (#388). The single per-Sovereign Gateway is added as additional documents in the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml (NOT a new top-level slot), since Cilium owns the GatewayClass. It includes: - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}` from `letsencrypt-dns01-prod` (cert-manager + #373 webhook) - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's existing `templates/` directory): | Blueprint | Host pattern | Backend port | |---------------------|---------------------------------|--------------| | bp-keycloak | auth.<sov> | 80 | | bp-gitea | git.<sov> | 3000 | | bp-openbao | bao.<sov> | 8200 | | bp-grafana | grafana.<sov> | 80 | | bp-harbor | registry.<sov> | 80 | | bp-powerdns | pdns.<sov>/api (dual-mode) | 8081 | | bp-catalyst-platform| console.<sov>, api.<sov> | 80, 8080 | bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute (Sovereign) simultaneously — the per-Sovereign overlay sets `api.gateway.enabled=true` while leaving `api.enabled=true`. The Ingress object is harmless on Cilium clusters with no Traefik. This preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4. bp-harbor flips `expose.type` from `ingress` to `clusterIP` in platform/harbor/chart/values.yaml so the upstream chart no longer emits its own Ingress; the HTTPRoute is the sole HTTP exposure. TLS terminates at the Gateway (wildcard cert) rather than per-host Certificates inside the chart. bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by .helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml, which remain contabo-only legacy demo infra). The contabo path keeps serving console.openova.io/sovereign via Traefik unchanged. Bootstrap-kit slot updates (per-Sovereign hostname interpolation): - 08-openbao.yaml → gateway.host: bao.${SOVEREIGN_FQDN} - 09-keycloak.yaml → gateway.host: auth.${SOVEREIGN_FQDN} - 10-gitea.yaml → gateway.host: gitea.${SOVEREIGN_FQDN} - 11-powerdns.yaml → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true - 19-harbor.yaml → gateway.host: registry.${SOVEREIGN_FQDN} - 25-grafana.yaml → gateway.host: grafana.${SOVEREIGN_FQDN} Server-side dry-run validation against the live Cilium Gateway API CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway + Certificate apply cleanly via `kubectl apply --dry-run=server`. Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy SME ingresses (console-nova, marketplace, admin, axon, talentmesh, stalwart, ...) continue to serve via Traefik as before. powerdns on contabo remains on the Ingress path (api.gateway.enabled defaults to false at the chart level). Closes #387. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
726af6df81
|
fix(bp-powerdns): self-generate api-credentials Secret + disable upstream zone-bootstrap Job (#248)
Root cause investigation on otech.omani.works (kubectl, sanitized):
$ kubectl get pods -n powerdns
create-zone-if-not-exist-sh-tjtr4 0/1 CreateContainerConfigError 4h
powerdns-57d7d49f99-{9hrb4,lxlgt,nkmht} 0/1 CreateContainerConfigError 4h
dnsdist-594dbfc5f-wznsw 1/1 Running 4h
$ kubectl get secrets -n powerdns
powerdns Opaque 1 4h
powerdns-api-tls-8kxpx Opaque 1 4h (NO `powerdns-api-credentials`, NO `pdns-pg-app`)
$ kubectl describe pod ... powerdns-57d7d49f99-9hrb4
Environment:
PDNS_API_KEY: <set to the key 'api-key' in secret 'powerdns-api-credentials'> Optional: false
PDNS_DB_HOST: <set to the key 'host' in secret 'pdns-pg-app'> Optional: false
State: Waiting Reason: CreateContainerConfigError
The handover's chicken-egg-with-secret theory was directionally right but
the cause was more fundamental:
1. Wrapper chart's api-credentials-secret.yaml (1.1.2) was a no-op
unless operator set `apiKey` value out-of-band — comment said the
deployment would "fail to start until the named Secret exists" as
"the explicit signal we want". On a Sovereign that bootstraps from
bp-* OCI artifacts, no operator is standing by, so the Secret is
never created and pods sit in CreateContainerConfigError forever.
2. The upstream chart's `create-zone-if-not-exists-sh` Job is rendered
whenever both `zoneName` and `api.key` are set — defaulting
`zoneName: "example.de."` it ALWAYS rendered and ALWAYS failed
(same missing Secret). Catalyst doesn't want this Job at all
because zones are loaded later by pool-domain-manager (PDM).
3. The chart's CNPG Cluster template is gated behind
Capabilities.APIVersions.Has "postgresql.cnpg.io/v1" — on a fresh
Sovereign without bp-cnpg yet (bp-cnpg is on the roadmap, not in
bootstrap-kit), no Cluster is rendered and `pdns-pg-app` Secret
never materialises. With Helm `--wait`, install times out
("context deadline exceeded") even though the manifests applied
cleanly.
Fix:
* api-credentials-secret.yaml: self-generate via Helm `lookup` +
`randAlphaNum 32`. First install creates fresh randoms; every
subsequent reconcile reads back the existing values from the
Secret so the API key never rotates on upgrade. Operator can
still pin specific values via .Values.powerdns.apiKey /
.Values.powerdns.webserverPassword, or skip Secret creation
entirely via .Values.powerdns.useExistingApiSecret. Same pattern
as bitnami/postgresql, bitnami/keycloak.
* values.yaml: set `powerdns.zoneName: ""` so upstream chart's
`{{- if and .Values.powerdns.zoneName .Values.powerdns.api.key }}`
gate skips the create-zone Job entirely. Catalyst's PDM creates
zones via the REST API after the cluster comes up; we don't want
a placeholder `example.de.` zone in production.
* HelmRelease (both _template and otech.omani.works overlays):
`install.disableWait: true` + `upgrade.disableWait: true` so the
HelmRelease reports Ready as soon as manifests apply cleanly,
rather than gating on powerdns Deployment readiness which depends
on bp-cnpg landing first to synthesise `pdns-pg-app`. Runtime
convergence is observed via kubectl, not gated on Helm.
Live error this addresses:
Helm upgrade failed for release powerdns/powerdns with chart
bp-powerdns@1.1.2: context deadline exceeded
Verified locally with `helm template`:
- powerdns-api-credentials Secret renders with random api-key + webserver-password
- create-zone-if-not-exist-sh Job no longer rendered
- Deployment env continues to reference powerdns-api-credentials correctly
Bumped 1.1.2 -> 1.1.3 (chart, blueprint, both bootstrap-kit overlays).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
bcd2e7980a
|
fix: hide CRD-emitting resources behind Capabilities gates (closes #190) (#200)
* fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190) Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have not yet reconciled — no longer fails with `no matches for kind X in version Y`. The values toggles (`externalDns.serviceMonitor.enabled`, `externalDns.externalSecret.enabled`) remain — Capabilities is defense in depth so an operator flipping the toggle on a Sovereign that hasn't reached Phase 2 doesn't break the bp-external-dns reconcile. Verified locally: `helm template` with toggles off renders 0 of these resources; with toggles ON and `--api-versions monitoring.coreos.com/v1 --api-versions external-secrets.io/v1beta1` both render exactly once. Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix wave from issue #190. * fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190) Three Catalyst overlay templates emit resources whose CRDs ship in OTHER charts and were unconditionally rendered, causing a cold install of bp-powerdns to fail with `no matches for kind X` on a Sovereign that hasn't yet reconciled the upstream chart: - cnpg-cluster.yaml → postgresql.cnpg.io/v1 Cluster (CRD ships in bp-cnpg) - api-ingress.yaml → traefik.io/v1alpha1 Middleware (CRD ships with the Traefik controller; k3s ships it by default but a Sovereign overlay MAY disable Traefik in favour of cilium-only ingress) - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP (CRD ships when the Catalyst Crossplane composition family lands — see GAP DISCLOSURE in that template) Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`. The Traefik router-middleware annotation on the Ingress is similarly gated so the auth posture cleanly moves to the Sovereign's chosen ingress controller when Traefik is absent. Verified locally: `helm template` with default values renders 0 of these resources; with `--api-versions postgresql.cnpg.io/v1 --api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1` plus `--set crossplane.floatingIP.enabled=true`, all three render exactly once. Existing tests/observability-toggle.sh still passes. Bump version 1.1.1 → 1.1.2. * fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
1f5c76def1
|
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards
15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:
1. card height drift from canonical 108px
2. reserved right padding eating description width
3. logo tile drift from per-brand LOGO_SURFACE
4. invisible glyph (white-on-white) via luminance proxy
5. wizard step order Org/Topology/Provider/Credentials/Components/
Domain/Review
6. legacy "Choose Your Stack" / "Always Included" tab labels
7. Domain step reachable before Components
8. CPX32 not the recommended Hetzner SKU
9. per-region SKU dropdown shows wrong provider catalog
10. provision page is .html (static) not SPA route
11. legacy bubble/edge DAG SVG markup on provision page
12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
13. AppDetail uses tablist instead of sectioned layout
14. job rows navigate to /job/<id> instead of expand-in-place
15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage
Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.
CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.
Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1ddd569789 |
fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.
Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.
Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
hubble.metrics.enabled → null (this is the exact value that disables
the upstream metrics ServiceMonitor template branch — verified by
reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
.enabled → false. tests/observability-toggle.sh extended with Case 4
(default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
→ false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
+ metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
→ false (forward-compatibility guard; current upstream
pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
upstream bump cannot silently regress).
Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added
|
||
|
|
43aff20254 |
feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream chart as a dependencies: entry. helm dependency build pulls the upstream payload into the OCI artifact at publish time, so Flux helm install of bp-<name>:1.1.0 actually installs the upstream Helm release alongside the Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer, ExternalSecret) under templates/. Pinned upstream chart versions per platform/<name>/blueprint.yaml: - cilium 1.16.5 https://helm.cilium.io - cert-manager v1.16.2 https://charts.jetstack.io - flux 2.4.0 https://fluxcd-community.github.io/helm-charts - crossplane 1.17.x https://charts.crossplane.io/stable - sealed-secrets 2.16.x https://bitnami-labs.github.io/sealed-secrets - spire ... https://spiffe.github.io/helm-charts-hardened - nats-jetstream ... https://nats-io.github.io/k8s/helm/charts - openbao ... https://openbao.github.io/openbao-helm - keycloak ... https://charts.bitnami.com/bitnami - gitea ... https://dl.gitea.com/charts - catalyst-platform umbrella over the 10 leaf bp-* charts via helm dependency values.yaml in each chart adopts the umbrella convention: catalystBlueprint metadata block (provenance + version) at top level, upstream subchart values namespaced under the dependency name. cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER cert-manager controllers are running and CRDs registered (the previous hollow-chart shape ran the ClusterIssuer at install time when CRDs didn't exist yet, which was the omantel cluster's exact failure mode). Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella conversion is a meaningful structural revision). Cluster manifests in clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/ bootstrap-kit/ updated to reference 1.1.0. The blueprint-release.yaml workflow's helm package step needs an explicit helm dependency build before push so the upstream subchart bytes ship inside the OCI artifact. That CI change is a follow-up commit on this same branch (separate file scope). |
||
|
|
67fdecb770 | merge: remove k8gb (#171) | ||
|
|
f5daac52af |
refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f4679e2748 |
fix(powerdns): enable gpgsql-dnssec for DNSSEC API (1.0.6)
Without `gpgsql-dnssec=yes` the gpgsql backend driver does not expose the DNSSEC API surface — `PUT /zones/<zone>` with `dnssec:true` returns 422 "no DNSSEC-capable backends are loaded". This blocks pool-domain- manager from enabling DNSSEC on every Sovereign child zone (mandatory per docs/PLATFORM-POWERDNS.md). Fix lands in additionalConfig so the directive is rendered alongside `default-soa-edit-signed=INCEPTION-EPOCH` and `direct-dnskey=yes`. No schema migration needed — the gpgsql 5.0.3 schema already includes the cryptokeys table; the missing piece was just the backend feature flag. Bumps Chart.yaml to 1.0.6. Verified: after this lands the PUT call returns 204 and POST /cryptokeys mints a usable KSK. Discovered while bringing up openova#168 (PDM per-Sovereign zones). |
||
|
|
fa84cac438 |
fix(powerdns): plain ALTER TABLE in postInitSQL (avoid $$ escape battle, 1.0.5)
The DO block in 1.0.4 rendered with $$ collapsed to $ by the time it reached CNPG's postInitApplicationSQL — "syntax error at or near $". Both Helm template processing and the YAML scalar block were chewing on the dollar signs. Replaced with explicit ALTER TABLE statements (one per gpgsql table) + GRANT — same end state, no PL/pgSQL quoting required. Verified at runtime on contabo-mkt: powerdns Pod went CrashLoopBackOff → Running 1/1 immediately after the manual ALTER ran by hand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
214a3e1ada |
fix(powerdns): grant table ownership to pdns user in CNPG bootstrap (1.0.4)
Verified at runtime on Contabo-mkt: postInitApplicationSQL runs as the
postgres superuser, not the application owner, so the schema tables
created by the bootstrap block were owned by postgres. PowerDNS connects
as 'pdns' and got 'permission denied for table domains' on the first
SELECT against the zone cache.
Added a DO block at the end of the schema bootstrap that walks every
table in the public schema and ALTERs OWNER TO {{ .Values.postgres.cluster.owner }}
plus GRANT ALL PRIVILEGES ON SCHEMA public — same shape PDM uses (and
the contabo-mkt cluster verified the fix runtime: powerdns Pod went
from CrashLoopBackOff to 1/1 Ready immediately after the same DDL was
run by hand).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
db20e9d42b |
fix(powerdns): dnsdist backend resolution + drop DnstapLogAction (1.0.3)
dnsdist 1.9.14 runtime errors:
1. newServer{address='powerdns:5353'} → "Unable to convert presentation
address" — dnsdist's address parser expects IP[:port], not a DNS
name. Kubernetes auto-injects POWERDNS_SERVICE_HOST as an env var
into every pod in the same namespace as the powerdns Service; using
that gives us the ClusterIP at config-load time without needing an
init container or runtime DNS resolution.
2. DnstapLogAction(name, bool, fn) signature changed in 1.9 — the
2nd parameter now expects a shared_ptr to a RemoteLoggerInterface,
not a boolean. Rather than wire up a remote dnstap server (which
adds a moving part for marginal observability gain), drop the line.
Catalyst observability is the dnsdist /metrics endpoint surfaced
to Prometheus + the k8s container log.
Bumped chart to 1.0.3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
20c0543806 |
fix(powerdns): correct dnsdist image tag + drop readOnlyRootFilesystem (1.0.2)
Two runtime issues caught during first contabo-mkt rollout: 1. dnsdist image tag was "1.9" (default) — that tag doesn't exist in docker.io/powerdns/dnsdist-19. The 1.9.x line publishes 1.9.0 .. 1.9.14 (no rolling "1.9" alias). Pinned to 1.9.14 (current latest). 2. PowerDNS pod crash-looped on Errno 30 (Read-only file system: /etc/powerdns/pdns.d/0-api.conf.conf). The upstream pdns_server-startup script writes rendered config files to /etc/powerdns/pdns.d/ at container start, and the upstream template doesn't expose an emptyDir we could redirect that path to. Set readOnlyRootFilesystem=false with a verbose comment explaining why; the rest of the security context (runAsNonRoot, runAsUser=953, drop ALL caps) stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
19d926bfeb |
fix(powerdns): avoid recursive include in dnsdist checksum, bump to 1.0.1
Helm flagged dnsdist.yaml's checksum/config annotation as a recursive template self-reference (the file included itself). Replaced with a hash of the rendered .Values.dnsdist.config (post-tpl), which is the substantive content the annotation is supposed to track anyway. Bumped Chart.yaml to 1.0.1 so the OCIRepository semver "1.x" picks up the fix automatically on next reconcile. Blueprint API version stays at 1.0.0 (Blueprint contract is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0190c60520 |
feat(powerdns): bp-powerdns wrapper chart + per-Sovereign zone model (#167)
Introduces the bp-powerdns Catalyst Blueprint wrapper as the authoritative
DNS service for every Sovereign zone. Replaces k8gb in componentGroups.ts —
PowerDNS Lua records cover geo + health-checked failover natively, removing
the dedicated GSLB controller.
Wrapper chart (platform/powerdns/chart/):
- Chart.yaml — bp-powerdns 1.0.0, depends on pschichtel/powerdns 0.10.0
upstream (verified Artifact Hub publisher, tracks docker.io/powerdns/
pdns-auth-50 at appVersion 5.0.3 — surveyed Artifact Hub, no official
PowerDNS chart exists)
- values.yaml — 3 replicas, gpgsql backend, DNSSEC ECDSAP256SHA256,
lua-records ON, dnsdist 100 qps default per source IP, REST API at
pdns.openova.io/api behind Traefik basicAuth
- blueprint.yaml — Catalyst metadata, visibility=unlisted (mandatory
infra), section pts-3-2-gitops-and-iac
- templates/cnpg-cluster.yaml — separate `pdns-pg` Postgres (1 instance,
5Gi, postgres-16) with PowerDNS auth-5.0.3 schema applied via
postInitApplicationSQL
- templates/dnsdist.yaml — companion Deployment + ConfigMap with
rate-limiting policy (MaxQPSIPRule per source IP)
- templates/api-ingress.yaml — Traefik Ingress + basicAuth Middleware
- templates/anycast-endpoint.yaml — placeholder Service of type
LoadBalancer (Phase-0 stand-in for the anycast Floating IP target state)
- templates/crossplane-floatingip.yaml — DISCLOSED GAP: target-state
XHetznerFloatingIP composite, disabled by default until the
Crossplane composition is authored (the existing compositions cover
Server/Network/Firewall/LoadBalancer/PoolAllocation only). The
placeholder anycast Service is the operational stand-in.
Per docs/INVIOLABLE-PRINCIPLES.md:
- #4 (never hardcode): every value flows from values.yaml or a
referenced K8s Secret. Image tags come from upstream chart appVersion,
never duplicated.
- #8 (disclose every divergence): the XHetznerFloatingIP gap is
documented in the template + in docs/PLATFORM-POWERDNS.md ("Anycast
deferral" section).
componentGroups.ts: powerdns added to SPINE group as mandatory (depends on
cnpg). external-dns now lists powerdns as a dependency. k8gb removed.
docs/PLATFORM-POWERDNS.md: per-Sovereign zone model, DNSSEC posture, REST
API contract, lua-records GSLB pattern, dnsdist policy, anycast deferral
runbook, first-deploy procedure for Contabo-mkt.
Closes #167 (Phase 1 of public-repo work; Phase 4 cluster manifest lands
in openova-private feat/powerdns-deploy).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|