Commit Graph

26 Commits

Author SHA1 Message Date
e3mrah
ce76a7b7ab
fix(bp-powerdns): root-cause Job DeadlineExceeded recurrence (post Fix #144) (#1425)
Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after
prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook.
That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12)
both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT
`DeadlineExceeded`. The deadline never got a chance to fire.

Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl
get hr bp-powerdns -o yaml`):

  status:
    Helm install failed for release powerdns/powerdns with chart
    bp-powerdns@1.2.2: failed post-install: 1 error occurred:
      * job powerdns-zone-bootstrap failed: BackoffLimitExceeded

Pod events for powerdns-zone-bootstrap-tq7qq:
  59m Started container zone-bootstrap
  56m Back-off restarting failed container zone-bootstrap
  55m Job has reached the specified backoff limit

Root cause walked end-to-end (per CLAUDE.md TRACE rule):

  TEST: bp-powerdns HR Ready=True
    ↑
  HR: Helm install succeeds (post-install Job exits 0)
    ↑
  Zone-bootstrap Job: curl POST succeeds
    ↑
  powerdns:8081 Service: reachable (has Ready endpoints)
    ↑
  powerdns Deployment: Pods Ready (3 replicas)  ← Pending, blocked here
    ↑
  CNPG cluster: pdns-pg-app Secret exists
    ↑
  pdns-pg-1-initdb Pod: scheduled, Running, Completed  ← Pending too
    ↑
  Worker node has capacity                              ← 99% CPU requested

The zone-bootstrap container curl'd `http://powerdns:8081`, hit
"connection refused" (empty Service endpoints), exited 7, container
restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level
backoffs (≈10min wall-time with exponential delay), the Job declared
`BackoffLimitExceeded` — well before activeDeadlineSeconds=840s
(14min) could even consider firing.

Fix #144 was directionally right (the upstream IS slow on cold k3s) but
operated on the wrong knob. The container's outer-loop retry budget is
bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds.
Bumping only the deadline left the BackoffLimit ceiling unchanged.

Architectural fix (this commit):

1. Move the wait-for-API loop INSIDE the container (one Pod, one inner
   poll loop, restartPolicy=Never). The inner loop polls
   GET /api/v1/servers every 10s until HTTP 200, bounded by new
   `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container
   run owns the full wait budget instead of N short-lived containers
   racing the backoff timer.

2. restartPolicy: OnFailure → Never. The container script handles its
   own retry; Kubernetes-level backoff is reserved for genuinely
   transient pod failures (image-pull, OS eviction) where the Job-level
   backoffLimit=6 still triggers a fresh Pod.

3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower
   clusters can raise the inner deadline without forking the chart
   (per docs/INVIOLABLE-PRINCIPLES.md #4).

4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s).
   Sits below activeDeadlineSeconds (840s) so the zone-creation phase
   keeps ≥240s of headroom AFTER the API comes Ready.

Curl status handling in the wait loop:
  200          → API up, proceed to bootstrap
  401|403      → auth failure, FATAL (no retry — operator misconfig)
  000|5xx|...  → transient, sleep & retry until inner deadline

Files changed:
- platform/powerdns/chart/Chart.yaml         1.2.2 → 1.2.3 + history
- platform/powerdns/chart/values.yaml        + apiReadyTimeoutSeconds knob
- platform/powerdns/chart/templates/
    zone-bootstrap-job.yaml                  inner wait-for-API loop;
                                              restartPolicy: Never
- clusters/_template/bootstrap-kit/
    11-powerdns.yaml                         pin to 1.2.3 + HR comment

Why this is sufficient where Fix #144 was not:

Fix #144 worked the chart-level deadline. This commit works the
inner-loop ownership — the wait budget is now owned by the script
inside the container, not by the Job spec arithmetic
(backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds
still caps the worst-case runtime (no runaway poll), but the script
now actually GETS to use it.

Verification:
- helm template renders cleanly (deps build OK, empty-zones short-
  circuit preserved, non-empty zones render Job + RBAC + Audit CM)
- kubectl create --dry-run=client --validate=false: 5/5 resources
  created (sa, role, rb, cm, job)
- chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml

Companion infrastructure note (NOT addressed by this commit, flagged
for Coordinator):

The DEEPER bottom of the trace stack is worker capacity. Prov #38's
single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The
cluster-autoscaler attempted 2→3 scale-up but is in backoff because
two unscheduled pods (gitea/gitea-* PV affinity conflict from a
previous wedged install; trivy-system/node-collector NodeAffinity)
poison the autoscaler's "can the template node fit" check. Even with
this chart fix in place, the powerdns Deployment cannot become Ready
until either:
  (a) the worker autoscales successfully (gitea PV migrated / trivy
      taints relaxed), or
  (b) worker_count is bumped from 2 to 3 in the provisioning body, or
  (c) qa_worker_size is bumped to cpx42.

This chart fix ensures bp-powerdns survives a slow CNPG cold-start.
It does NOT fix a fundamentally undersized cluster. Coordinator next
step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart
landed. Either should converge.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:13:34 +04:00
e3mrah
930529136c
fix(bp-powerdns): raise zone-bootstrap Job deadline 300s -> 840s (#144) (#1351)
Cold Sovereign on prov #22 (e2fc1004362ce765) hit terminal HR FAILED
on bp-powerdns: post-install hook Job DeadlineExceeded after 5m, Helm
hook reported `post-install: timed out waiting for the condition`,
HelmRelease retried 4x then went terminal.

Root cause: zoneBootstrap.activeDeadlineSeconds default of 300s was
shorter than the time bp-cnpg needed to synthesise the `pdns-pg-app`
Secret on a cold k3s control plane. The powerdns Pod was not Ready,
curl against http://powerdns:8081 inside the Job kept failing under
backoffLimit=6, and the 5-minute Job-level deadline killed it.

Canonical seam: chart values.yaml (the Job spec consumes
{{ .Values.zoneBootstrap.activeDeadlineSeconds }} via the existing
templated knob — no new template plumbing required, principle 18 met).

Fix: raise default 300s -> 840s (14m). Sits below the HR install.timeout
of 15m in clusters/_template/bootstrap-kit/11-powerdns.yaml, so a true
chart failure still surfaces via Flux's own remediation path rather
than wedging on a Helm wait that outlives its outer wrapper.

Chart bump: 1.2.1 -> 1.2.2. _template HR pinned to 1.2.2 with a comment
explaining the prov-#22 incident.

Per-Sovereign HR files (clusters/omantel.omani.works/, otech.omani.works/)
remain pinned to 1.1.5 — pre-existing drift, not in scope here. New
Sovereign provisioning reads from the _template path.

Same fix family as #127, #131, #143 (HR/Job timeout-ladder alignment
where a downstream Job's deadline must fit inside its HR wrapper cap).

## Claimed TCs
- prov-22-bp-powerdns-hr-ready
- prov-22-zone-bootstrap-job-completes-cold-cnpg

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 07:08:46 +04:00
e3mrah
25ef20a8e5
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.

Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
  (legacy, served, not storage) and v1 (canonical, served, storage). The
  shared schema means the 38 existing v1alpha1 files in platform/ +
  products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
  spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
  tagline interchangeable; category | family interchangeable; docs |
  documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
  hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
  manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
  Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
  Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
  observability, outputs, depends[].values, manifests.values, etc.

Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
  category (25), family (20), docs (20), documentation (14+1), icon (25),
  tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
  canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
  §3. Those 5 files are fixed in this commit:
    * platform/cert-manager-powerdns-webhook/blueprint.yaml
    * platform/cert-manager-dynadot-webhook/blueprint.yaml
    * platform/crossplane-claims/blueprint.yaml
    * platform/powerdns/blueprint.yaml
    * platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
  --dry-run=server) against the new CRD.

Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.

This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.

Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:25:08 +04:00
e3mrah
c42e98216c
fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (curl -o + readOnlyRootFS) (#843)
* fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0)

Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision.

2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job).

3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep.

Verified live state on otech103.omani.works (deployment id 12dff5098e33053e):
- bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1'
- bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver)
- bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret)

After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion.

* fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (readOnlyRootFS + curl -o)

Caught live on otech103 2026-05-04: zone-bootstrap Job exit 23 (curl write error) because curl -o /tmp/zone-resp + readOnlyRootFilesystem=true and no /tmp emptyDir mount. Bumps bp-powerdns 1.2.0 → 1.2.1 + slot 11 pin lockstep.

Without /tmp/zone-resp writable the Job CrashLoops every retry, never completes, bp-external-dns dependency stuck, Phase-1 watcher never reaches ready, handover never auto-fires.

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:28:44 +04:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
e3mrah
684759564e
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget

cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager

PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681)
calls contabo's authoritative PowerDNS at pdns.openova.io to write
DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook
needs an X-API-Key Secret in the Sovereign's cert-manager namespace —
PR #681 didn't ship the materialization seam, so on otech43..otech47
the Secret was missing and the wildcard cert never issued.

This commit closes the seam from contabo to the Sovereign:

1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on
   openova-system/powerdns-api-credentials extended from "external-dns"
   to "external-dns,catalyst" so contabo catalyst-api can mount the
   API key.

2. bp-powerdns: api.basicAuth.enabled flips default true to false.
   Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that
   blocked machine-to-machine API access from Sovereigns. The X-API-Key
   contract is unchanged.

3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds
   CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key
   secret (optional=true so Sovereign-side catalyst-api Pods that don't
   reflect this still start clean).

4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field
   reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every
   Request before Validate(). Forwards as tofu var powerdns_api_key.

5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive,
   default "").

6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct
   dynadot-api-credentials Secret block (PR #681 dropped
   bp-cert-manager-dynadot-webhook) with a new
   cert-manager/powerdns-api-credentials Secret block. runcmd applies
   it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook.

End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token.

Will be verified live on otech48 (next provision after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:23:27 +04:00
e3mrah
8bb66fe43e
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 00:16:34 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
ad9cfc0f23
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set):
- bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst
  job images now prefixed with global.imageRegistry when non-empty. Default
  (empty) renders identical manifests.
- bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed
  with global.imageRegistry when non-empty. Verified: dnsdist image rewrites
  to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14.

Subchart-only charts (global.imageRegistry stub added; threading via per-component
subchart values.yaml keys documented in comments):
- bp-external-secrets 1.1.0→1.1.1
- bp-cnpg 1.0.0→1.0.1  (charts/ missing = pre-existing state, not this PR)
- bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR)
- bp-nats-jetstream 1.1.1→1.1.2
- bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring

vcluster: N/A — no chart directory under platform/vcluster/chart/

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:52:43 +04:00
e3mrah
7d264d9647
fix(bp-powerdns): default cluster.namespace=powerdns not openova-system (Closes #553) (#556)
bp-powerdns HelmRelease upgrade fails on Sovereigns with:
  failed to create resource: namespaces "openova-system" not found

The chart's CNPG Cluster CR template targets postgres.cluster.namespace
which defaulted to openova-system (a contabo-only legacy ns). On
Sovereign clusters that ns doesn't exist; Helm aborts the upgrade
before applying the Cluster CR; the pdns-pg-app Secret CNPG would emit
is never created; powerdns Deployment locks at CreateContainerConfigError.

Default to powerdns (chart targetNamespace per bootstrap-kit overlay).
Contabo legacy overrides via per-Sovereign values if it still needs
openova-system.

Bump bp-powerdns 1.1.4 -> 1.1.5 across template + omantel + otech overlays.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:19:37 +04:00
e3mrah
902d857702
fix(bp-powerdns): reflect powerdns-api-credentials to external-dns namespace (Closes #544) (#552)
Add reflector.v1.k8s.emberstack.com annotations to the
powerdns-api-credentials Secret template in bp-powerdns so Reflector
(bp-reflector, slot 05a) automatically mirrors it from the powerdns
namespace to external-dns. Bump chart version 1.1.3 → 1.1.4.

Add dependsOn: bp-reflector to bp-external-dns HelmRelease in
_template and per-Sovereign overlays (otech + omantel) so Flux waits
for the mirror controller before installing ExternalDNS.

Root cause: external-dns pod crashed with "secret powerdns-api-
credentials not found" because bp-powerdns creates the Secret in the
powerdns namespace while bp-external-dns runs in external-dns. No
cross-namespace propagation existed. Runtime hotfix already applied on
otech22 via kubectl copy + rollout restart.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:11:43 +04:00
e3mrah
abf01b6f21
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from
networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute,
replacing the legacy Traefik-on-Sovereigns assumption with the canonical
Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2
correction note (#388).

The single per-Sovereign Gateway is added as additional documents in
the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml
(NOT a new top-level slot), since Cilium owns the GatewayClass. It
includes:

  - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}`
    from `letsencrypt-dns01-prod` (cert-manager + #373 webhook)
  - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS
    terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All

Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's
existing `templates/` directory):

  | Blueprint           | Host pattern                    | Backend port |
  |---------------------|---------------------------------|--------------|
  | bp-keycloak         | auth.<sov>                      | 80           |
  | bp-gitea            | git.<sov>                       | 3000         |
  | bp-openbao          | bao.<sov>                       | 8200         |
  | bp-grafana          | grafana.<sov>                   | 80           |
  | bp-harbor           | registry.<sov>                  | 80           |
  | bp-powerdns         | pdns.<sov>/api  (dual-mode)     | 8081         |
  | bp-catalyst-platform| console.<sov>, api.<sov>         | 80, 8080     |

bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute
(Sovereign) simultaneously — the per-Sovereign overlay sets
`api.gateway.enabled=true` while leaving `api.enabled=true`. The
Ingress object is harmless on Cilium clusters with no Traefik. This
preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4.

bp-harbor flips `expose.type` from `ingress` to `clusterIP` in
platform/harbor/chart/values.yaml so the upstream chart no longer
emits its own Ingress; the HTTPRoute is the sole HTTP exposure.
TLS terminates at the Gateway (wildcard cert) rather than per-host
Certificates inside the chart.

bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by
.helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml,
which remain contabo-only legacy demo infra). The contabo path keeps
serving console.openova.io/sovereign via Traefik unchanged.

Bootstrap-kit slot updates (per-Sovereign hostname interpolation):

  - 08-openbao.yaml      → gateway.host: bao.${SOVEREIGN_FQDN}
  - 09-keycloak.yaml     → gateway.host: auth.${SOVEREIGN_FQDN}
  - 10-gitea.yaml        → gateway.host: gitea.${SOVEREIGN_FQDN}
  - 11-powerdns.yaml     → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true
  - 19-harbor.yaml       → gateway.host: registry.${SOVEREIGN_FQDN}
  - 25-grafana.yaml      → gateway.host: grafana.${SOVEREIGN_FQDN}

Server-side dry-run validation against the live Cilium Gateway API
CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway
+ Certificate apply cleanly via `kubectl apply --dry-run=server`.

Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy
SME ingresses (console-nova, marketplace, admin, axon, talentmesh,
stalwart, ...) continue to serve via Traefik as before. powerdns
on contabo remains on the Ingress path (api.gateway.enabled defaults
to false at the chart level).

Closes #387.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:19:30 +04:00
e3mrah
726af6df81
fix(bp-powerdns): self-generate api-credentials Secret + disable upstream zone-bootstrap Job (#248)
Root cause investigation on otech.omani.works (kubectl, sanitized):

  $ kubectl get pods -n powerdns
  create-zone-if-not-exist-sh-tjtr4   0/1  CreateContainerConfigError  4h
  powerdns-57d7d49f99-{9hrb4,lxlgt,nkmht}  0/1  CreateContainerConfigError  4h
  dnsdist-594dbfc5f-wznsw                  1/1  Running  4h

  $ kubectl get secrets -n powerdns
  powerdns                Opaque  1  4h
  powerdns-api-tls-8kxpx  Opaque  1  4h     (NO `powerdns-api-credentials`, NO `pdns-pg-app`)

  $ kubectl describe pod ... powerdns-57d7d49f99-9hrb4
  Environment:
    PDNS_API_KEY:  <set to the key 'api-key' in secret 'powerdns-api-credentials'>  Optional: false
    PDNS_DB_HOST:  <set to the key 'host' in secret 'pdns-pg-app'>                  Optional: false
    State: Waiting   Reason: CreateContainerConfigError

The handover's chicken-egg-with-secret theory was directionally right but
the cause was more fundamental:

  1. Wrapper chart's api-credentials-secret.yaml (1.1.2) was a no-op
     unless operator set `apiKey` value out-of-band — comment said the
     deployment would "fail to start until the named Secret exists" as
     "the explicit signal we want". On a Sovereign that bootstraps from
     bp-* OCI artifacts, no operator is standing by, so the Secret is
     never created and pods sit in CreateContainerConfigError forever.

  2. The upstream chart's `create-zone-if-not-exists-sh` Job is rendered
     whenever both `zoneName` and `api.key` are set — defaulting
     `zoneName: "example.de."` it ALWAYS rendered and ALWAYS failed
     (same missing Secret). Catalyst doesn't want this Job at all
     because zones are loaded later by pool-domain-manager (PDM).

  3. The chart's CNPG Cluster template is gated behind
     Capabilities.APIVersions.Has "postgresql.cnpg.io/v1" — on a fresh
     Sovereign without bp-cnpg yet (bp-cnpg is on the roadmap, not in
     bootstrap-kit), no Cluster is rendered and `pdns-pg-app` Secret
     never materialises. With Helm `--wait`, install times out
     ("context deadline exceeded") even though the manifests applied
     cleanly.

Fix:

  * api-credentials-secret.yaml: self-generate via Helm `lookup` +
    `randAlphaNum 32`. First install creates fresh randoms; every
    subsequent reconcile reads back the existing values from the
    Secret so the API key never rotates on upgrade. Operator can
    still pin specific values via .Values.powerdns.apiKey /
    .Values.powerdns.webserverPassword, or skip Secret creation
    entirely via .Values.powerdns.useExistingApiSecret. Same pattern
    as bitnami/postgresql, bitnami/keycloak.

  * values.yaml: set `powerdns.zoneName: ""` so upstream chart's
    `{{- if and .Values.powerdns.zoneName .Values.powerdns.api.key }}`
    gate skips the create-zone Job entirely. Catalyst's PDM creates
    zones via the REST API after the cluster comes up; we don't want
    a placeholder `example.de.` zone in production.

  * HelmRelease (both _template and otech.omani.works overlays):
    `install.disableWait: true` + `upgrade.disableWait: true` so the
    HelmRelease reports Ready as soon as manifests apply cleanly,
    rather than gating on powerdns Deployment readiness which depends
    on bp-cnpg landing first to synthesise `pdns-pg-app`. Runtime
    convergence is observed via kubectl, not gated on Helm.

Live error this addresses:
  Helm upgrade failed for release powerdns/powerdns with chart
  bp-powerdns@1.1.2: context deadline exceeded

Verified locally with `helm template`:
  - powerdns-api-credentials Secret renders with random api-key + webserver-password
  - create-zone-if-not-exist-sh Job no longer rendered
  - Deployment env continues to reference powerdns-api-credentials correctly

Bumped 1.1.2 -> 1.1.3 (chart, blueprint, both bootstrap-kit overlays).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:12 +04:00
e3mrah
bcd2e7980a
fix: hide CRD-emitting resources behind Capabilities gates (closes #190) (#200)
* fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates
in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh
Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have
not yet reconciled — no longer fails with `no matches for kind X in
version Y`. The values toggles (`externalDns.serviceMonitor.enabled`,
`externalDns.externalSecret.enabled`) remain — Capabilities is defense
in depth so an operator flipping the toggle on a Sovereign that hasn't
reached Phase 2 doesn't break the bp-external-dns reconcile.

Verified locally: `helm template` with toggles off renders 0 of these
resources; with toggles ON and `--api-versions monitoring.coreos.com/v1
--api-versions external-secrets.io/v1beta1` both render exactly once.

Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix
wave from issue #190.

* fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Three Catalyst overlay templates emit resources whose CRDs ship in OTHER
charts and were unconditionally rendered, causing a cold install of
bp-powerdns to fail with `no matches for kind X` on a Sovereign that
hasn't yet reconciled the upstream chart:

  - cnpg-cluster.yaml          → postgresql.cnpg.io/v1 Cluster
                                 (CRD ships in bp-cnpg)
  - api-ingress.yaml           → traefik.io/v1alpha1 Middleware
                                 (CRD ships with the Traefik controller;
                                  k3s ships it by default but a Sovereign
                                  overlay MAY disable Traefik in favour
                                  of cilium-only ingress)
  - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP
                                 (CRD ships when the Catalyst Crossplane
                                  composition family lands — see GAP
                                  DISCLOSURE in that template)

Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`.
The Traefik router-middleware annotation on the Ingress is similarly
gated so the auth posture cleanly moves to the Sovereign's chosen
ingress controller when Traefik is absent.

Verified locally: `helm template` with default values renders 0 of
these resources; with `--api-versions postgresql.cnpg.io/v1
--api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1`
plus `--set crossplane.floatingIP.enabled=true`, all three render
exactly once. Existing tests/observability-toggle.sh still passes.

Bump version 1.1.1 → 1.1.2.

* fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 20:10:14 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
67fdecb770 merge: remove k8gb (#171) 2026-04-29 08:51:21 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
f4679e2748 fix(powerdns): enable gpgsql-dnssec for DNSSEC API (1.0.6)
Without `gpgsql-dnssec=yes` the gpgsql backend driver does not expose
the DNSSEC API surface — `PUT /zones/<zone>` with `dnssec:true` returns
422 "no DNSSEC-capable backends are loaded". This blocks pool-domain-
manager from enabling DNSSEC on every Sovereign child zone (mandatory
per docs/PLATFORM-POWERDNS.md).

Fix lands in additionalConfig so the directive is rendered alongside
`default-soa-edit-signed=INCEPTION-EPOCH` and `direct-dnskey=yes`. No
schema migration needed — the gpgsql 5.0.3 schema already includes the
cryptokeys table; the missing piece was just the backend feature flag.

Bumps Chart.yaml to 1.0.6. Verified: after this lands the PUT call
returns 204 and POST /cryptokeys mints a usable KSK.

Discovered while bringing up openova#168 (PDM per-Sovereign zones).
2026-04-29 08:42:18 +02:00
hatiyildiz
fa84cac438 fix(powerdns): plain ALTER TABLE in postInitSQL (avoid $$ escape battle, 1.0.5)
The DO block in 1.0.4 rendered with $$ collapsed to $ by the time it
reached CNPG's postInitApplicationSQL — "syntax error at or near $".
Both Helm template processing and the YAML scalar block were chewing on
the dollar signs.

Replaced with explicit ALTER TABLE statements (one per gpgsql table) +
GRANT — same end state, no PL/pgSQL quoting required. Verified at
runtime on contabo-mkt: powerdns Pod went CrashLoopBackOff →
Running 1/1 immediately after the manual ALTER ran by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:17:28 +02:00
hatiyildiz
214a3e1ada fix(powerdns): grant table ownership to pdns user in CNPG bootstrap (1.0.4)
Verified at runtime on Contabo-mkt: postInitApplicationSQL runs as the
postgres superuser, not the application owner, so the schema tables
created by the bootstrap block were owned by postgres. PowerDNS connects
as 'pdns' and got 'permission denied for table domains' on the first
SELECT against the zone cache.

Added a DO block at the end of the schema bootstrap that walks every
table in the public schema and ALTERs OWNER TO {{ .Values.postgres.cluster.owner }}
plus GRANT ALL PRIVILEGES ON SCHEMA public — same shape PDM uses (and
the contabo-mkt cluster verified the fix runtime: powerdns Pod went
from CrashLoopBackOff to 1/1 Ready immediately after the same DDL was
run by hand).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:14:12 +02:00
hatiyildiz
db20e9d42b fix(powerdns): dnsdist backend resolution + drop DnstapLogAction (1.0.3)
dnsdist 1.9.14 runtime errors:
  1. newServer{address='powerdns:5353'} → "Unable to convert presentation
     address" — dnsdist's address parser expects IP[:port], not a DNS
     name. Kubernetes auto-injects POWERDNS_SERVICE_HOST as an env var
     into every pod in the same namespace as the powerdns Service; using
     that gives us the ClusterIP at config-load time without needing an
     init container or runtime DNS resolution.
  2. DnstapLogAction(name, bool, fn) signature changed in 1.9 — the
     2nd parameter now expects a shared_ptr to a RemoteLoggerInterface,
     not a boolean. Rather than wire up a remote dnstap server (which
     adds a moving part for marginal observability gain), drop the line.
     Catalyst observability is the dnsdist /metrics endpoint surfaced
     to Prometheus + the k8s container log.

Bumped chart to 1.0.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:12:27 +02:00
hatiyildiz
20c0543806 fix(powerdns): correct dnsdist image tag + drop readOnlyRootFilesystem (1.0.2)
Two runtime issues caught during first contabo-mkt rollout:

1. dnsdist image tag was "1.9" (default) — that tag doesn't exist in
   docker.io/powerdns/dnsdist-19. The 1.9.x line publishes 1.9.0 .. 1.9.14
   (no rolling "1.9" alias). Pinned to 1.9.14 (current latest).

2. PowerDNS pod crash-looped on Errno 30 (Read-only file system:
   /etc/powerdns/pdns.d/0-api.conf.conf). The upstream pdns_server-startup
   script writes rendered config files to /etc/powerdns/pdns.d/ at
   container start, and the upstream template doesn't expose an emptyDir
   we could redirect that path to. Set readOnlyRootFilesystem=false with
   a verbose comment explaining why; the rest of the security context
   (runAsNonRoot, runAsUser=953, drop ALL caps) stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:06:39 +02:00
hatiyildiz
19d926bfeb fix(powerdns): avoid recursive include in dnsdist checksum, bump to 1.0.1
Helm flagged dnsdist.yaml's checksum/config annotation as a recursive
template self-reference (the file included itself). Replaced with a
hash of the rendered .Values.dnsdist.config (post-tpl), which is the
substantive content the annotation is supposed to track anyway.

Bumped Chart.yaml to 1.0.1 so the OCIRepository semver "1.x" picks
up the fix automatically on next reconcile. Blueprint API version stays
at 1.0.0 (Blueprint contract is unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:02:53 +02:00
hatiyildiz
0190c60520 feat(powerdns): bp-powerdns wrapper chart + per-Sovereign zone model (#167)
Introduces the bp-powerdns Catalyst Blueprint wrapper as the authoritative
DNS service for every Sovereign zone. Replaces k8gb in componentGroups.ts —
PowerDNS Lua records cover geo + health-checked failover natively, removing
the dedicated GSLB controller.

Wrapper chart (platform/powerdns/chart/):
  - Chart.yaml — bp-powerdns 1.0.0, depends on pschichtel/powerdns 0.10.0
    upstream (verified Artifact Hub publisher, tracks docker.io/powerdns/
    pdns-auth-50 at appVersion 5.0.3 — surveyed Artifact Hub, no official
    PowerDNS chart exists)
  - values.yaml — 3 replicas, gpgsql backend, DNSSEC ECDSAP256SHA256,
    lua-records ON, dnsdist 100 qps default per source IP, REST API at
    pdns.openova.io/api behind Traefik basicAuth
  - blueprint.yaml — Catalyst metadata, visibility=unlisted (mandatory
    infra), section pts-3-2-gitops-and-iac
  - templates/cnpg-cluster.yaml — separate `pdns-pg` Postgres (1 instance,
    5Gi, postgres-16) with PowerDNS auth-5.0.3 schema applied via
    postInitApplicationSQL
  - templates/dnsdist.yaml — companion Deployment + ConfigMap with
    rate-limiting policy (MaxQPSIPRule per source IP)
  - templates/api-ingress.yaml — Traefik Ingress + basicAuth Middleware
  - templates/anycast-endpoint.yaml — placeholder Service of type
    LoadBalancer (Phase-0 stand-in for the anycast Floating IP target state)
  - templates/crossplane-floatingip.yaml — DISCLOSED GAP: target-state
    XHetznerFloatingIP composite, disabled by default until the
    Crossplane composition is authored (the existing compositions cover
    Server/Network/Firewall/LoadBalancer/PoolAllocation only). The
    placeholder anycast Service is the operational stand-in.

Per docs/INVIOLABLE-PRINCIPLES.md:
  - #4 (never hardcode): every value flows from values.yaml or a
    referenced K8s Secret. Image tags come from upstream chart appVersion,
    never duplicated.
  - #8 (disclose every divergence): the XHetznerFloatingIP gap is
    documented in the template + in docs/PLATFORM-POWERDNS.md ("Anycast
    deferral" section).

componentGroups.ts: powerdns added to SPINE group as mandatory (depends on
cnpg). external-dns now lists powerdns as a dependency. k8gb removed.

docs/PLATFORM-POWERDNS.md: per-Sovereign zone model, DNSSEC posture, REST
API contract, lua-records GSLB pattern, dnsdist policy, anycast deferral
runbook, first-deploy procedure for Contabo-mkt.

Closes #167 (Phase 1 of public-repo work; Phase 4 cluster manifest lands
in openova-private feat/powerdns-deploy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:49:51 +02:00