openova

History

e3mrah 4e199f137b fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 (#1505 ) * fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. * fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99) correctly reached cilium-config — but only AFTER Flux helm-upgraded the release. The pre-Flux Cilium install (cloud-init line 1473) used /var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or cluster.id, so cilium-agent started with the chart defaults ("default", 0). The Flux upgrade then changed cilium-config but the already-running cilium-agent kept its in-memory cluster.name="default" because it reads ConfigMap once at startup. Downstream consequences observed live on t105: hubble-relay CrashLoopBackOff: "tls: failed to verify certificate: x509: certificate is valid for .t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1 .default.hubble-grpc.cilium.io" clustermesh peer announcements use stale "default" identity → cross-region mesh handshakes x509-fail. Fix: include cluster.name + cluster.id in the pre-Flux helm install's values file, sourced from the templatefile() vars cluster_mesh_name + cluster_mesh_id (already threaded per-region by main.tf:381-382 and :900-901). Now the first cilium-agent process announces with the correct identity, no helm-upgrade race. docs(sandbox): design docs for the Sandbox product Captures the agreed product shape, end-user journeys (developer + Sovereign admin), technical architecture (native agent TUI via xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue, four knowledge layers, JetStream/SSE integration), and the conversational-provisioning surface that reuses the same shell with a narrow MCP toolbox as an alternative to the catalyst-ui wizard. Status: design only — no implementation. Identifies one prerequisite (long-lived API token carrying org_id claim) with the exact files to extend in core/services/auth and platform/keycloak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the cilium-envoy-tls-restart Job stuck Running 10m+ with: W reflector.go:561] failed to list unstructured.Unstructured: deployments.apps "cilium-operator" is forbidden: User "system:serviceaccount:kube-system:cilium-envoy-tls-restart" cannot list resource "deployments" in API group "apps" in the namespace "kube-system" The Role grants `get` + `patch` but `kubectl rollout status` (which the Job runs after `rollout restart`) does NOT just GET — internally it uses client-go informerwatcher to LIST+WATCH the resource. Without those verbs the informer fails and `rollout status` hangs until activeDeadlineSeconds (900s). The Job never restarts cilium-envoy, console.<fqdn> never serves. Fix: add `list` + `watch` to both rules (cilium-operator Deployment + cilium-envoy DaemonSet). Scoped by resourceName, so the SA still can't enumerate or watch other workloads. fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15): dig +short A console.t110.omani.works @ns1.openova.io → 49.12.16.160 ← ORPHAN IP — Hetzner reassigned to a 3rd party The mothership PowerDNS had ZERO records for t110's hostnames. A stale wildcard `*.omani.works` (manual leftover from earlier provs) was returning a wrong IP that no longer belonged to the openova project at Hetzner — sending operator traffic to an unrelated tenant. The deeper gap: catalyst-api never auto-wrote the per-Sovereign A records that browsers need to resolve. The existing parent-domain flow has: pdmCreatePowerDNSZone — stub at parent_domains.go:1096 certManagerStep — stub at parent_domains.go:1141 commitPDMWithRetry — runs ONLY for pool-allocated FQDNs (otech<N>.<pool>), NOT BYO So BYO-style (operator-owned parent like omani.works + arbitrary Sovereign FQDN like t111.omani.works) left the parent zone untouched. Fix: internal/powerdns/client.go + PatchRRSets(ctx, zone, rrsets) — PATCH REPLACE on /api/v1/servers/{id}/zones/{zone} with idempotent re-runs internal/handler/handler.go + powerdnsZoneClient interface gains PatchRRSets — wired automatically by SetPowerDNSZoneClient internal/handler/sovereign_dns_records.go (new) + CanonicalSovereignSubdomains: console / auth / gitea / harbor / registry / bao / grafana / hubble / pdns / openova-flow / marketplace / api / guacamole + upsertSovereignParentZoneRecords: PATCH the parent zone with one A record per subdomain → primary LB IP + upsertSovereignParentZoneRecordsFromResult: deployment-flow wrapper that iterates every parentDomain in the request body internal/handler/deployments.go + Call upsertSovereignParentZoneRecordsFromResult right after commitPDMWithRetry on Phase-0 success — best-effort (log + continue), so a PowerDNS hiccup doesn't bail the Sovereign Operator override via CATALYST_SOVEREIGN_SUBDOMAINS not yet wired — filed as follow-up. Today the canonical list is the chart-side HTTPRoute list, kept aligned via the comment in sovereign_dns_records.go. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-15 21:12:38 +04:00
..
axon	feat(axon): make qwen3-coder thinking mode toggleable via request parameter	2026-04-26 09:20:33 +02:00
catalyst	fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 (#1505 )	2026-05-15 21:12:38 +04:00
continuum	feat(continuum): F — dry-run report + post-switchover health check + audit-emit coverage (slice F-1+F-2+F-3, #1101 ) (#1161 )	2026-05-09 08:33:37 +04:00
cortex	docs(pass-52): bundled date-sweep + cross-component namespace clean; knative clean	2026-04-28 00:37:21 +02:00
dmz-vcluster	fix: mark bp-dmz-vcluster + bp-netbird default-off for smoke-render gate (#1286 )	2026-05-10 15:57:18 +04:00
fabric	docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay	2026-04-28 10:23:46 +02:00
fingate	docs(pass-52): bundled date-sweep + cross-component namespace clean; knative clean	2026-04-28 00:37:21 +02:00
openova-flow	fix(openova-flow): COPY go.sum + go mod download in Dockerfile (#1475 )	2026-05-14 14:23:57 +04:00
relay	docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay	2026-04-28 10:23:46 +02:00
sandbox	fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )	2026-05-15 21:02:37 +04:00