Commit Graph

13 Commits

Author SHA1 Message Date
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
1e7d1e67c9
test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429) (#432)
Phase 8 of the omantel handover (#369) needs an automated E2E that proves
DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with
zero contabo dependency post-handover. Today this is a SCAFFOLD — when
Phase 4/6/7 land, dispatching the new workflow against a live omantel is
the entire Phase 8.

Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md):
  - tests/e2e/playwright/tests/  ← mirror of sovereign-wizard.spec.ts shape
    (NOT specs/ as the issue body said — actual repo path is tests/)
  - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries,
    workers=1, reporter=list) — reused as-is
  - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the
    pre-flight skip-when-unreachable pattern
  - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4,
    setup-node v4, npm install, playwright install --with-deps chromium,
    upload-artifact on failure) — mirrored, NOT duplicated

What ships:
  - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests):
      1. sovereign Ready + 23/23 blueprints
      2. all bp-* HelmReleases Ready=True
      3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready")
      4. vendor-agnostic Object Storage (post-#425 canonical secret name
         flux-system/object-storage — NOT hetzner-object-storage)
      5. dig +trace omantel.omani.works ends at omantel NS, not contabo
      6. zero contabo dependency (omantel /api/healthz keeps returning 200)
    Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset.

  - .github/workflows/omantel-e2e-handover.yaml (NEW):
    workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow
    MUST be event-driven, NEVER scheduled"). Inputs let the operator override
    base URLs at dispatch time.

  - docs/omantel-handover-wbs.md:
    new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1
    with the spec test() blocks; §9 status row added for #429
    (🟢 scaffold-shipped).

Local verification:
  cd tests/e2e/playwright && npm install && \
    npx playwright test --list tests/omantel-handover.spec.ts
  → 6 tests listed cleanly
  npx playwright test tests/omantel-handover.spec.ts
  → 6 skipped (env vars unset, expected)

Out of scope (per #425 / #428 territory split):
  - internal/hetzner/, infra/hetzner/, platform/velero/chart/,
    clusters/.../34-velero.yaml — #425's vendor-agnostic sweep
  - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:52:18 +04:00
hatiyildiz
4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.

This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
   built-in local-path-provisioner and registers the `local-path`
   StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
   apply that:
     a. waits for the local-path-provisioner pod Ready
     b. patches the local-path SC with is-default-class=true
     c. fails loudly if the SC is missing post-wait (safety gate so a
        broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
   (regression gate against re-introducing --disable=local-storage,
   plus positive assertions that the wait/patch/verify steps are
   present, plus ordering check that the patch precedes the Flux
   apply); phase 2 kind-cluster proof that a fresh cluster has a
   default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
   root cause, and the live-cluster recovery path (apply
   local-path-storage.yaml + patch default class) for already-provisioned
   Sovereigns that hit this without reprovisioning.

Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.

Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):

  Before:
    NAMESPACE      NAME                         STATUS    AGE
    keycloak       data-keycloak-postgresql-0   Pending   10m
    spire-system   spire-data-spire-server-0    Pending   10m
    No StorageClass.

  After (kubectl apply local-path-storage.yaml + patch):
    NAME                   PROVISIONER             ...   AGE
    local-path (default)   rancher.io/local-path   ...   34s

    NAMESPACE      NAME                         STATUS   STORAGECLASS
    keycloak       data-keycloak-postgresql-0   Bound    local-path
    spire-system   spire-data-spire-server-0    Bound    local-path

Gates:
  - tofu validate: Success! The configuration is valid.
  - tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
    phase 2 fresh kind cluster default StorageClass binds test PVC).
  - Regression sanity: re-injecting --disable=local-storage causes
    phase 1 to FAIL with the documented error message (verified).

Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:43:09 +02:00
hatiyildiz
1e1c8f5c39 merge: cloud-init creates ghcr-pull secret durable + GHCR token pipeline 2026-04-29 18:08:32 +02:00
hatiyildiz
dddbab4b80 fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly
Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull`
because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init
never created the Secret, so every fresh Sovereign's source-controller
logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium.
The operator workaround (kubectl apply by hand) is not durable across
reprovisioning. Verified live on omantel.omani.works pre-fix.

Changes:

- provisioner.Request gains GHCRPullToken (json:"-") so it is never
  serialized into persisted deployment records. provisioner.New() reads
  CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the
  Request before tofu.auto.tfvars.json. Validate() rejects empty for
  domain_mode=pool with a pointer to docs/SECRET-ROTATION.md.
- handler.CreateDeployment also stamps the env var onto the Request so
  the synchronous validation path returns 400 early on misconfiguration.
- infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true,
  default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64
  locals and passes both to templatefile().
  cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson
  Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd
  applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so
  the GitRepository + Kustomization land into a cluster that already
  has working GHCR creds.
- products/catalyst/chart/templates/api-deployment.yaml mounts
  CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in
  the catalyst namespace (key: token, optional: true so the Pod still
  starts on misconfigured installs and Validate() owns the gate).
- docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token,
  Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds.
  Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN>
  placeholder; the token never lives in git.
- Tests: provisioner unit tests cover New() reading the env var,
  tolerance of missing env, pool-mode validation rejection with
  operator-facing error, BYO acceptance, and the json:"-" serialization
  invariant. tests/e2e/hetzner-provisioning gains a
  TestCloudInit_RendersGHCRPullSecret render-only integration test that
  asserts the rendered cloud-init contains the Secret, applies it
  before flux-bootstrap, and that the dockerconfigjson round-trips the
  sample token through templatefile() correctly. Existing
  pool-mode handler tests now t.Setenv the placeholder token; the
  on-disk redaction test asserts the placeholder never reaches disk.

Gates:
- go vet ./... and go test -race -count=1 ./... in
  products/catalyst/bootstrap/api: PASS.
- helm lint products/catalyst/chart: PASS (warnings pre-existing).
- tofu fmt + tofu validate: deferred to CI (no tofu binary on the
  development host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:07:27 +02:00
hatiyildiz
015e7ab18b fix(catalyst-chart): annotate api-deployment for Flux strategy-flip recovery
DIVERGES from the literal "$patch: replace" prescription on the issue
because that directive cannot survive any apply path that actually
runs in production (verified end-to-end in
tests/integration/strategy-flip.sh):

  - Flux's kustomize-controller submits via Server-Side Apply. SSA
    rejects `.spec.strategy.$patch` with "field not declared in
    schema" — fluxcd/pkg/ssa Manager.Apply does not preprocess SMP
    directives.
  - kubectl strict-decoding rejects `$patch` on every CREATE path
    (`kubectl create`, `kubectl apply` to an empty namespace, every
    `--server-side` flavor) with "unknown field spec.strategy.$patch"
    — adding it to a chart base resource BREAKS fresh installs of
    every new Sovereign.

The durable fix is the documented Flux annotation
`kustomize.toolkit.fluxcd.io/force: enabled` on the Deployment.
When kustomize-controller's SSA dry-run fails Invalid (the contabo-
mkt failure mode: `spec.strategy.rollingUpdate: Forbidden` on the
post-merge object that retained `rollingUpdate.maxSurge=25%` /
`maxUnavailable=25%` from the prior `kubectl-client-side-apply`
field manager), the controller falls back to delete-and-recreate
THIS resource. The recreated Deployment carries no residual
`rollingUpdate.*` fields, so the regression cannot recur. The
annotation is IaC, scoped to the Deployment, applies on every
reconcile.

Verified gates:
  - `kubectl apply --dry-run=server -f .../api-deployment.yaml`
    over a Deployment in the bad pre-state (RollingUpdate +
    maxSurge=25% / maxUnavailable=25%) → exit 0,
    "deployment.apps/catalyst-api configured (server dry run)".
  - Same manifest applied to an empty namespace via SSA + CSA →
    both succeed (the fresh-install gate that catches `$patch:`-
    shaped regressions).
  - SSA path correctly REPRODUCES the regression mode (asserted
    in step 3 of the integration test) → proves the recovery layer
    is necessary.
  - Flux force-recovery equivalent (delete + apply) succeeds →
    proves the recovery path itself works.

Files:
  - products/catalyst/chart/templates/api-deployment.yaml: add
    `kustomize.toolkit.fluxcd.io/force: enabled` annotation +
    inline reference comment explaining failure mode and rejecting
    inline `$patch: replace` as a future regression vector.
  - docs/CHART-AUTHORING.md (new): authoritative chart-authoring
    doc, with §"Strategy flips on existing Deployments" anchoring
    the failure mode + canonical fix + table of related fields
    (selector, clusterIP, accessModes, etc.) that share the
    pattern. References docs/INVIOLABLE-PRINCIPLES.md #3 (Flux is
    the only GitOps reconciler) and #4 (never hardcode runtime
    knobs in operator runbooks).
  - tests/integration/strategy-flip.yaml (new): bad-state fixture
    + assertion ConfigMap. Reproduces the exact 25%/25% pre-state
    that triggered contabo-mkt.
  - tests/integration/strategy-flip.sh (new): 6-step runner —
    bad-state stage, CSA gate, SSA failure-mode reproduction,
    structural annotation check, recovery-path proof, fresh-
    install gate. Exits non-zero on any regression.
  - .github/workflows/test-strategy-flip.yaml (new): CI wiring on
    kind v1.30.6 (matches contabo-mkt k3s decoding behavior),
    triggered by edits to the chart manifest, the test, the doc,
    or the workflow itself.

Sweep of the rest of the Catalyst chart templates: the only
`strategy.type: Recreate` Deployment in the chart is catalyst-api.
catalyst-ui, marketplace-api, and all 11 sme-services Deployments
declare default RollingUpdate and live as RollingUpdate on contabo-
mkt — no latent flips. Services use ClusterIP with default IP
allocation; the api-deployments PVC is RWO and never re-shaped by
the chart. No additional resources needed hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:04:07 +02:00
hatiyildiz
55b8a18b32 test(e2e): #142, #143, #144 — Playwright UI smoke tests for sovereign wizard, admin vouchers, marketplace bp-<x> grid
Group L closes the three UI smoke-test gaps the verify-sweep flagged:

  #142 sovereign wizard       — tests/e2e/playwright/tests/sovereign-wizard.spec.ts
  #143 admin voucher UI       — tests/e2e/playwright/tests/admin-vouchers.spec.ts
  #144 unified bp-<x> grid    — tests/e2e/playwright/tests/marketplace-cards.spec.ts

Tests target the actual shipped UI shape (Pass 105+):

* Wizard step model is StepOrg → StepTopology → StepProvider →
  StepCredentials → StepComponents → StepReview, not the original ticket's
  StepDomain/StepHetzner draft from before the unified-Blueprints refactor.
* Admin voucher model uses an `active` toggle, not ISSUED/REVOKED status.
* "Marketplace card grid" = the Catalyst wizard's StepComponents (bp-<x>
  Blueprints), NOT the SME marketplace at core/marketplace (which is for
  SaaS Apps). Today every Blueprint is `visibility: unlisted`, so the test
  asserts the data layer (catalog.generated.ts) plus the documented
  EmptyState; once `visibility: listed` lands, the third assertion
  auto-extends to the rendered card grid.

Per principle #4 ("never hardcode"), all URLs come from env vars with
sensible local-dev defaults. Per principle #1 ("never speculate"), tests
self-skip with explicit reasons when their target app isn't reachable
instead of fail-noisy.

CI: .github/workflows/playwright-smoke.yaml boots the Catalyst UI in the
background and runs the suite on PRs touching UI sources or tests; admin
and marketplace specs self-skip in that workflow because spinning up all
three Astro apps + catalyst-api + Postgres is the full E2E pipeline's
job, not this smoke.

Local run (Catalyst UI on :4399, admin on :4398): 5 passed, 2 skipped
(skip reasons: marketplace #3 needs StepComponents reachable past
required-field gating; admin #2 needs ADMIN_TEST_COOKIE for an
authenticated session).

Refs: #142, #143, #144
2026-04-28 19:54:04 +02:00
hatiyildiz
919514ca78 merge: /sovereign nginx routing — values-driven /sovereign + /api/v1 (a35da92) 2026-04-28 19:50:39 +02:00
hatiyildiz
a35da929f1 feat(sovereign-route): values-driven /sovereign + /api/v1 routing
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the catalyst-ui
nginx config now flows from values.yaml at chart-render time:

- routing.basePath (/sovereign) — also drives ingress strip-prefix
- routing.catalystApi.serviceDNS — in-cluster reverse-proxy target
- routing.catalystApi.port — upstream port
- dns.resolverIP — CoreDNS for proxy-time resolution (avoids stale
  ClusterIP after catalyst-api restarts)
- ingress.host / ingress.priority / ingress.className

Files:
- products/catalyst/chart/values.yaml — new, documents every default
- products/catalyst/chart/templates/ui-configmap.yaml — new, nginx
  reverse-proxies /api/* to catalyst-api Service DNS
- products/catalyst/chart/templates/ui-deployment.yaml — mounts the
  ConfigMap at /etc/nginx/conf.d/default.conf
- products/catalyst/chart/templates/ingress.yaml — values-driven host
  + path + priority + class
- tests/e2e/sovereign-routing/* — Playwright smoke for the routing

Captured from stalled agent /tmp/agent-sovereign-route-finish — agent
stream watchdog timed out after the work was authored but before commit.
2026-04-28 19:48:40 +02:00
hatiyildiz
4554bd6d5d feat(dod): #149-#157 — Group M DoD scaffolding (DEMO-RUNBOOK + dod_test.go + dod.yaml)
Manual-dispatch-only DoD scaffolding for the omantel.omani.works
end-to-end test. Operator-gated; the test t.Skip()s when
HETZNER_TEST_TOKEN env var is missing so CI stays green.

- docs/DEMO-RUNBOOK.md: 9-step operator runbook covering Group C
  cutover, wizard provision, voucher issuance, tenant redemption.
- tests/dod/dod_test.go: HTTP-driven E2E that streams SSE through
  all 11 phases, asserts cert + DNS + voucher + redemption flow.
- .github/workflows/dod.yaml: workflow_dispatch only — never
  on-push (Hetzner cost gating).

Cherry-picked additive files from /tmp/agent-group-m-dod (a40b495);
the agent's branch had stale-base deletions of #108/#109/Pass-107
that we drop.
2026-04-28 19:34:46 +02:00
hatiyildiz
7c7c46bc62 test: Hetzner Sovereign end-to-end provisioning test (#141)
Closes the Group L "end-to-end provisioning test on Hetzner test project"
ticket. Per the ticket's exact wording: scaffolding + harness + CI
workflow, gated on HETZNER_TEST_TOKEN, NEVER mocked.

Lifecycle when HETZNER_TEST_TOKEN is set:
  1. Generate unique sovereign FQDN (e2e-<run-id>.openova.io)
  2. Stage canonical infra/hetzner/ OpenTofu module into temp dir
  3. Render tofu.auto.tfvars.json with test inputs (BYO domain mode so
     Dynadot isn't touched; region runtime-configurable; SSH key minted
     by CI per-run)
  4. tofu init && tofu apply -auto-approve (30m timeout)
  5. Assert outputs: control_plane_ip + load_balancer_ip are valid IPv4
  6. Assert TCP/22 reachable on control plane (5m await)
  7. Assert TCP/443 reachable on LB after Cilium + Flux land (15m await,
     soft-failure since the Catalyst control plane install is the long
     tail and partial-bootstrap is acceptable proof of OpenTofu + Flux)
  8. tofu destroy -auto-approve (always — t.Cleanup, runs even on fail)
  9. Verify state list is empty after destroy (no leaked resources)

When HETZNER_TEST_TOKEN is absent, the test SKIPS — does not mock, does
not fall through to a stub. Per docs/INVIOLABLE-PRINCIPLES.md #2,
mocking the cloud would tell us nothing about whether the OpenTofu module,
hcloud provider, cloud-init scripts, or k3s actually work. A second test
(TestHarness_NoHetznerCredsSkips) explicitly verifies the skip semantics
so future refactors don't accidentally land mocking.

CI workflow (.github/workflows/test-hetzner-e2e.yaml):
  - Triggers on workflow_dispatch (operator initiates real run) or PR
    labeled `test/hetzner-e2e` — NOT on every push (each run costs real
    Hetzner minutes ~EUR 0.005/run).
  - Generates a per-run throwaway SSH ed25519 keypair so no secret
    long-term key lands in any logs.
  - Installs OpenTofu via opentofu/setup-opentofu@v1.
  - Reads HETZNER_TEST_TOKEN + HETZNER_TEST_PROJECT_ID from repo secrets;
    operator populates them out-of-band (per the ticket: "operator will
    populate later").
  - 55m job timeout, plus the test itself uses contexts of 30m apply
    + 20m destroy.

Files:
  - tests/e2e/hetzner-provisioning/main_test.go (the harness)
  - tests/e2e/hetzner-provisioning/go.mod (separate module, stdlib-only)
  - .github/workflows/test-hetzner-e2e.yaml (gated CI)

Refs #141

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:00:29 +02:00
hatiyildiz
3dced3fdda test: bootstrap-kit Flux Kustomization integration test (#145)
Closes the Group L "integration test — provisioner backend bootstrap-kit
installer — all 11 phases install in sequence on a kind cluster" ticket.

Per the ticket note, the bootstrap installer is now Flux-driven from
clusters/<sovereign-fqdn>/ — NOT the bespoke Go-based installer that was
reverted in commit e668637. The test verifies that Flux reconciles the
right Kustomizations rather than that Go code helm-installs anything.

Two layers of validation:

1. Static manifest layer (runs on every push, cheap)
   - All 11 platform/<x>/blueprint.yaml + chart/Chart.yaml exist
   - Each blueprint.yaml satisfies catalyst.openova.io/v1alpha1 schema
     (apiVersion/kind/metadata.name/spec.version/card.title/card.summary)
   - Chart.yaml name matches "bp-<x>" and version matches blueprint.yaml
     spec.version
   - clusters/_template/ YAMLs parse after SOVEREIGN_FQDN_PLACEHOLDER
     substitution (when the template tree is on the branch — Group J/M
     ticket lands the per-Sovereign template)
   - The dependency order matches the canonical 11-phase sequence from
     SOVEREIGN-PROVISIONING.md §3 (cilium → cert-manager → flux →
     crossplane → sealed-secrets → spire → nats-jetstream → openbao →
     keycloak → gitea → bp-catalyst-platform)

2. Kind-cluster layer (runs on main pushes, gated on
   BOOTSTRAP_KIT_KIND_TEST=1)
   - Brings up kubernetes-in-docker
   - Installs Flux CRDs + source/kustomize controllers
   - Registers a GitRepository pointing at this monorepo
   - Synthesizes the 11 bootstrap-kit Kustomizations and applies them
   - Asserts the API server accepts all 11 (manifests are valid, schema
     satisfied) — this is the test's narrow scope per the ticket

The test deliberately does NOT wait for the kit to fully install upstream
charts or reach steady-state reconciliation. That belongs to #141 (real
Hetzner E2E with cloud credentials and outbound network), not a kind
cluster test in CI.

Files:
  - tests/e2e/bootstrap-kit/main_test.go (Go test, 11 subtests + 4 main)
  - tests/e2e/bootstrap-kit/go.mod (separate module — keeps test deps
    isolated from the production Go modules)
  - .github/workflows/test-bootstrap-kit.yaml (kind-action + flux2/action)

Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:18 +02:00