openova/platform/cluster-autoscaler-hcloud/README.md
e3mrah d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00

3.6 KiB

bp-cluster-autoscaler-hcloud

Catalyst Blueprint umbrella chart for the Kubernetes cluster-autoscaler configured with the Hetzner Cloud cloud-provider. Adds and removes Hetzner workers in response to FailedScheduling events on a Sovereign's k3s cluster.

Why

Per issue #767, a freshly-provisioned Sovereign reaches FailedScheduling the moment the bootstrap-kit's RAM aggregate exceeds the static worker pool the operator picked in the wizard. Live evidence (otech92): two cpx32 workers couldn't fit the external-secrets-webhook Pod because the bootstrap-kit consumed the full 16 GB. The fix is two-pronged:

  1. Pre-launch: the wizard's StepReview surfaces an estimated footprint so the operator picks a worker pool that fits.
  2. Runtime: this blueprint adds cluster-autoscaler so the Sovereign scales workers up/down on demand, bounded by the min/max operator chose at launch.

How it wires

  • Helm subchart: upstream kubernetes/autoscaler/cluster-autoscaler vendor-neutral, multi-cloud cluster-autoscaler. The Hetzner cloud provider ships in the same upstream container image.
  • Hetzner token: read at HelmRelease apply time from flux-system/cloud-credentials.hcloud-token (the canonical Secret cloud-init writes per ADR-0001 §11.3 — same Secret consumed by Crossplane provider-hcloud + provider-config-hcloud).
  • Node bootstrap (issue #921): cluster-autoscaler 1.32.x's Hetzner provider requires either HCLOUD_CLUSTER_CONFIG (per-pool JSON, base64) or HCLOUD_CLOUD_INIT (cloud-init.yaml, base64) — it FATALs at startup without one. This chart wires both via extraEnvSecrets against the rendered cluster-autoscaler/hetzner- node-config Secret. Per-Sovereign overlays populate the clusterAutoscalerHcloud.cloudInit value via Flux valuesFrom against flux-system/cloud-credentials.hcloud-cloud-init, which cloud-init at Phase 0 stamps with the base64 of the same worker cloud-init the Phase-0 worker fleet booted with.
  • Node group: a single canonical pool keyed off the Sovereign's worker SKU + region + cloud-init template. The pool's min is the operator's chosen worker count; max defaults to 10 (overridable per-Sovereign).
  • Scale-down: 10 minutes idle (cost-saving default).

What this blueprint does NOT do

  • It does not pre-create extra nodes. Phase 0 (tofu apply) only provisions the min worker count; cluster-autoscaler creates additional workers on-demand against the same Hetzner project.
  • It does not provision the OpenTofu node-pool template. That restructuring is tracked separately (see follow-up issue) — the MVP shipped in this PR pins the node-group config in chart values and assumes the existing single-pool topology.
  • It does not autoscale workloads. KEDA (event-driven workload autoscaling) and the kubernetes-builtin HPA (horizontal pod autoscaler) are layered on top; cluster-autoscaler handles the node dimension only.

Upstream pinning

Knob Value Notes
Chart cluster-autoscaler (kubernetes/autoscaler) 9.46.6 — current stable on 2026-05-04
App cluster-autoscaler 1.32.0 (matches k3s 1.31.x — within +/-1 minor of the Sovereign apiserver)
Cloud provider hetzner Built into upstream image

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every value is runtime-configurable; cluster overlays in clusters/<sovereign>/ MAY override any of them without rebuilding the OCI artifact.