History

e3mrah b743b646ac fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427 ) Root cause (autoscaler pod log, prov #43 chroot): W orchestrator.go:626 Node group workers is not ready for scaleup - backoff with status: Scale-up timed out for node group workers after 15m2.273255226s Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY: workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[] workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[] The worker cloud-init (identical to Phase-0 user_data) issues curl -sfL https://get.k3s.io \| K3S_URL=https://10.0.1.2:6443 ... sh - against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment that URL is unreachable → k3s agent install silent-fails → node never registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst- platform Pending Pods never schedulable → chroot canvas tests blocked. Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on the cluster-autoscaler deployment so the Hetzner provider attaches every scale-up VM to the SAME private network + firewall + ssh-key the Phase-0 Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net / -fw / catalyst-<sov-fqdn-with-dashes>). Names flow: Tofu (hcloud_network.main.name + hcloud_firewall.main.name + hcloud_ssh_key.main.name) → cloudinit-control-plane.tftpl (3 new template vars) → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys) → flux-system/cloud-credentials Secret → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*) → upstream chart's deployment env Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent regression of the three env-var slots in chart values.yaml. Reaffirms canonical seam: values flow through Tofu → cloud-init → flux-system Secret → Flux valuesFrom → chart values → upstream env. Never via kubectl patch, never via bespoke Go API calls. Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-12 06:11:30 +04:00
..
chart	fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427 )	2026-05-12 06:11:30 +04:00
blueprint.yaml	feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767 ) (#776 )	2026-05-04 19:49:44 +04:00
README.md	fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965 )	2026-05-05 16:21:59 +04:00

README.md

bp-cluster-autoscaler-hcloud

Catalyst Blueprint umbrella chart for the Kubernetes cluster-autoscaler configured with the Hetzner Cloud cloud-provider. Adds and removes Hetzner workers in response to FailedScheduling events on a Sovereign's k3s cluster.

Why

Per issue #767, a freshly-provisioned Sovereign reaches FailedScheduling the moment the bootstrap-kit's RAM aggregate exceeds the static worker pool the operator picked in the wizard. Live evidence (otech92): two cpx32 workers couldn't fit the external-secrets-webhook Pod because the bootstrap-kit consumed the full 16 GB. The fix is two-pronged:

Pre-launch: the wizard's StepReview surfaces an estimated footprint so the operator picks a worker pool that fits.
Runtime: this blueprint adds cluster-autoscaler so the Sovereign scales workers up/down on demand, bounded by the min/max operator chose at launch.

How it wires

Helm subchart: upstream kubernetes/autoscaler/cluster-autoscaler vendor-neutral, multi-cloud cluster-autoscaler. The Hetzner cloud provider ships in the same upstream container image.
Hetzner token: read at HelmRelease apply time from flux-system/cloud-credentials.hcloud-token (the canonical Secret cloud-init writes per ADR-0001 §11.3 — same Secret consumed by Crossplane provider-hcloud + provider-config-hcloud).
Node bootstrap (issue #921): cluster-autoscaler 1.32.x's Hetzner provider requires either HCLOUD_CLUSTER_CONFIG (per-pool JSON, base64) or HCLOUD_CLOUD_INIT (cloud-init.yaml, base64) — it FATALs at startup without one. This chart wires both via extraEnvSecrets against the rendered cluster-autoscaler/hetzner- node-config Secret. Per-Sovereign overlays populate the clusterAutoscalerHcloud.cloudInit value via Flux valuesFrom against flux-system/cloud-credentials.hcloud-cloud-init, which cloud-init at Phase 0 stamps with the base64 of the same worker cloud-init the Phase-0 worker fleet booted with.
Node group: a single canonical pool keyed off the Sovereign's worker SKU + region + cloud-init template. The pool's min is the operator's chosen worker count; max defaults to 10 (overridable per-Sovereign).
Scale-down: 10 minutes idle (cost-saving default).

What this blueprint does NOT do

It does not pre-create extra nodes. Phase 0 (tofu apply) only provisions the min worker count; cluster-autoscaler creates additional workers on-demand against the same Hetzner project.
It does not provision the OpenTofu node-pool template. That restructuring is tracked separately (see follow-up issue) — the MVP shipped in this PR pins the node-group config in chart values and assumes the existing single-pool topology.
It does not autoscale workloads. KEDA (event-driven workload autoscaling) and the kubernetes-builtin HPA (horizontal pod autoscaler) are layered on top; cluster-autoscaler handles the node dimension only.

Upstream pinning

Knob	Value	Notes
Chart	`cluster-autoscaler` (kubernetes/autoscaler)	`9.46.6` — current stable on 2026-05-04
App	`cluster-autoscaler`	`1.32.0` (matches k3s 1.31.x — within +/-1 minor of the Sovereign apiserver)
Cloud provider	`hetzner`	Built into upstream image

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every value is runtime-configurable; cluster overlays in clusters/<sovereign>/ MAY override any of them without rebuilding the OCI artifact.