openova/platform/harbor
e3mrah 3d929e69d7
fix(httproute): collapse double-prefix when releaseName contains chart name (gitea/harbor/openbao 500/404) (#1483)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(httproute): match upstream chart fullname-collapse when releaseName contains chart name

Three Sovereign-facing HTTPRoute templates (gitea, harbor, openbao) had
backend defaults hardcoded as `<release>-<chart>-<resource>` (e.g.
`gitea-gitea-http`, `harbor-harbor-core`, `openbao-openbao`). The
upstream subcharts use a `<chart>.fullname` helper that COLLAPSES the
prefix when `.Release.Name` already contains the chart name — i.e. when
the bootstrap-kit releaseName is the chart name (the convention), the
live Service is `<release>-<resource>` (or just `<release>` for openbao),
not `<release>-<chart>-<resource>`.

Effect on prov #80 (omantel.biz):
- gitea/gitea HTTPRoute → backendRef `gitea-gitea-http` (does not exist; live is `gitea-http`) → BackendNotFound → gitea.omantel.biz returns HTTP 500
- harbor/harbor HTTPRoute → `harbor-harbor-core` (live is `harbor-core`) → registry.omantel.biz returns HTTP 500
- openbao/openbao HTTPRoute → `openbao-openbao` (live is `openbao`) → bao.omantel.biz dead

Fix: replicate the upstream chart's `.fullname` collapse logic via
`(ternary .Release.Name (printf "%s-<chart>" .Release.Name) (contains "<chart>" .Release.Name))` so the default backend always matches
the live Service name regardless of releaseName choice. Operators retain
the `gateway.backendService` override for non-standard release names.

Chart bumps: bp-gitea 1.2.6 → 1.2.7, bp-harbor 1.2.16 → 1.2.17, bp-openbao 1.2.14 → 1.2.15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 19:00:07 +04:00
..
chart fix(httproute): collapse double-prefix when releaseName contains chart name (gitea/harbor/openbao 500/404) (#1483) 2026-05-14 19:00:07 +04:00
blueprint.yaml feat(bp-harbor): vendor-agnostic Object Storage backend (closes #383) (#437) 2026-05-01 18:18:37 +04:00
README.md feat(bp-harbor): vendor-agnostic Object Storage backend (closes #383) (#437) 2026-05-01 18:18:37 +04:00

Harbor

Container registry with vulnerability scanning. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.5) — every host cluster runs a Harbor instance for Catalyst component images, mirrored Blueprint OCI artifacts, and customer images.

Status: Accepted | Updated: 2026-04-27


Overview

Harbor is mandatory on every host cluster. Each host cluster runs its own Harbor instance that mirrors from upstream sources (ghcr.io/openova-io/... for Catalyst components and Blueprint OCI artifacts; the customer's own CI for application images). Local Harbor = fast Pod pulls, no cross-region traffic on every image pull, air-gap ready.

flowchart TB
    subgraph Upstream["Upstream OCI sources"]
        GHCR[ghcr.io/openova-io/* — Catalyst + Blueprints]
        CustCI[Customer CI — Application images]
    end

    subgraph Cluster1["Host cluster A (e.g. hz-fsn-rtz-prod)"]
        H1[Harbor — local mirror]
        T1[Trivy Scanner]
        Pods1[Pods pull locally]
    end

    subgraph Cluster2["Host cluster B (e.g. hz-hel-rtz-prod)"]
        H2[Harbor — local mirror]
        T2[Trivy Scanner]
        Pods2[Pods pull locally]
    end

    GHCR -.->|"pull mirror"| H1
    CustCI -.->|"push"| H1
    GHCR -.->|"pull mirror"| H2
    CustCI -.->|"push"| H2
    H1 --> T1
    H2 --> T2
    H1 --> Pods1
    H2 --> Pods2

Why Mandatory?

Requirement Harbor (per host cluster) External Registry
Local pulls (no cross-region traffic) Each cluster's Pods pull from local Harbor Pods pull cross-region
Vulnerability scanning Trivy integrated ⚠️ Depends on provider
Air-gap support Self-hosted
RBAC Full control ⚠️ Provider-specific
Audit logging Complete ⚠️ Limited
No external dependency at runtime Once mirrored

Features

Feature Support
Image storage OCI-compliant
Vulnerability scanning Trivy integration
Image signing Cosign/Notary
Replication Push/pull between regions
RBAC Project-based access
Quotas Per-project storage limits
Garbage collection Automatic cleanup

Per-host-cluster mirroring (NOT primary-replica)

Catalyst's agreed model is one Harbor per host cluster, each independently pulling from upstream OCI sources. There is no Harbor-to-Harbor replication primary/replica.

sequenceDiagram
    participant CI as CI / Upstream OCI
    participant H1 as Harbor (cluster A)
    participant T1 as Trivy (cluster A)
    participant H2 as Harbor (cluster B)
    participant T2 as Trivy (cluster B)
    participant Pods as Pods

    CI->>H1: pull-mirror sync (configured per project)
    H1->>T1: scan on ingest
    CI->>H2: pull-mirror sync (independent of H1)
    H2->>T2: scan on ingest
    Pods->>H1: pull (cluster A Pods)
    Pods->>H2: pull (cluster B Pods)

Why pull-mirror, not Harbor-to-Harbor replication:

  • Single source of truth = upstream (ghcr.io/openova-io/... or customer CI), not a "primary Harbor".
  • Each cluster is its own failure domain — primary-replica drift between Harbors would be one more thing to fail.
  • Air-gap path is the same shape: a one-time mirror import vs ongoing primary-pushed replication.

Benefits:

  • Images available locally in each cluster.
  • Survives any cluster (including the management cluster) going down — workload clusters keep pulling locally.
  • Faster pulls (no cross-region traffic per Pod start).

Storage Backend Options

Backend Use Case Notes
PVC (type: filesystem) Dev / contabo / single-node Default render — no S3 wiring
Cloud-native S3 Production Sovereigns Hetzner Object Storage / AWS S3 / GCP / Azure

S3-aware apps (Harbor is one) write DIRECTLY to the cloud-provider's native S3 endpoint. SeaweedFS is reserved as a POSIX→S3 buffer for legacy POSIX-only writers and is NOT in the minimal Sovereign set.

flowchart LR
    Harbor[Harbor] -->|"S3 API (HTTPS)"| Hetzner[Hetzner Object Storage<br/>fsn1.your-objectstorage.com]

Configuration

Helm Values (per-Sovereign overlay shape — issue #383 / #425)

gateway:
  host: registry.<sovereign-fqdn>

# Vendor-agnostic Object Storage seam — populated via Flux valuesFrom
# against the canonical flux-system/object-storage Sealed Secret.
objectStorage:
  enabled: true
  credentialsSecretName: harbor-objectstorage-credentials
  s3:
    accessKey: ""   # populated by Flux valuesFrom
    secretKey: ""   # populated by Flux valuesFrom

harbor:
  persistence:
    imageChartStorage:
      type: s3
      s3:
        # bucket / region / regionendpoint also populated by Flux valuesFrom
        existingSecret: harbor-objectstorage-credentials
        v4auth: true
        secure: true

trivy:
  enabled: true

database:
  type: internal  # or external for CNPG

redis:
  type: internal  # or external for Valkey

core:
  secretName: harbor-core-secret

Pull-mirror policy

{
  "name": "ghcr-openova-mirror",
  "src_registry": {
    "type": "harbor",
    "url": "https://ghcr.io",
    "credential": {
      "access_key": "",
      "access_secret": ""
    }
  },
  "trigger": {
    "type": "scheduled",
    "trigger_settings": {
      "cron": "0 */6 * * *"
    }
  },
  "filters": [
    {
      "type": "name",
      "value": "openova-io/**"
    }
  ],
  "enabled": true
}

Security Scanning

Trivy Integration

Scan Type Trigger
On push Automatic when image pushed
Scheduled Daily full scan
Manual On-demand via UI/API

Scan Policy

Severity Action
Critical Block pull
High Allow (configurable)
Medium Allow
Low Allow

Kyverno Policies

Require Harbor Images

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-harbor-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-harbor-registry
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Images must be pulled from Harbor registry"
        pattern:
          spec:
            containers:
              - image: "harbor.<location-code>.<sovereign-domain>/*"

Resource Requirements

Component CPU Memory
Harbor Core 0.5 512Mi
Registry 0.5 512Mi
Database 0.5 512Mi
Redis 0.25 256Mi
Trivy 0.5 1Gi
Total 2.25 2.75Gi

Backup Strategy

Harbor data backed up via Velero to Archival S3:

flowchart LR
    Harbor[Harbor] --> Velero[Velero]
    Velero --> S3[Archival S3]

Backed up:

  • Database (PostgreSQL)
  • Registry storage (blobs)
  • Configuration

Consequences

Positive:

  • Complete control over image lifecycle.
  • Built-in vulnerability scanning (Trivy on ingest).
  • Per-cluster mirror = no cross-region pull traffic; each cluster is an independent failure domain.
  • Air-gap ready (one-time import works the same way as ongoing pull-mirror).
  • Audit trail for compliance.

Negative:

  • Resource overhead (~3GB RAM)
  • Operational responsibility
  • Backup requirements (handled by Velero)

Part of OpenOva