openova/platform/kserve
e3mrah c09109a61a
feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290)
Edge + serverless + model-serving batch (W2.5.C) — three upstream-
subchart umbrella Blueprints completing the bootstrap-kit slots for
WebRTC media relay (bp-relay → bp-stunner) and the AI/ML serving stack
(bp-cortex → bp-kserve → bp-knative).

Each chart follows the canonical umbrella pattern from
docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream
chart under `dependencies:` so `helm dependency build` bundles the
upstream payload into the OCI artifact, and Catalyst-curated overlay
values + templates sit alongside in chart/values.yaml + chart/templates/.

Per-chart highlights:
- bp-stunner/1.0.0 — wraps stunner/stunner-gateway-operator 1.1.0.
  Ships a Cilium-native GatewayClass (Capabilities-gated on
  gateway.networking.k8s.io/v1) so bp-relay (LiveKit / SFU) can claim
  Gateway CRs without an operator-ordering dance. Default UDP TURN port
  range 30000-32767 matches the range opened at the Sovereign edge
  firewall (Crossplane bp-firewall composition).
- bp-knative/1.0.0 — wraps knative-operator v1.21.1. Ships a
  KnativeServing CR pre-configured for **istio-less mode**
  (ingress.istio.enabled=false, ingress.contour.enabled=false,
  ingress.kourier.enabled=false; config.network.ingress-class=cilium).
  Sovereign FQDN sourced from values, no hardcoded fallback per
  inviolable principle #4 — render fails loudly if cluster overlay
  doesn't set knativeOverlay.knativeServing.sovereignFqdn.
- bp-kserve/1.0.0 — wraps kserve/kserve v0.16.0 (latest version
  published on the official OCI registry as of 2026-04-30). Default
  deploymentMode=RawDeployment (no Knative hop on the hot path) but
  bp-knative is still installed (declared as a hard dep) so per-IS
  annotation `serving.kserve.io/deploymentMode: Serverless` opts in to
  scale-to-zero per tenant. Cilium native Gateway-API ingress
  (enableGatewayApi=true, className=cilium, disableIstioVirtualHost=
  true).

Observability discipline (issue #182): every observability toggle
(ServiceMonitor, HPA, GatewayClass) defaults false and is operator-
tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles.
Each chart ships tests/observability-toggle.sh covering default-off,
opt-in (with `--api-versions monitoring.coreos.com/v1` to simulate
Prometheus Operator CRDs), and explicit-off cases.

Per-chart kind summary (helm template default render):

  bp-stunner: ClusterRole, ClusterRoleBinding, ConfigMap, Dataplane,
              Deployment, Role, RoleBinding, Service, ServiceAccount.
              (+ GatewayClass when --api-versions
              gateway.networking.k8s.io/v1 is passed.)

  bp-knative: ClusterRole, ClusterRoleBinding, ConfigMap,
              CustomResourceDefinition, Deployment, KnativeServing,
              Role, RoleBinding, Secret, Service, ServiceAccount.

  bp-kserve:  Certificate, ClusterRole, ClusterRoleBinding,
              ClusterServingRuntime, ClusterStorageContainer,
              ConfigMap, Deployment, Gateway, Issuer,
              MutatingWebhookConfiguration, Role, RoleBinding,
              Service, ServiceAccount, ValidatingWebhookConfiguration.

`helm lint` clean for all three (single INFO on missing icon — icons
land with marketplace card work).

`bash tests/observability-toggle.sh` green for all three (3 cases each:
default-off, opt-in, explicit-off).

Closes #263 #264 #265

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:38 +04:00
..
chart feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00
blueprint.yaml feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00
README.md feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00

KServe

Kubernetes-native model serving. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Used by bp-cortex to serve LLMs via vLLM, embedding models via BGE, and any custom inference workload.

Status: Accepted | Updated: 2026-04-30


Blueprint chart

This folder ships an umbrella Helm chart at chart/ that wraps the upstream kserve/kserve chart (v0.16.0 — latest version published on the official OCI registry as of 2026-04-30) under dependencies:. Catalyst-curated overlay templates render alongside:

  • chart/templates/networkpolicy.yaml — locks the controller-manager namespace down (DEFAULT FALSE).
  • chart/templates/servicemonitor.yaml — controller-manager metrics scrape (DEFAULT FALSE per docs/BLUEPRINT-AUTHORING.md §11.2; Capabilities-gated).
  • chart/templates/hpa.yaml — controller-manager Deployment HPA (DEFAULT FALSE; controller is leader-elected).

Catalyst defaults:

  • kserve.controller.deploymentMode: RawDeployment — KServe writes plain Deployment+Service+HPA per InferenceService (no Knative hop on the hot path).
  • kserve.controller.gateway.ingressGateway.enableGatewayApi: true + className: cilium — Catalyst's istio-less Cilium native Gateway-API path.
  • kserve.controller.gateway.disableIstioVirtualHost: true — Knative-Istio is NOT installed.
  • bp-knative is still installed (declared as a hard dependency in blueprint.yaml) so per-InferenceService annotation serving.kserve.io/deploymentMode: Serverless opts in to scale-to-zero on a per-tenant basis without infra changes.

Overview

KServe provides standardized model serving on Kubernetes with support for multiple ML frameworks, autoscaling, and inference graphs.

flowchart TB
    subgraph KServe["KServe"]
        Controller[KServe Controller]
        Predictor[Predictor]
        Transformer[Transformer]
        Explainer[Explainer]
    end

    subgraph Runtimes["Serving Runtimes"]
        vLLM[vLLM]
        TorchServe[TorchServe]
        Triton[Triton]
        SKLearn[SKLearn]
    end

    subgraph Knative["Knative Serving"]
        Autoscale[Autoscaling]
        Revisions[Revisions]
    end

    Controller --> Predictor
    Controller --> Transformer
    Controller --> Explainer
    Predictor --> Runtimes
    Runtimes --> Knative

Why KServe?

Feature Benefit
Multi-framework TensorFlow, PyTorch, ONNX, vLLM, etc.
Autoscaling Scale-to-zero via Knative
InferenceService Standardized deployment pattern
Inference Graph Multi-model pipelines
Model explainability Integrated explainers

Components

Component Purpose
InferenceService Model deployment abstraction
ServingRuntime Framework-specific runtime
InferenceGraph Multi-model orchestration
ClusterStorageContainer Model storage configuration

Serving Runtimes

Runtime Use Case
vLLM LLM inference (recommended)
TorchServe PyTorch models
Triton Multi-framework, high performance
SKLearn Scikit-learn models
XGBoost Gradient boosting models
ONNX ONNX format models

Configuration

InferenceService Example

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-service
  namespace: ai-hub
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: pvc://model-cache/models/qwen-32b
      resources:
        requests:
          cpu: "4"
          memory: 32Gi
          nvidia.com/gpu: "2"
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "2"

ServingRuntime for vLLM

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-runtime
spec:
  supportedModelFormats:
    - name: vllm
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:latest
      args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=2
        - --max-model-len=32768
      resources:
        requests:
          nvidia.com/gpu: "2"

Inference Graph

Multi-model pipeline for complex inference:

apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
  name: rag-pipeline
spec:
  nodes:
    root:
      routerType: Sequence
      steps:
        - serviceName: embedder
        - serviceName: retriever
        - serviceName: llm
    embedder:
      serviceName: bge-embedder
    retriever:
      serviceName: vector-search
    llm:
      serviceName: qwen-llm

GPU Scheduling

# Node selector for GPU nodes
spec:
  predictor:
    nodeSelector:
      nvidia.com/gpu.product: NVIDIA-A10
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Model Storage

PVC-based Storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: ai-hub
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: oci-bv

S3-based Storage (SeaweedFS)

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
  name: seaweedfs-storage
spec:
  container:
    name: storage-initializer
    image: kserve/storage-initializer:latest
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: seaweedfs-credentials
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: seaweedfs-credentials
            key: secretkey
      - name: S3_ENDPOINT
        value: http://seaweedfs.storage.svc:8333

Monitoring

Metric Query
Inference latency kserve_inference_duration_seconds
Request count kserve_inference_count
GPU utilization DCGM_FI_DEV_GPU_UTIL
Model load time kserve_model_load_duration_seconds

Consequences

Positive:

  • Standardized model deployment
  • Multi-framework support
  • Autoscaling via Knative
  • Inference graphs for pipelines
  • GPU scheduling support

Negative:

  • Complexity for simple deployments
  • Requires Knative
  • Learning curve for KServe concepts

Part of OpenOva