Edge + serverless + model-serving batch (W2.5.C) — three upstream- subchart umbrella Blueprints completing the bootstrap-kit slots for WebRTC media relay (bp-relay → bp-stunner) and the AI/ML serving stack (bp-cortex → bp-kserve → bp-knative). Each chart follows the canonical umbrella pattern from docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream chart under `dependencies:` so `helm dependency build` bundles the upstream payload into the OCI artifact, and Catalyst-curated overlay values + templates sit alongside in chart/values.yaml + chart/templates/. Per-chart highlights: - bp-stunner/1.0.0 — wraps stunner/stunner-gateway-operator 1.1.0. Ships a Cilium-native GatewayClass (Capabilities-gated on gateway.networking.k8s.io/v1) so bp-relay (LiveKit / SFU) can claim Gateway CRs without an operator-ordering dance. Default UDP TURN port range 30000-32767 matches the range opened at the Sovereign edge firewall (Crossplane bp-firewall composition). - bp-knative/1.0.0 — wraps knative-operator v1.21.1. Ships a KnativeServing CR pre-configured for **istio-less mode** (ingress.istio.enabled=false, ingress.contour.enabled=false, ingress.kourier.enabled=false; config.network.ingress-class=cilium). Sovereign FQDN sourced from values, no hardcoded fallback per inviolable principle #4 — render fails loudly if cluster overlay doesn't set knativeOverlay.knativeServing.sovereignFqdn. - bp-kserve/1.0.0 — wraps kserve/kserve v0.16.0 (latest version published on the official OCI registry as of 2026-04-30). Default deploymentMode=RawDeployment (no Knative hop on the hot path) but bp-knative is still installed (declared as a hard dep) so per-IS annotation `serving.kserve.io/deploymentMode: Serverless` opts in to scale-to-zero per tenant. Cilium native Gateway-API ingress (enableGatewayApi=true, className=cilium, disableIstioVirtualHost= true). Observability discipline (issue #182): every observability toggle (ServiceMonitor, HPA, GatewayClass) defaults false and is operator- tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles. Each chart ships tests/observability-toggle.sh covering default-off, opt-in (with `--api-versions monitoring.coreos.com/v1` to simulate Prometheus Operator CRDs), and explicit-off cases. Per-chart kind summary (helm template default render): bp-stunner: ClusterRole, ClusterRoleBinding, ConfigMap, Dataplane, Deployment, Role, RoleBinding, Service, ServiceAccount. (+ GatewayClass when --api-versions gateway.networking.k8s.io/v1 is passed.) bp-knative: ClusterRole, ClusterRoleBinding, ConfigMap, CustomResourceDefinition, Deployment, KnativeServing, Role, RoleBinding, Secret, Service, ServiceAccount. bp-kserve: Certificate, ClusterRole, ClusterRoleBinding, ClusterServingRuntime, ClusterStorageContainer, ConfigMap, Deployment, Gateway, Issuer, MutatingWebhookConfiguration, Role, RoleBinding, Service, ServiceAccount, ValidatingWebhookConfiguration. `helm lint` clean for all three (single INFO on missing icon — icons land with marketplace card work). `bash tests/observability-toggle.sh` green for all three (3 cases each: default-off, opt-in, explicit-off). Closes #263 #264 #265 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| chart | ||
| blueprint.yaml | ||
| README.md | ||
KServe
Kubernetes-native model serving. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Used by bp-cortex to serve LLMs via vLLM, embedding models via BGE, and any custom inference workload.
Status: Accepted | Updated: 2026-04-30
Blueprint chart
This folder ships an umbrella Helm chart at chart/ that wraps the upstream kserve/kserve chart (v0.16.0 — latest version published on the official OCI registry as of 2026-04-30) under dependencies:. Catalyst-curated overlay templates render alongside:
chart/templates/networkpolicy.yaml— locks the controller-manager namespace down (DEFAULT FALSE).chart/templates/servicemonitor.yaml— controller-manager metrics scrape (DEFAULT FALSE perdocs/BLUEPRINT-AUTHORING.md§11.2; Capabilities-gated).chart/templates/hpa.yaml— controller-manager Deployment HPA (DEFAULT FALSE; controller is leader-elected).
Catalyst defaults:
kserve.controller.deploymentMode: RawDeployment— KServe writes plain Deployment+Service+HPA per InferenceService (no Knative hop on the hot path).kserve.controller.gateway.ingressGateway.enableGatewayApi: true+className: cilium— Catalyst's istio-less Cilium native Gateway-API path.kserve.controller.gateway.disableIstioVirtualHost: true— Knative-Istio is NOT installed.bp-knativeis still installed (declared as a hard dependency inblueprint.yaml) so per-InferenceService annotationserving.kserve.io/deploymentMode: Serverlessopts in to scale-to-zero on a per-tenant basis without infra changes.
Overview
KServe provides standardized model serving on Kubernetes with support for multiple ML frameworks, autoscaling, and inference graphs.
flowchart TB
subgraph KServe["KServe"]
Controller[KServe Controller]
Predictor[Predictor]
Transformer[Transformer]
Explainer[Explainer]
end
subgraph Runtimes["Serving Runtimes"]
vLLM[vLLM]
TorchServe[TorchServe]
Triton[Triton]
SKLearn[SKLearn]
end
subgraph Knative["Knative Serving"]
Autoscale[Autoscaling]
Revisions[Revisions]
end
Controller --> Predictor
Controller --> Transformer
Controller --> Explainer
Predictor --> Runtimes
Runtimes --> Knative
Why KServe?
| Feature | Benefit |
|---|---|
| Multi-framework | TensorFlow, PyTorch, ONNX, vLLM, etc. |
| Autoscaling | Scale-to-zero via Knative |
| InferenceService | Standardized deployment pattern |
| Inference Graph | Multi-model pipelines |
| Model explainability | Integrated explainers |
Components
| Component | Purpose |
|---|---|
| InferenceService | Model deployment abstraction |
| ServingRuntime | Framework-specific runtime |
| InferenceGraph | Multi-model orchestration |
| ClusterStorageContainer | Model storage configuration |
Serving Runtimes
| Runtime | Use Case |
|---|---|
| vLLM | LLM inference (recommended) |
| TorchServe | PyTorch models |
| Triton | Multi-framework, high performance |
| SKLearn | Scikit-learn models |
| XGBoost | Gradient boosting models |
| ONNX | ONNX format models |
Configuration
InferenceService Example
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-service
namespace: ai-hub
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: vllm-runtime
storageUri: pvc://model-cache/models/qwen-32b
resources:
requests:
cpu: "4"
memory: 32Gi
nvidia.com/gpu: "2"
limits:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "2"
ServingRuntime for vLLM
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-runtime
spec:
supportedModelFormats:
- name: vllm
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=2
- --max-model-len=32768
resources:
requests:
nvidia.com/gpu: "2"
Inference Graph
Multi-model pipeline for complex inference:
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
name: rag-pipeline
spec:
nodes:
root:
routerType: Sequence
steps:
- serviceName: embedder
- serviceName: retriever
- serviceName: llm
embedder:
serviceName: bge-embedder
retriever:
serviceName: vector-search
llm:
serviceName: qwen-llm
GPU Scheduling
# Node selector for GPU nodes
spec:
predictor:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Model Storage
PVC-based Storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: ai-hub
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: oci-bv
S3-based Storage (SeaweedFS)
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
name: seaweedfs-storage
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: seaweedfs-credentials
key: accesskey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: seaweedfs-credentials
key: secretkey
- name: S3_ENDPOINT
value: http://seaweedfs.storage.svc:8333
Monitoring
| Metric | Query |
|---|---|
| Inference latency | kserve_inference_duration_seconds |
| Request count | kserve_inference_count |
| GPU utilization | DCGM_FI_DEV_GPU_UTIL |
| Model load time | kserve_model_load_duration_seconds |
Consequences
Positive:
- Standardized model deployment
- Multi-framework support
- Autoscaling via Knative
- Inference graphs for pipelines
- GPU scheduling support
Negative:
- Complexity for simple deployments
- Requires Knative
- Learning curve for KServe concepts
Part of OpenOva