Commit Graph

36 Commits

Author SHA1 Message Date
e3mrah
8878938a43
fix(ci): bump sme-services Containerfiles golang 1.22 → 1.26 (unblock 5 stranded fixes) (#1691)
Every services-build run since 2026-05-18 06:32 UTC failed with
"go: go.mod requires go >= 1.26.0 (running go 1.22.12; GOTOOLCHAIN=local)"
because a recent go.mod bump to `go 1.26.0` was not paired with a
Containerfile base-image bump.

5 strandled fixes that never produced new image SHAs:
- PR #1683 fix(billing): consume catalyst.usage.recorded from
  CATALYST_SME stream (was creating overlapping CATALYST_USAGE)
- PR #1684 fix(provisioning): set Organization.spec.tenantPublic
- PR #1685 fix(catalog+billing): Sandbox Free/Pro/Ent plans + quota
- PR #1686 feat(sandbox): orchestrator listens tenant.sandbox_requested
- test(sandbox): integration tests for orchestrator + sessions API

The stranded billing image is the root cause of every voucher 502 on
t22 and blocks the full marketplace customer journey (steps 9, 10, 15
all fail). t22 billing Pod is in CrashLoopBackOff with the exact NATS
subject-overlap signature PR #1683 fixes.

Bumps all 10 service Containerfiles (auth/billing/catalog/catalyst-
catalog/domain/gateway/metering-sidecar/notification/provisioning/
tenant) to golang:1.26-alpine, matching the toolchain in go.mod.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:36:39 +04:00
e3mrah
8017700ad4
feat(sandbox): tier-bound MCP capabilities (Free/Pro/Ent plans gate tool access) (#1690)
Stop handing every Sandbox session the full MCP surface. Each per-Sandbox
NewAPI token now carries a plan-derived capability allowlist that the MCP
server enforces against per-tool RequiredCapability via Claims.HasCapability:

  - Free: read-only k8s + gitea read + session/rag/skills
  - Pro:  + sandbox.db.* + sandbox.storage.* + sandbox.preview.* +
          sandbox.auth.* + sandbox.secrets.* + marketplace.* + flux.status
  - Ent:  + sandbox.deploy.{staging,production,...} + sandbox.stripe.* +
          flux.{reconcile,suspend,resume} + gitea.pr.{create,merge} +
          gitea.issue.*

Wiring:
  - Sandbox CRD spec gains planId + capabilities[] (operator overlay).
  - Sandbox sandboxapi.{CapabilitiesForPlan,ResolveCapabilities} is the
    SoT; tenant orchestrator carries an exact-mirror capabilitiesForPlan
    (no controllers-module dep — same isolation pattern quotaForPlan
    uses).
  - sandbox-controller threads spec.capabilities (falling back to plan)
    into newapi.MintRequest.
  - catalyst-api bridge handler accepts capabilities[] on the wire and
    encodes it as the JWT `capabilities` claim (omitted when empty).
  - Claims.HasCapability gains wildcard prefix matching (`sandbox.db.*`
    satisfies `sandbox.db.provision`, `sandbox.db`, etc.) so plan grants
    stay coarse. Plain stem matches WITHOUT a wildcard are intentionally
    rejected — the production second-gate in sandbox_deploy.go stays
    honest.
  - MCP registry: every gated tool now carries its granular dotted
    RequiredCapability (`sandbox.db.provision`, `gitea.pr.list`, …).
    Read-only / session tools previously ungated also get granular
    grants so Free tokens can browse without inheriting the write
    surface.

No Chart.yaml bump — CRD additions are additive; existing Sandbox CRs
parse fine. Empty token capabilities downgrades to introspection only,
matching pre-PR-#1671 callers.

Tests: shared/auth/claims_test.go (wildcard matrix),
sandboxapi/capabilities_test.go (plan ladder + spec override),
sandbox_token_test.go (capabilities round-trip + omit-on-empty),
sandbox_controller_test.go (plan-derived + spec-override mint),
sandbox_consumer_test.go (orchestrator stamps spec.capabilities), plus
updates to every per-namespace registry test asserting new granular
RequiredCapability values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:30:00 +04:00
e3mrah
ffb79aab12
fix(billing): consume catalyst.usage.recorded from CATALYST_SME stream (was creating overlapping CATALYST_USAGE) (#1683)
t20 (2026-05-18) caught the bug: billing crashed at startup with NATS
error code 10065 "subjects overlap with an existing stream" because
CATALYST_SME (subjects `catalyst.>`, created by the tenant /
provisioning MultiSubscribers) had already claimed `catalyst.usage.recorded`
by the time billing tried to create CATALYST_USAGE
(subject `catalyst.usage.recorded`). JetStream forbids two Streams from
owning overlapping subject filters.

Option B per the matrix: have billing share CATALYST_SME and scope its
metering reads via a consumer-side FilterSubject instead of owning a
separate Stream. This matches the architecture every other SME service
(tenant, notification, provisioning) already uses for catalyst.* events.

Changes:
- core/services/shared/events/nats.go: add EnsureCatalystSMEStream
  (public wrapper around the existing package-private ensureSMEStream
  helper used by NewMultiSubscriber) + SubscribeUsageRecordedOnSME
  (durable consumer on CATALYST_SME with FilterSubject scoped to
  catalyst.usage.recorded). The original EnsureUsageStream and
  SubscribeUsageRecorded are retained but marked Deprecated for
  back-compat with any Catalyst-Zero / dev loop wired before t20.
- core/services/billing/main.go: replace the EnsureUsageStream call
  with EnsureCatalystSMEStream and the SubscribeUsageRecorded call
  with SubscribeUsageRecordedOnSME. Comment captures the t20 root
  cause + the bootstrap-order rationale so the next reader doesn't
  re-introduce the dedicated Stream.

The consumer-side FilterSubject (`catalyst.usage.recorded`) lives in
core/services/shared/events/nats.go inside SubscribeUsageRecordedOnSME.

go build + go test clean for core/services/billing and
core/services/shared.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:53:25 +04:00
e3mrah
3acb340b36
test(sandbox): integration tests for orchestrator + sessions API status reflection (#1680)
Adds regression coverage so the Sandbox event flow + REST surface can
be exercised without a live Sovereign — the convergence loop the
qa-loop's last 5 iterations relied on.

Tenant orchestrator (5 cases / 8 runs):
  * full event flow — tenant.sandbox_requested envelope → in-process
    BrokerSubscriber → SandboxOrchestrator.Start → recordingSandboxClient
    materialises a CR shaped per architecture.md §7 (labels, annotations,
    spec.owner/quota/agentCatalogue/planId)
  * NATS-style redelivery is idempotent — second Emit() goes Get(found)
    → no-op, Create count stays at 1
  * plan tiers fan out — free/pro/ent each stamp the right quota
    (catches the PR #1633 regression)
  * non-sandbox event types ignored at the dispatcher seam
  * agentCatalogue strips empty / whitespace entries before persist

Catalyst sessions API (7 cases / 10 runs):
  * POST → GET round-trip through a dynamic/fake apiserver via
    SetSovereignDepsFactory (mirrors chroot Sovereign "Path 2")
  * GET reflects controller status (sessions / storage / spend /
    previews / conditions) into the FE wire shape
  * Failed condition taxonomy — TokenMintFailed, GitopsWriteFailed,
    ManifestRenderFailed each preserved verbatim so the FE renders
    actionable error states instead of a generic red pill
  * POST invalid-agent returns 400 before any apiserver call
  * GET unknown sandbox returns 404 sandbox-not-found
  * LIST → DELETE → LIST round-trip
  * Org-scope isolation — claims.Org-scoped namespace boundary blocks
    cross-Org leak

Hard rules followed: READ-ONLY fake clients (no apiserver write), no
chart bump, no production code changes — only new _test.go files.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:41:41 +04:00
e3mrah
96d2d9bce7
fix(provisioning): set Organization.spec.tenantPublic on product-install (was empty; HTTPRoute reconciler had nothing to render) (#1650)
PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute
reconciler, but nothing set the field — every Org CR's TenantPublic
stayed zero-value, the reconciler short-circuited at the empty
ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium
Gateway.

Wire the patch at the only point that knows a tenant's product is
actually Ready: the provisioning service. Both the initial workflow
(`provision.completed`) and the day-2 install path
(`provision.app_ready`) now patch the Organization CR's
spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env),
subdomain (= slug), backendService (canonical vcluster-synced name),
port 80, and the picked product slug. Last-write-wins on subsequent
installs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through
env, never hardcoded — every Sovereign picks its own pool zone.
Empty env disables the patch entirely (legacy tenants keep working
through the Sovereign-wide tenant-wildcard route). Best-effort:
failures don't fail the provision. 404 on the CR is benign (legacy
tenant without an Organization counterpart).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:44:00 +04:00
e3mrah
8888d9edd1
feat(catalog+billing): Sandbox Free/Pro/Ent plans + quota wire (was no plans = broken checkout) (#1642)
PR #1633 added the Sandbox app to seedApps but never wired the matching plan
rows. The marketplace checkout hit "plan_id not found" the moment a customer
picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with
the Wave 1 baseline quota regardless of the picked tier.

This PR closes both gaps in lockstep:

Catalog:
- Plan struct gets ProductSlug + IncludedQuotas fields (back-compat:
  omitempty BSON tags so legacy rows decode fine).
- expectedSandboxPlans() helper canonical-defines the three tiers:
    sandbox-free  0 OMR  1 session, 1 agent,    5 GB, BYOS
    sandbox-pro   9 OMR  3 sessions, 6 agents, 50 GB, BYOS (Popular)
    sandbox-ent  49 OMR  unlimited,  6 agents, 500 GB, BYOS
- seedAllData appends them on fresh seed; seedMissingSandboxPlans
  backfills them on already-populated Sovereigns (idempotent GET-then-
  create, patches missing ProductSlug/IncludedQuotas on legacy rows).
- UpdatePlan persists the two new fields.

Sandbox orchestrator wiring:
- SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID.
- buildSandbox stamps openova.io/plan-id annotation + spec.planId when
  PlanID is non-empty.
- quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or
  unknown plan_id falls through to DefaultQuota (Wave 1 baseline =
  Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so
  tenant-service avoids a compile-time dep on the catalog mongo stack.

Tests:
- TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the
  Popular flag (sandbox-pro), and the quota ladder.
- TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises
  all three tiers end-to-end (annotation + spec.planId + spec.quota).
- TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat
  with pre-PR publishers.
- TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd /
  retired plan IDs.

go build + go test clean for catalog, tenant, billing, provisioning,
shared, marketplace-api.

No Chart.yaml bump, no cluster touch.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:31:25 +04:00
e3mrah
4c83d98765
feat(sandbox): orchestrator listens tenant.sandbox_requested → Sandbox CR materialisation (#1639)
PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the
marketplace cart includes the sandbox product. Nobody was subscribing —
the event landed in NATS `catalyst.tenant.sandbox_requested` and aged
out unread, so no Sandbox CR (PR #1622) was ever minted and the
customer sat on a "Provisioning…" spinner forever.

This slice closes the loop. A new SandboxOrchestrator in tenant-service:

- Subscribes via events.MultiSubscriber (PR #1636) to the canonical
  NATS subject + legacy Kafka topic.
- Parses {tenant_id, org_slug, owner_id, owner_email, agents,
  sovereign, requested_at} and resolves the owner email (event field
  → store.GetMemberEmail → owner_id fallback).
- Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE
  override) via a dynamic client, with spec per architecture §7:
  owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi /
  50 Gi / 3 sessions), spec.agentCatalogue from the cart.
- Idempotent: Get-then-Create with AlreadyExists swallowed so NATS
  redeliveries + duplicate marketplace submits stay no-ops; the
  sandbox-controller remains SoR for spec mutations.

Wiring in main.go is best-effort — when no in-cluster config nor
KUBECONFIG is available (CI / dev loops) the orchestrator is skipped
with a Warn; the rest of the tenant service still boots.

Hard rules: no chart bump, no cluster writes outside of the Sandbox
Create call (sandbox-controller reconciles the rest), `go build ./...`
clean, `go test ./...` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:22 +04:00
e3mrah
72f82ea7f2
fix(sme): wire provisioning/notification/domain consumers to NATS (was Kafka-only, was silent-dropping every tenant.created event) (#1636)
PR #1626 wired the PUBLISH leg of tenant + billing to NATS via
events.MultiPublisher (canonical subject `catalyst.<event.Type>` per
ADR-0001 §6). The CONSUME leg stayed Kafka-only — provisioning,
notification, domain, billing's tenant-events cascade, AND tenant's own
provision-events + members-cleanup consumers all called
events.NewConsumer(redpandaBrokers, …). On Sovereigns REDPANDA_BROKERS
is empty by design (no Redpanda exists; NATS is the canonical bus per
the convergence-fix block in configmap.yaml) so those consumers either
never started OR dialed `localhost:9092` in a hot crash loop.

Net effect on every Sovereign install pre-this-PR:
  1. alice POSTs /sme/tenants → tenant publishes catalyst.tenant.created
     to NATS (PR #1626).
  2. provisioning's only subscriber was Kafka-only → silent drop.
  3. No Organization CR ever spawned → no vCluster → CONVERGENCE BROKEN.

This change introduces a symmetric subscribe-side abstraction mirroring
bridge.go's MultiPublisher:

  - events.BrokerSubscriber: unified Subscribe(ctx, handler) interface,
    satisfied by *Consumer, *DLQSubscriber, *MultiSubscriber.
  - events.MultiSubscriber: fans in from NATS JetStream durable
    consumers (one per canonical subject) + an optional legacy Kafka
    Consumer. NewMultiSubscriber refuses to construct with both legs
    nil (the silent-no-op pattern this PR exists to prevent).
  - events.NATSConn.ensureSMEStream: idempotently creates the
    CATALYST_SME Stream filtering `catalyst.>` so the first consumer
    on a fresh Sovereign bootstraps lifecycle.

Each service's main.go now constructs a MultiSubscriber and passes it
to the consumer dispatch loop. Consumer signatures take
events.BrokerSubscriber instead of *events.Consumer (interface upcast,
so *events.Consumer call sites keep working on Catalyst-Zero):

  - provisioning: tenant.created / tenant.deleted /
    tenant.app_install_requested / tenant.app_uninstall_requested /
    order.placed (the 5 subjects PR #1626 publishes to NATS).
    Also wires MultiPublisher so provision.* publishes hit NATS too —
    downstream tenant + notification consumers need them.
  - notification: full fan-in (user.login, order.placed,
    payment.received, provision.*, domain.*, member.invited).
  - domain: tenant.deleted (subdomain + BYOD reclamation cascade).
  - billing: tenant.deleted (Stripe sub-cancel + invoice void + ledger
    marker cascade). Existing metering NATS subscriber unaffected.
  - tenant: provision.* + tenant.deleted (members cleanup).
    Now reachable on Sovereigns; pre-this-PR they were inside the
    `if redpandaBrokersRaw != ""` block.

Chart wiring: NATS_URL env added to provisioning, notification, and
domain Deployments (tenant + billing already wired via PR #1626).
notification.yaml also flips its hardcoded REDPANDA_BROKERS literal to
the shared ConfigMap key so the per-topology default (empty on
Sovereigns, talentmesh redpanda on Catalyst-Zero) applies.

Verification:
  - go build ./core/services/{shared,tenant,billing,provisioning,
    notification,domain}/... clean.
  - go test ./... clean across all 6 modules.
  - helm template with global.sovereignFQDN=test.example.com renders
    NATS_URL="nats://nats-jetstream.nats-system.svc.cluster.local:4222"
    into all 5 Deployments + ConfigMap.
  - helm template without sovereignFQDN renders NATS_URL="" and
    REDPANDA_BROKERS=talentmesh redpanda, matching Catalyst-Zero.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:32:49 +04:00
Emrah Baysal
b8b80973de feat(sandbox): Wave 4 — marketplace catalog entry (customer can pick Sandbox alongside WordPress)
Adds the Sandbox product to the marketplace storefront so a customer
picks it off marketplace.<sov>/apps the same way they pick WordPress /
Nextcloud. Card chrome is the existing .app-card shape verbatim — no
new components per the design-system inheritance rule. The detail page
gains a 6-agent picker (aider, claude-code, cursor-agent, little-coder,
opencode, qwen-code) using the existing .related-card chrome with a
picked state mirroring .app-card.in-cart. Picks land on cart.agents
and travel through checkout into the tenant create-org payload.

Tenant-service emits a sibling `tenant.sandbox_requested` event on
sme.tenant.events when the cart contains the sandbox product. The
event carries org slug + owner + agents list, sufficient for the
sandbox-controller (or its upstream orchestrator) to mint a Sandbox
CR with matching spec.agentCatalogue. The Organization CR creation
path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 08:22:37 +02:00
e3mrah
d681f64505
fix(catalyst-api): mint HS256 token on SME proxy calls (was forwarding incompatible RS256) (#1630)
PR #1625 shipped the /api/v1/sme/billing/vouchers/* proxies but the
SME gateway (core/services/gateway/proxy.go) rejects RS256 outright
— it only accepts HS256 signed with sme-secrets/JWT_SECRET. Result
on every fresh Sovereign: operator clicks on /bss/vouchers returned
silent 401 with no upstream audit trail.

This commit ships the bridge:

- core/services/shared/auth/mint_sme.go (new)
  - MintSMEAccessToken(secret, sub, email, role) → 5-min HS256 JWT
    in the wire shape billing's requireVoucherIssuer expects.
  - SMERoleFor(realmRoles, tier) → maps Keycloak roles + tier claim
    onto SME vocab (superadmin | sovereign-admin | member).
  - Pure, no IO, fully unit-tested (mint_sme_test.go).

- products/catalyst/bootstrap/api/internal/handler/sme_billing_vouchers.go
  - proxySMEVoucher now mints a fresh HS256 token per upstream hop
    from the operator's already-validated RS256 session claims and
    forwards that as Bearer to the SME gateway. RS256 header is no
    longer leaked upstream.
  - Unwired bridge (CATALYST_SME_JWT_SECRET empty) surfaces 503
    `sme-jwt-bridge-unwired` instead of the silent 401.

- products/catalyst/bootstrap/api/internal/handler/handler.go
  - h.smeJWTSecret field + SetSMEJWTSecret(secret) setter.

- products/catalyst/bootstrap/api/cmd/api/main.go
  - Reads CATALYST_SME_JWT_SECRET on startup and wires it.
  - Log line includes byte count only (never the secret value, per
    INVIOLABLE-PRINCIPLES.md #10).

- products/catalyst/chart/templates/api-deployment.yaml
  - New env CATALYST_SME_JWT_SECRET sourced from sme-secrets/JWT_SECRET
    in the same namespace (catalyst-system). optional: true so
    Sovereigns without marketplace surface a 503 rather than
    CreateContainerConfigError.

- products/catalyst/chart/templates/sme-services/sme-secrets.yaml
  - emberstack/reflector annotation block mirroring sme-secrets
    from `sme` ns into `catalyst-system` (Kubernetes secretKeyRef
    is same-namespace-only). Same pattern as cnpg-cluster.yaml
    and provisioning-github-token.yaml.

Operator-visible behaviour: the bridge is transparent on the happy
path (operator with sovereign-admin tier on a Sovereign with
marketplace enabled clicks /bss/vouchers → list returns). On the
unhappy paths the operator now sees a real status code:
  - 503 sme-jwt-bridge-unwired (chart wire missing) — actionable
  - 503 sme-gateway-unreachable (DNS NXDOMAIN) — pre-existing
  - 403 from billing's requireVoucherIssuer (role insufficient)
    — was silent 401 before, now propagates the real authz result.

Tests: core/services/shared/auth `go test ./...` PASS. catalyst-api
`go build ./...` PASS.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:11:04 +04:00
e3mrah
50a45a9783
fix(billing): skip Stripe when voucher covers 100% of total (unblocks fully-paid voucher checkout) (#1628)
POST /billing/checkout was 503'ing with "payment processor is not
configured" on Sovereigns that have not pasted Stripe keys yet — even
when the customer's credit balance (from a fresh voucher redemption
in the same request, or a prior balance) fully covered the order
total. Make the credit-only short-circuit explicit: compute
`remainingOMR := totalOMR - creditBalance` and settle via
CreditOnlyCheckout when `<= 0`, BEFORE any Stripe settings probe.
This is the path that has to keep working during the voucher-only
weeks of a new Sovereign.

Adds checkout_test.go covering two regression paths:

  - fresh-voucher path: customer with 0 credit redeems WELCOME50
    against a 50-OMR plan → 200 + paid_by_credit:true, settings table
    never probed (sqlmock asserts no unexpected queries).
  - pre-existing-credit path: customer with 200-OMR standing balance
    buys a 100-OMR plan, no promo_code in request → 200 +
    paid_by_credit:true + 100-OMR leftover credit.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:44:22 +04:00
e3mrah
048cb2c3de
fix(sme): wire tenant + billing event dispatchers to NATS (was Redpanda-only, blocking convergence) (#1626)
The tenant + billing services hardcoded a franz-go Kafka publisher
pointing at REDPANDA_BROKERS. On Sovereigns there is NO Redpanda in
cluster — only NATS JetStream at
nats-jetstream.nats-system.svc.cluster.local:4222 — so every
tenant.created / tenant.deleted / order.placed event was silently
dropped, blocking provisioning + downstream consumers and stalling
the convergence chain end to end.

Per ADR-0001 §6 the canonical event bus is NATS JetStream with
subject convention `catalyst.<domain>.<event>`. This change:

  - Adds events.BrokerPublisher + events.MultiPublisher that fan out
    to NATS (`catalyst.<event.Type>` derived from Event.Type) and the
    legacy Redpanda topic in one call. Either transport may be nil;
    the constructor refuses to build a no-op publisher (the exact
    silent-failure mode we just hit).
  - Adds NATSConn.PublishEvent so the generic Event envelope can flow
    over the same JetStream connection used for the metering
    subscriber (#798), with Event.ID as the JetStream Msg-Id for
    broker-side de-dup.
  - Updates tenant + billing main.go to read NATS_URL +
    REDPANDA_BROKERS independently, construct the appropriate
    transports, and wire MultiPublisher into the Handler. Legacy
    Kafka consumers only start when REDPANDA_BROKERS is non-empty
    so the pods no longer crashloop dialling localhost:9092 on
    Sovereigns.
  - Updates chart templates to inject NATS_URL into both tenant and
    billing Deployments. ConfigMap default for NATS_URL on Sovereigns
    is nats://nats-jetstream.nats-system.svc.cluster.local:4222
    (fixes the existing bug where defaults pointed at the wrong
    namespace `nats-jetstream` — NATS actually lives in `nats-system`
    per clusters/_template/bootstrap-kit/07-nats-jetstream.yaml).
  - Sovereign default of REDPANDA_BROKERS is now empty (was the wrong
    NATS URL stuffed into a Kafka env, which made franz-go fail every
    dial).

Subject mapping per CanonicalSubject:
  tenant.created               → catalyst.tenant.created
  tenant.deleted               → catalyst.tenant.deleted
  tenant.app_install_requested → catalyst.tenant.app_install_requested
  order.placed                 → catalyst.billing.order.placed

Test:
  go build ./... in shared/, tenant/, billing/ (clean)
  go test ./events/... ./handlers/... in all three (existing + new
    bridge_test.go pass)
  helm template with global.sovereignFQDN set renders NATS_URL in
    both Deployments + REDPANDA_BROKERS="" in ConfigMap
  helm template without global.sovereignFQDN renders the legacy
    Redpanda broker (Catalyst-Zero contabo path remains intact)

NATS-side consumers for sme.tenant.events / sme.provision.events ship
in a follow-up PR per the ADR-0001 §6 migration plan; this PR only
unblocks the publish leg which is the immediate convergence blocker.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:33:36 +04:00
e3mrah
255eb3bf17
feat(sandbox+auth+newapi): Wave 1b — newapi proxy + BYOS + org-scoped JWT (#1619)
Three coordinated deliverables for Sandbox Wave 1b — scaffolding +
design + the ONE prerequisite (long-lived org-scoped JWT) the rest of
Sandbox depends on.

Deliverable 1 — newapi proxy contract:
  - products/sandbox/docs/newapi-proxy-contract.md: agent-pod env
    (LLM_GATEWAY_URL / OPENAI_BASE_URL alias), provider selection
    (?provider=qwen; default Qwen via omtd.bankdhofar.com), per-Sandbox
    token issuance via /admin/tokens/sandbox bridge, lifecycle +
    rotation, auth model.
  - platform/newapi/internal/handler/sandbox_token.go: bridge handler
    stub. Validates the inbound PAT (typ=pat + aud=newapi + org_id
    cross-check vs request body), then echoes a NewAPI-shaped response
    so the contract is testable without the upstream NewAPI admin
    API. Wave 4 wires the actual upstream calls.

Deliverable 2 — Claude Code BYOS OAuth:
  - products/sandbox/docs/claude-code-byos.md: UX (Connect Claude Max →
    OAuth → refresh token Secret/catalyst-system/sandbox-byos-claude-
    code-<user-uid>), Pod env injection (ANTHROPIC_API_KEY bypassing
    newapi), per-session toggle, revocation paths, chart wiring.
  - products/catalyst/bootstrap/api/internal/handler/byos_claude_code.go:
    POST /start, GET /callback, DELETE, GET /status — four endpoints
    behind RequireSession. Honest 503 + 501 surface so the popup
    flow exercises end-to-end against the placeholder client_id;
    Wave 4 flips it live.

Deliverable 3 — Long-lived org-scoped JWT (THE prerequisite):
  - platform/keycloak/chart/templates/configmap-sovereign-realm.yaml +
    configmap-tenant-realm.yaml: add `org` protocolMapper emitting
    user attribute `org` as claim `org_id`; add `org` to default
    client scopes for ALL clients.
  - core/services/auth/handlers/handlers.go: include typ=session in
    JWTs + document the cross-service claim contract.
  - core/services/auth/handlers/pat.go: NEW POST /auth/pat with
    admin-configurable TTL (default 7d, max 90d), audience claim,
    capabilities pass-through, typ=pat discriminator.
  - core/services/auth/handlers/routes.go + main.go: wire /auth/pat
    behind JWTAuth middleware.
  - core/services/shared/auth/claims.go: single Claims struct +
    HasCapability/HasGroup helpers + ContextKey for cross-service
    consumers (sandbox-controller, newapi bridge, MCP server).
  - products/catalyst/bootstrap/api/internal/auth/session.go: align
    Org JSON tag with new `org_id` claim; UnmarshalJSON accepts BOTH
    legacy `org` and new `org_id` so a rolling chart upgrade does
    not regress org-scoped queries.

Out of scope (Wave 4 wires):
  - Sandbox CRD + controller (writes Secret, mounts Pod env).
  - Actual outbound HTTP to Anthropic /oauth/token + KMS encrypt.
  - Actual outbound HTTP to NewAPI admin API.
  - Per-Sandbox capability projection from Keycloak groups.
  - PAT revocation lookup (jti store) + /auth/pats list.
  - Settings UI card + session-toolbar routing toggle.

Build verification (go vet + go build clean):
  - core/services/auth/...
  - core/services/shared/...
  - platform/newapi/internal/handler/...
  - products/catalyst/bootstrap/api/...

Founder TODO (single knob to flip BYOS live, Wave 4):
  Register an Anthropic OAuth client at
  https://console.anthropic.com/settings/oauth (public PKCE,
  redirect=https://console.<sov-fqdn>/api/v1/sandbox/byos/claude-code/callback)
  and paste the client_id into clusters/<sovereign>/bootstrap-kit/
  sandbox.yaml. Today every BYOS endpoint returns 503 with a clear
  message pointing at claude-code-byos.md §8.

Refs: products/sandbox/docs/architecture.md §6 (THE prerequisite).

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-18 08:43:11 +04:00
e3mrah
964dc15570
fix(catalog): D27 — fresh-seed apps default Published+Deployable (#1584)
* fix(handover): rename itoa→regionSlotIndex (collision with infrastructure.go)

PR #1581 introduced an `itoa` helper that collided with the existing
`itoa` in handler/infrastructure.go:1952. Go vet failed:

  internal/handler/infrastructure.go:1952:6: itoa redeclared in this block
  internal/handler/deployment_handover_export.go:199:6: other declaration of itoa

Rename my helper to `regionSlotIndex` — more descriptive of its actual
use (deriving the per-region slot suffix for the kubeconfig filename).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalyst-api): D16/D17 — 3 bugs caught on t138

Founder caught on t136 (now wiped) that /dashboard cluster grouping
still showed 1 region and /cloud nodes showed 1 node despite earlier
D16 PRs shipping. Root cause: 3 bugs in the D16 chain that surfaced
on t138 fresh prov.

1. exportSecondaryKubeconfigsToChild was guarded behind the early
   return of exportDeploymentToChild's failed POST. The child's
   ingress + cert + gateway are still racing to reach reachable
   state in the seconds after handover fires, so the first POST
   gets EOF and the goroutine never fires. Fix: kick off the
   D16 fan-out IMMEDIATELY at the top of exportDeploymentToChild
   in its own goroutine, BEFORE the deployment-record POST.

2. Both exports now retry with exponential backoff (5s → 60s) for
   up to 5 min total. Most handovers will succeed on attempt 2-4.
   Was: no retry, single shot, silent failure.

3. /api/v1/sovereign/secondary-kubeconfig route moved OUT of the
   auth group (rg) into the top-level router (r), alongside
   /api/v1/internal/deployments/import. The previous registration
   required an operator session that doesn't exist at handover —
   mothership POSTs were 401'd silently. Validation is now via
   safeIDPattern regex on depID + regionKey (same security model
   as the deployments/import companion endpoint).

4. HandleSovereignCloud now fans out across h.k8sCache.Clusters()
   instead of using only the in-cluster client. Adds Cluster
   field (omitempty) to sovereignNode/LB/SC/PVC so the UI can
   group/filter by region. Without this, /cloud?view=list&kind=nodes
   shows 1 node even when 3 secondary kubeconfigs are registered.

Together these fix:
- D16 /dashboard Layer-1=Cluster grouping (3 bubbles, not 1)
- /cloud?view=list&kind=nodes (3+ nodes, not 1)

Refs: feedback_test_theater_3rd_violation_2026_05_17.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(catalog): D27 — fresh-seed apps default Published+Deployable

Founder caught on t136: marketplace.t136/apps shows blank application
grid. Root cause: catalog seed.go calls migrateAppPublished +
migrateAppDeployable ONLY on the "already populated" path. On a fresh
Sovereign install (empty catalog) seedAllData inserts 27 rows with
zero-value bools — Published=false, Deployable=false. The marketplace
storefront filters with `?published=true`, gets [], renders blank.

Fix: after seedAllData also call migrateAppDeployable + migrateAppPublished
+ seedSystemApps. Both migrations are idempotent (skip rows already
true), so re-runs are safe.

Verified the bug live on t138 (eaaee1ea24184c2a):
  http://catalog.sme:8082/catalog/apps returns 27 apps
  http://catalog.sme:8082/catalog/apps?published=true returns 0

With this fix the latter returns 27.

Refs: feedback_test_theater_3rd_violation_2026_05_17.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 09:28:35 +04:00
e3mrah
c04b2ec76d
feat(wordpress-tenant): activeHotStandby option wires bp-cnpg-pair (D31) (#1562)
Sovereign DoD D31 — tenants subscribing to an HA-capable marketplace app
may opt into a cross-region active-hot-standby Postgres pair for their
WordPress instance instead of the default single CNPG Cluster.

Mirrors the canonical bp-cnpg-pair pattern (primary + replica Cluster
CRs with WAL streaming over Cilium ClusterMesh via a managed Service
annotated service.cilium.io/global=true). When the new
pg.activeHotStandby.enabled flag is false (default), templates render
the existing single Cluster bit-for-bit — no regression for non-HA
tenants.

Catalog seed flags WordPress with ha + cnpg-pair tags so the marketplace
HA filter can surface it.

Chart bumped 0.2.1 -> 0.3.0. New render-gate test asserts both default
single-cluster shape AND the enabled 2-Cluster shape with the right
nodeSelectors, replica.source, externalCluster.host, Cilium global
annotation, and bootstrap.pg_basebackup; all 5 cases pass.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:39:29 +04:00
e3mrah
f9ed292198
fix(billing): /redeem-preview + plans + addons bypass JWT (D29) (#1561)
* chore(slot-13): pin bp-catalyst-platform to 1.4.145 (D29 gateway public routes)

PR #1559 added /api/billing/{vouchers/redeem-preview,plans,addons} as
public gateway routes — required for the marketplace /redeem zero-touch
flow. Pin the slot so future provisions inherit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(billing): /redeem-preview + plans + addons bypass JWT (D29)

Mirror PR #1559's gateway public routes in the billing service's own
middleware chain. The gateway now lets these requests through without
an Authorization header (D29 voucher-redeem landing), but billing
service's main.go was JWT-gating EVERY /billing/* path except
/billing/webhook — so the request still got 401, just one hop later.

Caught live on t132 2026-05-16 after PR #1559 rolled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:28:48 +04:00
e3mrah
a11067da1a
fix(gateway): /redeem-preview + plans + addons must be public (D29) (#1559)
* feat(billing+notification): wire voucher-issued email (D28)

D28 of the Sovereign DoD requires that issuing a voucher emails it to
the recipient zero-touch. Today POST /billing/vouchers/issue persists
the PromoCode row but never notifies anyone — so a gifted voucher only
reaches its recipient if the operator manually sends the code over a
side channel. This wires sme-billing -> sme-notification so the email
fires automatically on every successful upsert that carries a
recipient_email field.

Architecture follows the existing notification-service seam:
sme-billing POSTs to http://notification.sme.svc.cluster.local:8087/
notification/send with template=voucher-issued; sme-notification renders
the HTML and dispatches via Stalwart over SMTP. No direct SMTP code is
added to billing, no stalwart-mail calls bypass notification.

Server-side only — the owner-UI for issuing vouchers (D28b) is a
separate PR.

Changes:

  notification/templates/templates.go
    + VoucherIssuedEmail(code, creditOMR, description, sovereignFQDN,
      validityHint) — renders code prominently, redeem button to
      https://marketplace.<sovereignFQDN>/redeem/?code=<CODE>; FQDN
      always supplied by caller, NEVER hardcoded.

  notification/handlers/handlers.go
    + renderTemplate("voucher-issued") case parsing
      {code, credit_omr, description, sovereign_fqdn, validity_hint}.
    + Default subject "You've been gifted a voucher for OpenOva SME".

  billing/handlers/handlers.go
    + Handler fields: NotificationURL, SovereignFQDN, NotificationClient.

  billing/handlers/vouchers.go
    + issueVoucherRequest = store.PromoCode + RecipientEmail (request-
      only; never persisted).
    + sendVoucherIssuedEmail() — POSTs to NotificationURL with a 5s
      timeout. Best-effort: a non-2xx or transport error logs but does
      NOT fail the IssueVoucher response, because the row is already
      persisted and re-issuing the same code re-fires the email.
    + Re-issue semantics (#91 resurrects soft-deleted rows) extend to
      the email path — documented in the handler comment.

  billing/main.go
    + Reads NOTIFICATION_SERVICE_URL (default
      http://notification.sme.svc.cluster.local:8087/notification/send)
      and SOVEREIGN_FQDN env vars. Wires a 5s default http.Client.

  products/catalyst/chart/templates/sme-services/billing.yaml
    + Pipes NOTIFICATION_SERVICE_URL (cluster-DNS constant) and
      SOVEREIGN_FQDN (from .Values.global.sovereignFQDN, NEVER
      hardcoded) into the billing Deployment.

Tests:

  notification/handlers/handlers_test.go (new)
    + TestRenderTemplate_VoucherIssued: rendered HTML contains code +
      credit + a redeem URL built from the supplied FQDN; never falls
      back to marketplace.openova.io.
    + TestRenderTemplate_VoucherIssued_CustomSubject + _NoDescription
      + TestRenderTemplate_UnknownTemplate as guard rails.

  billing/handlers/vouchers_test.go
    + TestIssueVoucher_SendsEmail_WhenRecipientPresent: a fake round-
      tripper sees the POST to notification with the right URL +
      template + data (code upper-cased, credit_omr, sovereign_fqdn,
      description) when recipient_email is set.
    + TestIssueVoucher_NoEmail_WhenRecipientAbsent: no notification
      call when recipient is empty.
    + TestIssueVoucher_NotificationFailure_DoesNotFailUpsert:
      operator gets 200 even when notification returns 500.
    + TestIssueVoucher_403WithoutVoucherIssuerRole: role gate preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): admin pod uses dedicated image tag (D27 SME stack)

t132 caught admin pod stuck in ImagePullBackOff on `admin:b0ed216` —
the SME services CI run for that mono-repo SHA published 10 services
but admin's image was missing from GHCR. Decouple admin's tag from
smeTag so a missing-build for one service doesn't wedge the SME stack.

Default to `3c2f7e4` (matches marketplaceApi + console, known-published).
When admin's UI changes, bump in lockstep with those.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): pin bp-catalyst-platform to 1.4.144

PR #1556 (D28 voucher email wire) + PR #1557 (D27 admin tag override)
landed and Blueprint Release packaged 1.4.144. Pin the slot file so
future provisions get the latest chart by default — t132 manually
upgraded via kubectl patch but t133+ will inherit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gateway): /redeem-preview + plans + addons must be public (D29)

The marketplace /redeem?code=XXX landing page calls
/api/billing/vouchers/redeem-preview unauthenticated per docs/FRANCHISE-
MODEL.md §3, but the gateway's catch-all /api/billing/ entry was
returning 401 to it — breaking the entire voucher-redeem zero-touch
flow that D29 depends on.

Also expose /api/billing/plans and /api/billing/addons so the
marketplace landing can render pricing without a session.

Caught live on t132 2026-05-16 — every /redeem call returned 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:17:04 +04:00
e3mrah
1fe706769f
feat(billing+notification): wire voucher-issued email (D28) (#1556)
D28 of the Sovereign DoD requires that issuing a voucher emails it to
the recipient zero-touch. Today POST /billing/vouchers/issue persists
the PromoCode row but never notifies anyone — so a gifted voucher only
reaches its recipient if the operator manually sends the code over a
side channel. This wires sme-billing -> sme-notification so the email
fires automatically on every successful upsert that carries a
recipient_email field.

Architecture follows the existing notification-service seam:
sme-billing POSTs to http://notification.sme.svc.cluster.local:8087/
notification/send with template=voucher-issued; sme-notification renders
the HTML and dispatches via Stalwart over SMTP. No direct SMTP code is
added to billing, no stalwart-mail calls bypass notification.

Server-side only — the owner-UI for issuing vouchers (D28b) is a
separate PR.

Changes:

  notification/templates/templates.go
    + VoucherIssuedEmail(code, creditOMR, description, sovereignFQDN,
      validityHint) — renders code prominently, redeem button to
      https://marketplace.<sovereignFQDN>/redeem/?code=<CODE>; FQDN
      always supplied by caller, NEVER hardcoded.

  notification/handlers/handlers.go
    + renderTemplate("voucher-issued") case parsing
      {code, credit_omr, description, sovereign_fqdn, validity_hint}.
    + Default subject "You've been gifted a voucher for OpenOva SME".

  billing/handlers/handlers.go
    + Handler fields: NotificationURL, SovereignFQDN, NotificationClient.

  billing/handlers/vouchers.go
    + issueVoucherRequest = store.PromoCode + RecipientEmail (request-
      only; never persisted).
    + sendVoucherIssuedEmail() — POSTs to NotificationURL with a 5s
      timeout. Best-effort: a non-2xx or transport error logs but does
      NOT fail the IssueVoucher response, because the row is already
      persisted and re-issuing the same code re-fires the email.
    + Re-issue semantics (#91 resurrects soft-deleted rows) extend to
      the email path — documented in the handler comment.

  billing/main.go
    + Reads NOTIFICATION_SERVICE_URL (default
      http://notification.sme.svc.cluster.local:8087/notification/send)
      and SOVEREIGN_FQDN env vars. Wires a 5s default http.Client.

  products/catalyst/chart/templates/sme-services/billing.yaml
    + Pipes NOTIFICATION_SERVICE_URL (cluster-DNS constant) and
      SOVEREIGN_FQDN (from .Values.global.sovereignFQDN, NEVER
      hardcoded) into the billing Deployment.

Tests:

  notification/handlers/handlers_test.go (new)
    + TestRenderTemplate_VoucherIssued: rendered HTML contains code +
      credit + a redeem URL built from the supplied FQDN; never falls
      back to marketplace.openova.io.
    + TestRenderTemplate_VoucherIssued_CustomSubject + _NoDescription
      + TestRenderTemplate_UnknownTemplate as guard rails.

  billing/handlers/vouchers_test.go
    + TestIssueVoucher_SendsEmail_WhenRecipientPresent: a fake round-
      tripper sees the POST to notification with the right URL +
      template + data (code upper-cased, credit_omr, sovereign_fqdn,
      description) when recipient_email is set.
    + TestIssueVoucher_NoEmail_WhenRecipientAbsent: no notification
      call when recipient is empty.
    + TestIssueVoucher_NotificationFailure_DoesNotFailUpsert:
      operator gets 200 even when notification returns 500.
    + TestIssueVoucher_403WithoutVoucherIssuerRole: role gate preserved.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:04:46 +04:00
e3mrah
b0ed216e81
feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097) (#1148)
EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST
service backed by Gitea (3 sources: public mirror, sovereign-curated,
per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3
(different scope: SME's was Org-bound; catalyst-catalog is Sovereign-
wide multi-source).

L1 — core/services/catalyst-catalog/ Go service:

  - Separate go.mod (services group is for HTTP services, controllers
    group is for CRD reconcilers — documented in DESIGN.md).
  - Imports the unified Gitea client via Go module replace directive.
  - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog
    (a sibling Go module) can import it (Go internal/ rule). 5 Group C
    controllers updated atomically.
  - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions,
    /{name}/versions/{version}} + /healthz.
  - Source resolution priority on collision: private > sovereign > public.
  - Per-Org access filter: caller's Claims.Groups[] determines visible
    private blueprints; Org A user does NOT see Org B's private set.
  - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default).
  - Session-cookie / Bearer / ?access_token= claim extraction matching
    catalyst-api's seam; expired-token rejection in-process.
  - Containerfile: distroless-static, non-root UID 65532.

L2 — products/catalyst/chart/templates/services/catalog/ wiring:

  - 5 templates (deployment, service, serviceaccount, rbac, httproute)
    + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled.
  - helm template: 0 catalog resources when OFF, 6 when ON.
  - Empty image.tag fail-fasts at render per Inviolable Principle #4a.
  - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname.
  - Chart bumped 1.4.85 → 1.4.86.

Gitea client extension (canonical seam, NOT per-service variant):

  - +ListOrgRepos(ctx, org) []Repo — paginated repo listing.
  - +ListContents(ctx, org, repo, branch, path) []ContentEntry —
    directory listing for per-Org shared-blueprints fan-out.

GitHub Actions workflow:

  - .github/workflows/catalyst-catalog-build.yaml — push-on-paths +
    pull_request + workflow_dispatch (NO cron). go vet + go test (race +
    count=1) + image build → GHCR :<sha>. repository_dispatch fan-out
    to chart-bump matches the Group C controllers' pattern.

Tests (3-tier gate): unit (config, cache, auth, source, handler) +
integration (httptest-backed Gitea fixtures across all 3 sources +
priority + per-Org access). All green; race detector on.

L3 (SME catalog retirement) is deferred per the EPIC-2 master brief.
GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps
for a feature no UI consumer has asked for yet).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 04:04:52 +04:00
e3mrah
a57d05d4dd
fix(provisioning,catalog): parent-kustomization prefix collision + disable openclaw/stalwart-mail (#1043)
Two bugs surfaced live 2026-05-06 on tenant "test":

1) UpdateParentKustomization used substring match against "  - <slug>",
   which falsely "found" the slug when it was a PREFIX of an existing
   entry. Adding "test" to a file already listing "test11" or "test13"
   silently no-op'd. Result: tenant manifests committed but the
   tenants/kustomization.yaml never registered them, Flux's tenants
   Kustomization couldn't apply the new tenant, vCluster step timed
   out at 10m. Fix: exact line match on the resources entry.

2) openclaw + stalwart-mail were flagged Deployable=true in #941 but
   never had AppSpec entries in core/services/provisioning/gitops/apps.go
   KnownApps. The SME provisioning generator emits a single-Deployment
   template that requires Image + Port; for those two slugs it produced
   invalid manifests:

     Deployment.apps "openclaw" is invalid:
     containers[0].image: Required value
     containers[0].ports[0].containerPort: Required value

   tenant-test11-apps Kustomization rejected the dry-run, no apps ever
   landed inside the vcluster. Re-enabling these requires per-app
   overlay support beyond the single-Deployment template — separate
   work. For now: comment them out of DeployableAppSlugs so the catalog
   seed flips them back to Deployable=false on next pod restart and the
   marketplace UI shows them as COMING SOON.

Adds regression tests for both: prefix-collision in
UpdateParentKustomization, and a stability test on the deployable map
shape.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 10:21:39 +04:00
e3mrah
ff0e90156d
fix(provisioning): re-read parent kustomization on commit retry — prevent slug-resurrection race (#1034)
Live race seen 2026-05-06: bookcheck teardown committed at T (removed
the slug from tenants/kustomization.yaml + pruned its directory).
Multitest provision's first commit attempt at T-2s got a ref-race
rejection, the github client's retry replayed the SAME files map (which
held the pre-teardown parent kustomization with bookcheck still in it),
and the retry's commit at T+5s overwrote the teardown's removal. Result:
the parent kustomization listed bookcheck but the directory was gone,
Flux's tenants Kustomization wedged in build-failure loop, and EVERY
subsequent tenant change was blocked until manually unblocked.

Add CommitFilesWithPruneAndRebuild — same as CommitFilesWithPrune but
takes a `rebuild(ctx) (files, error)` callback invoked at the start of
each attempt. Wire both consumer paths (provision + teardown) through
it; each rebuild re-reads parent kustomization.yaml against the current
HEAD and re-applies UpdateParentKustomization / RemoveTenantFromParentKustomization
fresh. Static tenant-scoped manifests still flow through unchanged.

CommitFilesWithPrune is preserved as a thin wrapper for callers that
ship truly static files (e.g. day-2 app installs scoped to a tenant
subdir, no parent merge involved).

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 03:28:35 +04:00
e3mrah
f1744c8973
fix(provisioning): BookStack — also emit DB_USERNAME/DB_PASSWORD (Laravel-native) (#1031)
PR #1028 fixed the APP_KEY halt and switched to DB_USER/DB_PASS, but
linuxserver/bookstack's init script does NOT substitute DB_USER →
DB_USERNAME in the .env file. Laravel reads env vars natively but
using DB_USERNAME / DB_PASSWORD (Laravel-canonical names). Without
those, Laravel falls back to the .env placeholder values
(database_username / database_user_password) and the app fails with:

  SQLSTATE[HY000] [1045] Access denied for user 'database_username'@...

Caught live on tenant 'bookcheck' 2026-05-06 after PR #1028 deployed —
pod ran, app started, but every request hit the placeholder credentials.

Emit BOTH name pairs so the env works regardless of which the LSIO
upstream eventually wires up.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 02:59:14 +04:00
e3mrah
b180d56926
fix(provisioning): BookStack overlay — add DB_* envs + APP_KEY + APP_URL (#1028)
linuxserver/bookstack reads DB_HOST/DB_USER/DB_PASS/DB_DATABASE
(NOT WORDPRESS_DB_*) and halts init with "The application key is
missing, halting init!" when APP_KEY isn't set. The pod stays 1/1
Running because the readiness probe doesn't catch the silent halt,
but the application never binds to port 80, so the ingress returns
502. Discovered via live E2E on tenant 'aaa' (BookStack on m plan):
all 7 provisioning steps reported done, ingress healthy, cert ready,
but https://aaa.omani.rest → 502.

Add a "bookstack" DBEnvStyle case in the mysql env-emitter that
writes DB_*, APP_URL=https://<slug>.omani.rest, and a Laravel-format
APP_KEY (base64:<32-byte>). Also add a randomAppKey() helper alongside
randomHex(). Tag the catalog AppSpec with DBEnvStyle: "bookstack".

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 02:49:35 +04:00
e3mrah
c9b8c13406
fix(tenant): JWT-bypass /tenant/internal/* — paid checkouts never provisioned (#1018) (#1019)
Billing's dispatchOrderPlaced enriches the order.placed NATS event by
calling /tenant/internal/tenants/<id>/subdomain over the in-cluster
ClusterIP. routes.go registers that path with the comment "Internal —
unauthenticated service-to-service", but main.go wraps everything
under /tenant/ in JWTAuth except /tenant/check-slug/. So billing got
401, returned "" for the subdomain, published order.placed with
subdomain="", and provisioning rejected every paid checkout with
"invalid subdomain expected=[a-z][a-z0-9-]{2,30}".

Add /tenant/internal/ to the public-paths bypass. Both gateways
already 401 the path externally, and subdomain values are public DNS
names — the documented threat model.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 02:09:55 +04:00
e3mrah
689276889c
fix(bp-catalyst-platform+bp-newapi): unblock alice signup gates 2-6 on Sovereigns (#915) (#951)
Six coupled chart + orchestrator fixes that unblock alice marketplace
signup → tenant ready → SaaS integrations → LLM → ledger on a freshly
franchised Sovereign. C5-final got Gate 1 GREEN on otech113 (2026-05-05)
but every downstream gate failed because the SME bundle hardcoded
contabo-only assumptions.

Bumps:
  - bp-catalyst-platform 1.4.21 → 1.4.22
  - bp-newapi             1.3.0 → 1.4.0
  - bootstrap-kit slot 13 + 80 pins updated in lockstep

Issues addressed (single consolidated PR — smaller PRs would race
against alice signup retries):

  - #934 (auth SMTP empty → "failed to send email"): sme-secrets.yaml
    now reads SMTP_* from `catalyst-system/sovereign-smtp-credentials`
    (the same A5-seeded source #883/#905 the chart 1.4.20 catalyst-
    openova-kc-credentials Secret already uses) with source-wins
    precedence. Both canonical (smtp-host/port/from/user/pass) AND
    legacy (host/port/from/user/password) source-Secret key shapes
    accepted. Empty source falls back to chart-level defaults so the
    contabo path stays clean.

  - #940 (provisioning service GITHUB_TOKEN placeholder + hardcoded
    upstream github.com): chart values
    .Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
    repo,branch}} make every GitHub-API coordinate operator-overridable
    with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
    API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
    Provisioning binary's startup gate validates the GITHUB_TOKEN does
    NOT contain placeholder substrings (<placeholder>, PLACEHOLDER,
    REPLACE_ME, ...) and crashes the Pod into Pending if it does — the
    operator sees the misconfig immediately instead of after alice
    signups have failed silently in service logs. GitHub client now
    accepts a custom API URL via NewClientWithAPIURL so Gitea's GitHub-
    compatible /api/v1 surface drops in without re-implementing the
    client.

  - #941 (catalog "27 apps COMING SOON"): added `openclaw` and
    `stalwart-mail` to migrateAppDeployable's deployable map at
    core/services/catalog/handlers/seed.go. Both blueprints (bp-openclaw,
    bp-stalwart-{sovereign,tenant}) ship with visibility=listed in the
    embedded blueprints.json AND have working SME-tenant overlay
    templates in sme_tenant_gitops.go, but the catalog handler silently
    filtered them out because they were missing here. Map extracted to
    DeployableAppSlugs() exported function so unit tests can assert
    membership without invoking a Mongo store.

  - #942 (REDPANDA_BROKERS hardcoded to talentmesh): configmap.yaml
    selects broker default at render time based on global.sovereignFQDN
    — Sovereign ⇒ NATS JetStream Service per ADR-0001 (the only local
    bus on Sovereigns); contabo ⇒ legacy Redpanda Service in talentmesh.
    Operator MAY override either default via
    .Values.smeServices.eventBus.brokers without forking the chart.
    The ConfigMap key name stays REDPANDA_BROKERS for back-compat with
    existing SME service Go env wiring; new EVENT_BUS_PROTOCOL key
    surfaces the protocol hint for services that want to switch wire
    format independently.

  - #943 (bp-newapi silently skips Deployment): NEW
    templates/cnpg-cluster.yaml auto-provisions a CNPG-backed Postgres
    Cluster + Helm-`lookup`-persistent DSN Secret when
    .Values.cnpg.enabled (DEFAULT true). NEW templates/credentials-
    secret.yaml auto-generates SESSION_SECRET + CRYPTO_SECRET (each
    64-char randAlphaNum, persistent across reconciles via Helm
    `lookup`) when .Values.credentials.autoProvision (DEFAULT true).
    deployment.yaml gate now resolves Secret names from the chart-
    emitted defaults when the operator hasn't supplied an override.
    Capabilities-gated on postgresql.cnpg.io/v1 so a cold install
    before bp-cnpg is Ready surfaces as "no Cluster yet" rather than
    a hard install error.

  - #944 (CRITICAL — cross-cluster pollution): provisioning.yaml
    templates GIT_BASE_PATH from
    .Values.smeServices.provisioning.gitBasePath with a topology-aware
    default `clusters/<sovereignFQDN>/sme-tenants` on Sovereigns. NEW
    `core/services/provisioning/gitguard` package validates at startup
    AND on every commit code path that the path begins with
    `clusters/<self-FQDN>/` — refusing to commit to any other cluster's
    tree. Defence in depth so a runtime env mutation (kubectl exec,
    ConfigMap update without Pod restart, hostile sidecar) cannot
    bypass the check. Pre-#944 every alice tenant overlay landed in
    upstream openova/openova `clusters/contabo-mkt/tenants/<id>/`
    which contabo Flux would then install on the contabo cluster —
    C5-final caught + reverted the alice2 incident at commit 5715db04.

Tests:
  - core/services/provisioning/gitguard: 22 cases covering Sovereign
    + contabo + traversal + prefix-collision + placeholder token
  - core/services/catalog/handlers: openclaw/stalwart-mail in
    deployable map + stable-shape lock against accidental deletes
  - helm-template smoke pass: bp-newapi (default values renders
    Deployment + auto-provisioned Secrets); bp-catalyst-platform
    (Sovereign render shows GIT_BASE_PATH=clusters/otech113.../sme-
    tenants, REDPANDA_BROKERS=nats-jetstream..., GITHUB_OWNER=openova,
    GITHUB_API_URL=http://gitea-http...)

Closes #934 #940 #941 #942 #943 #944
Refs umbrella #915

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:27:23 +04:00
e3mrah
95a06f56f8
fix(sme-marketplace): unblock PIN signin — route /api/* to sme/gateway + add send-pin alias (#868) (#869)
Two-part fix for marketplace UI signin flow which 503'd then 404'd on
otech103. Live debugging found two stacked bugs.

Part A — chart (HTTPRoute backend):
- marketplace-routes.yaml: /api/* rule now backendRefs sme/gateway:8080
  (cross-namespace) instead of catalyst-system/marketplace-api which had
  a Service selector matching zero Pods. The gateway in sme already
  fronts services-auth, catalog, tenant, billing, provisioning.
- marketplace-reference-grant.yaml: extend `to:` list with the gateway
  Service so the cross-ns hop is authorised by Gateway API.
- Bump bp-catalyst-platform 1.4.7 → 1.4.8 + lockstep slot 13 pin.

Part B — services-auth (route name):
- Add /auth/send-pin alias delegating to existing SendMagicLink handler,
  and /auth/verify-pin alias delegating to VerifyMagicLink. The
  marketplace UI surfaces a 6-digit PIN ("Send PIN" button), so the
  PIN-named routes are the canonical UX-facing names. /auth/magic-link
  and /auth/verify remain registered for backward compat.
- services-build workflow auto-rebuilds the auth image on push to
  core/services/** — no manual dispatch needed.

Refs: #868

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 08:22:17 +04:00
e3mrah
fa4395fa3a
fix(bp-catalyst-platform): wire VALKEY_PASSWORD into SME auth + gateway (#863) (#864)
After PR #862 (1.4.4) made cross-ns Valkey reachable from `sme` ns, the
auth Pod started CrashLoopBackOff with "NOAUTH HELLO must be called with
the client already authenticated". Root cause: bp-valkey 1.0.0 ships
auth.enabled=true (bitnami default) but SME service code + Deployment
templates never plumbed a password through.

Chart 1.4.4 -> 1.4.5. Slot 13 pin lockstep.

Changes:
- core/services/shared/db/valkey.go: add ConnectValkeyWithAuth overload
  taking username + password. ConnectValkey kept backwards-compatible
  for contabo-mkt's auth-less in-namespace Valkey.
- core/services/auth/main.go + gateway/main.go: read VALKEY_USERNAME +
  VALKEY_PASSWORD env, call ConnectValkeyWithAuth when password set,
  else fall through to no-auth path.
- NEW templates/sme-services/valkey-cross-ns-secret.yaml: Helm `lookup`
  reads bp-valkey's auto-generated `valkey-password` from the
  `valkey/valkey` Secret and re-emits it as `sme-valkey-auth` in `sme`
  ns. Same pattern as sme-secrets.yaml (#859) and gitea-admin-secret
  (#830 Bug 2). On first install the lookup may return nil; Flux's 15m
  reconcile picks up the mirror once bp-valkey is Ready.
- auth.yaml + gateway.yaml: add VALKEY_PASSWORD env from `sme-valkey-
  auth` Secret with optional=true so contabo-mkt's auth-less path keeps
  working when the mirror Secret is absent.
- values.yaml: add `smeServices.valkey.{sourceSecretName,
  sourcePasswordKey, destNamespace, destSecretName}` knobs (Inviolable
  Principle #4).

Live verified the failure mode on otech103: 11/13 SME pods Running 1/1,
auth in CrashLoopBackOff with NOAUTH HELLO error. Provisioning Pod's
CreateContainerConfigError is unrelated (ghcr-pull, separate ticket).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 06:09:38 +04:00
e3mrah
5cdb738ac9
fix(services): go mod tidy across sibling services after #798 shared deps bump (#821)
#798 added github.com/nats-io/nats.go to core/services/shared/go.mod and
adjusted x/sys/x/crypto/x/text to Go 1.22-compatible versions. The
sibling services (auth, catalog, domain, gateway, notification,
provisioning, tenant) reference the same shared module via the local
`replace` directive — their go.sum files must include the new transitive
hashes, otherwise the CI Containerfile build hits:

    go: updates to go.mod needed; to update it: go mod tidy

This commit is a pure `go mod tidy` across all 7 services; no source
changes. CI services-build is now unblocked.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:35:46 +04:00
e3mrah
9645a9044a
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798)

Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add
the SME-2 metering integration end-to-end. NewAPI is consumed as the
upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned
mirror, not a fork) — the metering envelope is produced by a Go sidecar
that observes the OpenAI-style `usage.total_tokens` field on every
2xx /v1/* response. This avoids forking the upstream binary while still
producing the canonical envelope shape on `catalyst.usage.recorded`.

A) NewAPI metering sidecar — core/services/metering-sidecar/
   - Transparent reverse proxy in front of NewAPI on its own port; the
     bp-newapi Service routes the cluster-fronting port to the sidecar,
     which forwards to NewAPI on the pod's loopback.
   - Observes successful /v1/* JSON responses, parses
     `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes
     amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes
     one envelope on `catalyst.usage.recorded` per completed request.
   - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed.
   - Customer-facing latency is NEVER blocked on metering: the response
     body is restored before publish; on NATS unreachable the envelope
     is persisted to disk and retried by a background drain loop.
   - 14 unit tests (proxy + publisher + safeFilename guards).

B) sme-billing NATS subscriber — core/services/billing/handlers/
   metering_consumer.go
   - JetStream durable consumer `sme-billing-metering` on stream
     `CATALYST_USAGE` (provisioned by sme-billing on startup).
   - Idempotent on metadata.request_id via a UNIQUE partial index on
     credit_ledger.external_ref; redelivery from the broker collapses
     to a single ledger row.
   - Customer auto-create on cold start (the rbac sme.user.created
     envelope may land AFTER the first metered request; we don't strand
     usage waiting for it).
   - 11 unit tests covering happy-path, idempotency, malformed-payload
     poison-pill, missing-request-id, non-negative amount guard,
     resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak.

C) HTTP handler POST /billing/metering/record — handlers/metering.go
   - Synchronous validate → INSERT credit_ledger → return
     {ledger_entry_id, balance_after_omr, balance_after_micro_omr,
     duplicate}. Same payload + idempotency guard as the NATS path.
   - Auth: superadmin OR sovereign-admin (operator-admin model;
     end-user LLM traffic flows through the sidecar, never this URL).
   - 8 unit tests covering happy-path, idempotency, role gating,
     malformed-JSON, positive-amount rejection, customer-not-found.

D) Schema — core/services/billing/store/store.go
   - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT
     (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR
     exact integer — preserves precision at metering rates).
   - ADD COLUMN external_ref TEXT + UNIQUE partial index for
     idempotency dedup.
   - ADD COLUMN metadata JSONB for the raw envelope.
   - GetCreditBalance projects both amount_omr (legacy) and
     amount_micro_omr (new) into the integer-OMR view.
   - GetCreditBalanceMicroOMR returns canonical precision.
   - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0)
     distinguishes fresh insert from duplicate without a follow-up
     SELECT.

E) Wiring
   - core/services/shared/events/nats.go — minimal NATS JetStream
     publisher + subscriber surface; legacy RedPanda producer/consumer
     in events.go untouched per [Q-mine-3].
   - core/services/billing/main.go — NATS_URL env; subscriber wired
     in parallel with the existing RedPanda tenant-events consumer.
   - middleware/jwt.go — exported test helper WithClaims so handler
     tests can construct an authenticated context without minting a
     real signed token.
   - .github/workflows/services-build.yaml — metering-sidecar added
     to the build matrix; deploy job skips it (image consumed by the
     bp-newapi chart, not products/catalyst sme-services).

F) bp-newapi chart (1.0.0 → 1.1.0)
   - meteringSidecar block in values.yaml: image, port, NATS URL,
     priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool
     dir, header names, resources, securityContext (read-only-rootfs).
   - deployment.yaml renders the sidecar container + emptyDir spool
     volume when meteringSidecar.enabled (default true).
   - service.yaml routes the cluster-fronting :3000 to the sidecar
     when enabled, exposes a separate :3001 → NewAPI direct port for
     bp-catalyst-platform admin-API traffic (ADR-0003 §3.2).
   - networkpolicy.yaml allows the sidecar's port + nats-system
     egress for JetStream publish.

Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green.
Helm template renders cleanly with sidecar enabled and disabled.

Closes #798

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798)

Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which
lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000")
that does NOT scan directly into Go int64 — the integration test
TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on
the post-redeem balance read.

Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is
unambiguously bigint and Scan target stays uniform across pre-#798 rows
(amount_omr only) and post-#798 rows (amount_micro_omr present).

Affects:
  - GetCreditBalance
  - GetCreditBalanceMicroOMR
  - RecordUsage's running-balance read

Test mocks updated to match the new SQL prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:32:42 +04:00
e3mrah
2a034a0959
feat(catalog): unified catalog with Published flag — operator curates marketplace (#710 wave 2) (#724)
Single source of truth for apps; Sovereign-console operator decides which
apps marketplace customers see; marketplace storefront filters by
Published. Per founder rule 2026-05-04: unpublish is a marketplace-
visibility toggle, not a deployment-lifecycle action — existing tenant
deployments of an unpublished app keep running unaffected.

core/services/catalog/store/store.go
====================================
- App.Published bool — operator-controlled visibility
- ListPublishedApps: marketplace-storefront subset
  (Published=true AND System=false AND Deployable=true).
  System and Deployable are catalog-team-controlled; Published is the
  operator's curation knob.
- SetAppPublished(slug, bool) — hot-path one-bit write the Sovereign
  console hits per row toggle. Cheaper than UpdateApp; slug-keyed so
  the UI doesn't need the internal Mongo _id.
- UpdateApp: thread published through full-update path too.

core/services/catalog/handlers/handlers.go + routes.go
======================================================
- ListApps now honours ?published=true query param:
    GET /catalog/apps                  → operator view: every app
    GET /catalog/apps?published=true   → marketplace view: filtered
- New PATCH /catalog/admin/apps/{slug}/publish?value={true|false}
  for the Sovereign-console operator's row toggle.
- requireAdmin gating preserved on the admin endpoint.

core/services/catalog/handlers/seed.go
======================================
- migrateAppPublished: defaults Published=true on every existing app
  on the day Catalyst 1.3.x ships. Operators opt OUT of marketplace
  visibility per app, not IN — matches how a real SaaS storefront is
  curated and prevents an empty marketplace on flag-introduction day.
  Idempotent on re-run.

core/marketplace/src/lib/api.ts
================================
- getApps() now hits /catalog/apps?published=true so the marketplace
  storefront only renders the operator-curated subset.

DoD pending wave 2.5
====================
The Sovereign-console "Catalog & publishing" admin page (per-row
toggle UI) is the next chunk and ships in a follow-up — backend +
storefront filter are the load-bearing change here. Catalog admins
can flip the flag today via the PATCH endpoint; the per-row UI is
quality-of-life on top.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 11:37:03 +04:00
Emrah Baysal
9519c1ef00 merge: Group L testing (Playwright e2e smoke tests, Hetzner provisioning test scaffold gated on HETZNER_TEST_TOKEN secret, integration tests for bootstrap installer + Dynadot + voucher) 2026-04-28 14:05:59 +02:00
hatiyildiz
7edf63ca7e docs(franchise),test(billing): voucher CRD propagation invariant
#118 verifies that the voucher shape on a franchised Sovereign is
identical to Catalyst-Zero. Two artefacts:

1. New §"Voucher shape propagates automatically" in
   docs/FRANCHISE-MODEL.md explaining WHY there is no propagation
   problem to solve: vouchers are not a CRD. They are rows in the
   per-Sovereign billing service's Postgres database, and every
   Sovereign runs the same SHA-pinned core/services/billing image.
   Same image → same migration → same schema → same handlers → same
   shape. The doc lists which file owns each part of the shape and
   includes a 4-step curl smoke test to run on any Sovereign at
   first-provisioning to confirm the invariant holds.

2. New core/services/billing/handlers/vouchers_test.go covering the
   public POST /billing/vouchers/redeem-preview endpoint added in
   #117. Four cases:
   - 404 on unknown / soft-deleted code (no tombstone leak)
   - 200 on a valid live code, asserting the public shape excludes
     times_redeemed and max_redemptions (defence-in-depth against
     enumeration)
   - 410 Gone on a code that exists but has hit its cap, with the
     credit/description still in the response so the landing page can
     show "campaign ended"
   - 400 on whitespace-only input

The tests run on every CI build of the billing service, on every
Sovereign that builds from this repo. If a future change drifts the
preview endpoint's shape, the tests fail before the regression can
ship.

Also tidies vouchers.go imports (removed two unused stdlib imports
that were placeholder).

Closes #118.
2026-04-28 13:59:31 +02:00
hatiyildiz
12387a4a74 feat(billing): /billing/vouchers/{issue,list,revoke,redeem-preview} surface
#117 adds a franchise-aligned URL surface for the existing PromoCode
voucher implementation, plus one new endpoint (redeem-preview) for the
public landing flow described in docs/FRANCHISE-MODEL.md §3.

The orchestrator's hint was right — the issue/list/revoke handlers
already exist (AdminUpsertPromo / AdminListPromos / AdminDeletePromo
on the legacy /billing/admin/promos surface). This commit:

1. Adds new endpoint handlers in core/services/billing/handlers/vouchers.go:
   - POST   /billing/vouchers/issue          (superadmin or sovereign-admin)
   - GET    /billing/vouchers/list           (superadmin or sovereign-admin)
   - DELETE /billing/vouchers/revoke/{code}  (superadmin or sovereign-admin)
   - POST   /billing/vouchers/redeem-preview (unauthenticated; public)

   The first three reuse the existing store-layer methods. The last is
   new — it validates a code without consuming it, returning a safe
   shape (no times_redeemed, no max_redemptions exposure) so an
   attacker scraping the public endpoint cannot enumerate cap status.

2. Distinguishes 404 (code never existed or soft-deleted — same
   tombstone-leak protection as #91) from 410 Gone (code exists but is
   inactive or capped). The 410 body still includes the credit and
   description so the landing page can show "this campaign has ended".

3. Keeps the legacy /billing/admin/promos endpoints in place — the
   existing admin UI continues to work without any breaking change.
   New code should target /billing/vouchers/...

4. Updates docs/FRANCHISE-MODEL.md to point to the new URL surface.

The actual REDEMPTION still happens transactionally inside POST
/billing/checkout via the `promo_code` field — that path locks the
promo row, inserts the promo_redemptions edge, increments
times_redeemed, and adds the credit_ledger entry in one transaction.
Splitting it into a separate /redeem endpoint would break that
atomicity, so we deliberately do not add one. The public redeem flow
is preview → signup → checkout-with-promo_code.

Closes #117.
2026-04-28 13:54:19 +02:00
hatiyildiz
3e956b7d81 test: voucher issuance integration test — real Postgres (#147)
Closes the Group L "integration test — voucher issuance via API — issue
→ redeem → Org created path" ticket.

Per docs/INVIOLABLE-PRINCIPLES.md principle #2 (no mocks where the test
would otherwise verify real behavior), this test runs against a real
PostgreSQL — not sqlmock. The voucher mechanic lives in
store.RedeemPromoCode which runs a transaction with SELECT FOR UPDATE on
promo_codes, COUNT lookup on promo_redemptions, and inserts into
credit_ledger. Mocking SQL strings doesn't verify whether the
transactional invariants actually hold under concurrent contention; this
codebase has been bitten by exactly that gap before (#93: counter
incremented before order was committed).

The test is gated on BILLING_TEST_PG_URL — when unset, it skips (NOT
mocks). CI populates it via the new postgres service container in
.github/workflows/test-billing-integration.yaml.

Each test gets its own Postgres schema (via CREATE SCHEMA + libpq's
options=-c search_path) so parallel runs don't cross-contaminate, and so
goroutine concurrency tests reliably hit the same schema regardless of
which pooled connection they pick up.

Coverage:
  - Issue → Redeem → Credit applied (the canonical happy path)
  - Per-customer double-redemption blocked
  - Redemption cap enforced under concurrency (12 goroutines fighting
    for a 5-cap voucher → exactly 5 successful redemptions, no more)
  - Soft-deleted codes rejected as "not found" (no tombstone leak per #91)
  - Inactive codes rejected with distinct "not active" error
  - Two different customers can each redeem the same voucher
  - Org-creation prerequisites: customer.tenant_id non-empty, balance > 0
    (these are the inputs the downstream tenant.created event consumer
    feeds into CreateTenant — covered by tenant-service consumer_test.go)

CI workflow added: .github/workflows/test-billing-integration.yaml runs
the tests against a postgres:16-alpine service container with -race.

Refs #147

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:53:43 +02:00
hatiyildiz
fabedd42c1 feat(admin,billing): per-Sovereign voucher issuance for sovereign-admin
#115 extends the existing PromoCode (voucher) admin surface so a
sovereign-admin role can issue, list, and revoke vouchers on a
franchised Sovereign. No new endpoints, no new schema, no new CRD —
all the changes are role-gating widenings on the existing surface.

Backend (core/services/billing/handlers/handlers.go):

- New `requireVoucherIssuer` helper accepts both `superadmin` and
  `sovereign-admin`. Used by AdminListPromos, AdminUpsertPromo, and
  AdminDeletePromo only. All other admin endpoints (Stripe settings,
  revenue, orders) keep the existing `requireAdmin` (superadmin-only).

UI (core/admin/src/components/AdminShell.svelte + BillingPage.svelte):

- AdminShell now accepts both roles. Sidebar nav is filtered by role:
  superadmin sees Revenue / Catalog / Tenants / Orders / Billing;
  sovereign-admin sees only Billing. Filtering is via a
  `superadminOnly` flag on each nav item (defence-in-depth: even if
  a sovereign-admin guesses a URL, the backend's requireAdmin will
  return 403).

- BillingPage hides the Stripe Configuration section for
  sovereign-admin (it would 403 from GET /billing/admin/settings
  anyway). The Vouchers (Promo Codes) section is shown to both roles
  with a small label tweak ("Issued vouchers are scoped to this
  Sovereign" for sovereign-admin).

Per docs/INVIOLABLE-PRINCIPLES.md §1 (target-state shape, no MVP)
and §3 (follow documented architecture exactly) — this matches the
FRANCHISE-MODEL.md design where "every franchised Sovereign runs the
same admin app" with role-based gating.

Closes #115.
2026-04-28 13:52:19 +02:00
hatiyildiz
7646840ffe feat(consolidation): move 8 SME backend services + shared module to public repo
Per docs/PROVISIONING-PLAN.md and tickets [B] sme-backend group. Migrates the 8 Go backend services from openova-private/services/ to openova/core/services/, plus the shared module they all depend on, plus the services-build CI workflow.

What moved:
- services/auth → core/services/auth (Go HTTP service for SME marketplace authentication)
- services/billing → core/services/billing (Go HTTP service for billing + voucher backend)
- services/catalog → core/services/catalog (Go HTTP service for App catalog)
- services/domain → core/services/domain (Go HTTP service for tenant domain mapping)
- services/gateway → core/services/gateway (Go HTTP gateway with rate limiting)
- services/notification → core/services/notification (Go HTTP service with email templates)
- services/provisioning → core/services/provisioning (Go HTTP service that commits tenant Application manifests via Gitea/GitHub API)
- services/tenant → core/services/tenant (Go HTTP service for tenant lifecycle)
- services/shared → core/services/shared (shared Go module: db, events, health, middleware, respond)
- 9 go.mod files updated: module github.com/openova-io/openova-private/services/<X> → github.com/openova-io/openova/core/services/<X>
- 9 go.sum and import paths similarly updated
- replace directives updated: openova-private/services/shared → openova/core/services/shared
- sme-services-build.yaml workflow → services-build.yaml in .github/workflows/, paths/context/image-base/deploy paths all repointed at core/services + ghcr.io/openova-io/openova/services-* + products/catalyst/chart/templates/sme-services
- All 8 manifests in products/catalyst/chart/templates/sme-services/ updated: image refs ghcr.io/openova-io/openova-private/sme-{X} → ghcr.io/openova-io/openova/services-{X}
- provisioning.yaml GITHUB_REPO env var: "openova-private" → "openova"

Closes [B] sme-backend (10 tickets).

After this commit, all 14 user-facing + backend Catalyst-Zero modules build from this public repo:
- 4 UIs: console, admin, marketplace, catalyst-ui
- 2 backends: marketplace-api, catalyst-api
- 8 SME services: auth, billing, catalog, domain, gateway, notification, provisioning, tenant
- 1 shared Go module

Note: 1 line in core/services/provisioning/main.go retains a literal default of "openova-private" for the GITHUB_REPO fallback when env var is unset; the K8s manifest sets GITHUB_REPO=openova explicitly so this path is never exercised in the deployed runtime, and the in-code default will be cleaned up in a follow-up.
2026-04-28 12:30:32 +02:00