CEREBRAL STRATUM Help

ADR-0001: Network Topology and Multi-Region Ingress Strategy

Field

Value

Status

Accepted

Migrated from

Project-wide ADR-007

Context

CEREBRAL STRATUM is deployed on OpenShift and must satisfy several networking requirements that evolve across its lifecycle:

  • Public ingress without dependency on cloud provider or on-premises load balancing solutions, supporting both the commercial hub instance and future dedicated spoke deployments

  • Secure inter-service communication within the platform, with mutual TLS enforced between all components

  • Progressive delivery support for safe rollout of backend and microservice changes

  • Multi-region failover as the platform scales beyond a single region

  • Full-stack observability spanning infrastructure, platform, application, and networking layers, published to a centralised observability platform

  • Policy enforcement for authentication, rate limiting, and TLS lifecycle management, initially at the application layer with a path to gateway-level enforcement as the platform matures

The platform uses a hub-and-spoke deployment model, where the hub manages billing, identity, and operator functions, and spokes host tenant-isolated device telemetry workloads. The networking layer must support both the single-instance commercial deployment and the future operator-provisioned spoke model.

Decision

1. Public Ingress: Cloudflare Zero Trust + cloudflared

All public-facing services are exposed via Cloudflare Zero Trust using the cloudflared Kubernetes Deployment (tunnel connector), rather than Kubernetes Ingress resources or cloud provider / on-premises load balancers.

Traffic flow:

Cloudflare Edge (Zero Trust / Access policies) → cloudflared tunnel (Kubernetes Deployment, per cluster) → OpenShift Service Mesh Gateway (Istio Ingress Gateway) → Service Mesh (mTLS, Envoy sidecars) → Application services

This approach:

  • Eliminates dependency on cloud provider load balancers or on-premises hardware

  • Leverages Cloudflare's global anycast backbone for latency and resilience

  • Provides DDoS protection, WAF, and Zero Trust Access enforcement at the edge

  • Requires no inbound firewall rules on the OpenShift cluster — all connections are outbound from cloudflared

  • Supports spoke deployments natively: each spoke's Istio Ingress Gateway is exposed via its own cloudflared tunnel, provisioned by the Kubernetes Operator as part of spoke reconciliation

The cloudflared tunnel terminates at the Istio Ingress Gateway, which is configured using Kubernetes Gateway API resources (Gateway, HTTPRoute, GRPCRoute). This is consistent with OpenShift Service Mesh 3.x's Gateway API-first model.

2. Service Mesh: OpenShift Service Mesh 3.x (Sail Operator)

OpenShift Service Mesh 3.x, managed by the Sail Operator (upstream Istio), is used for all inter-service communication within the cluster.

Key configuration:

  • PeerAuthentication set to STRICT mesh-wide — all pod-to-pod communication requires mTLS; no plaintext traffic is accepted within the mesh

  • Envoy sidecar injection enabled for all application namespaces

  • The Istio Ingress Gateway acts as the mTLS termination boundary for inbound tunnel traffic from cloudflared

  • GRPCRoute resources (or HTTPRoute with appProtocol: h2) are used for gRPC-capable services (e.g. the primary backend)

OSSM 3.x provides the Envoy-based data plane from which mesh telemetry (traces, metrics, access logs) is collected and forwarded to Grafana Cloud.

3. Multi-Region Failover: Cloudflare Load Balancing

Multi-region failover is handled by Cloudflare Load Balancing (origin pools + health checks), rather than DNS-based failover via a policy controller (e.g. Kuadrant/Connectivity Link DNSPolicy).

Rationale over DNS-based failover:

  • Cloudflare Load Balancing operates at the anycast + health check layer — failover is faster than DNS TTL propagation allows

  • Consistent with the existing Cloudflare Zero Trust commitment — the platform is already on Cloudflare's backbone

  • Avoids introducing a separate DNS management control plane prematurely

Each region exposes a cloudflared tunnel as a Cloudflare origin. Origin pools are grouped by region with health checks against the Istio Ingress Gateway's health endpoint.

For spoke deployments, spoke-specific subdomains are Cloudflare DNS records pointing at the spoke's tunnel. Multi-region failover for spokes is addressed per-spoke as required.

4. Progressive Delivery: Argo Rollouts

Argo Rollouts is used for progressive delivery of application services, integrated with the Istio data plane for traffic splitting.

  • Canary and blue-green rollout strategies use HTTPRoute (Gateway API) traffic weight management via the Argo Rollouts Gateway API plugin

  • VirtualService-based traffic splitting is retained where required for Argo Rollouts compatibility during the Gateway API plugin maturation period

  • Argo Rollouts operates at the HTTPRoute/VirtualService level (per-service traffic splitting), distinct from Cloudflare Load Balancing which operates at the origin/region level — the two layers do not conflict

5. Observability: Grafana Cloud (Full-Stack)

All observability signals — spanning infrastructure, platform, application, and networking layers — are published to Grafana Cloud, providing a single unified observability picture.

Layer

Signal

Collection mechanism

Infrastructure

Node metrics, resource utilisation

Grafana Alloy (OpenShift node exporters)

Platform

PostgreSQL metrics

Grafana Alloy scraping PostgreSQL views (read-only monitoring role, 30s interval)

Application

Traces, metrics, logs

OpenTelemetry SDK (quarkus-opentelemetry), exported via OTLP to Grafana Cloud

Networking / Mesh

Envoy access logs, mesh metrics, request traces

OSSM telemetry pipeline → Grafana Alloy → Grafana Cloud

Mesh observability specifics:

  • Istio's Envoy sidecars emit per-request metrics (latency, error rate, throughput) and access logs

  • Distributed traces span application and mesh layers via W3C TraceContext propagation

  • Grafana Alloy is deployed as a DaemonSet/Deployment on OpenShift, collecting from both the application OTLP endpoint and the OSSM telemetry pipeline

Customer-facing dashboards are served as scoped Grafana embed URLs generated server-side and rendered in KMP clients, consistent with cerebralstratum-backend ADR-0002.

Red Hat Connectivity Link (Kuadrant) is identified as the future mechanism for attaching AuthPolicy, RateLimitPolicy, and TLSPolicy to Gateway and HTTPRoute resources across hub and spoke deployments.

This is deferred to the scaling phase. All Gateway and HTTPRoute resources are defined using Kubernetes Gateway API, which is the attachment surface for Connectivity Link policies. Adding Connectivity Link later is additive with no structural refactoring required.

DNSPolicy (Connectivity Link's DNS management capability) is explicitly not adopted — Cloudflare Load Balancing serves this function (see Section 3).

Consequences

Positive

  • No dependency on cloud provider or on-premises load balancing infrastructure — the platform is portable across any OpenShift environment with outbound internet access

  • Cloudflare's backbone provides globally distributed ingress with DDoS protection and Zero Trust access enforcement as baseline capabilities

  • OSSM mTLS in STRICT mode provides a strong zero-trust posture within the cluster without application-level changes

  • Full-stack observability in Grafana Cloud gives a single pane of glass across all signal types

  • Gateway API-first resource definitions are Connectivity Link-compatible, preserving the path to gateway-level policy enforcement without future refactoring

  • Spoke deployments are operationally consistent with the hub — the Operator provisions the same set of networking resources for each spoke

Negative / Risks

  • cloudflared introduces a runtime dependency on Cloudflare's service availability for all public ingress

  • Argo Rollouts Gateway API plugin maturity requires validation — VirtualService fallback is retained as a mitigation

  • OSSM 3.x (Sail Operator) is a relatively recent transition from the Maistra-based OSSM 2.x — operational runbooks and upgrade paths require validation

  • Mesh observability pipeline adds operational components that must be sized and monitored; cardinality of Istio metrics at scale requires attention to avoid Grafana Cloud ingestion cost blowout

Alternatives Considered

Alternative

Reason not adopted

Kubernetes Ingress + cloud LB

Introduces cloud provider dependency; not portable across hub and spoke deployments

Connectivity Link DNSPolicy for failover

DNS TTL-based failover is slower than Cloudflare Load Balancing's anycast health-check model; redundant given existing Cloudflare commitment

Prometheus + self-hosted Grafana for observability

Increases operational burden; Grafana Cloud provides the same capability with managed infrastructure

OSSM 2.x (Maistra)

EOL trajectory; OSSM 3.x / Sail Operator aligns with upstream Istio and Red Hat's current supported path

Last modified: 17 May 2026