ADR-0001: Network Topology and Multi-Region Ingress Strategy
Field | Value |
|---|---|
Status | Accepted |
Migrated from | Project-wide ADR-007 |
Context
CEREBRAL STRATUM is deployed on OpenShift and must satisfy several networking requirements that evolve across its lifecycle:
Public ingress without dependency on cloud provider or on-premises load balancing solutions, supporting both the commercial hub instance and future dedicated spoke deployments
Secure inter-service communication within the platform, with mutual TLS enforced between all components
Progressive delivery support for safe rollout of backend and microservice changes
Multi-region failover as the platform scales beyond a single region
Full-stack observability spanning infrastructure, platform, application, and networking layers, published to a centralised observability platform
Policy enforcement for authentication, rate limiting, and TLS lifecycle management, initially at the application layer with a path to gateway-level enforcement as the platform matures
The platform uses a hub-and-spoke deployment model, where the hub manages billing, identity, and operator functions, and spokes host tenant-isolated device telemetry workloads. The networking layer must support both the single-instance commercial deployment and the future operator-provisioned spoke model.
Decision
1. Public Ingress: Cloudflare Zero Trust + cloudflared
All public-facing services are exposed via Cloudflare Zero Trust using the cloudflared Kubernetes Deployment (tunnel connector), rather than Kubernetes Ingress resources or cloud provider / on-premises load balancers.
Traffic flow:
This approach:
Eliminates dependency on cloud provider load balancers or on-premises hardware
Leverages Cloudflare's global anycast backbone for latency and resilience
Provides DDoS protection, WAF, and Zero Trust Access enforcement at the edge
Requires no inbound firewall rules on the OpenShift cluster — all connections are outbound from
cloudflaredSupports spoke deployments natively: each spoke's Istio Ingress Gateway is exposed via its own
cloudflaredtunnel, provisioned by the Kubernetes Operator as part of spoke reconciliation
The cloudflared tunnel terminates at the Istio Ingress Gateway, which is configured using Kubernetes Gateway API resources (Gateway, HTTPRoute, GRPCRoute). This is consistent with OpenShift Service Mesh 3.x's Gateway API-first model.
2. Service Mesh: OpenShift Service Mesh 3.x (Sail Operator)
OpenShift Service Mesh 3.x, managed by the Sail Operator (upstream Istio), is used for all inter-service communication within the cluster.
Key configuration:
PeerAuthenticationset toSTRICTmesh-wide — all pod-to-pod communication requires mTLS; no plaintext traffic is accepted within the meshEnvoy sidecar injection enabled for all application namespaces
The Istio Ingress Gateway acts as the mTLS termination boundary for inbound tunnel traffic from
cloudflaredGRPCRouteresources (orHTTPRoutewithappProtocol: h2) are used for gRPC-capable services (e.g. the primary backend)
OSSM 3.x provides the Envoy-based data plane from which mesh telemetry (traces, metrics, access logs) is collected and forwarded to Grafana Cloud.
3. Multi-Region Failover: Cloudflare Load Balancing
Multi-region failover is handled by Cloudflare Load Balancing (origin pools + health checks), rather than DNS-based failover via a policy controller (e.g. Kuadrant/Connectivity Link DNSPolicy).
Rationale over DNS-based failover:
Cloudflare Load Balancing operates at the anycast + health check layer — failover is faster than DNS TTL propagation allows
Consistent with the existing Cloudflare Zero Trust commitment — the platform is already on Cloudflare's backbone
Avoids introducing a separate DNS management control plane prematurely
Each region exposes a cloudflared tunnel as a Cloudflare origin. Origin pools are grouped by region with health checks against the Istio Ingress Gateway's health endpoint.
For spoke deployments, spoke-specific subdomains are Cloudflare DNS records pointing at the spoke's tunnel. Multi-region failover for spokes is addressed per-spoke as required.
4. Progressive Delivery: Argo Rollouts
Argo Rollouts is used for progressive delivery of application services, integrated with the Istio data plane for traffic splitting.
Canary and blue-green rollout strategies use
HTTPRoute(Gateway API) traffic weight management via the Argo Rollouts Gateway API pluginVirtualService-based traffic splitting is retained where required for Argo Rollouts compatibility during the Gateway API plugin maturation periodArgo Rollouts operates at the
HTTPRoute/VirtualServicelevel (per-service traffic splitting), distinct from Cloudflare Load Balancing which operates at the origin/region level — the two layers do not conflict
5. Observability: Grafana Cloud (Full-Stack)
All observability signals — spanning infrastructure, platform, application, and networking layers — are published to Grafana Cloud, providing a single unified observability picture.
Layer | Signal | Collection mechanism |
|---|---|---|
Infrastructure | Node metrics, resource utilisation | Grafana Alloy (OpenShift node exporters) |
Platform | PostgreSQL metrics | Grafana Alloy scraping PostgreSQL views (read-only monitoring role, 30s interval) |
Application | Traces, metrics, logs | OpenTelemetry SDK ( |
Networking / Mesh | Envoy access logs, mesh metrics, request traces | OSSM telemetry pipeline → Grafana Alloy → Grafana Cloud |
Mesh observability specifics:
Istio's Envoy sidecars emit per-request metrics (latency, error rate, throughput) and access logs
Distributed traces span application and mesh layers via W3C TraceContext propagation
Grafana Alloy is deployed as a DaemonSet/Deployment on OpenShift, collecting from both the application OTLP endpoint and the OSSM telemetry pipeline
Customer-facing dashboards are served as scoped Grafana embed URLs generated server-side and rendered in KMP clients, consistent with cerebralstratum-backend ADR-0002.
6. Gateway-Level Policy Enforcement: Deferred (Red Hat Connectivity Link)
Red Hat Connectivity Link (Kuadrant) is identified as the future mechanism for attaching AuthPolicy, RateLimitPolicy, and TLSPolicy to Gateway and HTTPRoute resources across hub and spoke deployments.
This is deferred to the scaling phase. All Gateway and HTTPRoute resources are defined using Kubernetes Gateway API, which is the attachment surface for Connectivity Link policies. Adding Connectivity Link later is additive with no structural refactoring required.
DNSPolicy (Connectivity Link's DNS management capability) is explicitly not adopted — Cloudflare Load Balancing serves this function (see Section 3).
Consequences
Positive
No dependency on cloud provider or on-premises load balancing infrastructure — the platform is portable across any OpenShift environment with outbound internet access
Cloudflare's backbone provides globally distributed ingress with DDoS protection and Zero Trust access enforcement as baseline capabilities
OSSM mTLS in
STRICTmode provides a strong zero-trust posture within the cluster without application-level changesFull-stack observability in Grafana Cloud gives a single pane of glass across all signal types
Gateway API-first resource definitions are Connectivity Link-compatible, preserving the path to gateway-level policy enforcement without future refactoring
Spoke deployments are operationally consistent with the hub — the Operator provisions the same set of networking resources for each spoke
Negative / Risks
cloudflaredintroduces a runtime dependency on Cloudflare's service availability for all public ingressArgo Rollouts Gateway API plugin maturity requires validation —
VirtualServicefallback is retained as a mitigationOSSM 3.x (Sail Operator) is a relatively recent transition from the Maistra-based OSSM 2.x — operational runbooks and upgrade paths require validation
Mesh observability pipeline adds operational components that must be sized and monitored; cardinality of Istio metrics at scale requires attention to avoid Grafana Cloud ingestion cost blowout
Alternatives Considered
Alternative | Reason not adopted |
|---|---|
Kubernetes Ingress + cloud LB | Introduces cloud provider dependency; not portable across hub and spoke deployments |
Connectivity Link | DNS TTL-based failover is slower than Cloudflare Load Balancing's anycast health-check model; redundant given existing Cloudflare commitment |
Prometheus + self-hosted Grafana for observability | Increases operational burden; Grafana Cloud provides the same capability with managed infrastructure |
OSSM 2.x (Maistra) | EOL trajectory; OSSM 3.x / Sail Operator aligns with upstream Istio and Red Hat's current supported path |