ADR-0001: Network Topology and Multi-Region Ingress Strategy

Field	Value
Status	Accepted
Migrated from	Project-wide ADR-007

Context

CEREBRAL STRATUM is deployed on OpenShift and must satisfy several networking requirements that evolve across its lifecycle:

Public ingress without dependency on cloud provider or on-premises load balancing solutions, supporting both the commercial hub instance and future dedicated spoke deployments
Secure inter-service communication within the platform, with mutual TLS enforced between all components
Progressive delivery support for safe rollout of backend and microservice changes
Multi-region failover as the platform scales beyond a single region
Full-stack observability spanning infrastructure, platform, application, and networking layers, published to a centralised observability platform
Policy enforcement for authentication, rate limiting, and TLS lifecycle management, initially at the application layer with a path to gateway-level enforcement as the platform matures

The platform uses a hub-and-spoke deployment model, where the hub manages billing, identity, and operator functions, and spokes host tenant-isolated device telemetry workloads. The networking layer must support both the single-instance commercial deployment and the future operator-provisioned spoke model.

Decision

1. Public Ingress: Cloudflare Zero Trust + cloudflared

All public-facing services are exposed via Cloudflare Zero Trust using the cloudflared Kubernetes Deployment (tunnel connector), rather than Kubernetes Ingress resources or cloud provider / on-premises load balancers.

Traffic flow:

Cloudflare Edge (Zero Trust / Access policies)
    → cloudflared tunnel (Kubernetes Deployment, per cluster)
        → OpenShift Service Mesh Gateway (Istio Ingress Gateway)
            → Service Mesh (mTLS, Envoy sidecars)
                → Application services

This approach:

Eliminates dependency on cloud provider load balancers or on-premises hardware
Leverages Cloudflare's global anycast backbone for latency and resilience
Provides DDoS protection, WAF, and Zero Trust Access enforcement at the edge
Requires no inbound firewall rules on the OpenShift cluster — all connections are outbound from cloudflared
Supports spoke deployments natively: each spoke's Istio Ingress Gateway is exposed via its own cloudflared tunnel, provisioned by the Kubernetes Operator as part of spoke reconciliation

The cloudflared tunnel terminates at the Istio Ingress Gateway, which is configured using Kubernetes Gateway API resources (Gateway, HTTPRoute, GRPCRoute). This is consistent with OpenShift Service Mesh 3.x's Gateway API-first model.

2. Service Mesh: OpenShift Service Mesh 3.x (Sail Operator)

OpenShift Service Mesh 3.x, managed by the Sail Operator (upstream Istio), is used for all inter-service communication within the cluster.

Key configuration:

PeerAuthentication set to STRICT mesh-wide — all pod-to-pod communication requires mTLS; no plaintext traffic is accepted within the mesh
Envoy sidecar injection enabled for all application namespaces
The Istio Ingress Gateway acts as the mTLS termination boundary for inbound tunnel traffic from cloudflared
GRPCRoute resources (or HTTPRoute with appProtocol: h2) are used for gRPC-capable services (e.g. the primary backend)

OSSM 3.x provides the Envoy-based data plane from which mesh telemetry (traces, metrics, access logs) is collected and forwarded to Grafana Cloud.

3. Multi-Region Failover: Cloudflare Load Balancing

Multi-region failover is handled by Cloudflare Load Balancing (origin pools + health checks), rather than DNS-based failover via a policy controller (e.g. Kuadrant/Connectivity Link DNSPolicy).

Rationale over DNS-based failover:

Cloudflare Load Balancing operates at the anycast + health check layer — failover is faster than DNS TTL propagation allows
Consistent with the existing Cloudflare Zero Trust commitment — the platform is already on Cloudflare's backbone
Avoids introducing a separate DNS management control plane prematurely

Each region exposes a cloudflared tunnel as a Cloudflare origin. Origin pools are grouped by region with health checks against the Istio Ingress Gateway's health endpoint.

For spoke deployments, spoke-specific subdomains are Cloudflare DNS records pointing at the spoke's tunnel. Multi-region failover for spokes is addressed per-spoke as required.

4. Progressive Delivery: Argo Rollouts

Argo Rollouts is used for progressive delivery of application services, integrated with the Istio data plane for traffic splitting.

Canary and blue-green rollout strategies use HTTPRoute (Gateway API) traffic weight management via the Argo Rollouts Gateway API plugin
VirtualService-based traffic splitting is retained where required for Argo Rollouts compatibility during the Gateway API plugin maturation period
Argo Rollouts operates at the HTTPRoute/VirtualService level (per-service traffic splitting), distinct from Cloudflare Load Balancing which operates at the origin/region level — the two layers do not conflict

5. Observability: Grafana Cloud (Full-Stack)

All observability signals — spanning infrastructure, platform, application, and networking layers — are published to Grafana Cloud, providing a single unified observability picture.

Layer	Signal	Collection mechanism
Infrastructure	Node metrics, resource utilisation	Grafana Alloy (OpenShift node exporters)
Platform	PostgreSQL metrics	Grafana Alloy scraping PostgreSQL views (read-only monitoring role, 30s interval)
Application	Traces, metrics, logs	OpenTelemetry SDK (`quarkus-opentelemetry`), exported via OTLP to Grafana Cloud
Networking / Mesh	Envoy access logs, mesh metrics, request traces	OSSM telemetry pipeline → Grafana Alloy → Grafana Cloud

Mesh observability specifics:

Istio's Envoy sidecars emit per-request metrics (latency, error rate, throughput) and access logs
Distributed traces span application and mesh layers via W3C TraceContext propagation
Grafana Alloy is deployed as a DaemonSet/Deployment on OpenShift, collecting from both the application OTLP endpoint and the OSSM telemetry pipeline

Customer-facing dashboards are served as scoped Grafana embed URLs generated server-side and rendered in KMP clients, consistent with cerebralstratum-backend ADR-0002.

6. Gateway-Level Policy Enforcement: Deferred (Red Hat Connectivity Link)

Red Hat Connectivity Link (Kuadrant) is identified as the future mechanism for attaching AuthPolicy, RateLimitPolicy, and TLSPolicy to Gateway and HTTPRoute resources across hub and spoke deployments.

This is deferred to the scaling phase. All Gateway and HTTPRoute resources are defined using Kubernetes Gateway API, which is the attachment surface for Connectivity Link policies. Adding Connectivity Link later is additive with no structural refactoring required.

DNSPolicy (Connectivity Link's DNS management capability) is explicitly not adopted — Cloudflare Load Balancing serves this function (see Section 3).

Consequences

Positive

No dependency on cloud provider or on-premises load balancing infrastructure — the platform is portable across any OpenShift environment with outbound internet access
Cloudflare's backbone provides globally distributed ingress with DDoS protection and Zero Trust access enforcement as baseline capabilities
OSSM mTLS in STRICT mode provides a strong zero-trust posture within the cluster without application-level changes
Full-stack observability in Grafana Cloud gives a single pane of glass across all signal types
Gateway API-first resource definitions are Connectivity Link-compatible, preserving the path to gateway-level policy enforcement without future refactoring
Spoke deployments are operationally consistent with the hub — the Operator provisions the same set of networking resources for each spoke

Negative / Risks

cloudflared introduces a runtime dependency on Cloudflare's service availability for all public ingress
Argo Rollouts Gateway API plugin maturity requires validation — VirtualService fallback is retained as a mitigation
OSSM 3.x (Sail Operator) is a relatively recent transition from the Maistra-based OSSM 2.x — operational runbooks and upgrade paths require validation
Mesh observability pipeline adds operational components that must be sized and monitored; cardinality of Istio metrics at scale requires attention to avoid Grafana Cloud ingestion cost blowout

Alternatives Considered

Alternative	Reason not adopted
Kubernetes Ingress + cloud LB	Introduces cloud provider dependency; not portable across hub and spoke deployments
Connectivity Link `DNSPolicy` for failover	DNS TTL-based failover is slower than Cloudflare Load Balancing's anycast health-check model; redundant given existing Cloudflare commitment
Prometheus + self-hosted Grafana for observability	Increases operational burden; Grafana Cloud provides the same capability with managed infrastructure
OSSM 2.x (Maistra)	EOL trajectory; OSSM 3.x / Sail Operator aligns with upstream Istio and Red Hat's current supported path

Last modified: 17 May 2026