CEREBRAL STRATUM Help

ADR-0003: Identity Infrastructure Placement — Keycloak on ECS with RDS (PostgreSQL)

Field

Value

Status

Proposed

Date

2026-05-03

Author

Alex Henshaw

Relates to

cerebralstratum-backend ADR-0001 (IAM & Device Registration), cerebralstratum ADR-0002 (CI/CD)

Migrated from

Project-wide ADR-008 (Identity Infrastructure Placement)

Context

CEREBRAL STRATUM requires a robust, available identity provider (IdP) for two distinct concerns:

  1. Platform IAM — Keycloak (Red Hat Build of Keycloak / RHBK) serves as the OIDC/OAuth2 authority for all platform services, device authentication via private_key_jwt, and Cloudflare Zero Trust identity brokering (cerebralstratum-backend ADR-0001).

  2. Internal directory — Red Hat Identity Management (IdM) serves as the LDAP directory for internal user and host management, running on a reserved EC2 instance.

The current state is:

  • IdM: Running natively on a reserved EC2 instance (t3.medium or equivalent), Canberra region.

  • Keycloak + PostgreSQL: Running as containers on the same EC2 instance as IdM, using a containerised PostgreSQL instance for the Keycloak database.

  • LDAP federation: Keycloak federates users from IdM via LDAP, providing a clean separation between directory (IdM) and OIDC broker (Keycloak).

This arrangement was pragmatic during the bootstrapping phase but introduces several concerns as Keycloak becomes a critical-path dependency for Cloudflare Zero Trust Access:

  • Single point of failure: A container crash, OOM event, or EC2 instance issue takes down Keycloak, blocking all Zero Trust-protected service access.

  • Ephemeral database risk: The containerised PostgreSQL has no managed backup, snapshot, or failover capability. The Keycloak database is the authoritative store for realm configuration, client registrations, users, and role mappings.

  • Migration cost: Moving Keycloak to ROSA HCP (the long-term production platform) will require a database migration. Establishing RDS now avoids a second migration later.

  • Operational boundary: Running Keycloak and IdM on the same instance couples two services with different lifecycle and update cadences.

The long-term target is ROSA HCP with the RHBK Operator managing Keycloak. However, ROSA HCP clusters are currently decommissioned due to cost constraints during the bootstrapping phase. A pragmatic intermediate architecture is required that:

  • Improves availability and recoverability of Keycloak.

  • Positions the Keycloak database for reuse when ROSA HCP is reinstated.

  • Operates within tight bootstrapping budget constraints.

  • Aligns with AWS Cloud Financial Management (CFM) Technical Implementation Playbooks (TIPs) cost optimisation guidance.

Decision

Migrate Keycloak from the shared EC2 instance to Amazon ECS (Fargate), backed by Amazon RDS for PostgreSQL (Single-AZ), within the existing AWS account.

The containerised Keycloak workload moves to ECS Fargate for managed container lifecycle. The containerised PostgreSQL instance is replaced by RDS for PostgreSQL, providing managed backups, point-in-time recovery, and a durable database that survives the eventual migration to ROSA HCP without requiring a data migration.

IdM remains on the existing EC2 reserved instance. The Keycloak → IdM LDAP federation is preserved via VPC-internal networking.

Architecture

Component Overview

┌──────────────────────────────────────────────────────────────────┐ │ AWS VPC │ │ │ │ ┌──────────────────┐ ┌─────────────────────────────────┐ │ │ │ EC2 (Reserved) │ │ ECS Cluster (Fargate) │ │ │ │ │ │ │ │ │ │ Red Hat IdM │◄────►│ ┌───────────────────────────┐ │ │ │ │ (LDAP/Kerberos) │ LDAP │ │ ECS Task │ │ │ │ │ │ │ │ │ │ │ │ └──────────────────┘ │ │ ┌─────────────────────┐ │ │ │ │ │ │ │ Keycloak (RHBK) │ │ │ │ │ │ │ │ :8080 │ │ │ │ │ │ │ └──────────┬──────────┘ │ │ │ │ │ │ │ localhost │ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ │ │ │ │ cloudflared sidecar │ │ │ │ │ │ │ │ (tunnel daemon) │ │ │ │ │ │ │ └──────────┬──────────┘ │ │ │ │ │ └─────────────┼─────────────┘ │ │ │ └────────────────┼────────────────┘ │ │ │ JDBC │ │ ┌───────────────▼──────────────┐ │ │ │ RDS PostgreSQL │ │ │ │ db.t4g.micro (Single-AZ) │ │ │ │ Keycloak DB │ │ │ └──────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ │ Outbound HTTPS (443) │ Cloudflare Tunnel ▼ ┌─────────────────────┐ │ Cloudflare Network │ │ Zero Trust Access │ │ OIDC endpoints │ └─────────────────────┘

ECS Task Definition (Keycloak)

The ECS task runs two containers: the Keycloak (RHBK) application container and a cloudflared sidecar that maintains the outbound Cloudflare tunnel. cloudflared connects outbound to the Cloudflare network and proxies inbound requests to Keycloak on localhost:8080. No inbound security group rules or load balancer are required.

family: keycloak cpu: 512 # 0.5 vCPU — scale to 1024 under load memory: 1024 # 1GB total — Keycloak ~768MB, cloudflared ~64MB networkMode: awsvpc requiresCompatibilities: [FARGATE] runtimePlatform: cpuArchitecture: ARM64 # Graviton — ~20% cost reduction vs x86 operatingSystemFamily: LINUX containerDefinitions: - name: keycloak image: registry.redhat.io/rhbk/keycloak-rhel9:latest essential: true portMappings: - containerPort: 8080 protocol: tcp environment: - name: KC_DB value: postgres - name: KC_DB_URL value: jdbc:postgresql://<rds-endpoint>:5432/keycloak - name: KC_HOSTNAME value: auth.blueguardian.co - name: KC_HTTP_ENABLED value: "true" # TLS terminated at Cloudflare edge - name: KC_PROXY value: edge secrets: - name: KC_DB_USERNAME valueFrom: arn:aws:secretsmanager:...:keycloak-db-username - name: KC_DB_PASSWORD valueFrom: arn:aws:secretsmanager:...:keycloak-db-password logConfiguration: logDriver: awslogs options: awslogs-group: /ecs/keycloak awslogs-region: ap-southeast-2 awslogs-stream-prefix: keycloak - name: cloudflared image: cloudflare/cloudflared:latest essential: true # if tunnel dies, restart the whole task command: [tunnel, --no-autoupdate, run, --token, $(TUNNEL_TOKEN)] secrets: - name: TUNNEL_TOKEN valueFrom: arn:aws:secretsmanager:...:cloudflared-tunnel-token logConfiguration: logDriver: awslogs options: awslogs-group: /ecs/keycloak awslogs-region: ap-southeast-2 awslogs-stream-prefix: cloudflared

Both containers are marked essential: true. If either container exits, ECS replaces the entire task.

RDS Instance

Engine: PostgreSQL 16 Instance class: db.t4g.micro (Graviton2, 2 vCPU, 1GB RAM) Storage: gp3, 20GB Multi-AZ: No (Single-AZ — see trade-offs) Backup retention: 7 days automated snapshots Deletion protection: Enabled

Networking

  • ECS task and RDS instance are in a private subnet; neither has a public IP.

  • No inbound security group rules required on the Keycloak taskcloudflared connects outbound only.

  • Security group — Keycloak task: Outbound 443 (Cloudflare tunnel), 5432 (RDS), 389/636 (IdM LDAP).

  • Security group — RDS: Inbound 5432 from Keycloak task security group only.

  • Cloudflare tunnel ingress rules:

    Path prefix

    Behaviour

    /realms/

    Public — OIDC discovery, token endpoints

    /protocol/

    Public — OIDC/SAML protocol endpoints

    /.well-known/

    Public — OIDC discovery document

    /admin/

    Zero Trust Access policy required (platform-admin realm role)

    / (catch-all)

    Block

CFM TIPs Cost Optimisation Alignment

This decision applies guidance from the AWS Cloud Financial Management Technical Implementation Playbooks (CFM TIPs) across the four CFM pillars.

Tagging Strategy

All resources carry consistent cost allocation tags:

Project: cerebral-stratum Component: identity Environment: bootstrap ManagedBy: terraform Owner: [email protected]

ECS / Fargate Cost Optimisation

  • Graviton (ARM64) task runtime: ~20% cost reduction vs x86 Fargate.

  • Right-sized task (0.5 vCPU / 1GB RAM): Compute Optimizer review after 30 days before committing to Savings Plans.

  • Compute Savings Plan deferred — rightsize first.

  • CloudWatch log retention: 14-day retention policy on /ecs/keycloak log group.

  • No NAT Gateway for ECS → RDS traffic (same VPC). AWS service API calls routed via VPC Interface Endpoints.

RDS Cost Optimisation

  • Graviton instance class (db.t4g.micro): Comparable performance to t3 at lower cost.

  • Single-AZ deployment: Multi-AZ will be revisited pre-revenue.

  • gp3 storage: Up to 20% cost savings vs gp2.

  • RDS Reserved Instance deferred — rightsize after 30 days.

  • Performance Insights (7-day free tier): Enabled for rightsizing decisions.

Estimated Monthly Cost (ap-southeast-2, on-demand)

Resource

Specification

Estimated Cost (USD/mo)

ECS Fargate

0.5 vCPU / 1GB, ARM64, ~720h

~$8–10

RDS PostgreSQL

db.t4g.micro, Single-AZ, 20GB gp3

~$14–16

Secrets Manager

2 secrets + API calls

<$1

CloudWatch Logs

14-day retention, ~1GB/mo

~$1–2

VPC Interface Endpoints

Secrets Manager, CloudWatch

~$7

Total

~$31–36/mo

Migration Plan

Phase 1: RDS provisioning

  1. Provision RDS db.t4g.micro PostgreSQL 16, Single-AZ, in the existing VPC private subnet.

  2. Create keycloak database and user with least-privilege grants.

  3. Store credentials in AWS Secrets Manager.

  4. Validate connectivity from a temporary task or EC2 bastion.

Phase 2: Database migration

# Export from existing containerised PostgreSQL pg_dump -h localhost -U keycloak -d keycloak -Fc -f keycloak_bootstrap.dump # Restore into RDS pg_restore -h <rds-endpoint> -U keycloak -d keycloak keycloak_bootstrap.dump

Validate realm configuration, client registrations, and user federation settings.

Phase 3: ECS deployment

  1. Push RHBK image to ECR (or use Red Hat registry with pull credentials in Secrets Manager).

  2. Deploy ECS task definition and ECS service (desired count: 1).

  3. Configure cloudflared to route auth.blueguardian.cohttp://localhost:8080.

  4. Validate Cloudflare Zero Trust OIDC flow end-to-end.

Phase 4: Cutover and decommission

  1. Validate all Zero Trust Access Applications authenticate correctly.

  2. Validate device private_key_jwt authentication flows (cerebralstratum-backend ADR-0001).

  3. Remove Keycloak and containerised PostgreSQL from the EC2 instance.

  4. Evaluate whether IdM EC2 instance can be right-sized downward.

Rollback: Revert cloudflared tunnel configuration to point at the EC2-hosted Keycloak. Containerised PostgreSQL on EC2 is not decommissioned until Phase 4 validation is complete.

Long-Term Migration Path (ROSA HCP)

When ROSA HCP clusters are reinstated:

Current (this ADR): Future (ROSA HCP): ECS Fargate → Keycloak RHBK Operator (OpenShift) → Keycloak RDS PostgreSQL RDS PostgreSQL (same instance — no data migration) OR Crunchy Postgres Operator (one-time migration)

The RHBK Operator on OpenShift supports external database configuration, meaning the RDS instance provisioned by this ADR can be pointed at directly from the operator-managed Keycloak instance. This is the primary cost justification for RDS over containerised PostgreSQL.

Alternatives Considered

Harden in place (EC2 containerised PostgreSQL + Keycloak): Rejected — retains unmanaged database risk, no backup/PITR, couples Keycloak and IdM lifecycles, and requires a full database migration when ROSA is reinstated regardless.

ECS Fargate + containerised PostgreSQL (no RDS): Rejected — retains the unmanaged database risk. The marginal cost of RDS (~$14–16/mo for db.t4g.micro) buys managed backups, PITR, and migration continuity.

ECS Fargate + RDS Multi-AZ: Rejected for bootstrapping phase — roughly doubles RDS cost. Forward item for pre-revenue launch.

EC2 + RDS (dedicated EC2 for Keycloak): Rejected — worse cost efficiency than Fargate at this task size.

ROSA HCP immediately: Rejected — current budget constraints make this infeasible during bootstrapping.

Consequences

Positive

  • Keycloak database is now managed (automated backups, PITR, deletion protection) — the single largest operational risk is eliminated.

  • ECS Fargate provides automatic task restart on failure.

  • RDS survives the eventual ROSA HCP migration without a data migration.

  • LDAP federation to IdM is preserved without architectural change.

  • Cost is predictable and within bootstrapping budget.

  • Architecture aligns with CFM TIPs guidance: Graviton, rightsizing before commitment, gp3, VPC endpoints, tagging, budget alerting.

Negative / Trade-offs

  • VPC endpoints add ~$7/mo at low usage — disproportionate at bootstrapping scale.

  • Single-AZ RDS: database-layer recovery time is minutes, not seconds.

  • ECS Fargate cold start adds ~30–60 seconds to Keycloak recovery after task failure.

  • Adds ECS and RDS operational surface to the AWS account.

Open Items

Item

Target

Evaluate Multi-AZ RDS before first paying customer

Pre-revenue launch

Apply Compute Savings Plan after 30 days Fargate utilisation data

T+30 days

Apply RDS Reserved Instance after 30 days RDS utilisation data

T+30 days

Evaluate VPC endpoint cost vs NAT Gateway at bootstrapping budget

T+7 days post-deploy

Define RHBK Operator external DB configuration for ROSA migration

ADR for ROSA reinstatement

Right-size IdM EC2 instance post Keycloak removal

Phase 4 cutover

Last modified: 17 May 2026