ADR-0003: Identity Infrastructure Placement — Keycloak on ECS with RDS (PostgreSQL)
Field | Value |
|---|---|
Status | Proposed |
Date | 2026-05-03 |
Author | Alex Henshaw |
Relates to |
|
Migrated from | Project-wide ADR-008 (Identity Infrastructure Placement) |
Context
CEREBRAL STRATUM requires a robust, available identity provider (IdP) for two distinct concerns:
Platform IAM — Keycloak (Red Hat Build of Keycloak / RHBK) serves as the OIDC/OAuth2 authority for all platform services, device authentication via
private_key_jwt, and Cloudflare Zero Trust identity brokering (cerebralstratum-backendADR-0001).Internal directory — Red Hat Identity Management (IdM) serves as the LDAP directory for internal user and host management, running on a reserved EC2 instance.
The current state is:
IdM: Running natively on a reserved EC2 instance (
t3.mediumor equivalent), Canberra region.Keycloak + PostgreSQL: Running as containers on the same EC2 instance as IdM, using a containerised PostgreSQL instance for the Keycloak database.
LDAP federation: Keycloak federates users from IdM via LDAP, providing a clean separation between directory (IdM) and OIDC broker (Keycloak).
This arrangement was pragmatic during the bootstrapping phase but introduces several concerns as Keycloak becomes a critical-path dependency for Cloudflare Zero Trust Access:
Single point of failure: A container crash, OOM event, or EC2 instance issue takes down Keycloak, blocking all Zero Trust-protected service access.
Ephemeral database risk: The containerised PostgreSQL has no managed backup, snapshot, or failover capability. The Keycloak database is the authoritative store for realm configuration, client registrations, users, and role mappings.
Migration cost: Moving Keycloak to ROSA HCP (the long-term production platform) will require a database migration. Establishing RDS now avoids a second migration later.
Operational boundary: Running Keycloak and IdM on the same instance couples two services with different lifecycle and update cadences.
The long-term target is ROSA HCP with the RHBK Operator managing Keycloak. However, ROSA HCP clusters are currently decommissioned due to cost constraints during the bootstrapping phase. A pragmatic intermediate architecture is required that:
Improves availability and recoverability of Keycloak.
Positions the Keycloak database for reuse when ROSA HCP is reinstated.
Operates within tight bootstrapping budget constraints.
Aligns with AWS Cloud Financial Management (CFM) Technical Implementation Playbooks (TIPs) cost optimisation guidance.
Decision
Migrate Keycloak from the shared EC2 instance to Amazon ECS (Fargate), backed by Amazon RDS for PostgreSQL (Single-AZ), within the existing AWS account.
The containerised Keycloak workload moves to ECS Fargate for managed container lifecycle. The containerised PostgreSQL instance is replaced by RDS for PostgreSQL, providing managed backups, point-in-time recovery, and a durable database that survives the eventual migration to ROSA HCP without requiring a data migration.
IdM remains on the existing EC2 reserved instance. The Keycloak → IdM LDAP federation is preserved via VPC-internal networking.
Architecture
Component Overview
ECS Task Definition (Keycloak)
The ECS task runs two containers: the Keycloak (RHBK) application container and a cloudflared sidecar that maintains the outbound Cloudflare tunnel. cloudflared connects outbound to the Cloudflare network and proxies inbound requests to Keycloak on localhost:8080. No inbound security group rules or load balancer are required.
Both containers are marked essential: true. If either container exits, ECS replaces the entire task.
RDS Instance
Networking
ECS task and RDS instance are in a private subnet; neither has a public IP.
No inbound security group rules required on the Keycloak task —
cloudflaredconnects outbound only.Security group — Keycloak task: Outbound 443 (Cloudflare tunnel), 5432 (RDS), 389/636 (IdM LDAP).
Security group — RDS: Inbound 5432 from Keycloak task security group only.
Cloudflare tunnel ingress rules:
Path prefix
Behaviour
/realms/Public — OIDC discovery, token endpoints
/protocol/Public — OIDC/SAML protocol endpoints
/.well-known/Public — OIDC discovery document
/admin/Zero Trust Access policy required (
platform-adminrealm role)/(catch-all)Block
CFM TIPs Cost Optimisation Alignment
This decision applies guidance from the AWS Cloud Financial Management Technical Implementation Playbooks (CFM TIPs) across the four CFM pillars.
Tagging Strategy
All resources carry consistent cost allocation tags:
ECS / Fargate Cost Optimisation
Graviton (ARM64) task runtime: ~20% cost reduction vs x86 Fargate.
Right-sized task (0.5 vCPU / 1GB RAM): Compute Optimizer review after 30 days before committing to Savings Plans.
Compute Savings Plan deferred — rightsize first.
CloudWatch log retention: 14-day retention policy on
/ecs/keycloaklog group.No NAT Gateway for ECS → RDS traffic (same VPC). AWS service API calls routed via VPC Interface Endpoints.
RDS Cost Optimisation
Graviton instance class (
db.t4g.micro): Comparable performance tot3at lower cost.Single-AZ deployment: Multi-AZ will be revisited pre-revenue.
gp3 storage: Up to 20% cost savings vs gp2.
RDS Reserved Instance deferred — rightsize after 30 days.
Performance Insights (7-day free tier): Enabled for rightsizing decisions.
Estimated Monthly Cost (ap-southeast-2, on-demand)
Resource | Specification | Estimated Cost (USD/mo) |
|---|---|---|
ECS Fargate | 0.5 vCPU / 1GB, ARM64, ~720h | ~$8–10 |
RDS PostgreSQL | db.t4g.micro, Single-AZ, 20GB gp3 | ~$14–16 |
Secrets Manager | 2 secrets + API calls | <$1 |
CloudWatch Logs | 14-day retention, ~1GB/mo | ~$1–2 |
VPC Interface Endpoints | Secrets Manager, CloudWatch | ~$7 |
Total | ~$31–36/mo |
Migration Plan
Phase 1: RDS provisioning
Provision RDS
db.t4g.microPostgreSQL 16, Single-AZ, in the existing VPC private subnet.Create
keycloakdatabase and user with least-privilege grants.Store credentials in AWS Secrets Manager.
Validate connectivity from a temporary task or EC2 bastion.
Phase 2: Database migration
Validate realm configuration, client registrations, and user federation settings.
Phase 3: ECS deployment
Push RHBK image to ECR (or use Red Hat registry with pull credentials in Secrets Manager).
Deploy ECS task definition and ECS service (desired count: 1).
Configure
cloudflaredto routeauth.blueguardian.co→http://localhost:8080.Validate Cloudflare Zero Trust OIDC flow end-to-end.
Phase 4: Cutover and decommission
Validate all Zero Trust Access Applications authenticate correctly.
Validate device
private_key_jwtauthentication flows (cerebralstratum-backendADR-0001).Remove Keycloak and containerised PostgreSQL from the EC2 instance.
Evaluate whether IdM EC2 instance can be right-sized downward.
Rollback: Revert cloudflared tunnel configuration to point at the EC2-hosted Keycloak. Containerised PostgreSQL on EC2 is not decommissioned until Phase 4 validation is complete.
Long-Term Migration Path (ROSA HCP)
When ROSA HCP clusters are reinstated:
The RHBK Operator on OpenShift supports external database configuration, meaning the RDS instance provisioned by this ADR can be pointed at directly from the operator-managed Keycloak instance. This is the primary cost justification for RDS over containerised PostgreSQL.
Alternatives Considered
Harden in place (EC2 containerised PostgreSQL + Keycloak): Rejected — retains unmanaged database risk, no backup/PITR, couples Keycloak and IdM lifecycles, and requires a full database migration when ROSA is reinstated regardless.
ECS Fargate + containerised PostgreSQL (no RDS): Rejected — retains the unmanaged database risk. The marginal cost of RDS (~$14–16/mo for db.t4g.micro) buys managed backups, PITR, and migration continuity.
ECS Fargate + RDS Multi-AZ: Rejected for bootstrapping phase — roughly doubles RDS cost. Forward item for pre-revenue launch.
EC2 + RDS (dedicated EC2 for Keycloak): Rejected — worse cost efficiency than Fargate at this task size.
ROSA HCP immediately: Rejected — current budget constraints make this infeasible during bootstrapping.
Consequences
Positive
Keycloak database is now managed (automated backups, PITR, deletion protection) — the single largest operational risk is eliminated.
ECS Fargate provides automatic task restart on failure.
RDS survives the eventual ROSA HCP migration without a data migration.
LDAP federation to IdM is preserved without architectural change.
Cost is predictable and within bootstrapping budget.
Architecture aligns with CFM TIPs guidance: Graviton, rightsizing before commitment, gp3, VPC endpoints, tagging, budget alerting.
Negative / Trade-offs
VPC endpoints add ~$7/mo at low usage — disproportionate at bootstrapping scale.
Single-AZ RDS: database-layer recovery time is minutes, not seconds.
ECS Fargate cold start adds ~30–60 seconds to Keycloak recovery after task failure.
Adds ECS and RDS operational surface to the AWS account.
Open Items
Item | Target |
|---|---|
Evaluate Multi-AZ RDS before first paying customer | Pre-revenue launch |
Apply Compute Savings Plan after 30 days Fargate utilisation data | T+30 days |
Apply RDS Reserved Instance after 30 days RDS utilisation data | T+30 days |
Evaluate VPC endpoint cost vs NAT Gateway at bootstrapping budget | T+7 days post-deploy |
Define RHBK Operator external DB configuration for ROSA migration | ADR for ROSA reinstatement |
Right-size IdM EC2 instance post Keycloak removal | Phase 4 cutover |