# Architecture

This section describes the infrastructure architecture for the UME data platform. It covers the GCP project topology, network layout, and how each component fits together.

The diagram above will be created as a .drawio file in docs/infrastructure/diagrams/ and exported as PNG. Until it is available, the text descriptions below serve as the canonical architecture reference.

# GCP Project Topology

# Target state (multi-project)

The target architecture uses multiple GCP projects to isolate concerns and minimize blast radius:

GCP Organization
├── ume-shared-prod          # Artifact Registry, Terraform state, WIF pool, shared KMS
├── ume-platform-dev         # GKE, Cloud SQL, VPC (dev)
├── ume-platform-prod        # GKE, Cloud SQL, VPC (prod)
├── ume-data-dev             # BigQuery datasets, GCS landing buckets (dev)
└── ume-data-prod            # BigQuery datasets, GCS landing buckets (prod)

Each project has its own IAM boundary. Cross-project access is granted explicitly via service-account bindings managed in the 10-platform-shared layer.

# Wave-1 reality (single project)

For wave-1, all resources live in the existing project poc-ume-data:

poc-ume-data
├── Terraform state bucket (GCS)
├── Artifact Registry (Docker)
├── VPC + subnets
├── GKE Standard cluster (zonal for dev, regional for prod)
│   ├── default-pool (Airflow + system services)          ← Phase 1
│   │   ├── Airflow scheduler (CeleryExecutor)
│   │   ├── Airflow Celery worker(s) + Redis
│   │   ├── Airflow webserver (OIDC auth)
│   │   ├── Airflow triggerer
│   │   └── GKE Ingress (GCLB)
│   ├── kpo-pool (spot, scale-to-zero batch tasks)        ← Phase 1
│   │   └── KubernetesPodOperator tasks (dbt runs, etc.)
│   └── workload-pool (DataHub + dependencies)            ← Phase 2
│       ├── DataHub (Helm)
│       ├── Strimzi Kafka (3 brokers)
│       ├── OpenSearch (3 data nodes)
│       └── Secret Manager CSI driver
├── Cloud SQL for PostgreSQL (Airflow metadata; shared with DataHub in Phase 2)
└── Cloud Monitoring + Managed Prometheus

Terraform modules accept project IDs as variables. When production projects are provisioned externally, only terraform.tfvars change.

# Network Architecture

# VPC design

A single VPC per environment with purpose-built subnets:

Subnet	CIDR (dev)	Purpose
`gke-nodes`	`10.0.0.0/20`	GKE node IPs
`gke-pods`	`10.4.0.0/14`	GKE pod secondary range
`gke-services`	`10.8.0.0/20`	GKE service secondary range

CIDR ranges are illustrative. Final ranges will be confirmed during implementation to avoid overlap with any existing VPCs in the project.

# Private connectivity

Cloud NAT for outbound internet access from GKE nodes (no public IPs on nodes).
Private Service Access (PSA) for Cloud SQL - the database is reachable only via private IP within the VPC.
GKE private cluster with private nodes and a public endpoint restricted to authorized networks (CI runner IPs, operator IPs).

# DNS

Cloud DNS private zone for internal service discovery.
Certificate Manager with DNS authorization for wildcard TLS certificates on public-facing services (Airflow UI, DataHub UI).

# Component Architecture

# Terraform Repository (`ume-data-infra`)

The repository separates global layers from per-environment stacks:

ume-data-infra/
├── layers/                        # Global, applied once
│   ├── 00-bootstrap/              # State bucket, Artifact Registry, WIF pool, CI SA
│   └── 10-platform-shared/        # Phase 2: cross-env resources (datahub-sa, KMS)
├── environments/                  # Per environment
│   ├── dev-01-base/               # VPC, GKE, Cloud SQL, Airflow SAs     ← Phase 1
│   ├── dev-02-runtime/             # Airflow + ingress (Phase 1), DataHub (Phase 2)
│   └── prod-*/                    # Mirrors dev; different tfvars
├── resources/                     # DAGs, dbt, Docker (temp; ports to own repo)
│   ├── docker/                    # Custom Airflow image
│   ├── dags/                      # Airflow DAG files
│   ├── dbt/                       # dbt project
│   └── scripts/                   # Build + utility scripts
└── modules/                       # Reusable, prefer upstream terraform-google-modules

Each stack has its own Terraform state in GCS. Stacks communicate via terraform_remote_state. Details in Terraform Structure.

# GKE Cluster

A GKE Standard cluster hosts Airflow (Phase 1) and DataHub with dependencies (Phase 2). Zonal for dev PoC, regional for prod.

GKE Cluster
├── default-pool (e2-standard-2, Airflow + system)           ← Phase 1
│   ├── Airflow scheduler (CeleryExecutor)
│   ├── Airflow Celery worker(s)
│   ├── Airflow webserver
│   ├── Airflow triggerer
│   ├── Redis (Celery broker)
│   ├── GCS FUSE sidecar (DAG delivery from GCS bucket)
│   ├── GKE Ingress controller
│   └── Google Managed Prometheus collectors
├── kpo-pool (e2-standard-2, spot, scale-to-zero)            ← Phase 1
│   └── KubernetesPodOperator tasks (dbt via KPO, etc.)
└── workload-pool (e2-standard-4, autoscaling min=2 max=6)   ← Phase 2
    ├── Kafka brokers (3, Strimzi-managed)
    ├── OpenSearch data nodes (3, operator-managed)
    ├── DataHub GMS
    ├── DataHub Frontend
    ├── DataHub MAE Consumer
    └── DataHub MCE Consumer

Details in GKE Platform.

# Airflow

Airflow is deployed on the GKE cluster via the official Apache Airflow Helm chart with CeleryExecutor + Redis. It runs in the default-pool (scheduler, workers, webserver, triggerer) with batch tasks offloaded to the kpo-pool (spot VMs, scale-to-zero) via KubernetesPodOperator.

Airflow Deployment
├── Airflow scheduler (CeleryExecutor, enqueues tasks to Redis)
│   └── → Cloud SQL (Postgres, private IP) for metadata
│   └── → Cloud SQL Auth Proxy sidecar (IAM auth via Workload Identity)
├── Airflow Celery worker(s) (execute tasks from Redis queue)
│   └── → GCS FUSE sidecar (mounts DAGs + dbt from GCS bucket)
├── Airflow webserver (UI, Google OIDC auth)
├── Airflow triggerer (deferrable operators)
├── Redis (Celery task broker)
├── KPO tasks (on kpo-pool, separate namespace)
│   └── dbt runs, data quality checks, ingestion jobs
└── Custom image (from Artifact Registry)
    └── Base Airflow + astronomer-cosmos + dbt-core + dbt-bigquery

Details in Airflow on GKE.

# DataHub

DataHub is deployed via Helm on our GKE cluster:

DataHub Deployment
├── DataHub GMS (metadata service)
│   └── → Cloud SQL (Postgres, private IP)
│   └── → Kafka (Strimzi, in-cluster)
│   └── → OpenSearch (in-cluster)
├── DataHub Frontend (React UI)
│   └── → GKE Ingress (GCLB, wildcard TLS)
│   └── → Google OIDC (org-domain restricted)
├── DataHub MAE Consumer (metadata audit events)
│   └── → Kafka → OpenSearch
├── DataHub MCE Consumer (metadata change events)
│   └── → Kafka → Cloud SQL
└── Ingestion (runs as Airflow DAGs, not in GKE)
    └── BigQuery, Airflow, dbt connectors

Details in DataHub.

# Inter-Stack Dependencies

00-bootstrap
    │
    ▼
dev-01-base ─────────────────── Phase 1
    │ VPC, GKE, Cloud SQL,       (Airflow)
    │ Airflow SAs + WI bindings
    ▼
dev-02-runtime
    │ GCS buckets, Airflow Helm,
    │ Ingress + TLS, DNS, OIDC
    │
    ▼
[CI: DAGs + dbt synced to GCS bucket]

00-bootstrap
    │
    ▼
10-platform-shared ──────────── Phase 2
    │ Cross-env: datahub-sa,     (DataHub)
    │ KMS, logging
    ▼
dev-01-base
    ▼
dev-02-runtime (+ datahub.tf, Kafka, OpenSearch)

Phase 1: 00-bootstrap → dev-01-base (VPC, GKE, Cloud SQL, Airflow SAs) → dev-02-runtime (GCS buckets, Airflow Helm, ingress + TLS, Google OIDC auth). 10-platform-shared is deferred to Phase 2. All infrastructure is reused in Phase 2.

Phase 2: Adds DataHub, Kafka, and OpenSearch to dev-02-runtime. The existing GKE cluster gains a workload-pool for DataHub. If platform add-ons need a separate layer, dev-02-runtime is renumbered.

Each arrow represents a terraform_remote_state dependency. A downstream stack reads outputs (VPC ID, cluster name, SQL connection name, Kafka endpoint) from the stack above it.