# Architecture

This section describes the infrastructure architecture for the UME data platform. It covers the GCP project topology, network layout, and how each component fits together.

Infrastructure Architecture
Infrastructure Architecture

# GCP Project Topology

# Target state (multi-project)

The target architecture uses multiple GCP projects to isolate concerns and minimize blast radius:

GCP Organization
├── ume-shared-prod          # Artifact Registry, Terraform state, WIF pool, shared KMS
├── ume-platform-dev         # GKE, Cloud SQL, VPC (dev)
├── ume-platform-prod        # GKE, Cloud SQL, VPC (prod)
├── ume-data-dev             # BigQuery datasets, GCS landing buckets (dev)
└── ume-data-prod            # BigQuery datasets, GCS landing buckets (prod)

Each project has its own IAM boundary. Cross-project access is granted explicitly via service-account bindings managed in the 10-platform-shared layer.

# Wave-1 reality (single project)

For wave-1, all resources live in the existing project poc-ume-data:

poc-ume-data
├── Terraform state bucket (GCS)
├── Artifact Registry (Docker)
├── VPC + subnets
├── GKE Standard cluster (zonal for dev, regional for prod)
│   ├── default-pool (Airflow + system services)          ← Phase 1
│   │   ├── Airflow scheduler (CeleryExecutor)
│   │   ├── Airflow Celery worker(s) + Redis
│   │   ├── Airflow webserver (OIDC auth)
│   │   ├── Airflow triggerer
│   │   └── GKE Ingress (GCLB)
│   ├── kpo-pool (spot, scale-to-zero batch tasks)        ← Phase 1
│   │   └── KubernetesPodOperator tasks (dbt runs, etc.)
│   └── workload-pool (DataHub + dependencies)            ← Phase 2
│       ├── DataHub (Helm)
│       ├── Strimzi Kafka (3 brokers)
│       ├── OpenSearch (3 data nodes)
│       └── Secret Manager CSI driver
├── Cloud SQL for PostgreSQL (Airflow metadata; shared with DataHub in Phase 2)
└── Cloud Monitoring + Managed Prometheus

Terraform modules accept project IDs as variables. When production projects are provisioned externally, only terraform.tfvars change.

# Network Architecture

# VPC design

A single VPC per environment with purpose-built subnets:

Subnet CIDR (dev) Purpose
gke-nodes 10.0.0.0/20 GKE node IPs
gke-pods 10.4.0.0/14 GKE pod secondary range
gke-services 10.8.0.0/20 GKE service secondary range

# Private connectivity

  • Cloud NAT for outbound internet access from GKE nodes (no public IPs on nodes).
  • Private Service Access (PSA) for Cloud SQL - the database is reachable only via private IP within the VPC.
  • GKE private cluster with private nodes and a public endpoint restricted to authorized networks (CI runner IPs, operator IPs).

# DNS

  • Cloud DNS private zone for internal service discovery.
  • Certificate Manager with DNS authorization for wildcard TLS certificates on public-facing services (Airflow UI, DataHub UI).

# Component Architecture

# Terraform Repository (ume-data-infra)

The repository separates global layers from per-environment stacks:

ume-data-infra/
├── layers/                        # Global, applied once
│   ├── 00-bootstrap/              # State bucket, Artifact Registry, WIF pool, CI SA
│   └── 10-platform-shared/        # Phase 2: cross-env resources (datahub-sa, KMS)
├── environments/                  # Per environment
│   ├── dev-01-base/               # VPC, GKE, Cloud SQL, Airflow SAs     ← Phase 1
│   ├── dev-02-runtime/             # Airflow + ingress (Phase 1), DataHub (Phase 2)
│   └── prod-*/                    # Mirrors dev; different tfvars
├── resources/                     # DAGs, dbt, Docker (temp; ports to own repo)
│   ├── docker/                    # Custom Airflow image
│   ├── dags/                      # Airflow DAG files
│   ├── dbt/                       # dbt project
│   └── scripts/                   # Build + utility scripts
└── modules/                       # Reusable, prefer upstream terraform-google-modules

Each stack has its own Terraform state in GCS. Stacks communicate via terraform_remote_state. Details in Terraform Structure.

# GKE Cluster

A GKE Standard cluster hosts Airflow (Phase 1) and DataHub with dependencies (Phase 2). Zonal for dev PoC, regional for prod.

GKE Cluster
├── default-pool (e2-standard-2, Airflow + system)           ← Phase 1
│   ├── Airflow scheduler (CeleryExecutor)
│   ├── Airflow Celery worker(s)
│   ├── Airflow webserver
│   ├── Airflow triggerer
│   ├── Redis (Celery broker)
│   ├── GCS FUSE sidecar (DAG delivery from GCS bucket)
│   ├── GKE Ingress controller
│   └── Google Managed Prometheus collectors
├── kpo-pool (e2-standard-2, spot, scale-to-zero)            ← Phase 1
│   └── KubernetesPodOperator tasks (dbt via KPO, etc.)
└── workload-pool (e2-standard-4, autoscaling min=2 max=6)   ← Phase 2
    ├── Kafka brokers (3, Strimzi-managed)
    ├── OpenSearch data nodes (3, operator-managed)
    ├── DataHub GMS
    ├── DataHub Frontend
    ├── DataHub MAE Consumer
    └── DataHub MCE Consumer

Details in GKE Platform.

# Airflow

Airflow is deployed on the GKE cluster via the official Apache Airflow Helm chart with CeleryExecutor + Redis. It runs in the default-pool (scheduler, workers, webserver, triggerer) with batch tasks offloaded to the kpo-pool (spot VMs, scale-to-zero) via KubernetesPodOperator.

Airflow Deployment
├── Airflow scheduler (CeleryExecutor, enqueues tasks to Redis)
│   └── → Cloud SQL (Postgres, private IP) for metadata
│   └── → Cloud SQL Auth Proxy sidecar (IAM auth via Workload Identity)
├── Airflow Celery worker(s) (execute tasks from Redis queue)
│   └── → GCS FUSE sidecar (mounts DAGs + dbt from GCS bucket)
├── Airflow webserver (UI, Google OIDC auth)
├── Airflow triggerer (deferrable operators)
├── Redis (Celery task broker)
├── KPO tasks (on kpo-pool, separate namespace)
│   └── dbt runs, data quality checks, ingestion jobs
└── Custom image (from Artifact Registry)
    └── Base Airflow + astronomer-cosmos + dbt-core + dbt-bigquery

Details in Airflow on GKE.

# DataHub

DataHub is deployed via Helm on our GKE cluster:

DataHub Deployment
├── DataHub GMS (metadata service)
│   └── → Cloud SQL (Postgres, private IP)
│   └── → Kafka (Strimzi, in-cluster)
│   └── → OpenSearch (in-cluster)
├── DataHub Frontend (React UI)
│   └── → GKE Ingress (GCLB, wildcard TLS)
│   └── → Google OIDC (org-domain restricted)
├── DataHub MAE Consumer (metadata audit events)
│   └── → Kafka → OpenSearch
├── DataHub MCE Consumer (metadata change events)
│   └── → Kafka → Cloud SQL
└── Ingestion (runs as Airflow DAGs, not in GKE)
    └── BigQuery, Airflow, dbt connectors

Details in DataHub.

# Inter-Stack Dependencies

00-bootstrap
    │
    ▼
dev-01-base ─────────────────── Phase 1
    │ VPC, GKE, Cloud SQL,       (Airflow)
    │ Airflow SAs + WI bindings
    ▼
dev-02-runtime
    │ GCS buckets, Airflow Helm,
    │ Ingress + TLS, DNS, OIDC
    │
    ▼
[CI: DAGs + dbt synced to GCS bucket]

00-bootstrap
    │
    ▼
10-platform-shared ──────────── Phase 2
    │ Cross-env: datahub-sa,     (DataHub)
    │ KMS, logging
    ▼
dev-01-base
    ▼
dev-02-runtime (+ datahub.tf, Kafka, OpenSearch)

Phase 1: 00-bootstrapdev-01-base (VPC, GKE, Cloud SQL, Airflow SAs) → dev-02-runtime (GCS buckets, Airflow Helm, ingress + TLS, Google OIDC auth). 10-platform-shared is deferred to Phase 2. All infrastructure is reused in Phase 2.

Phase 2: Adds DataHub, Kafka, and OpenSearch to dev-02-runtime. The existing GKE cluster gains a workload-pool for DataHub. If platform add-ons need a separate layer, dev-02-runtime is renumbered.

Each arrow represents a terraform_remote_state dependency. A downstream stack reads outputs (VPC ID, cluster name, SQL connection name, Kafka endpoint) from the stack above it.