# GKE Platform

The GKE Standard cluster hosts Airflow (Phase 1) and DataHub with its dependencies (Phase 2) as well as platform add-ons (ingress, observability, secrets injection). This section covers the cluster design, node pool strategy, zero-downtime recipe, and ingress/TLS configuration.

Key decisions:

GKE Standard (not Autopilot) for cost control and stateful workload flexibility
Zonal cluster for dev PoC; regional for prod
Three node pools: default-pool (Airflow + system), kpo-pool (spot, scale-to-zero batch), workload-pool (DataHub, Phase 2)
Dataplane V2 (Cilium/eBPF) for built-in network policy enforcement
GKE Ingress (GCLB) + Certificate Manager for GCP-native TLS
Zero-downtime operations via surge upgrades, PDBs, and topology spreading

# Cluster Design

# Why GKE Standard over Autopilot

Autopilot simplifies operations but imposes constraints and cost premiums that work against our needs:

Factor	Standard	Autopilot
Cost (dev, ~4 vCPU / 8 GB aggregate)	~$50-80/mo	~$100-200/mo
Stateful workload support	Full control over node config, DaemonSets, HostPath	Restricted; some Strimzi/OpenSearch features limited
Node-level tuning	Custom machine types, spot nodes for batch pools	Google chooses machine types
Scale-to-zero (node pools)	Supported (per-pool `min_node_count=0`)	Google manages scaling
Ops burden	Moderate (mitigated by automation)	Near-zero

For Kafka and OpenSearch on GKE (Phase 2), Standard provides the necessary flexibility. The ops burden is mitigated by the zero-downtime recipe documented below.

# Cluster configuration

Setting	Dev (PoC)	Dev (hardened)	Prod
Type	Zonal	Regional	Regional
Zones	1	3	3
Release channel	Regular	Regular	Regular
Maintenance window	Weekdays 02:00-06:00 UTC	Same	Weekends 02:00-06:00 UTC
Private cluster	Yes (private nodes, public endpoint with authorized networks)	Yes	Yes
Workload Identity	Enabled	Enabled	Enabled
Binary Authorization	Disabled (wave-1)	Evaluate for wave-2	Evaluate
Network policy	Dataplane V2 (built-in)	Dataplane V2 (built-in)	Dataplane V2 (built-in)

Why zonal for dev PoC: halves node count compared to regional (1 node per pool instead of 3). Regional deferred to prod or when HA is required. The GKE free tier ($74.40/mo credit) covers one zonal cluster — regional clusters pay the full $0.10/hr management fee (~$74/mo).

# Node resource reservations

GKE reserves CPU and memory on every worker node for kubelet, kube-proxy, containerd, and eviction thresholds. You pay for the full VM but can only schedule pods into the allocatable portion.

CPU reservation (dedicated-core machines):

Core range	Reserved
1st core	6% (60m)
2nd core	1% (10m)
3rd-4th cores	0.5% each (5m)
5th+ cores	0.25% each

CPU reservation (shared-core E2): A flat 1060 millicores — these machines lose over half their nominal vCPU. Avoid shared-core for GKE.

Memory reservation:

Capacity range	Reserved
First 4 GiB	25% (1024 MiB)
4-8 GiB	20% (819 MiB)
8-16 GiB	10%
16-128 GiB	6%
Above 128 GiB	2%
Plus (every node)	100 MiB eviction threshold

# Machine type selection

Machine	vCPU	RAM	Alloc. CPU	Alloc. RAM	$/mo	Verdict
`e2-small`	2 (shared)	2 GiB	940m	~1.4 GiB	$12	Unusable
`e2-medium`	2 (shared)	4 GiB	940m	~2.9 GiB	$24	Marginal
`e2-standard-2`	2	8 GiB	1930m	~6.1 GiB	$49	Phase 1 node
`e2-standard-4`	4	16 GiB	3920m	~13.3 GiB	$98	Phase 2 workload nodes
`e2-standard-8`	8	32 GiB	7900m	~28 GiB	$195	Prod workload nodes

# Node Pools

# `default-pool` — Airflow + system services

Hosts Airflow (scheduler, Celery worker, webserver, triggerer, Redis) and lightweight system components (ingress controller, CSI driver, metrics agent). Single node in Phase 1.

Setting	Dev (PoC)	Dev (hardened)	Prod
Machine type	`e2-standard-2`	`e2-standard-2`	`e2-standard-4`
Min nodes	1	1	3
Max nodes	2	3	6
Autoscaling	Cluster Autoscaler	Cluster Autoscaler	Cluster Autoscaler
Spot	No	No	No
Surge upgrade	`max_surge=1, max_unavailable=0`	Same	Same
Taints	None (default scheduling)	None	None

Phase 1 resource budget (1x e2-standard-2, ~1930m CPU / ~6.1 GiB allocatable):

Consumer	CPU request	Memory request
Airflow scheduler	500m	1.5 Gi
Celery worker (1)	250m	1 Gi
Airflow webserver	250m	512 Mi
Airflow triggerer	100m	256 Mi
Redis	50m	128 Mi
System pods (kube-system)	~300m	~400 Mi
Used	~1450m	~3.8 Gi
Remaining headroom	~480m	~2.3 Gi

This budget is snug but workable because dbt-bigquery is I/O-bound (submits SQL to BigQuery and waits). CPU spikes during dbt compile are brief and burst above requests up to limits.

Scaling signals and upgrade path — see Airflow on GKE — Monitoring and Alerting for thresholds. Upgrade path: e2-standard-2 ($49/mo) → e2-standard-4 ($98/mo, 3920m / 13.3 GiB), or keep e2-standard-2 with min_nodes=2 ($98/mo, 3860m aggregate, better fault tolerance).

# `kpo-pool` — KubernetesPodOperator tasks

Ephemeral nodes for on-demand batch work: dbt runs via KPO, data quality checks, ingestion jobs. Scales from 0 to 10 nodes. Uses spot VMs for ~69% savings.

Setting	Dev (PoC)	Prod
Machine type	`e2-standard-2`	`e2-standard-4`
Min nodes	0	0
Max nodes	10	20
Autoscaling	Cluster Autoscaler	Cluster Autoscaler
Spot	Yes	Yes (with on-demand fallback via NAP)
Surge upgrade	`max_surge=1, max_unavailable=0`	Same
Taints	`workload=kpo:NoSchedule`	Same
Labels	`pool=kpo`	Same

Scale-to-zero flow:

Airflow triggers a KPO task.
KPO creates a pod with toleration for workload=kpo:NoSchedule and nodeSelector: pool: kpo.
Pod is Pending — no nodes exist in the pool.
Cluster Autoscaler detects the pending pod (~30 seconds).
Spot VM provisioned (~60-90 seconds).
Pod runs, completes, is cleaned up.
After ~10 minutes idle (scaleDownUnneededTime), the autoscaler removes the empty node.

Spot VM savings: ~$15/mo vs ~$49/mo for e2-standard-2 (on-demand). dbt tasks are idempotent — if a spot node is preempted mid-task, Airflow retries.

# `workload-pool` — DataHub stack (Phase 2)

Hosts Kafka brokers, OpenSearch data nodes, and DataHub services. Not created in Phase 1. Added when DataHub work begins.

Setting	Dev	Prod
Machine type	`e2-standard-4`	`e2-standard-8`
Min nodes	2	6
Max nodes	6	30
Autoscaling	Cluster Autoscaler	Cluster Autoscaler + NAP
Spot	No (stateful workloads need consistent uptime)	No
Taints	None	None

# Zero-Downtime Operations

The combination of the following mechanisms ensures that node replacements, upgrades, and scaling events do not interrupt running workloads.

# Surge upgrades

max_surge = 1           # Add 1 new node before draining old one
max_unavailable = 0     # Never remove a node without adding a replacement first

During a node pool upgrade:

GKE creates a new node with the updated version.
The old node is cordoned (no new pods scheduled).
The old node is drained (existing pods are evicted, respecting PDBs).
Once all pods are safely rescheduled, the old node is deleted.

# PodDisruptionBudgets (PDBs)

PDBs tell Kubernetes how many pods of a set must remain available during voluntary disruptions (upgrades, scaling down).

Workload	PDB	Phase	Source
Airflow scheduler	`minAvailable: 1`	1	Airflow Helm values
Airflow webserver	`minAvailable: 1`	1	Airflow Helm values
Kafka brokers	`minAvailable: 2` (of 3)	2	Strimzi operator (automatic)
OpenSearch data nodes	`minAvailable: 2` (of 3)	2	OpenSearch operator (automatic)
DataHub GMS	`minAvailable: 1`	2	DataHub Helm values
DataHub Frontend	`minAvailable: 1`	2	DataHub Helm values

# Topology spreading (Phase 2, regional cluster)

Stateful workloads spread across zones to survive zone-level failures:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

Applied to: Kafka brokers, OpenSearch data nodes, DataHub GMS replicas (in prod). Not applicable in Phase 1 (zonal cluster).

# Maintenance window

Node auto-upgrades respect the maintenance window. Combined with surge upgrades and PDBs, upgrades happen during off-hours with zero service interruption.

# Autoscaling behavior

Cluster Autoscaler monitors pending pods. When pods can't schedule due to insufficient resources, it adds nodes (up to max_nodes). When nodes are underutilized for 10+ minutes, it scales down (respecting PDBs).
HPA (Horizontal Pod Autoscaler) scales DataHub GMS and Frontend pods based on CPU/memory utilization. Only in prod (dev uses fixed replicas).
Strimzi Cruise Control rebalances Kafka partitions across brokers after scaling events (Phase 2).
OpenSearch automatically redistributes shards when nodes join or leave the cluster (Phase 2).

# Alert on scaling limits

Cloud Monitoring alert fires when a node pool reaches its max_nodes count. This means the autoscaler cannot add more capacity and workloads may queue. See Observability.

# Ingress and TLS

# GKE Ingress (GCLB)

GKE's built-in Ingress controller provisions a Google Cloud L7 Load Balancer (GCLB) for each Ingress resource. No nginx-ingress, no Istio, no Traefik.

Advantages:

Zero ops: GKE manages the load balancer lifecycle.
Native integration with Certificate Manager for GCP-issued TLS.
Native integration with Cloud Armor for WAF (future).
Native integration with IAP for zero-trust access (future).

# Certificate Manager

TLS certificates are issued by GCP Certificate Manager. Wildcard certificate via DNS authorization:

*.data.ume.com.br → Certificate Manager → GCLB → GKE Ingress

Setting	Value
Certificate type	Google-managed
Authorization	DNS (Cloud DNS)
Scope	Wildcard (`*.data.ume.com.br` or similar)
Renewal	Automatic (managed by GCP)

No cert-manager pods. No Let's Encrypt ACME challenges. No certificate rotation runbooks. GCP handles everything.

# Ingress routing

Host	Backend	Phase	Notes
`airflow.{domain}`	Airflow webserver service	1 (Story 4c)	Google OIDC auth, port-forward initially
`datahub.{domain}`	DataHub Frontend service	2	Google OIDC auth

Domain is parametrized via var.domain_name (default umedev.marpont.es, will change). DNS zone delegated from GoDaddy to Google Cloud DNS.

# Terraform Configuration

GKE is provisioned via the modules/gke-standard/ local module, called from environments/{env}-01-base/gke.tf. The module uses direct Terraform resources (google_container_cluster, google_container_node_pool) and encapsulates naming, labels, and security defaults. Each environment calls the module with different parameters (machine types, node counts, location).

The module enforces:

Dataplane V2 (ADVANCED_DATAPATH) for built-in network policy.
Workload Identity on all nodes.
GCS FUSE CSI driver add-on (for mounting GCS buckets as volumes — used by DAG sync).
Shielded instances (secure boot + integrity monitoring).
Private nodes with public endpoint (restricted via authorized networks).
Legacy metadata endpoints disabled.
Surge upgrade defaults (max_surge=1, max_unavailable=0).
Mandatory labels on cluster and node pools.

All settings are exposed as variables with sensible defaults so environments can override without editing the module.

In-cluster resources (Helm releases, ingress config, operators) are provisioned in environments/{env}-02-runtime/ using the helm and kubernetes Terraform providers, authenticated via the GKE cluster credentials from {env}-01-base remote state.

# Workload Identity Bindings

Kubernetes SA	Namespace	Google SA	Phase	Purpose
`airflow`	`airflow`	`ume-airflow`	1	Cloud SQL IAM auth, Secret Manager, BigQuery, GCS
`airflow-kpo`	`airflow-kpo`	`ume-airflow-kpo`	1	Scoped identity for KPO tasks (BigQuery, GCS only)
`datahub-gms`	`datahub`	`datahub-sa`	2	Cloud SQL IAM auth, Secret Manager
`datahub-frontend`	`datahub`	`datahub-sa`	2	Secret Manager (OAuth client secret)
`strimzi-operator`	`kafka`	(none needed)	2	Operator runs cluster-internal only
`opensearch-operator`	`opensearch`	(none needed)	2	Operator runs cluster-internal only