# GKE Platform

The GKE Standard cluster hosts Airflow (Phase 1) and DataHub with its dependencies (Phase 2) as well as platform add-ons (ingress, observability, secrets injection). This section covers the cluster design, node pool strategy, zero-downtime recipe, and ingress/TLS configuration.

# Cluster Design

# Why GKE Standard over Autopilot

Autopilot simplifies operations but imposes constraints and cost premiums that work against our needs:

Factor Standard Autopilot
Cost (dev, ~4 vCPU / 8 GB aggregate) ~$50-80/mo ~$100-200/mo
Stateful workload support Full control over node config, DaemonSets, HostPath Restricted; some Strimzi/OpenSearch features limited
Node-level tuning Custom machine types, spot nodes for batch pools Google chooses machine types
Scale-to-zero (node pools) Supported (per-pool min_node_count=0) Google manages scaling
Ops burden Moderate (mitigated by automation) Near-zero

For Kafka and OpenSearch on GKE (Phase 2), Standard provides the necessary flexibility. The ops burden is mitigated by the zero-downtime recipe documented below.

# Cluster configuration

Setting Dev (PoC) Dev (hardened) Prod
Type Zonal Regional Regional
Zones 1 3 3
Release channel Regular Regular Regular
Maintenance window Weekdays 02:00-06:00 UTC Same Weekends 02:00-06:00 UTC
Private cluster Yes (private nodes, public endpoint with authorized networks) Yes Yes
Workload Identity Enabled Enabled Enabled
Binary Authorization Disabled (wave-1) Evaluate for wave-2 Evaluate
Network policy Dataplane V2 (built-in) Dataplane V2 (built-in) Dataplane V2 (built-in)

Why zonal for dev PoC: halves node count compared to regional (1 node per pool instead of 3). Regional deferred to prod or when HA is required. The GKE free tier ($74.40/mo credit) covers one zonal cluster — regional clusters pay the full $0.10/hr management fee (~$74/mo).

# Node resource reservations

GKE reserves CPU and memory on every worker node for kubelet, kube-proxy, containerd, and eviction thresholds. You pay for the full VM but can only schedule pods into the allocatable portion.

CPU reservation (dedicated-core machines):

Core range Reserved
1st core 6% (60m)
2nd core 1% (10m)
3rd-4th cores 0.5% each (5m)
5th+ cores 0.25% each

CPU reservation (shared-core E2): A flat 1060 millicores — these machines lose over half their nominal vCPU. Avoid shared-core for GKE.

Memory reservation:

Capacity range Reserved
First 4 GiB 25% (1024 MiB)
4-8 GiB 20% (819 MiB)
8-16 GiB 10%
16-128 GiB 6%
Above 128 GiB 2%
Plus (every node) 100 MiB eviction threshold

# Machine type selection

Machine vCPU RAM Alloc. CPU Alloc. RAM $/mo Verdict
e2-small 2 (shared) 2 GiB 940m ~1.4 GiB $12 Unusable
e2-medium 2 (shared) 4 GiB 940m ~2.9 GiB $24 Marginal
e2-standard-2 2 8 GiB 1930m ~6.1 GiB $49 Phase 1 node
e2-standard-4 4 16 GiB 3920m ~13.3 GiB $98 Phase 2 workload nodes
e2-standard-8 8 32 GiB 7900m ~28 GiB $195 Prod workload nodes

# Node Pools

# default-pool — Airflow + system services

Hosts Airflow (scheduler, Celery worker, webserver, triggerer, Redis) and lightweight system components (ingress controller, CSI driver, metrics agent). Single node in Phase 1.

Setting Dev (PoC) Dev (hardened) Prod
Machine type e2-standard-2 e2-standard-2 e2-standard-4
Min nodes 1 1 3
Max nodes 2 3 6
Autoscaling Cluster Autoscaler Cluster Autoscaler Cluster Autoscaler
Spot No No No
Surge upgrade max_surge=1, max_unavailable=0 Same Same
Taints None (default scheduling) None None

Phase 1 resource budget (1x e2-standard-2, ~1930m CPU / ~6.1 GiB allocatable):

Consumer CPU request Memory request
Airflow scheduler 500m 1.5 Gi
Celery worker (1) 250m 1 Gi
Airflow webserver 250m 512 Mi
Airflow triggerer 100m 256 Mi
Redis 50m 128 Mi
System pods (kube-system) ~300m ~400 Mi
Used ~1450m ~3.8 Gi
Remaining headroom ~480m ~2.3 Gi

This budget is snug but workable because dbt-bigquery is I/O-bound (submits SQL to BigQuery and waits). CPU spikes during dbt compile are brief and burst above requests up to limits.

Scaling signals and upgrade path — see Airflow on GKE — Monitoring and Alerting for thresholds. Upgrade path: e2-standard-2 ($49/mo) → e2-standard-4 ($98/mo, 3920m / 13.3 GiB), or keep e2-standard-2 with min_nodes=2 ($98/mo, 3860m aggregate, better fault tolerance).

# kpo-pool — KubernetesPodOperator tasks

Ephemeral nodes for on-demand batch work: dbt runs via KPO, data quality checks, ingestion jobs. Scales from 0 to 10 nodes. Uses spot VMs for ~69% savings.

Setting Dev (PoC) Prod
Machine type e2-standard-2 e2-standard-4
Min nodes 0 0
Max nodes 10 20
Autoscaling Cluster Autoscaler Cluster Autoscaler
Spot Yes Yes (with on-demand fallback via NAP)
Surge upgrade max_surge=1, max_unavailable=0 Same
Taints workload=kpo:NoSchedule Same
Labels pool=kpo Same

Scale-to-zero flow:

  1. Airflow triggers a KPO task.
  2. KPO creates a pod with toleration for workload=kpo:NoSchedule and nodeSelector: pool: kpo.
  3. Pod is Pending — no nodes exist in the pool.
  4. Cluster Autoscaler detects the pending pod (~30 seconds).
  5. Spot VM provisioned (~60-90 seconds).
  6. Pod runs, completes, is cleaned up.
  7. After ~10 minutes idle (scaleDownUnneededTime), the autoscaler removes the empty node.

Spot VM savings: ~$15/mo vs ~$49/mo for e2-standard-2 (on-demand). dbt tasks are idempotent — if a spot node is preempted mid-task, Airflow retries.

# workload-pool — DataHub stack (Phase 2)

Hosts Kafka brokers, OpenSearch data nodes, and DataHub services. Not created in Phase 1. Added when DataHub work begins.

Setting Dev Prod
Machine type e2-standard-4 e2-standard-8
Min nodes 2 6
Max nodes 6 30
Autoscaling Cluster Autoscaler Cluster Autoscaler + NAP
Spot No (stateful workloads need consistent uptime) No
Taints None None

# Zero-Downtime Operations

The combination of the following mechanisms ensures that node replacements, upgrades, and scaling events do not interrupt running workloads.

# Surge upgrades

max_surge = 1           # Add 1 new node before draining old one
max_unavailable = 0     # Never remove a node without adding a replacement first

During a node pool upgrade:

  1. GKE creates a new node with the updated version.
  2. The old node is cordoned (no new pods scheduled).
  3. The old node is drained (existing pods are evicted, respecting PDBs).
  4. Once all pods are safely rescheduled, the old node is deleted.

# PodDisruptionBudgets (PDBs)

PDBs tell Kubernetes how many pods of a set must remain available during voluntary disruptions (upgrades, scaling down).

Workload PDB Phase Source
Airflow scheduler minAvailable: 1 1 Airflow Helm values
Airflow webserver minAvailable: 1 1 Airflow Helm values
Kafka brokers minAvailable: 2 (of 3) 2 Strimzi operator (automatic)
OpenSearch data nodes minAvailable: 2 (of 3) 2 OpenSearch operator (automatic)
DataHub GMS minAvailable: 1 2 DataHub Helm values
DataHub Frontend minAvailable: 1 2 DataHub Helm values

# Topology spreading (Phase 2, regional cluster)

Stateful workloads spread across zones to survive zone-level failures:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

Applied to: Kafka brokers, OpenSearch data nodes, DataHub GMS replicas (in prod). Not applicable in Phase 1 (zonal cluster).

# Maintenance window

Node auto-upgrades respect the maintenance window. Combined with surge upgrades and PDBs, upgrades happen during off-hours with zero service interruption.

# Autoscaling behavior

  • Cluster Autoscaler monitors pending pods. When pods can't schedule due to insufficient resources, it adds nodes (up to max_nodes). When nodes are underutilized for 10+ minutes, it scales down (respecting PDBs).
  • HPA (Horizontal Pod Autoscaler) scales DataHub GMS and Frontend pods based on CPU/memory utilization. Only in prod (dev uses fixed replicas).
  • Strimzi Cruise Control rebalances Kafka partitions across brokers after scaling events (Phase 2).
  • OpenSearch automatically redistributes shards when nodes join or leave the cluster (Phase 2).

# Alert on scaling limits

Cloud Monitoring alert fires when a node pool reaches its max_nodes count. This means the autoscaler cannot add more capacity and workloads may queue. See Observability.

# Ingress and TLS

# GKE Ingress (GCLB)

GKE's built-in Ingress controller provisions a Google Cloud L7 Load Balancer (GCLB) for each Ingress resource. No nginx-ingress, no Istio, no Traefik.

Advantages:

  • Zero ops: GKE manages the load balancer lifecycle.
  • Native integration with Certificate Manager for GCP-issued TLS.
  • Native integration with Cloud Armor for WAF (future).
  • Native integration with IAP for zero-trust access (future).

# Certificate Manager

TLS certificates are issued by GCP Certificate Manager. Wildcard certificate via DNS authorization:

*.data.ume.com.br → Certificate Manager → GCLB → GKE Ingress
Setting Value
Certificate type Google-managed
Authorization DNS (Cloud DNS)
Scope Wildcard (*.data.ume.com.br or similar)
Renewal Automatic (managed by GCP)

No cert-manager pods. No Let's Encrypt ACME challenges. No certificate rotation runbooks. GCP handles everything.

# Ingress routing

Host Backend Phase Notes
airflow.{domain} Airflow webserver service 1 (Story 4c) Google OIDC auth, port-forward initially
datahub.{domain} DataHub Frontend service 2 Google OIDC auth

# Terraform Configuration

GKE is provisioned via the modules/gke-standard/ local module, called from environments/{env}-01-base/gke.tf. The module uses direct Terraform resources (google_container_cluster, google_container_node_pool) and encapsulates naming, labels, and security defaults. Each environment calls the module with different parameters (machine types, node counts, location).

The module enforces:

  • Dataplane V2 (ADVANCED_DATAPATH) for built-in network policy.
  • Workload Identity on all nodes.
  • GCS FUSE CSI driver add-on (for mounting GCS buckets as volumes — used by DAG sync).
  • Shielded instances (secure boot + integrity monitoring).
  • Private nodes with public endpoint (restricted via authorized networks).
  • Legacy metadata endpoints disabled.
  • Surge upgrade defaults (max_surge=1, max_unavailable=0).
  • Mandatory labels on cluster and node pools.

All settings are exposed as variables with sensible defaults so environments can override without editing the module.

In-cluster resources (Helm releases, ingress config, operators) are provisioned in environments/{env}-02-runtime/ using the helm and kubernetes Terraform providers, authenticated via the GKE cluster credentials from {env}-01-base remote state.

# Workload Identity Bindings

Kubernetes SA Namespace Google SA Phase Purpose
airflow airflow ume-airflow 1 Cloud SQL IAM auth, Secret Manager, BigQuery, GCS
airflow-kpo airflow-kpo ume-airflow-kpo 1 Scoped identity for KPO tasks (BigQuery, GCS only)
datahub-gms datahub datahub-sa 2 Cloud SQL IAM auth, Secret Manager
datahub-frontend datahub datahub-sa 2 Secret Manager (OAuth client secret)
strimzi-operator kafka (none needed) 2 Operator runs cluster-internal only
opensearch-operator opensearch (none needed) 2 Operator runs cluster-internal only