# Requirements

This section captures what the infrastructure must deliver and the quality attributes it must satisfy. Requirements are grouped by category and tagged with their wave (wave-1 = dev, wave-2 = prod rollout, future = backlog).

# Functional Requirements

# Airflow (Orchestration)

ID Requirement Wave
AIR-01 Deploy Airflow on GKE Standard via the official Apache Airflow Helm chart with CeleryExecutor wave-1
AIR-02 Custom image extending the Apache Airflow base with astronomer-cosmos, dbt-core, dbt-bigquery wave-1
AIR-03 DAGs synced via git-sync sidecar from the Git repository wave-1
AIR-04 dbt project synced alongside DAGs via git-sync, readable by Cosmos at /opt/airflow/dags/repo/dbt wave-1
AIR-05 Image tag updates flow via bot-PR to the infra repo, preserving Terraform plan review wave-1
AIR-06 Prod uses the same image tag validated in dev; only Helm values differ for sizing wave-2

# DataHub

ID Requirement Wave
DH-01 Deploy DataHub via Helm on GKE Standard, backed by Cloud SQL (Postgres), Strimzi Kafka, and OpenSearch wave-1
DH-02 Authenticate users via Google OIDC, restricted to the organization domain wave-1
DH-03 DataHub internal groups govern access levels (admin, editor, viewer) wave-1
DH-04 Ingestion recipes for BigQuery metadata, Airflow DAGs, and dbt manifests run as scheduled Airflow DAGs wave-1
DH-05 Helm values are parametrized: dev uses minimal replicas/resources; prod scales via tfvars wave-1
DH-06 OpenSearch snapshots to GCS on a schedule; Cloud SQL has automated backups with PITR wave-1

# GKE

ID Requirement Wave
GKE-01 GKE Standard cluster: zonal for dev PoC, regional for prod wave-1
GKE-02 Three node pools: default-pool (Airflow + system), kpo-pool (spot, scale-to-zero batch), workload-pool (DataHub, Phase 2) wave-1
GKE-03 Workload Identity enabled cluster-wide wave-1
GKE-04 GKE Ingress (GCLB) with Certificate Manager for GCP-issued wildcard TLS wave-1
GKE-05 Secret Manager CSI driver for in-cluster secret injection wave-1
GKE-06 Node Auto-Provisioning for heavy pod scheduling wave-1

# CI/CD

ID Requirement Wave
CI-01 GitHub Actions with Workload Identity Federation (no SA keys) wave-1
CI-02 terraform-plan on PRs; changed-stack detection; plan posted as PR comment wave-1
CI-03 terraform-apply on merge to main; dev auto-apply wave-1
CI-04 terraform-drift daily; drift surfaced via Slack or GitHub issue wave-1
CI-05 Prod apply gated by GitHub Environment required-reviewers wave-2
CI-06 Semantic-release: git tag on main triggers prod apply future

# Observability

ID Requirement Wave
OBS-01 Google Managed Prometheus enabled on GKE wave-1
OBS-02 Cloud Monitoring alert policies for Kafka, OpenSearch, DataHub, Cloud SQL, and GKE autoscaler events wave-1
OBS-03 Project-level budget alerts at 50%, 80%, 100% thresholds wave-1

# Non-Functional Requirements

# Availability

  • Dev SLO: best-effort. Acceptable to lose availability during maintenance or experiments. Maintenance windows outside business hours are preferred but not mandatory.
  • Prod SLO (target): 99.5% monthly availability for Airflow and DataHub. Achieved via regional GKE, multi-zone spreading, and automated failover of Cloud SQL.

# Performance

  • Airflow: DAG parsing time under 60 seconds for the full DAG bag. DAG sync latency under 2 minutes from push to scheduler pickup (via git-sync).
  • DataHub: search latency under 2 seconds for 95th percentile queries. Metadata ingestion backlog (Kafka consumer lag) clears within 10 minutes after a full crawl.

# Security

  • No long-lived service-account keys anywhere in the platform.
  • All inter-service communication over private IP or within the GKE cluster network.
  • Cloud SQL accessible only via Private Service Access (no public IP).
  • GKE API server restricted to authorized networks (CI runners + operator IPs).
  • Secrets stored in Secret Manager; values never committed to code.

# Cost

  • Dev monthly budget target: under $500/month for the full stack (GKE nodes + Cloud SQL + network egress). Airflow runs on GKE, so its cost is included in GKE node spend. This is a soft target; actual cost will be measured after deployment and optimized iteratively.
  • Labels for attribution: every Terraform-managed resource carries env, layer, service, owner, cost_center labels.
  • Prod cost model: deferred until prod sizing is validated on dev workloads.

# Operability

  • Every component has a documented runbook in Operations.
  • DataHub upgrades follow a dev-first, migration-preflight, then prod-promote workflow.
  • Node pool upgrades use surge strategy with zero-downtime guarantee.
  • Kafka and OpenSearch have documented backup and restore procedures.