# Requirements

This section captures what the infrastructure must deliver and the quality attributes it must satisfy. Requirements are grouped by category and tagged with their wave (wave-1 = dev, wave-2 = prod rollout, future = backlog).

# Functional Requirements

# Airflow (Orchestration)

ID	Requirement	Wave
AIR-01	Deploy Airflow on GKE Standard via the official Apache Airflow Helm chart with CeleryExecutor	wave-1
AIR-02	Custom image extending the Apache Airflow base with `astronomer-cosmos`, `dbt-core`, `dbt-bigquery`	wave-1
AIR-03	DAGs synced via git-sync sidecar from the Git repository	wave-1
AIR-04	dbt project synced alongside DAGs via git-sync, readable by Cosmos at `/opt/airflow/dags/repo/dbt`	wave-1
AIR-05	Image tag updates flow via bot-PR to the infra repo, preserving Terraform plan review	wave-1
AIR-06	Prod uses the same image tag validated in dev; only Helm values differ for sizing	wave-2

# DataHub

ID	Requirement	Wave
DH-01	Deploy DataHub via Helm on GKE Standard, backed by Cloud SQL (Postgres), Strimzi Kafka, and OpenSearch	wave-1
DH-02	Authenticate users via Google OIDC, restricted to the organization domain	wave-1
DH-03	DataHub internal groups govern access levels (admin, editor, viewer)	wave-1
DH-04	Ingestion recipes for BigQuery metadata, Airflow DAGs, and dbt manifests run as scheduled Airflow DAGs	wave-1
DH-05	Helm values are parametrized: dev uses minimal replicas/resources; prod scales via `tfvars`	wave-1
DH-06	OpenSearch snapshots to GCS on a schedule; Cloud SQL has automated backups with PITR	wave-1

# GKE

ID	Requirement	Wave
GKE-01	GKE Standard cluster: zonal for dev PoC, regional for prod	wave-1
GKE-02	Three node pools: `default-pool` (Airflow + system), `kpo-pool` (spot, scale-to-zero batch), `workload-pool` (DataHub, Phase 2)	wave-1
GKE-03	Workload Identity enabled cluster-wide	wave-1
GKE-04	GKE Ingress (GCLB) with Certificate Manager for GCP-issued wildcard TLS	wave-1
GKE-05	Secret Manager CSI driver for in-cluster secret injection	wave-1
GKE-06	Node Auto-Provisioning for heavy pod scheduling	wave-1

# CI/CD

ID	Requirement	Wave
CI-01	GitHub Actions with Workload Identity Federation (no SA keys)	wave-1
CI-02	`terraform-plan` on PRs; changed-stack detection; plan posted as PR comment	wave-1
CI-03	`terraform-apply` on merge to `main`; dev auto-apply	wave-1
CI-04	`terraform-drift` daily; drift surfaced via Slack or GitHub issue	wave-1
CI-05	Prod apply gated by GitHub Environment required-reviewers	wave-2
CI-06	Semantic-release: git tag on `main` triggers prod apply	future

# Observability

ID	Requirement	Wave
OBS-01	Google Managed Prometheus enabled on GKE	wave-1
OBS-02	Cloud Monitoring alert policies for Kafka, OpenSearch, DataHub, Cloud SQL, and GKE autoscaler events	wave-1
OBS-03	Project-level budget alerts at 50%, 80%, 100% thresholds	wave-1

# Non-Functional Requirements

# Availability

Dev SLO: best-effort. Acceptable to lose availability during maintenance or experiments. Maintenance windows outside business hours are preferred but not mandatory.
Prod SLO (target): 99.5% monthly availability for Airflow and DataHub. Achieved via regional GKE, multi-zone spreading, and automated failover of Cloud SQL.

# Performance

Airflow: DAG parsing time under 60 seconds for the full DAG bag. DAG sync latency under 2 minutes from push to scheduler pickup (via git-sync).
DataHub: search latency under 2 seconds for 95th percentile queries. Metadata ingestion backlog (Kafka consumer lag) clears within 10 minutes after a full crawl.

# Security

No long-lived service-account keys anywhere in the platform.
All inter-service communication over private IP or within the GKE cluster network.
Cloud SQL accessible only via Private Service Access (no public IP).
GKE API server restricted to authorized networks (CI runners + operator IPs).
Secrets stored in Secret Manager; values never committed to code.

# Cost

Dev monthly budget target: under $500/month for the full stack (GKE nodes + Cloud SQL + network egress). Airflow runs on GKE, so its cost is included in GKE node spend. This is a soft target; actual cost will be measured after deployment and optimized iteratively.
Labels for attribution: every Terraform-managed resource carries env, layer, service, owner, cost_center labels.
Prod cost model: deferred until prod sizing is validated on dev workloads.

# Operability

Every component has a documented runbook in Operations.
DataHub upgrades follow a dev-first, migration-preflight, then prod-promote workflow.
Node pool upgrades use surge strategy with zero-downtime guarantee.
Kafka and OpenSearch have documented backup and restore procedures.