#
Requirements
This section captures what the infrastructure must deliver and the quality attributes it must satisfy. Requirements are grouped by category and tagged with their wave (wave-1 = dev, wave-2 = prod rollout, future = backlog).
#
Functional Requirements
#
Airflow (Orchestration)
#
DataHub
#
GKE
#
CI/CD
#
Observability
#
Non-Functional Requirements
#
Availability
- Dev SLO: best-effort. Acceptable to lose availability during maintenance or experiments. Maintenance windows outside business hours are preferred but not mandatory.
- Prod SLO (target): 99.5% monthly availability for Airflow and DataHub. Achieved via regional GKE, multi-zone spreading, and automated failover of Cloud SQL.
#
Performance
- Airflow: DAG parsing time under 60 seconds for the full DAG bag. DAG sync latency under 2 minutes from push to scheduler pickup (via git-sync).
- DataHub: search latency under 2 seconds for 95th percentile queries. Metadata ingestion backlog (Kafka consumer lag) clears within 10 minutes after a full crawl.
#
Security
- No long-lived service-account keys anywhere in the platform.
- All inter-service communication over private IP or within the GKE cluster network.
- Cloud SQL accessible only via Private Service Access (no public IP).
- GKE API server restricted to authorized networks (CI runners + operator IPs).
- Secrets stored in Secret Manager; values never committed to code.
#
Cost
- Dev monthly budget target: under $500/month for the full stack (GKE nodes + Cloud SQL + network egress). Airflow runs on GKE, so its cost is included in GKE node spend. This is a soft target; actual cost will be measured after deployment and optimized iteratively.
- Labels for attribution: every Terraform-managed resource carries
env,layer,service,owner,cost_centerlabels. - Prod cost model: deferred until prod sizing is validated on dev workloads.
#
Operability
- Every component has a documented runbook in Operations.
- DataHub upgrades follow a dev-first, migration-preflight, then prod-promote workflow.
- Node pool upgrades use surge strategy with zero-downtime guarantee.
- Kafka and OpenSearch have documented backup and restore procedures.