# Introduction

This section establishes the scope and assumptions behind the infrastructure wave-1 implementation. It defines what we are building, what we are not building, and the constraints that shape our decisions.

# Purpose

The UME data platform, as described in the Architecture and Tools section, envisions a governed, self-service analytics environment. Before any of that vision materializes, we need the infrastructure underneath it. This documentation covers:

  • How the Terraform repository is organized (layers, environments, modules)
  • How CI/CD pipelines provision and maintain infrastructure
  • How Airflow on GKE Standard is deployed and wired to DAGs and dbt
  • How DataHub is deployed on GKE with its backing services
  • How we observe, alert on, and operate all of the above

# Wave-1 Scope

Wave-1 is the first implementation cycle. It delivers a working development environment that validates tooling choices, interoperability, and developer experience before we commit to production.

# In scope

  • Airflow on GKE Standard - Deployed via the official Apache Airflow Helm chart with CeleryExecutor. Custom image containing astronomer-cosmos, dbt-core, and dbt-bigquery. DAGs delivered via git-sync sidecar. At least one dbt model running end-to-end via Cosmos.
  • DataHub - Self-hosted on GKE Standard. Backed by Cloud SQL (PostgreSQL), self-hosted Kafka (Strimzi), and self-hosted OpenSearch. Google OIDC for authentication. Ingestion recipes for BigQuery metadata, Airflow, and dbt.
  • GKE Standard - Zonal cluster for dev PoC (regional for prod). Hosts Airflow (Phase 1) and DataHub, Kafka, OpenSearch (Phase 2). Zero-downtime node management.
  • CI/CD - GitHub Actions with Workload Identity Federation for plan/apply/drift workflows against the Terraform repository.
  • Observability - Google Managed Prometheus and Cloud Operations for metrics, dashboards, and alerts.
  • Cost groundwork - Mandatory labels on all resources. Project-level budget alerts.

# Out of scope (wave-1)

  • Production environment - Prod stacks mirror dev but are only provisioned after dev is proven. The Terraform code is multi-project ready; only tfvars change.
  • BigQuery datasets and GCS landing buckets - These are data engineering concerns managed outside the infrastructure repository (via dbt, manual creation, or a future data-platform layer).
  • Billing export to BigQuery / Looker dashboards - Documented as a future enhancement; labels are in place to support it.
  • Managed Kafka migration - Documented as an upgrade path when pricing becomes viable.
  • Self-hosted Grafana - GMP + Cloud Ops covers wave-1 needs.
  • SIEM, advanced compliance, or regulated-environment controls - The platform is lightweight for now.
  • GCP project creation - Projects are provisioned externally; Terraform assumes they exist.
  • GCP Folders or Organization-level IAM - The current org has a flat project list. Terraform does not touch org-level resources.

# Assumptions

  1. Single GCP project for dev: poc-ume-data. The user has Owner-level access. All dev resources (GKE, Cloud SQL, networking, Artifact Registry, state bucket) coexist here.
  2. Multi-project target: Production will use dedicated projects per concern (shared-services, platform, data). Terraform modules accept a project-ID map from tfvars, so migration requires no code changes.
  3. Projects are pre-existing inputs: Terraform never creates GCP projects. The 00-bootstrap layer assumes the target project already exists.
  4. No GCP Folders: The organization uses a flat list of projects under a single org. No folder-level IAM or hierarchy.
  5. Existing DAGs repository: There is an existing repository containing Airflow DAGs and a large dbt project. This will be expanded to include the custom Airflow image Dockerfile and CI, as well as DataHub ingestion recipes. During Phase 1, DAG/dbt/image work lives in resources/ within ume-data-infra and will be ported to the dedicated repo later.
  6. GitHub as source control: The infrastructure repo lives at github.com/1edata/ume-data-infra (will move orgs later).
  7. Two environments only: dev and prod. No staging, no QA. Dev validates; prod mirrors.

# Environments

Environment GCP Project(s) Purpose
dev poc-ume-data (single project) Validation. All components deployed here first.
prod Multiple projects (TBD, created externally) Production. Brought up only after dev is proven.

# Relationship to Other Documentation

  • Architecture and Tools - Defines the what and why: tool selection rationale, data flow, governance goals. This infrastructure section defines the how.
  • ETL - Describes transformation patterns, orchestration capabilities, and SDLC practices. Infrastructure provides the Airflow deployment, DAG sync mechanism, and image pipeline that ETL relies on.
  • Data Catalog - Describes DataHub's role, integrations, and governance workflows. Infrastructure provides the deployed DataHub instance, its backing services, and operational runbooks.