#
Observability and Cost
Observability covers metrics collection, alerting, and dashboarding. Cost attribution covers the labeling strategy and budget controls. Both are foundational for operating the platform confidently.
Wave-1 approach:
- Google Managed Prometheus (GMP) + Cloud Operations for metrics and alerts
- No self-hosted Grafana - Cloud Ops dashboards are sufficient for wave-1
- Mandatory labels on all resources for future per-service cost attribution
- Budget alerts on the project at 50/80/100% thresholds
#
Metrics Collection
#
Google Managed Prometheus (GMP)
GMP is enabled on the GKE cluster. It collects Prometheus-format metrics without requiring a self-hosted Prometheus server.
Cost model: you pay only for samples ingested and API queries. No CPU/memory cost for a Prometheus server, since Google manages the collection and storage infrastructure.
GMP collectors run as DaemonSets on the system node pool. They scrape:
- GKE node and pod metrics (CPU, memory, disk, network) - built-in.
- Kafka metrics - Strimzi exposes Prometheus metrics via JMX exporter. GMP scrapes the Strimzi-configured
PodMonitor. - OpenSearch metrics - OpenSearch Prometheus exporter plugin or sidecar. GMP scrapes the
ServiceMonitor. - DataHub metrics - DataHub GMS exposes JMX/Prometheus metrics. GMP scrapes via
PodMonitor. - Cloud SQL metrics - collected natively by Cloud Monitoring (not GMP). Available in the same Cloud Ops console.
#
Cloud Monitoring integration
GMP metrics are queryable via Cloud Monitoring's PromQL interface. Dashboards and alerts are created in Cloud Monitoring, using PromQL for GKE/app metrics (including Airflow) and MQL for GCP-native metrics (Cloud SQL, GCS).
#
Alert Catalog
All alerts are provisioned via Terraform in environments/{env}-02-k8s-base/observability.tf (for GKE/app alerts) and environments/{env}-01-base/ (for Cloud SQL/project alerts).
#
GKE and node scaling
#
Kafka (Strimzi)
#
OpenSearch
#
DataHub
#
Cloud SQL
#
Airflow
#
Budget
#
Dashboards
Wave-1 uses Cloud Monitoring dashboards (not Grafana). Pre-built dashboards:
Dashboards are created via Terraform using google_monitoring_dashboard resources in observability.tf.
#
Cost Attribution
#
Label strategy
Every Terraform-managed resource carries five mandatory labels:
#
Enforcement
Labels are defined in locals.tf and merged into every resource. CI lints for missing labels before terraform plan (see CI/CD).
#
Future: billing export to BigQuery
The path to per-service cost dashboards (not built in wave-1):
- Enable GCP billing export to a BigQuery dataset.
- Query billing data grouped by the
serviceandenvlabels. - Create scheduled queries that aggregate daily costs by service.
- Surface in Looker Studio (or the reporting tool of choice).
The labels are in place now so this can be enabled at any time without re-tagging resources.
#
GKE namespace-level cost
For finer-grained GKE cost attribution:
- Workloads run in dedicated namespaces (
datahub,kafka,opensearch,system). - GKE usage metering (enabled in the cluster config) exports per-namespace resource consumption to BigQuery.
- Combined with billing export, this gives per-namespace cost.
This is a future enhancement; the namespace structure and usage metering config are set up in wave-1 to make it trivial to activate later.