# Observability and Cost

Observability covers metrics collection, alerting, and dashboarding. Cost attribution covers the labeling strategy and budget controls. Both are foundational for operating the platform confidently.

# Metrics Collection

# Google Managed Prometheus (GMP)

GMP is enabled on the GKE cluster. It collects Prometheus-format metrics without requiring a self-hosted Prometheus server.

Cost model: you pay only for samples ingested and API queries. No CPU/memory cost for a Prometheus server, since Google manages the collection and storage infrastructure.

GMP collectors run as DaemonSets on the system node pool. They scrape:

  • GKE node and pod metrics (CPU, memory, disk, network) - built-in.
  • Kafka metrics - Strimzi exposes Prometheus metrics via JMX exporter. GMP scrapes the Strimzi-configured PodMonitor.
  • OpenSearch metrics - OpenSearch Prometheus exporter plugin or sidecar. GMP scrapes the ServiceMonitor.
  • DataHub metrics - DataHub GMS exposes JMX/Prometheus metrics. GMP scrapes via PodMonitor.
  • Cloud SQL metrics - collected natively by Cloud Monitoring (not GMP). Available in the same Cloud Ops console.

# Cloud Monitoring integration

GMP metrics are queryable via Cloud Monitoring's PromQL interface. Dashboards and alerts are created in Cloud Monitoring, using PromQL for GKE/app metrics (including Airflow) and MQL for GCP-native metrics (Cloud SQL, GCS).

# Alert Catalog

All alerts are provisioned via Terraform in environments/{env}-02-k8s-base/observability.tf (for GKE/app alerts) and environments/{env}-01-base/ (for Cloud SQL/project alerts).

# GKE and node scaling

Alert Condition Severity Action
Cluster Autoscaler scale-up Autoscaler adds nodes Info Awareness only
Cluster Autoscaler scale-down Autoscaler removes nodes Info Awareness only
Node pool at max capacity Workload pool nodes = max_nodes Warning Investigate pending pods; consider increasing max_nodes
Node not ready Any node NotReady > 5 min Critical GKE usually auto-repairs; investigate if persists

# Kafka (Strimzi)

Alert Condition Severity Action
Broker CPU high Any broker CPU > 80% for 10 min Warning Check for ingestion spike; consider scaling brokers
Broker memory high Any broker memory > 80% for 10 min Warning Check for large messages; increase broker resources
Consumer lag high Any consumer group lag > 10,000 for 15 min Critical DataHub is falling behind on events; check GMS/MAE/MCE consumer health
Under-replicated partitions ISR count < replication factor for 5 min Critical Broker may be down; check Strimzi operator logs
Broker count low Active brokers < expected count for 5 min Critical Check node health and Strimzi operator

# OpenSearch

Alert Condition Severity Action
JVM heap high JVM heap usage > 75% for 10 min Warning Consider increasing heap or adding data nodes
Unassigned shards Unassigned shard count > 0 for 10 min Critical Node may be down or disk full; check cluster health
Disk usage high Data node disk > 80% Warning Increase PV size or add data nodes
Cluster status red Cluster health = red for 5 min Critical Primary shards missing; check node status and operator logs

# DataHub

Alert Condition Severity Action
GMS OOMKilled OOMKilled events > 0 in 10 min Critical Increase GMS memory requests; check for metadata spike
GMS pod restart loop Restarts > 3 in 15 min Critical Check logs; likely DB connection issue or migration failure
Frontend unreachable HTTP 5xx on ingress > 10% for 5 min Critical Check frontend pods and ingress config

# Cloud SQL

Alert Condition Severity Action
Disk usage high Storage > 80% of provisioned Warning Enable auto-increase or manually increase
Connections high Active connections > 80% of max Warning Check for connection leaks; consider connection pooling
CPU high CPU > 80% for 15 min Warning Consider upgrading instance tier
Replication lag (prod) Replica lag > 30s for 10 min Critical Check network and replica health

# Airflow

Alert Condition Severity Action
DAG parse time high Parse time > 60s Warning Optimize DAG complexity or increase scheduler resources
Celery worker failures Worker task failures > 5 in 1 hour Warning Check task logs; likely code or resource issue
Scheduler heartbeat No heartbeat for 60s Critical Check scheduler pod status and resources; see Airflow scaling signals
Scheduler pod restarts Restarts > 3 in 1 hour Critical Check OOM or crash loop; consider upgrading node to e2-standard-4
Task failure rate > 10% over 15 minutes Warning Check task logs for common failure patterns

# Budget

Alert Condition Channel
Budget 50% Monthly spend reaches 50% of budget Email
Budget 80% Monthly spend reaches 80% of budget Email + Slack
Budget 100% Monthly spend reaches 100% of budget Email + Slack + PagerDuty (if configured)

# Dashboards

Wave-1 uses Cloud Monitoring dashboards (not Grafana). Pre-built dashboards:

Dashboard Metrics
GKE Overview Node count, CPU/memory by pool, pod count, autoscaler events
Kafka Broker CPU/memory, consumer lag by group, throughput (bytes/sec), ISR count
OpenSearch JVM heap, shard status, query latency, indexing rate
DataHub GMS request latency, error rate, consumer lag, active users
Cloud SQL CPU, memory, connections, disk, replication lag
Airflow Scheduler heartbeat, task queue depth, DAG parse time, worker CPU/memory, task failure rate

Dashboards are created via Terraform using google_monitoring_dashboard resources in observability.tf.

# Cost Attribution

# Label strategy

Every Terraform-managed resource carries five mandatory labels:

Label Purpose Example values
env Environment dev, prod
layer Terraform stack bootstrap, base, k8s-base, runtime
service Logical service gke, airflow, datahub, kafka, opensearch, cloudsql
owner Responsible team platform-team, data-engineering
cost_center Cost group data-platform

# Enforcement

Labels are defined in locals.tf and merged into every resource. CI lints for missing labels before terraform plan (see CI/CD).

# Future: billing export to BigQuery

The path to per-service cost dashboards (not built in wave-1):

  1. Enable GCP billing export to a BigQuery dataset.
  2. Query billing data grouped by the service and env labels.
  3. Create scheduled queries that aggregate daily costs by service.
  4. Surface in Looker Studio (or the reporting tool of choice).

The labels are in place now so this can be enabled at any time without re-tagging resources.

# GKE namespace-level cost

For finer-grained GKE cost attribution:

  • Workloads run in dedicated namespaces (datahub, kafka, opensearch, system).
  • GKE usage metering (enabled in the cluster config) exports per-namespace resource consumption to BigQuery.
  • Combined with billing export, this gives per-namespace cost.

This is a future enhancement; the namespace structure and usage metering config are set up in wave-1 to make it trivial to activate later.