# Observability and Cost

Observability covers metrics collection, alerting, and dashboarding. Cost attribution covers the labeling strategy and budget controls. Both are foundational for operating the platform confidently.

Wave-1 approach:

Google Managed Prometheus (GMP) + Cloud Operations for metrics and alerts
No self-hosted Grafana - Cloud Ops dashboards are sufficient for wave-1
Mandatory labels on all resources for future per-service cost attribution
Budget alerts on the project at 50/80/100% thresholds

# Metrics Collection

# Google Managed Prometheus (GMP)

GMP is enabled on the GKE cluster. It collects Prometheus-format metrics without requiring a self-hosted Prometheus server.

Cost model: you pay only for samples ingested and API queries. No CPU/memory cost for a Prometheus server, since Google manages the collection and storage infrastructure.

GMP collectors run as DaemonSets on the system node pool. They scrape:

GKE node and pod metrics (CPU, memory, disk, network) - built-in.
Kafka metrics - Strimzi exposes Prometheus metrics via JMX exporter. GMP scrapes the Strimzi-configured PodMonitor.
OpenSearch metrics - OpenSearch Prometheus exporter plugin or sidecar. GMP scrapes the ServiceMonitor.
DataHub metrics - DataHub GMS exposes JMX/Prometheus metrics. GMP scrapes via PodMonitor.
Cloud SQL metrics - collected natively by Cloud Monitoring (not GMP). Available in the same Cloud Ops console.

# Cloud Monitoring integration

GMP metrics are queryable via Cloud Monitoring's PromQL interface. Dashboards and alerts are created in Cloud Monitoring, using PromQL for GKE/app metrics (including Airflow) and MQL for GCP-native metrics (Cloud SQL, GCS).

# Alert Catalog

All alerts are provisioned via Terraform in environments/{env}-02-k8s-base/observability.tf (for GKE/app alerts) and environments/{env}-01-base/ (for Cloud SQL/project alerts).

# GKE and node scaling

Alert	Condition	Severity	Action
Cluster Autoscaler scale-up	Autoscaler adds nodes	Info	Awareness only
Cluster Autoscaler scale-down	Autoscaler removes nodes	Info	Awareness only
Node pool at max capacity	Workload pool nodes = `max_nodes`	Warning	Investigate pending pods; consider increasing `max_nodes`
Node not ready	Any node `NotReady` > 5 min	Critical	GKE usually auto-repairs; investigate if persists

# Kafka (Strimzi)

Alert	Condition	Severity	Action
Broker CPU high	Any broker CPU > 80% for 10 min	Warning	Check for ingestion spike; consider scaling brokers
Broker memory high	Any broker memory > 80% for 10 min	Warning	Check for large messages; increase broker resources
Consumer lag high	Any consumer group lag > 10,000 for 15 min	Critical	DataHub is falling behind on events; check GMS/MAE/MCE consumer health
Under-replicated partitions	ISR count < replication factor for 5 min	Critical	Broker may be down; check Strimzi operator logs
Broker count low	Active brokers < expected count for 5 min	Critical	Check node health and Strimzi operator

# OpenSearch

Alert	Condition	Severity	Action
JVM heap high	JVM heap usage > 75% for 10 min	Warning	Consider increasing heap or adding data nodes
Unassigned shards	Unassigned shard count > 0 for 10 min	Critical	Node may be down or disk full; check cluster health
Disk usage high	Data node disk > 80%	Warning	Increase PV size or add data nodes
Cluster status red	Cluster health = red for 5 min	Critical	Primary shards missing; check node status and operator logs

# DataHub

Alert	Condition	Severity	Action
GMS OOMKilled	OOMKilled events > 0 in 10 min	Critical	Increase GMS memory requests; check for metadata spike
GMS pod restart loop	Restarts > 3 in 15 min	Critical	Check logs; likely DB connection issue or migration failure
Frontend unreachable	HTTP 5xx on ingress > 10% for 5 min	Critical	Check frontend pods and ingress config

# Cloud SQL

Alert	Condition	Severity	Action
Disk usage high	Storage > 80% of provisioned	Warning	Enable auto-increase or manually increase
Connections high	Active connections > 80% of max	Warning	Check for connection leaks; consider connection pooling
CPU high	CPU > 80% for 15 min	Warning	Consider upgrading instance tier
Replication lag (prod)	Replica lag > 30s for 10 min	Critical	Check network and replica health

# Airflow

Alert	Condition	Severity	Action
DAG parse time high	Parse time > 60s	Warning	Optimize DAG complexity or increase scheduler resources
Celery worker failures	Worker task failures > 5 in 1 hour	Warning	Check task logs; likely code or resource issue
Scheduler heartbeat	No heartbeat for 60s	Critical	Check scheduler pod status and resources; see Airflow scaling signals
Scheduler pod restarts	Restarts > 3 in 1 hour	Critical	Check OOM or crash loop; consider upgrading node to `e2-standard-4`
Task failure rate	> 10% over 15 minutes	Warning	Check task logs for common failure patterns

# Budget

Alert	Condition	Channel
Budget 50%	Monthly spend reaches 50% of budget	Email
Budget 80%	Monthly spend reaches 80% of budget	Email + Slack
Budget 100%	Monthly spend reaches 100% of budget	Email + Slack + PagerDuty (if configured)

# Dashboards

Wave-1 uses Cloud Monitoring dashboards (not Grafana). Pre-built dashboards:

Dashboard	Metrics
GKE Overview	Node count, CPU/memory by pool, pod count, autoscaler events
Kafka	Broker CPU/memory, consumer lag by group, throughput (bytes/sec), ISR count
OpenSearch	JVM heap, shard status, query latency, indexing rate
DataHub	GMS request latency, error rate, consumer lag, active users
Cloud SQL	CPU, memory, connections, disk, replication lag
Airflow	Scheduler heartbeat, task queue depth, DAG parse time, worker CPU/memory, task failure rate

Dashboards are created via Terraform using google_monitoring_dashboard resources in observability.tf.

# Cost Attribution

# Label strategy

Every Terraform-managed resource carries five mandatory labels:

Label	Purpose	Example values
`env`	Environment	`dev`, `prod`
`layer`	Terraform stack	`bootstrap`, `base`, `k8s-base`, `runtime`
`service`	Logical service	`gke`, `airflow`, `datahub`, `kafka`, `opensearch`, `cloudsql`
`owner`	Responsible team	`platform-team`, `data-engineering`
`cost_center`	Cost group	`data-platform`

# Enforcement

Labels are defined in locals.tf and merged into every resource. CI lints for missing labels before terraform plan (see CI/CD).

# Future: billing export to BigQuery

The path to per-service cost dashboards (not built in wave-1):

Enable GCP billing export to a BigQuery dataset.
Query billing data grouped by the service and env labels.
Create scheduled queries that aggregate daily costs by service.
Surface in Looker Studio (or the reporting tool of choice).

The labels are in place now so this can be enabled at any time without re-tagging resources.

# GKE namespace-level cost

For finer-grained GKE cost attribution:

Workloads run in dedicated namespaces (datahub, kafka, opensearch, system).
GKE usage metering (enabled in the cluster config) exports per-namespace resource consumption to BigQuery.
Combined with billing export, this gives per-namespace cost.

This is a future enhancement; the namespace structure and usage metering config are set up in wave-1 to make it trivial to activate later.