# DataHub

DataHub is the data catalog and governance platform for UME. It is deployed on GKE via Helm, backed by Cloud SQL (PostgreSQL), Strimzi-managed Kafka, and self-hosted OpenSearch. This section covers the deployment architecture, component configuration, identity, and known risks.

For the business rationale, feature set, and governance workflows, see Data Catalog.

Key decisions:

Cloud SQL Postgres (not MySQL, not AlloyDB) as the metadata store
Strimzi Kafka on GKE (not GCP Managed Kafka) for cost reasons
OpenSearch on GKE (not Elastic Cloud) for cost and data-residency reasons
Google OIDC with org-domain restriction; DataHub internal groups for access levels

# Component Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  GKE Cluster                                                        │
│                                                                     │
│  ┌──────────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │  DataHub Frontend │   │   DataHub GMS    │   │  MAE Consumer  │  │
│  │  (React UI)       │   │  (Metadata Svc)  │   │                │  │
│  └────────┬─────────┘   └──┬────┬────┬─────┘   └───────┬────────┘  │
│           │                 │    │    │                  │           │
│           │     ┌───────────┘    │    └────────┐        │           │
│           │     │                │             │        │           │
│  ┌────────▼─────▼──┐  ┌────────▼────────┐  ┌─▼────────▼────────┐  │
│  │   Kafka (Strimzi)│  │   OpenSearch    │  │  MCE Consumer     │  │
│  │   3 brokers      │  │   3 data nodes  │  │                   │  │
│  └──────────────────┘  └────────────────┘  └───────────────────┘  │
│                                                                     │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Private Service Access
                    ┌──────────▼──────────┐
                    │  Cloud SQL Postgres  │
                    │  (DataHub metadata)  │
                    └─────────────────────┘

# DataHub services

Service	Role	Scaling
GMS (Generalized Metadata Service)	Core API. Reads/writes metadata to SQL, publishes events to Kafka, queries OpenSearch	Dev: 1 replica. Prod: HPA 2-4 replicas
Frontend	React web UI. Serves search, browse, lineage visualization	Dev: 1 replica. Prod: HPA 2-3 replicas
MAE Consumer	Reads Metadata Audit Events from Kafka, indexes into OpenSearch	Dev: 1 replica. Prod: 2 replicas
MCE Consumer	Reads Metadata Change Events from Kafka, writes to SQL	Dev: 1 replica. Prod: 2 replicas

# Cloud SQL (PostgreSQL)

DataHub's metadata store. Holds entities, relationships, aspects, and system metadata.

# Configuration

Setting	Dev	Prod
Instance tier	`db-g1-small`	`db-custom-2-7680`
HA	Single zone	Regional (automatic failover)
Storage	10 GB SSD	50 GB SSD, auto-increase
Backups	Daily, 7-day retention	Daily, 30-day retention + PITR
Network	Private IP via PSA	Private IP via PSA
Auth	IAM authentication for service accounts	Same
Admin password	In Secret Manager (break-glass only)	Same

# IAM authentication

DataHub GMS authenticates to Cloud SQL using the datahub-sa Google service account via IAM database authentication. No passwords for programmatic access:

datahub-gms pod (k8s SA: datahub-gms)
    → Workload Identity → Google SA: datahub-sa
    → IAM database auth → Cloud SQL Postgres

The datahub-sa is granted roles/cloudsql.client and roles/cloudsql.instanceUser plus the IAM database user is created with GRANT ALL on the DataHub database.

# Kafka (Strimzi)

Kafka serves as DataHub's event bus for metadata change propagation.

# Why self-hosted

GCP Managed Service for Apache Kafka pricing is prohibitive for the workload size. The smallest cluster costs several hundred dollars per month. Strimzi on GKE gives us full Kafka with operator-managed lifecycle at node cost only.

Documented as an upgrade path: when Managed Kafka pricing improves, migration from Strimzi to managed is a config-level change (update broker endpoints in DataHub Helm values, decommission Strimzi resources).

# Strimzi deployment

Setting	Dev	Prod
Brokers	3	5-7
Replicas (replication factor)	2	3
Storage per broker	20 GB PD-SSD	100 GB PD-SSD
Strimzi operator version	Latest stable	Same
Cruise Control	Enabled	Enabled

Strimzi is deployed in two Terraform resources within dev-02-k8s-base/kafka.tf:

Strimzi operator - Helm release of the Strimzi operator chart.
Kafka cluster - Kafka CRD applied via the kubernetes_manifest resource, referencing the operator.

# Topics

DataHub requires several Kafka topics. The DataHub Helm chart creates them automatically via its kafka-setup init job. Topic names follow DataHub's defaults:

MetadataChangeLog_Versioned_v1
MetadataChangeLog_Timeseries_v1
MetadataAuditEvent_v1
PlatformEvent_v1
And several others (DataHub manages these; we do not create them manually).

# Autoscaling

Strimzi does not natively auto-scale brokers. However:

Kafka broker scaling: Strimzi supports declarative scaling (change replicas in the Kafka CRD and the operator handles rolling addition/removal). This is a manual operation triggered by monitoring.
Node-level scaling: the GKE Cluster Autoscaler adds nodes when Kafka pods are pending. This handles the compute side.
Partition rebalancing: Cruise Control rebalances partitions after broker additions, distributing load evenly.

Alerts to watch: broker CPU > 80%, consumer lag > threshold, broker count at minimum. See Observability.

# OpenSearch

OpenSearch provides full-text search and graph query capabilities for DataHub's discovery UI.

# Deployment

Setting	Dev	Prod
Data nodes	3	5-7
Storage per node	20 GB PD-SSD	50 GB PD-SSD
JVM heap	512 MB	2 GB
Master-eligible	All data nodes (dev)	Dedicated master nodes (prod)

OpenSearch is deployed via the OpenSearch operator Helm chart in dev-02-k8s-base/opensearch.tf. The operator manages rolling upgrades, shard rebalancing, and node replacement.

# Backups

OpenSearch snapshots are stored in GCS:

Schedule: daily snapshot via Kubernetes CronJob that calls the OpenSearch Snapshot API.
Bucket: ume-opensearch-snapshots-{env} (created in {env}-01-base).
Retention: 7 days dev, 30 days prod.
Restore procedure: documented in Operations.

# Autoscaling

OpenSearch does not natively auto-scale. Scaling is handled by:

Horizontal: add data nodes by updating the operator CRD replica count. The operator handles rolling addition and shard redistribution.
Vertical: increase JVM heap and node resources in the CRD.
Node-level: GKE Cluster Autoscaler adds nodes when OpenSearch pods are pending.

Alerts to watch: JVM heap > 75%, unassigned shards > 0, disk usage > 80%.

# Identity and OAuth

# Google OIDC configuration

DataHub Frontend authenticates users via Google OIDC (OpenID Connect):

An OAuth client is created in the GCP console (or via Terraform google_iap_client if IAP is used, or google_project_service_identity + manual OAuth consent screen setup).
Client ID and client secret are stored in Secret Manager.
Mounted into the DataHub Frontend pod via the Secret Manager CSI driver.
DataHub's application.yml is configured via Helm values:

datahub-frontend:
  extraEnvs:
    - name: AUTH_OIDC_ENABLED
      value: "true"
    - name: AUTH_OIDC_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: datahub-oidc-secret
          key: client_id
    - name: AUTH_OIDC_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: datahub-oidc-secret
          key: client_secret
    - name: AUTH_OIDC_DISCOVERY_URI
      value: "https://accounts.google.com/.well-known/openid-configuration"
    - name: AUTH_OIDC_BASE_URL
      value: "https://datahub.data.ume.com.br"

# Access restriction

Org-domain restriction: the OAuth consent screen is configured as "Internal" (GCP Workspace), which restricts login to users in the UME organization domain. This applies to both dev and prod.
DataHub groups: within DataHub, groups are created to map access levels:

Group	DataHub role	Members
`datahub-admins`	Admin	Platform team
`datahub-editors`	Editor	Data engineers, analytics engineers
`datahub-viewers`	Viewer	All other org members

Group membership is managed within DataHub's UI. Integration with Google Workspace groups is a future enhancement.

# Helm Values (Dev vs Prod)

The modules/datahub-helm/ module templates Helm values based on environment variables. Key differences:

Value path	Dev	Prod
`datahub-gms.replicaCount`	1	2
`datahub-frontend.replicaCount`	1	2
`datahub-mae-consumer.replicaCount`	1	2
`datahub-mce-consumer.replicaCount`	1	2
`datahub-gms.resources.requests.memory`	`1Gi`	`4Gi`
`datahub-frontend.resources.requests.memory`	`512Mi`	`2Gi`
`global.sql.datasource.host`	(from remote state)	(from remote state)
`global.kafka.bootstrap.server`	(from remote state)	(from remote state)
`global.elasticsearch.host`	(from remote state)	(from remote state)

All backing-service endpoints are read from terraform_remote_state and passed into the Helm values dynamically.

# Known Risks and Mitigations

# The "metadata spike" risk

Analysts running large metadata crawls (e.g., full BigQuery ingestion) cause CPU/RAM spikes on GMS and Kafka backlogs.

Mitigations:

GKE Cluster Autoscaler adds workload nodes when pods need more resources.
Kafka consumer lag alerts fire before the backlog becomes critical.
Document: run large crawls during off-hours.
DataHub supports staged ingestion (process N tables per run); configure crawls accordingly.

# Data persistence and corruption

If OpenSearch indices or Kafka topics are corrupted during a node upgrade, metadata discovery breaks.

Mitigations:

OpenSearch: daily GCS snapshots. Restore procedure in Operations.
Kafka: topics are recoverable by re-ingesting from DataHub's metadata sources (Cloud SQL is the durable store; Kafka is the event bus). Re-ingest procedure in Operations.
Cloud SQL: automated backups + PITR.
PDBs prevent multiple replicas from being evicted simultaneously during upgrades.

# Version drift

DataHub releases frequently. Upgrading versions often involves metadata migrations that can take hours or fail.

Mitigations:

Never upgrade prod directly. Always upgrade dev first.
Run DataHub's built-in migration preflight script (datahub docker quickstart has a migration check; the Helm chart's upgrade hooks run migrations automatically, but test on dev first).
Pin the DataHub Helm chart version in terraform.tfvars. Version bumps are explicit PRs.
Upgrade procedure documented in Operations.

# The "connector trap"

Users will request fixes to Looker, dbt, or BigQuery ingestion connectors. This is data engineering work, not infrastructure.

Mitigations:

Clear ownership boundary: infrastructure team owns the DataHub platform (deployment, scaling, upgrades). Data engineering team owns ingestion recipes (connector config, scheduling, troubleshooting).
Ingestion recipes live in the DAGs repo, not in ume-data-infra.
The datahub-platform agent is scoped to Helm values and platform config; it does not touch ingestion recipes.

# Ingestion Recipes (Reference)

Ingestion recipes run as Airflow DAGs (in the DAGs repo, not in ume-data-infra). Wave-1 delivers three recipes:

Recipe	Source	Frequency
BigQuery metadata	BigQuery datasets + tables + column descriptions	Daily
Airflow DAGs	Airflow REST API	Daily
dbt manifests	dbt `manifest.json` + `run_results.json`	After each dbt run

Each recipe is an Airflow DAG that invokes datahub ingest (DataHub's CLI) with a YAML recipe file. The DataHub CLI is included in the custom Airflow image.