#
DataHub
DataHub is the data catalog and governance platform for UME. It is deployed on GKE via Helm, backed by Cloud SQL (PostgreSQL), Strimzi-managed Kafka, and self-hosted OpenSearch. This section covers the deployment architecture, component configuration, identity, and known risks.
For the business rationale, feature set, and governance workflows, see Data Catalog.
Key decisions:
- Cloud SQL Postgres (not MySQL, not AlloyDB) as the metadata store
- Strimzi Kafka on GKE (not GCP Managed Kafka) for cost reasons
- OpenSearch on GKE (not Elastic Cloud) for cost and data-residency reasons
- Google OIDC with org-domain restriction; DataHub internal groups for access levels
#
Component Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ GKE Cluster │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ DataHub Frontend │ │ DataHub GMS │ │ MAE Consumer │ │
│ │ (React UI) │ │ (Metadata Svc) │ │ │ │
│ └────────┬─────────┘ └──┬────┬────┬─────┘ └───────┬────────┘ │
│ │ │ │ │ │ │
│ │ ┌───────────┘ │ └────────┐ │ │
│ │ │ │ │ │ │
│ ┌────────▼─────▼──┐ ┌────────▼────────┐ ┌─▼────────▼────────┐ │
│ │ Kafka (Strimzi)│ │ OpenSearch │ │ MCE Consumer │ │
│ │ 3 brokers │ │ 3 data nodes │ │ │ │
│ └──────────────────┘ └────────────────┘ └───────────────────┘ │
│ │
└──────────────────────────────┬──────────────────────────────────────┘
│ Private Service Access
┌──────────▼──────────┐
│ Cloud SQL Postgres │
│ (DataHub metadata) │
└─────────────────────┘
#
DataHub services
#
Cloud SQL (PostgreSQL)
DataHub's metadata store. Holds entities, relationships, aspects, and system metadata.
#
Configuration
#
IAM authentication
DataHub GMS authenticates to Cloud SQL using the datahub-sa Google service account via IAM database authentication. No passwords for programmatic access:
datahub-gms pod (k8s SA: datahub-gms)
→ Workload Identity → Google SA: datahub-sa
→ IAM database auth → Cloud SQL Postgres
The datahub-sa is granted roles/cloudsql.client and roles/cloudsql.instanceUser plus the IAM database user is created with GRANT ALL on the DataHub database.
#
Kafka (Strimzi)
Kafka serves as DataHub's event bus for metadata change propagation.
#
Why self-hosted
GCP Managed Service for Apache Kafka pricing is prohibitive for the workload size. The smallest cluster costs several hundred dollars per month. Strimzi on GKE gives us full Kafka with operator-managed lifecycle at node cost only.
Documented as an upgrade path: when Managed Kafka pricing improves, migration from Strimzi to managed is a config-level change (update broker endpoints in DataHub Helm values, decommission Strimzi resources).
#
Strimzi deployment
Strimzi is deployed in two Terraform resources within dev-02-k8s-base/kafka.tf:
- Strimzi operator - Helm release of the Strimzi operator chart.
- Kafka cluster -
KafkaCRD applied via thekubernetes_manifestresource, referencing the operator.
#
Topics
DataHub requires several Kafka topics. The DataHub Helm chart creates them automatically via its kafka-setup init job. Topic names follow DataHub's defaults:
MetadataChangeLog_Versioned_v1MetadataChangeLog_Timeseries_v1MetadataAuditEvent_v1PlatformEvent_v1- And several others (DataHub manages these; we do not create them manually).
#
Autoscaling
Strimzi does not natively auto-scale brokers. However:
- Kafka broker scaling: Strimzi supports declarative scaling (change
replicasin theKafkaCRD and the operator handles rolling addition/removal). This is a manual operation triggered by monitoring. - Node-level scaling: the GKE Cluster Autoscaler adds nodes when Kafka pods are pending. This handles the compute side.
- Partition rebalancing: Cruise Control rebalances partitions after broker additions, distributing load evenly.
Alerts to watch: broker CPU > 80%, consumer lag > threshold, broker count at minimum. See Observability.
#
OpenSearch
OpenSearch provides full-text search and graph query capabilities for DataHub's discovery UI.
#
Deployment
OpenSearch is deployed via the OpenSearch operator Helm chart in dev-02-k8s-base/opensearch.tf. The operator manages rolling upgrades, shard rebalancing, and node replacement.
#
Backups
OpenSearch snapshots are stored in GCS:
- Schedule: daily snapshot via Kubernetes CronJob that calls the OpenSearch Snapshot API.
- Bucket:
ume-opensearch-snapshots-{env}(created in{env}-01-base). - Retention: 7 days dev, 30 days prod.
- Restore procedure: documented in Operations.
#
Autoscaling
OpenSearch does not natively auto-scale. Scaling is handled by:
- Horizontal: add data nodes by updating the operator CRD replica count. The operator handles rolling addition and shard redistribution.
- Vertical: increase JVM heap and node resources in the CRD.
- Node-level: GKE Cluster Autoscaler adds nodes when OpenSearch pods are pending.
Alerts to watch: JVM heap > 75%, unassigned shards > 0, disk usage > 80%.
#
Identity and OAuth
#
Google OIDC configuration
DataHub Frontend authenticates users via Google OIDC (OpenID Connect):
- An OAuth client is created in the GCP console (or via Terraform
google_iap_clientif IAP is used, orgoogle_project_service_identity+ manual OAuth consent screen setup). - Client ID and client secret are stored in Secret Manager.
- Mounted into the DataHub Frontend pod via the Secret Manager CSI driver.
- DataHub's
application.ymlis configured via Helm values:
datahub-frontend:
extraEnvs:
- name: AUTH_OIDC_ENABLED
value: "true"
- name: AUTH_OIDC_CLIENT_ID
valueFrom:
secretKeyRef:
name: datahub-oidc-secret
key: client_id
- name: AUTH_OIDC_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: datahub-oidc-secret
key: client_secret
- name: AUTH_OIDC_DISCOVERY_URI
value: "https://accounts.google.com/.well-known/openid-configuration"
- name: AUTH_OIDC_BASE_URL
value: "https://datahub.data.ume.com.br"
#
Access restriction
- Org-domain restriction: the OAuth consent screen is configured as "Internal" (GCP Workspace), which restricts login to users in the UME organization domain. This applies to both dev and prod.
- DataHub groups: within DataHub, groups are created to map access levels:
Group membership is managed within DataHub's UI. Integration with Google Workspace groups is a future enhancement.
#
Helm Values (Dev vs Prod)
The modules/datahub-helm/ module templates Helm values based on environment variables. Key differences:
All backing-service endpoints are read from terraform_remote_state and passed into the Helm values dynamically.
#
Known Risks and Mitigations
#
The "metadata spike" risk
Analysts running large metadata crawls (e.g., full BigQuery ingestion) cause CPU/RAM spikes on GMS and Kafka backlogs.
Mitigations:
- GKE Cluster Autoscaler adds workload nodes when pods need more resources.
- Kafka consumer lag alerts fire before the backlog becomes critical.
- Document: run large crawls during off-hours.
- DataHub supports staged ingestion (process N tables per run); configure crawls accordingly.
#
Data persistence and corruption
If OpenSearch indices or Kafka topics are corrupted during a node upgrade, metadata discovery breaks.
Mitigations:
- OpenSearch: daily GCS snapshots. Restore procedure in Operations.
- Kafka: topics are recoverable by re-ingesting from DataHub's metadata sources (Cloud SQL is the durable store; Kafka is the event bus). Re-ingest procedure in Operations.
- Cloud SQL: automated backups + PITR.
- PDBs prevent multiple replicas from being evicted simultaneously during upgrades.
#
Version drift
DataHub releases frequently. Upgrading versions often involves metadata migrations that can take hours or fail.
Mitigations:
- Never upgrade prod directly. Always upgrade dev first.
- Run DataHub's built-in migration preflight script (
datahub docker quickstarthas a migration check; the Helm chart's upgrade hooks run migrations automatically, but test on dev first). - Pin the DataHub Helm chart version in
terraform.tfvars. Version bumps are explicit PRs. - Upgrade procedure documented in Operations.
#
The "connector trap"
Users will request fixes to Looker, dbt, or BigQuery ingestion connectors. This is data engineering work, not infrastructure.
Mitigations:
- Clear ownership boundary: infrastructure team owns the DataHub platform (deployment, scaling, upgrades). Data engineering team owns ingestion recipes (connector config, scheduling, troubleshooting).
- Ingestion recipes live in the DAGs repo, not in
ume-data-infra. - The
datahub-platformagent is scoped to Helm values and platform config; it does not touch ingestion recipes.
#
Ingestion Recipes (Reference)
Ingestion recipes run as Airflow DAGs (in the DAGs repo, not in ume-data-infra). Wave-1 delivers three recipes:
Each recipe is an Airflow DAG that invokes datahub ingest (DataHub's CLI) with a YAML recipe file. The DataHub CLI is included in the custom Airflow image.