# CDC fan-out architecture — Juice DB

# Context and problem statement

The production database is a Cloud SQL PostgreSQL 14 instance (700GB disk, 10 CPU, 22GB RAM). Today, Datastream replicates tables directly into BigQuery every 15 minutes. The BigQuery tables are unoptimized (no partitioning, no clustering, no column selection), and many reports source them directly with no bronze/silver/gold layering.

Several use cases require transactional (OLTP) data in PostgreSQL or AlloyDB rather than BigQuery, leading to inefficient patterns where data flows to BigQuery and then back to a relational database. The production database cannot be directly queried by downstream consumers or modified beyond minimal configuration.

# Architecture overview

The solution introduces Juice DB — a standalone AlloyDB instance that acts as a centralized, read-only fan-out hub. It receives data from production via PostgreSQL native logical replication and distributes it downstream through multiple mechanisms depending on the target type.

Production DB (Cloud SQL PG14)
        │
        │  PG logical replication (WAL)
        │
   Juice DB (AlloyDB PG15+)
    ┌───┼───────────┐
    │   │           │
    ▼   ▼           ▼
  [Analytical]   [Transactional]
  Datastream     PG pub/sub with
  + dbt jobs     col + row filters
    │   │           │
    ▼   ▼           ▼
  BigQuery       Cloud SQL
  bronze +       instances
  silver/gold    (per-consumer)

# Key architectural decisions

# 1. Juice DB is a standalone AlloyDB instance, not a read replica

A Cloud SQL read replica would be simpler but cannot act as a logical replication publisher (read replicas are read-only at the WAL level). Juice DB must generate its own logical decoding output to fan out downstream. AlloyDB was chosen over Cloud SQL for Juice DB because its read pool architecture provides workload isolation — the primary node handles replication while read pool nodes absorb dbt queries and Datastream reads without affecting replication performance.

Implication: Juice DB is fed by logical replication from production, not physical streaming replication. This allows Juice DB to run a different PG major version (PG15+) than production (PG14).

# 2. Production → Juice DB: PG native logical replication (cross-version)

PostgreSQL logical replication supports cross-major-version replication. A PG14 publisher can feed a PG15 subscriber natively.

On production (one-time setup):

  • cloudsql.logical_decoding set to on (likely already enabled for existing Datastream)
  • A dedicated replication user with REPLICATION attribute
  • A publication: CREATE PUBLICATION juice_pub FOR TABLE table_a, table_b, ...;

On Juice DB:

  • Private IP connectivity to production (same project, same VPC)
  • CREATE SUBSCRIPTION juice_sub CONNECTION 'host=<prod-private-ip> ...' PUBLICATION juice_pub;

On PG14, the publication is table-level only — no column or row filtering at this leg. That is intentional; Juice DB is the full-fidelity copy. Filtering happens one hop later.

Production footprint: One publication, one replication user, one replication slot. The publication was created independently (not reusing Datastream's publication) to ensure independent lifecycle control.

# 3. Juice DB → downstream PG targets: PG15 logical replication with filtered publications

PG15 introduces native column-list and row-filter support on publications. This enables the core fan-out pattern:

-- Example: Customer A gets a filtered subset
CREATE PUBLICATION customer_a_pub
  FOR TABLE orders (id, order_date, amount, status)
  WHERE (tenant_code = 1);

-- Customer A's Cloud SQL instance subscribes
CREATE SUBSCRIPTION customer_a_sub
  CONNECTION 'host=<juice-db-ip> ...'
  PUBLICATION customer_a_pub;

Each downstream consumer gets exactly the tables, columns, and rows they need. This allows smaller instance sizes, tighter permission sets, and tenant-level data isolation.

# 4. Juice DB → BigQuery: Datastream CDC (bronze) + dbt (silver/gold)

Two parallel analytical paths:

  • Datastream from Juice DB lands raw CDC data into BigQuery as the bronze layer. Datastream supports table and column selection in its configuration. This replaces the current direct Datastream from production.
  • dbt jobs connect to Juice DB's AlloyDB read pool (isolating analytical query load from the replication primary) and materialize transformed models into BigQuery as the silver/gold layers.

# 5. Provisioning: Terraform for infrastructure, idempotent SQL for PG objects

The config repository is the single source of truth, structured as folders per consumer:

replication-configs/
  customer-a/
    config.yaml    # tables, columns, row_filter: "tenant_code = 1"
  customer-b/
    config.yaml    # tables, columns, row_filter: "tenant_code = 2"
  analytics-bronze/
    config.yaml    # Datastream config for BQ bronze
  analytics-dbt/
    config.yaml    # dbt source definitions

Terraform provisions infrastructure: AlloyDB/Cloud SQL instances, Datastream streams, networking, IAM, monitoring dashboards.

Idempotent SQL scripts provision PG-internal objects: publications, subscriptions, schema setup. These are generated from the YAML configs and executed via Airflow or CI/CD.

-- Idempotent publication creation pattern
DO $$
BEGIN
  IF NOT EXISTS (SELECT 1 FROM pg_publication WHERE pubname = 'customer_a_pub') THEN
    EXECUTE 'CREATE PUBLICATION customer_a_pub
      FOR TABLE orders (id, order_date, amount, status)
      WHERE (tenant_code = 1)';
  END IF;
END $$;

Rationale for the split: Terraform's postgresql provider can manage publications and subscriptions but is brittle for changes — publication modifications often require drop/recreate, which breaks active subscriptions. Idempotent SQL gives more surgical control over migration paths.

# 6. Schema change propagation: polling-based drift detection

PostgreSQL logical replication does not propagate DDL. When production runs ALTER TABLE ADD COLUMN, the new column is silently ignored by subscribers. When production runs ALTER TABLE DROP COLUMN or an incompatible type change, replication can break entirely and WAL accumulates on production disk.

Solution: An Airflow DAG polls information_schema.columns on production and Juice DB every 1-2 minutes. On drift detection:

  1. Generates ALTER TABLE statements for Juice DB
  2. Updates ALTER PUBLICATION statements for affected downstream publications
  3. Generates ALTER TABLE statements for each downstream subscriber
  4. Applies in order: Juice DB first, then downstream
  5. Notifies via Slack

Why polling over event triggers: An event_trigger on production would provide real-time detection but introduces ongoing maintenance overhead on the one system we committed to minimizing changes on. Polling is operationally simpler, and the 1-2 minute window is acceptable because there is always a gap between ALTER TABLE and the first meaningful data insertion into new columns.

# Monitoring requirements (to be detailed)

Key lag points to observe:

Leg Metric source What to watch
Prod → Juice DB pg_stat_subscription on Juice DB Replication lag (bytes + time)
Juice DB → PG targets pg_stat_replication on Juice DB Per-subscription lag
Juice DB → BQ Cloud Monitoring (Datastream) Stream freshness + throughput
Schema drift Airflow DAG status DAG failures, drift events

Alerting thresholds and on-call routing TBD.

# Open questions

  • Monitoring and observability stack selection (Cloud Monitoring, Datadog, Grafana)
  • On-call routing: dedicated data platform team vs broader SRE
  • dbt model weight: incremental vs full-refresh, scheduling strategy
  • Existing Datastream from prod → BQ: migration/cutover plan to the new path through Juice DB
  • Catalog integration with DataHub: lineage tracking across the fan-out