# ETL

This section covers the processes responsible for moving and transforming data across the platform. We break this into two complementary components:

  • Transformations - actual data work: cleaning, joining, aggregating, testing, changing data
  • Orchestration - scheduling, dependency awareness, retry rules, operational monitoring and alerting

This helps us assign the right tools and responsibilities to each component.

# ELT and data movement

Traditional ETL (Extract, Transform, Load) transforms data before loading it into target storage for reporting. The more recent paradigm ELT (Extract, Load, Transform) changes this. The data is first loaded into the lake in raw form (bronze, landing, or raw area), then it is transformed in place using the lake engine's processing power. Ultimately, the "golden" data is consumed by reporting/dashboard applications. Sometimes the data is consumed from the lake directly or from dedicated workloads as seen in the Lake Engine section.

This approach offers several advantages:

  • Raw data preservation - original data is always available for reprocessing or auditing
  • Leverage lake engine - BigQuery handles transformations efficiently with no need for separate compute workloads and pricing SKUs.
  • Simpler pipelines - extraction and loading are decoupled from transformation logic.
  • Cost control - transformations expressed in SQL avoid spinning up dedicated compute and are more accessible for analytics engineers and power business users to understand and maintain.

Data flows through the medallion architecture:

Sources → Bronze (raw) → Silver (cleaned) → Gold (business-ready) → Consumption

# Data extraction

Extraction is the first step - getting data from source systems into the lake. While extraction itself is often handled by dedicated tools or source system integrations, there are important considerations:

  • CDC (Change Data Capture) is preferred for relational databases, as it captures only changes, reducing load on source systems.
  • When connecting to relational databases from our extraction code, copy data as fast as possible with minimal overhead. Prefer idempotent, timestamp-based filters (or cursors) to allow for reproducible extracts. Do not transform data on extraction. Keep original data types (schema-on-read).
  • Use incremental extraction where possible. Full extracts are expensive and slow, burdening both source systems and the lake engine.
  • Handle schema changes gracefully - sources evolve; extraction should not break silently. Implement "schema evolution" strategies where additive changes are accepted automatically.
  • Log extraction metadata - record what was extracted, when, record counts, and any errors. Batch IDs should group artifacts of the same schedule run (e.g., Airflow dag_run_id).
  • Quarantine failed extractions - do not mix bad data with good. Alert and investigate processes that fail validation.

Extraction processes should land data in the Bronze tier in its raw form, preserving the original structure and data types. Transformation comes later. This also enables easier replacement of source systems or ingestion tools without rewriting business logic.

See Data Sources for detailed onboarding patterns and schema validation at extraction time.

# Extraction patterns & Metadata-Driven Ingestion

Because of the variety of technologies used for extracting data (API, DBMS/SQL, file deserialization), the data team must avoid writing custom code for every new source. Instead, we adopt Metadata-Driven Ingestion.

The data team provides standardized extraction frameworks (blueprints/operators) for common sources. Users (or the data team itself) only configure the metadata:

  • Source connection details
  • Target destination
  • Schedule frequency
  • Schema/Table list

This code can live at the orchestration level (e.g., reusable Airflow operators) or as a task configuration + wrapper execution container (e.g., Airflow task template launching a standardized Golang/Python container).


# Transformations

# Transformation flavors

Because of different needs, we distinguish between a few types of transformation workloads. The main flavors based on complexity and tooling are:

# SQL-based transformations (dbt)

The preferred path for typical analytics workloads. If it can be expressed in SQL, it should be.

Characteristics:

  • Declarative, readable, version-controlled
  • Runs on BigQuery - no separate compute costs
  • Supports incremental models, testing, and documentation
  • Lineage captured automatically and imported into data governance tooling

Tooling: dbt Core with instrumented blueprints that enable non-developers to contribute. The data team maintains patterns for:

  • Creating new projects
  • Cross-referencing datasets and existing dbt projects
  • Maintaining existing datamarts
  • Evaluating impact

Please observe Lake Engine good practices for clustering and partitioning because transformations, where workload is more predictable, are some of the best contenders for cost optimization.

# Compute-based transformations (Spark/containers)

For cases where SQL is insufficient - e.g.: complex algorithms, machine learning processing, or legacy workloads.

Characteristics:

  • Runs on Dataproc (Spark) or Cloud Run (containers)
  • Spins up dedicated compute, charged by processing time
  • Requires proper tagging for cost attribution
  • Should be migrated to SQL when feasible
  • Can be triggered with the use of dbt python models

When to use:

  • Python/Scala logic that cannot be expressed (or not easily) in SQL
  • Legacy pipelines pending migration

# Design principles

  1. Idempotency - running a pipeline multiple times should produce the same result. Check the not so new but still relevant Functional Data Engineering - a modern paradigm for batch data processing
  2. Incremental processing - process only new or changed data when possible
  3. Lineage capture - track data origins and transformations automatically
  4. Testability - every pipeline includes automated tests

# New project checklist

When onboarding a new data project or datamart, one suggestion is to ensure the following items are documented and agreed:

  • Data sources - what systems feed this project, who owns them
  • Schema definitions - expected structure, registered in data catalog
  • Refresh frequency - how often data should be extracted and transformed
  • Data owner and steward - who is accountable, who maintains
  • Downstream consumers - dashboards, models, or teams that depend on this data - also should be reflected automatically on code. dags, dbt lineage, dashboards, etc.
  • Testing strategy - which tests apply, leverage existing patterns
  • Retention policy - how long data is kept at each storage tier when applies
  • Cost estimate - expected storage growth YoY and compute/processing costs, what tagging or data separation strategy to use for cost monitoring

This checklist lives in the data catalog alongside the project documentation. Completing it could represent an additional indicator of quality.

# Schema enforcement

Validate data against expected schemas as early as possible:

  • Schema validation on ingestion - reject or quarantine non-conforming data
  • Schema evolution - handle backward-compatible changes gracefully (additive columns with defaults)
  • Breaking changes - require coordinated migration and documentation
  • Data catalog as source of truth - schema definitions live in the governance tool

See Data Sources for additional schema validation patterns.

# Data testing

Automated validation ensures data quality and builds trust. dbt is our primary tool for data testing. It integrates testing directly into the transformation workflow.

Test Type Description
Schema tests Column presence, types, nullability
Uniqueness tests Primary key violations
Referential tests Foreign key relationships
Range tests Values within expected bounds
Freshness tests Data recency checks
Custom assertions Business-specific rules

Implementation:

  • Tests run as part of pipeline execution (dbt test)
  • Failures block downstream processing
  • Results logged and surfaced in the data catalog. Users should be able to see see test status alongside data documentation
  • Alerts sent on test failures

The data team curates a library of common testing patterns as reusable dbt macros (still to be implemented), via generally available documentation and through generic tests. New projects can leverage these patterns rather than writing tests from scratch. An user creating a solution should have this facilitated and very easily accessible. The introduction of testing agents available at the data catalog/documentatio tool will enable faster adoption.

Transparency about data testing in the data catalog builds confidence: users know whether data passed quality gates before consuming it.

# Intermediate snapshots for auditability

Each data transition applies specific transformations and quality gates. A well-designed transformation pipeline provides intermediate snapshots at each storage tier:

  • Bronze snapshots - raw data as received, timestamped
  • Silver snapshots - cleaned data before business logic - can have many intermediate models
  • Gold snapshots - final business-ready state - can have many intermediate models

This enables:

  • Reproducibility - recreate any report at any point in time
  • Auditing - trace values back to source
  • Debugging - isolate where issues were introduced
  • Compliance - demonstrate data lineage to maintainers and business people

Reporting over intermediate snapshots is the opposite of shift-right complexity, where dashboards contain all the business logic. Push logic into earliest transformations - shift left. Keep reporting simple. Use data warehousing techniques. Offer links to users of the dashboard to underlying data formation reports.

# PII and Sensitive Data Handling

Handling Personally Identifiable Information (PII) is a critical part of the ETL process.

  • Identify early: Tag sensitive columns (email, phone, national IDs) in the Data Catalog and dbt models.
  • Masking/Hashing: Apply masking or hashing at the Silver layer. The Bronze layer may contain raw PII (restricted access), but downstream Gold models exposed to broader audiences must have PII obfuscated.
  • Right to be Forgotten: Design pipelines to support data deletion requests (LGPD/GDPR). This often requires mutable storage patterns or keyed-deletion capability in the Lake Engine.

# Tenancy controls (Multi-tenancy)

For multi-tenant architectures, we apply strict isolation patterns:

  • Tenancy - partition data by tenant; apply tenant-specific logic where needed.
  • Compute isolation - separate pipeline workloads if strict regulatory isolation is required.
  • Cost attribution - tag jobs by tenant for cost separation.
  • Data partitioning - use tenant ID as a partition key for efficient filtering and cost control.

# Orchestration

Orchestration is distinct from transformation: it handles when and in what order (dependencies, impact) pipelines run, not the actual data transformation work.

# Capabilities

  • Schedule based execution - cron-like triggers for recurring jobs
  • Event-triggered execution - react to file arrivals, API calls, or upstream events
  • Dependency management - ensure upstream pipelines complete before downstream ones start
  • Retry logic and failure handling - automatic retries with backoff
  • Backfill support - by design allows for reprocessing historical data when logic changes. It is a pattern.

# Alerting and playbooks

Operational monitoring is critical for maintaining trust:

Alerting on:

  • Pipeline failures
  • Data quality test failures
  • Unusual execution times (anomaly detection)
  • Cost threshold breaches
  • Data freshness violations

Playbooks:

  • Documented response procedures for common issues
  • Escalation paths to responsible teams
  • Runbook automation where possible. Self-healing for known issues

# Security

  • No long-lived credentials - use workload identity for all jobs. Especially for Cloud Composer that supports this by design.
  • Least-privilege access - pipelines access only the data they need. Use dags configurations.
  • Audit trail - all executions logged with identity and results.
  • Secret management - Whenever necessary, credentials must be stored in Secret Manager, not in code or environment variables. Secrets must be accessed using workload identity, not long lived credentials.

# Best practices

  • Define clear ownership for each pipeline.
  • Set appropriate timeouts and retries to avoid "zombie" tasks clogging the scheduler.
  • Micro-batch vs. Streaming: Most workloads are well-suited for micro-batch (e.g., hourly/daily). Use streaming (Dataflow/Flink) only when low latency (<15 min) is a strictly validated business requirement, as it introduces significant operational complexity.
  • Monitor execution duration trends - catch regressions and anomalies early.
  • Document dependencies explicitly - use code references (e.g., dbt ref()) or DAG dependencies.
  • Publish DAGs via CI/CD - integrated with version control, linting, and automated testing.

# SDLC and Environments

To maintain stability and trust, UME must follow a strict Software Development Life Cycle (SDLC) process for all data assets. This process shuold enable easy collaboration from different teams and personal capabilities.

Environments:

  • Development (Dev): Sandbox for engineers. Can use sampled or anonymized production data. Resources here should be transient.
  • Staging (QA): Mirror of production environment configurations. Used for integration testing and UAT (User Acceptance Testing). Data here must be masked if PII is involved. Can use a subset of production data.
  • Production (Prod): The source of truth. Locked down; only accessible via CI/CD pipelines. No manual write access.

Workflow:

  • [suggestion] Feature Branch > Pull Request (Code Review + Automated Linting) > Merge to Main > Deploy to staging > Automated Tests > Promotion to Production.
  • Promotion to production can use conventional commits + semantic release processes both for orchestration dags and transformation code.

Infrastructure as Code (IaC):

  • Orchestration resources (Airflow environments), IAM roles, and storage buckets should be managed via Terraform, Pulumi or other IaC tool of choice, ensuring identical configurations across environments.

# Governance recommendations

  • Establish blueprints for common transformation patterns (CDC ingestion, API loaders, log parsers, aggregations) to accelerate development and enforce consistency
  • Require data tests for all pipelines - no exceptions; untested pipelines do not reach production. Use a transition rule where exceptions are only flagged and not restricted.
  • Surface test results and data quality metrics in the data catalog so consumers have visibility
  • Use workload identity for all jobs where possible. Reduce or eliminate long-lived credentials. Monitor.
  • Tag compute-based jobs (Spark, containers) with project, team, and purpose for cost attribution. Use dbt instrumentation for python jobs.
  • Monitor cost trends for Dataproc and container workloads; evaluate migration to SQL alternatives
  • Define schema contracts early; use data catalog as source of truth for schema definitions
  • Capture lineage metadata automatically. Transformation tools should integrate with catalog
  • Document playbooks for common failure scenarios. Automate recovery whenever possible (e.g., transient database errors should retry automatically; schema violations should alert).