# Data Science

Data science workloads occupy a unique position in the data platform. They are both consumers of governed data and producers of new assets - features, models, predictions - that feed back into business processes. This dual role creates governance challenges that are distinct from pure analytics.

The goal is not to constrain experimentation - rapid iteration is essential to data science work. The goal is to ensure that when experiments mature into production assets, they integrate cleanly with the governed platform: traceable, reproducible, and compliant.

# Current Challenges

Data science at UME currently faces challenges that are common in growing organizations:

  • Isolated notebooks - Data scientists work in local Jupyter environments, producing analyses and models that just starting to get version-controlled, sometimes are not fully reproducible, and also just starting to be connected to the governed data pipeline.
  • Disconnected data preparation - Feature engineering and data transformations happen in ad-hoc Python scripts, but should blend more and moreto the governed ETL layer and the ways data teams recommend. These bad practices from the past created duplicate logic, inconsistent definitions, and ungoverned data sprawl.
  • Missing lineage - Models consume data and produce predictions, but these relationships are not always captured. When a source table changes, there is no way to know which models are affected and the downstream impact in machine learning models.
  • No model lifecycle - Models were trained and deployed without standardized processes for versioning, monitoring, retraining, or retirement. Though this is known to be currently changing, we would like to establish a wider standard.
  • PII exposure - Experimentation environments often have broad access to raw data, including sensitive information that should be masked or restricted.

The recommendations here aim to bring data science work into the governed platform without sacrificing the agility that makes data science valuable.

# Integration with the Data Platform

Data scientists should be consumers of the same governed datasets that feed reporting. This alignment delivers consistency and reduces duplicate work.

# Consume Gold-layer Data

Models should be as much as possible trained on data from the Gold layer - curated, validated, and documented datasets that have passed quality gates. This ensures:

  • Consistency - The same data definitions used in reports are used in models.
  • Quality - Data has been cleaned, deduplicated, and tested.
  • Traceability - Lineage exists from source through transformations to the dataset.

Avoid training models directly on Bronze or Silver data. If a dataset needed for modeling does not exist in Gold, the solution is to create it through the governed ETL process, not to bypass the process.

We understand that experimentation often comes with onboarding new data and we must see results early. In cases where we need to deal with ungoverned data, this must be done in dev/staging environments and as soon as possible the data pipelines behind models must fit the engineering patterns before they can be called production.

# Align Data Preparation with ETL

When data scientists build feature engineering pipelines, they should follow patterns that can translate into dbt models. This does not mean every experiment must be production-ready from day one - that would kill agility. But it does mean:

  • Use SQL where possible - Transformations expressed in SQL can be migrated to dbt with minimal effort. Reserve Python for logic that genuinely/easily cannot be expressed in SQL.
  • Follow naming conventions - Use the same naming patterns defined in ETL and Lake Engine so that features can be promoted to governed datasets.
  • Document assumptions - Even in exploratory work, capture what business rules are being applied. This documentation makes migration easier.

For complex data preparation that cannot reasonably be expressed in SQL, data scientists should work with the data engineering team to build well-architected pipelines that follow the standards defined in ETL - Compute-based transformations. The key principle: if it runs in production, it must be governed.

# Experimentation Environment

Data science experimentation needs a home - a place to explore, iterate, and collaborate without the overhead of production systems.

# Recommendation: Vertex AI Workbench with GitHub

Data scientists should own their experimentation platform. The recommendation is for each data science team (or domain) to maintain GitHub repositories as their codebase foundation, with Vertex AI Workbench providing the compute environment.

Why GitHub + Vertex AI Workbench:

  • Ownership and autonomy - Data scientists build and evolve their own platform within their repositories. Code, notebooks, and configurations live in version control from day one.
  • Managed kernels - Vertex AI Workbench provides Jupyter kernels that can be spun up, resized, and shut down as needed. No infrastructure management required.
  • Flexible access patterns - Data scientists can work in two ways:
    • Local IDE with remote kernel - Connect VS Code, PyCharm, or other IDEs to a Vertex kernel running in the cloud. Code executes remotely; data stays in GCP.
    • Jupyter interface - Access JupyterLab directly from the Vertex instance for a browser-based experience.
  • Compute sizing on demand - Easily scale kernel resources (CPU, memory, GPUs) based on workload requirements without provisioning new infrastructure.
  • Native GCP integration - Workload identity, networking, and access controls work seamlessly. No long-lived credentials needed.

Operational guardrails:

Data science kernels can be expensive if left running. The platform should include:

  • Idle detection and auto-shutdown - Configure kernels to shut down after a period of inactivity (e.g., 30 minutes with no execution). Vertex AI Workbench supports idle timeout settings.
  • Usage monitoring - Track kernel uptime and compute consumption per user and per project. Surface this in cost dashboards alongside other data platform costs.
  • Alerts for runaway kernels - Notify users and administrators when a kernel has been running continuously beyond a threshold (e.g., 24 hours).

The goal is to give data scientists the resources they need while protecting against the cost leakage that comes from forgotten instances.

# Alternative: Hex

If Hex is adopted for reporting (see Reporting - Evaluation: Hex), it can serve as an alternative experimentation environment for lighter workloads:

  • Notebooks with collaboration - Code is saved, versioned, and shareable within the Hex platform.
  • BigQuery integration - Query governed datasets directly; access policies apply.
  • Lower barrier to entry - For analysts who need to do light data science work, Hex provides a gentler learning curve than Vertex.
  • Publication - Mature analyses can be published as reports or dashboards within the same tool.

Hex is appropriate for ad-hoc exploration, quick analyses, and work that does not require heavy compute. For production-oriented data science, model training, and workloads requiring GPUs or large-scale processing, Vertex AI Workbench remains the primary environment.

# Source Control as Foundation

Regardless of where code runs, GitHub must be the source of truth for all code that moves toward production:

  • Notebooks checked into repositories
  • Feature engineering code version-controlled
  • Model training scripts tracked with history
  • CI/CD pipelines for promotion

Local-only or notebook-only work is acceptable for early exploration. But once an asset is destined for production, it must live in version control. The recommendation for Vertex AI Workbench aligns naturally with this principle - data scientists already work from GitHub repositories, so the transition from exploration to production is seamless.

# Model Lifecycle Management

Models are not static artifacts. They degrade over time as data patterns shift. A governed platform requires processes for the full model lifecycle.

# Recommendation: Vertex AI

For model lifecycle management, we recommend Vertex AI - GCP's managed ML platform. Vertex AI provides:

Capability Description
Experiments Track training runs, hyperparameters, and metrics in a structured way
Model Registry Version and catalog models with metadata, lineage, and deployment history
Pipelines Orchestrate training workflows with dependencies and reproducibility
Model Serving Deploy models to endpoints with monitoring and traffic management
Model Monitoring Detect drift, skew, and performance degradation in production

Why Vertex AI over alternatives like MLflow:

  • Managed infrastructure - No clusters to maintain, no upgrades to manage. Aligns with the principle of minimizing operational overhead.
  • Native GCP integration - Identity, networking, and access controls integrate with existing infrastructure.
  • Workload identity - No long-lived credentials required for model training or serving.

For organizations that prefer open-source, MLflow is a solid choice. But at UME, where the engineering preference is to use native GCP tools and avoid infrastructure management, Vertex AI is the better fit.

# Lifecycle Stages

Adopt a clear model lifecycle with defined stages:

Stage Description Governance
Experimentation Rapid iteration, exploring approaches Minimal; tracked in Workbench or Hex
Development Building a candidate model with proper engineering Version-controlled; documented
Validation Testing against holdout data, bias checks, stakeholder review Reviewed by data steward; test results logged
Deployment Serving predictions in production Registered in Vertex AI; monitoring enabled
Monitoring Ongoing observation of performance and drift Alerts on degradation; retraining triggers
Retirement Graceful deprecation when model is replaced Documented; downstream consumers notified

The Data Catalog should reflect model status - users should know whether a model is experimental, in production, or deprecated.

# SDLC for Data Science

Data science work must follow Software Development Life Cycle (SDLC) practices once it moves toward production. The challenge is applying these practices without killing the agility that makes experimentation valuable. The solution is a clear boundary: exploration can be informal, but production-bound work enters the SDLC workflow.

# Environment Separation

Like ETL workloads (see ETL - SDLC and Environments), data science work should flow through distinct environments:

Environment Purpose Data Access
Development Experimentation, prototyping, model iteration Sampled or anonymized data; PII masked
Staging Integration testing, validation, UAT Production-like data with PII masked; used for final validation before deployment
Production Live model serving Real data; no manual deployments - only via CI/CD

Training typically happens in development or staging environments. Production environments are for serving predictions, not for training (except for online learning scenarios, which require additional safeguards).

# CI/CD Pipelines

Automate the path from code to deployed model:

On pull request:

  • Lint Python code (e.g., ruff, black)
  • Run unit tests
  • Run integration tests on sample data

On merge to main:

  • Deploy to staging environment
  • Trigger training pipeline on staging data
  • Run model validation tests
  • If validation passes, await manual approval or auto-promote

On promotion to production:

  • Register model in Vertex AI Model Registry
  • Deploy to Vertex AI endpoint
  • Enable monitoring
  • Notify stakeholders

The key principle: no manual deployments to production. All production changes flow through the pipeline.

# Model and Code Versioning

A trained model is a separate artifact from the code that produced it. Both must be versioned and linked:

Artifact Versioning Mechanism What to Track
Code Git (commits, tags) Feature logic, training scripts, pipeline definitions
Model Vertex AI Model Registry Trained weights, model metadata, performance metrics
Training data Dataset snapshots, lineage Which data version was used for training
Configuration Git (alongside code) Hyperparameters, feature lists, thresholds
Dependencies requirements.txt or pyproject.toml Pinned library versions for reproducibility

When a model is registered, its metadata should include:

  • Git commit hash of the training code
  • Training dataset identifier (and lineage to source tables)
  • Hyperparameters used
  • Validation metrics at training time

This linkage enables reproducibility - given a model version, you can trace back to exactly what code, data, and configuration produced it.

# Reproducibility Requirements

For a model to be considered production-ready, it must be reproducible:

  • Pinned dependencies - All Python packages have explicit versions
  • Seeded randomness - Random seeds are set and logged for any stochastic processes
  • Data snapshots - Training data is versioned or snapshotted; the exact dataset can be retrieved
  • Logged parameters - All hyperparameters and configuration are captured in Vertex AI Experiments or equivalent
  • Containerized training - Training runs in a defined container image, not an ad-hoc environment

If another data scientist cannot re-run the training and get the same (or statistically equivalent) model, it is not ready for production.

# Promotion Gates

Define explicit criteria for promoting models between stages:

Transition Required Criteria
Dev > Staging Code review approved; unit tests pass; training completes without errors
Staging > Production Model validation tests pass (accuracy, AUC, bias metrics, etc); integration tests pass; stakeholder sign-off (for business-critical models)
Production retirement Replacement model deployed and stable; downstream consumers notified; deprecation period observed

# Experimentation Escape Hatch

Not all data science work needs full SDLC. Early exploration is informal by design:

  • Notebooks can live outside the module structure during exploration
  • No PR required for experiments that won't go to production
  • Data access in development environments is permissive (within PII constraints)

The boundary: Once a model is identified as a production candidate, it enters the SDLC workflow. This means:

  1. Code is refactored into modules
  2. Tests are written
  3. A PR is opened
  4. The promotion process begins

The discipline is in recognizing when work crosses this boundary, and not letting "temporary" experiments become permanent production systems without going through the process.

# Feature Engineering

Features are the bridge between raw data and model inputs. Well-governed feature engineering prevents the proliferation of redundant, inconsistent features.

# Patterns for Feature Development

  1. Start in notebooks - Explore and prototype features in Workbench or Hex.
  2. Validate with SQL - Before committing to a feature, attempt to express it in SQL. If it can be expressed in SQL, it should eventually become a dbt model.
  3. Promote through ETL - Mature features that are used across multiple models should be promoted to the Gold layer through the standard ETL process. This makes them available to all consumers, not just the original modeler.
  4. Document in the catalog - Features - whether in dbt or in specialized feature pipelines - should appear in the Data Catalog with definitions, owners, and lineage.

# Complex Pipelines

When feature engineering genuinely requires Python or other non-SQL logic:

  • Build pipelines using orchestration patterns from ETL - Orchestration
  • Tag compute jobs for cost attribution
  • Emit lineage metadata that integrates with the data catalog
  • Follow the SDLC practices described in SDLC for Data Science (dev/staging/prod, code review, CI/CD)

The data engineering team should provide blueprints for common feature engineering patterns, just as they do for data ingestion. Data scientists use these blueprints rather than inventing custom infrastructure.

# Feature Store Considerations

A dedicated feature store (e.g., Vertex AI Feature Store) may be appropriate when:

  • Many models share common features
  • Features require low-latency serving
  • Feature lineage and versioning are critical

For initial efforts, the Gold layer in BigQuery can serve as a lightweight feature store. Evaluate dedicated feature store tooling as the number of production models grows.

# Governance and Compliance

Data science workloads must participate in the same governance framework as the rest of the platform.

# Lineage and Catalog Integration

All data science artifacts should appear in the Data Catalog:

Artifact Catalog Entry
Features Documented with definition, owner, source lineage
Models Registered with version, training data lineage, performance metrics
Notebooks Linked to datasets consumed and outputs produced
Predictions Tables of model outputs traced back to model version and input data

This lineage enables impact analysis - when a source table changes, we can identify which features, models, and predictions are affected.

Vertex AI emits metadata that can be ingested into DataHub or similar catalog tools. The data team should configure this integration as part of the platform setup.

# PII and Sensitive Data

Data scientists often need access to detailed data for feature engineering. This creates PII exposure risk.

Controls:

  • Experimentation on masked data - Where possible, provide anonymized or synthetic datasets for experimentation. Reserve access to real PII for validated production use cases.
  • Access policies - Apply the same access controls to experimentation environments as to other data consumers. Data scientists should not have broader access than their role requires.
  • PII tagging - Data scientists who create features or model outputs containing sensitive information must tag them appropriately. The Data Catalog should reflect PII status.
  • Audit trails - Queries from experimentation environments should be logged with user identity for audit purposes.

The principle from ETL - PII and Sensitive Data Handling applies: identify PII early, mask or hash before it reaches broad audiences, and design for data deletion requests.

# Security

  • Workload identity - Model training and serving jobs use workload identity, not service account keys.
  • No long-lived credentials - Notebooks and pipelines access data through identity federation, not stored secrets.
  • Least-privilege access - Experimentation environments access only the datasets required for the work at hand.

# Governance Recommendations

  • Consume Gold-layer data - Train models on governed, validated datasets, not raw Bronze data. When the speed is necessary, bring supporting data into gold as soon as possible before a model can be called production.
  • Align feature engineering with ETL - Use SQL where possible; complex pipelines follow ETL patterns and blueprints.
  • Version control everything - GitHub is the source of truth for all code destined for production.
  • Use Vertex AI for model lifecycle - Managed infrastructure, native GCP integration, no operational overhead.
  • Register models and features in the catalog - Lineage flows from source data through features to models to predictions.
  • Tag PII appropriately - Data scientists are responsible for identifying and tagging sensitive data in artifacts they create.
  • Use Vertex AI Workbench for experimentation - Data scientists work from their GitHub repositories with Vertex kernels for compute. Configure idle timeout and cost monitoring to prevent runaway costs.
  • Apply SDLC to production workloads - Models and features that run in production follow dev/staging/prod workflow, with code review, automated testing, and CI/CD deployment.
  • Extract logic into modules - Production code lives in .py modules, not notebooks. Notebooks orchestrate; modules contain logic. This enables proper code review, testing, and reuse.

# Tasks

  • Configure Vertex AI project and enable Experiments, Model Registry, and Pipelines
  • Establish integration between Vertex AI metadata and the Data Catalog
  • Create blueprint for feature engineering pipelines that emit catalog-compatible lineage
  • Define PII access policy for experimentation environments
  • Document model lifecycle stages and criteria for promotion between stages
  • Configure Vertex AI Workbench with idle timeout policies and cost monitoring alerts
  • If Hex is adopted for reporting, evaluate for lightweight data science use cases
  • Set up CI/CD pipeline template for model training and deployment (linting, testing, validation gates)
  • Define model validation test requirements (accuracy thresholds, bias metrics, data quality checks)
  • Document promotion criteria for dev > staging > production transitions
  • Create onboarding guide for data scientists: how to access governed data, how to register models, how to tag PII, SDLC workflow