# Data Science

The Data Science layer provides infrastructure and tooling for machine learning workflows. It emphasizes reproducibility, governance, and integration with the broader data platform.

# Platform Capabilities

# Ad-Hoc Notebooks

Interactive development environment for data scientists:

Features:

  • Jupyter-compatible notebooks
  • Connection to data platform sources
  • Compute resources for experimentation
  • Collaborative editing capabilities

GitHub Integration:

  • Code stored in version control
  • Pull request workflows for review
  • CI/CD for notebook validation
  • Single source of truth for code

Best Practices:

  • Notebooks for exploration, scripts for production
  • Parameterize notebooks for reusability
  • Document assumptions and findings
  • Regular commits with meaningful messages

# Experiment Tracking

Record and compare model experiments:

Tracked metadata:

  • Model parameters and hyperparameters
  • Training data version
  • Performance metrics
  • Artifacts (models, plots, reports)

Benefits:

  • Reproducibility of results
  • Comparison across experiments
  • Audit trail for model decisions
  • Team visibility into work

# Model Lifecycle Management

Structured approach from concept to retirement:

Stages:

Stage Activities
Conception Problem definition, feasibility assessment
Development Feature engineering, model training, validation
Deployment Productionization, serving infrastructure
Observability Monitoring, drift detection, performance tracking
Retraining Scheduled or triggered model updates
Sunsetting Deprecation and removal

# Blueprints for Business Reporting

Templates for common DS outputs:

  • Forecasting reports
  • Anomaly detection dashboards
  • Segmentation analyses
  • Model performance reports

# Model Lifecycle

# Conception

Before starting development:

  1. Problem definition: Clear business objective
  2. Success criteria: How will we measure success?
  3. Data assessment: Is required data available?
  4. Feasibility: Is ML the right approach?

# Development

Building and validating models:

Data Preparation:

  • Use platform data sources (Silver/Gold layers)
  • Document data transformations
  • Version training datasets
  • Handle data types consistently

Training:

  • Reproducible training pipelines
  • Hyperparameter tuning
  • Cross-validation
  • Experiment logging

Validation:

  • Hold-out test sets
  • Business metric validation
  • Bias and fairness checks
  • Edge case testing

# Deployment

Moving models to production:

Patterns:

  • Batch inference (scheduled predictions)
  • Real-time inference (API serving)
  • Embedded models (in-database scoring)

Requirements:

  • Version-controlled model artifacts
  • Documented dependencies
  • Health checks and monitoring
  • Rollback capability

# Observability

Monitoring production models:

Metrics to track:

  • Prediction latency
  • Error rates
  • Input data distribution
  • Output distribution
  • Business outcome correlation

Drift detection:

  • Data drift (input distribution changes)
  • Concept drift (relationship changes)
  • Performance degradation alerts

# Retraining

Keeping models current:

Triggers:

  • Scheduled (e.g., monthly)
  • Performance-based (degradation detected)
  • Data-based (significant new data available)

Process:

  • Automated pipeline execution
  • Validation against current model
  • Champion/challenger comparison
  • Approval for production update

# Sunsetting

Retiring models responsibly:

  1. Identify replacement or alternative
  2. Notify stakeholders
  3. Deprecation period with warnings
  4. Remove from production
  5. Archive artifacts

# Feature Store

Centralized feature management (future capability):

# Concept

A feature store provides:

  • Reusable features: Define once, use in many models
  • Consistency: Same features in training and serving
  • Discovery: Find existing features before creating new
  • Lineage: Track feature origins and usage

# Benefits

  1. Reduce duplicate feature engineering
  2. Faster model development
  3. Consistent training/serving features
  4. Feature documentation and discovery

# Data Preparation Best Practices

# Data Types

Handle data types consistently:

  • Define schemas for input data
  • Validate types before processing
  • Handle missing values explicitly
  • Document type assumptions

# Code Reuse

Avoid duplicating logic:

  • Shared utility libraries
  • Common preprocessing functions
  • Centralized configuration
  • Package management

# Data Pipeline

Reproducible data preparation:

  • Version data transformations
  • Parameterize date ranges
  • Log data statistics
  • Test transformation logic

# Repeatability

Ensure experiments can be reproduced:

  • Seed random number generators
  • Version all dependencies
  • Document environment setup
  • Use containerization

# Data Testing

Validate data quality:

  • Input validation
  • Assertion checks
  • Statistical tests
  • Integration tests

# Lineage and Auditability

# Model Lineage

Track the complete model history:

  • Training data sources
  • Feature transformations
  • Model code version
  • Training parameters
  • Validation results

# Audit Requirements

Support regulatory and compliance needs:

  • Explain model decisions
  • Document model assumptions
  • Track model versions in production
  • Maintain prediction logs

# Governance Integration

# Data Access

Data science workflows respect platform governance:

  • Access only authorized datasets
  • Follow PII handling policies
  • Log data access for audit
  • Use approved data sources

# Catalog Integration

Register data science artifacts:

  • Document models in catalog
  • Link to training data
  • Describe intended use
  • Assign ownership

# Current Tools

Tool Use Case Notes
Jupyter Notebooks Exploration Local and cloud-based
Streamlit Dashboards Data Science and Credit teams
GitHub Version control Code and notebooks
Vertex AI ML infrastructure Future expansion

# Best Practices Summary

# Do

  • Version everything (code, data, models)
  • Document assumptions and decisions
  • Test data and model logic
  • Monitor production models
  • Collaborate through GitHub

# Don't

  • Deploy unversioned models
  • Skip validation steps
  • Ignore production monitoring
  • Create data silos
  • Duplicate feature logic

# Related Sections