#
Data Science
The Data Science layer provides infrastructure and tooling for machine learning workflows. It emphasizes reproducibility, governance, and integration with the broader data platform.
#
Platform Capabilities
#
Ad-Hoc Notebooks
Interactive development environment for data scientists:
Features:
- Jupyter-compatible notebooks
- Connection to data platform sources
- Compute resources for experimentation
- Collaborative editing capabilities
GitHub Integration:
- Code stored in version control
- Pull request workflows for review
- CI/CD for notebook validation
- Single source of truth for code
Best Practices:
- Notebooks for exploration, scripts for production
- Parameterize notebooks for reusability
- Document assumptions and findings
- Regular commits with meaningful messages
#
Experiment Tracking
Record and compare model experiments:
Tracked metadata:
- Model parameters and hyperparameters
- Training data version
- Performance metrics
- Artifacts (models, plots, reports)
Benefits:
- Reproducibility of results
- Comparison across experiments
- Audit trail for model decisions
- Team visibility into work
#
Model Lifecycle Management
Structured approach from concept to retirement:
Stages:
#
Blueprints for Business Reporting
Templates for common DS outputs:
- Forecasting reports
- Anomaly detection dashboards
- Segmentation analyses
- Model performance reports
#
Model Lifecycle
#
Conception
Before starting development:
- Problem definition: Clear business objective
- Success criteria: How will we measure success?
- Data assessment: Is required data available?
- Feasibility: Is ML the right approach?
#
Development
Building and validating models:
Data Preparation:
- Use platform data sources (Silver/Gold layers)
- Document data transformations
- Version training datasets
- Handle data types consistently
Training:
- Reproducible training pipelines
- Hyperparameter tuning
- Cross-validation
- Experiment logging
Validation:
- Hold-out test sets
- Business metric validation
- Bias and fairness checks
- Edge case testing
#
Deployment
Moving models to production:
Patterns:
- Batch inference (scheduled predictions)
- Real-time inference (API serving)
- Embedded models (in-database scoring)
Requirements:
- Version-controlled model artifacts
- Documented dependencies
- Health checks and monitoring
- Rollback capability
#
Observability
Monitoring production models:
Metrics to track:
- Prediction latency
- Error rates
- Input data distribution
- Output distribution
- Business outcome correlation
Drift detection:
- Data drift (input distribution changes)
- Concept drift (relationship changes)
- Performance degradation alerts
#
Retraining
Keeping models current:
Triggers:
- Scheduled (e.g., monthly)
- Performance-based (degradation detected)
- Data-based (significant new data available)
Process:
- Automated pipeline execution
- Validation against current model
- Champion/challenger comparison
- Approval for production update
#
Sunsetting
Retiring models responsibly:
- Identify replacement or alternative
- Notify stakeholders
- Deprecation period with warnings
- Remove from production
- Archive artifacts
#
Feature Store
Centralized feature management (future capability):
#
Concept
A feature store provides:
- Reusable features: Define once, use in many models
- Consistency: Same features in training and serving
- Discovery: Find existing features before creating new
- Lineage: Track feature origins and usage
#
Benefits
- Reduce duplicate feature engineering
- Faster model development
- Consistent training/serving features
- Feature documentation and discovery
#
Data Preparation Best Practices
#
Data Types
Handle data types consistently:
- Define schemas for input data
- Validate types before processing
- Handle missing values explicitly
- Document type assumptions
#
Code Reuse
Avoid duplicating logic:
- Shared utility libraries
- Common preprocessing functions
- Centralized configuration
- Package management
#
Data Pipeline
Reproducible data preparation:
- Version data transformations
- Parameterize date ranges
- Log data statistics
- Test transformation logic
#
Repeatability
Ensure experiments can be reproduced:
- Seed random number generators
- Version all dependencies
- Document environment setup
- Use containerization
#
Data Testing
Validate data quality:
- Input validation
- Assertion checks
- Statistical tests
- Integration tests
#
Lineage and Auditability
#
Model Lineage
Track the complete model history:
- Training data sources
- Feature transformations
- Model code version
- Training parameters
- Validation results
#
Audit Requirements
Support regulatory and compliance needs:
- Explain model decisions
- Document model assumptions
- Track model versions in production
- Maintain prediction logs
#
Governance Integration
#
Data Access
Data science workflows respect platform governance:
- Access only authorized datasets
- Follow PII handling policies
- Log data access for audit
- Use approved data sources
#
Catalog Integration
Register data science artifacts:
- Document models in catalog
- Link to training data
- Describe intended use
- Assign ownership
#
Current Tools
#
Best Practices Summary
#
Do
- Version everything (code, data, models)
- Document assumptions and decisions
- Test data and model logic
- Monitor production models
- Collaborate through GitHub
#
Don't
- Deploy unversioned models
- Skip validation steps
- Ignore production monitoring
- Create data silos
- Duplicate feature logic
#
Related Sections
- Data Catalog - Registering models and features
- ETL - Data preparation pipelines
- Reporting - Sharing DS insights