# Data Sources

The data platform ingests data from multiple source systems across the organization. To ensure consistent, governed, and cost-effective data onboarding, we define blueprints for each category of data source.

# Source System Categories

# Transactional Databases (OLTP)

Core business systems including backoffice applications, point-of-sale (POS) systems, and partner integrations. These systems generate the primary operational data for the business.

Current state: 19 transactional databases
Ingestion method: CDC (Change Data Capture) running at 15-minute intervals
Target: Bronze layer in object storage

# API Data

External service integrations such as Infobip (communications), partner APIs, and third-party data providers.

# Mobile Applications

User-facing mobile apps generate behavioral data, transaction events, and user interaction logs.

# Operational Logs

Application logs, system metrics, and audit trails from various services across the infrastructure.

# Data Onboarding Blueprints

Blueprints provide standardized patterns for onboarding new data sources. Each blueprint defines:

Ingestion method: How data is captured and moved
Schema handling: Validation, evolution, and documentation
Quality checks: Automated tests and assertions
Lineage metadata: Source and transformation tracking
Cost considerations: Storage and compute optimization

# DBMS Data Blueprint

For relational database sources using CDC:

Connection setup: Secure connection with appropriate credentials
Table selection: Define which tables to replicate
CDC configuration: Set up change capture with appropriate frequency
Schema registration: Document schema in data catalog
Quality gates: Define freshness and completeness checks

# API Data Blueprint

For REST APIs, webhooks, and event streams:

Authentication: Secure credential management
Rate limiting: Respect source system constraints
Pagination handling: Complete data extraction patterns
Error handling: Retry logic and failure notifications
Schema inference: Document and validate response structures

# Log Data Blueprint

For streaming log data (e.g., application logs to GCS):

Format standardization: Define expected log structure (JSON, structured text)
Partitioning strategy: Time-based organization for efficient querying
Retention policies: Define lifecycle based on compliance requirements
Cost controls: Avoid expensive patterns like single-column BigQuery tables over raw JSON
Aggregation: Define summary tables for common query patterns

# File-Based Data Blueprint

For batch file uploads (CSV, Excel, Parquet):

Landing zone: Designated upload location with access controls
Validation: Schema and content validation before processing
Archival: Move processed files to archive storage
Notification: Alert data owners on successful ingestion

# Ingestion Methods

# CDC (Change Data Capture)

The primary method for transactional database ingestion. CDC captures incremental changes, reducing load on source systems and enabling near-real-time data availability.

Current implementation: 15-minute refresh cycles from source databases to BigQuery.

# Batch Loaders

For periodic bulk data loads from APIs, files, or systems without CDC support.

# Streaming

For real-time use cases requiring sub-minute latency (future consideration).

# Best Practices

Document before ingesting: Register the data source in the catalog before building the pipeline
Start with Bronze: Always land raw data in Bronze before transformation
Automate quality checks: Include data tests in every pipeline
Monitor costs: Set up alerts for unexpected cost increases
Plan for schema evolution: Design for backward-compatible changes

ETL - Pipeline implementation and blueprints
Object Storage - Where data lands after ingestion
Data Catalog - Registering and documenting sources