# Data Sources

The data platform ingests data from multiple source systems across the organization. To ensure consistent, governed, and cost-effective data onboarding, we define blueprints for each category of data source.

# Source System Categories

# Transactional Databases (OLTP)

Core business systems including backoffice applications, point-of-sale (POS) systems, and partner integrations. These systems generate the primary operational data for the business.

  • Current state: 19 transactional databases
  • Ingestion method: CDC (Change Data Capture) running at 15-minute intervals
  • Target: Bronze layer in object storage

# API Data

External service integrations such as Infobip (communications), partner APIs, and third-party data providers.

# Mobile Applications

User-facing mobile apps generate behavioral data, transaction events, and user interaction logs.

# Operational Logs

Application logs, system metrics, and audit trails from various services across the infrastructure.

# Data Onboarding Blueprints

Blueprints provide standardized patterns for onboarding new data sources. Each blueprint defines:

  • Ingestion method: How data is captured and moved
  • Schema handling: Validation, evolution, and documentation
  • Quality checks: Automated tests and assertions
  • Lineage metadata: Source and transformation tracking
  • Cost considerations: Storage and compute optimization

# DBMS Data Blueprint

For relational database sources using CDC:

  1. Connection setup: Secure connection with appropriate credentials
  2. Table selection: Define which tables to replicate
  3. CDC configuration: Set up change capture with appropriate frequency
  4. Schema registration: Document schema in data catalog
  5. Quality gates: Define freshness and completeness checks

# API Data Blueprint

For REST APIs, webhooks, and event streams:

  1. Authentication: Secure credential management
  2. Rate limiting: Respect source system constraints
  3. Pagination handling: Complete data extraction patterns
  4. Error handling: Retry logic and failure notifications
  5. Schema inference: Document and validate response structures

# Log Data Blueprint

For streaming log data (e.g., application logs to GCS):

  1. Format standardization: Define expected log structure (JSON, structured text)
  2. Partitioning strategy: Time-based organization for efficient querying
  3. Retention policies: Define lifecycle based on compliance requirements
  4. Cost controls: Avoid expensive patterns like single-column BigQuery tables over raw JSON
  5. Aggregation: Define summary tables for common query patterns

# File-Based Data Blueprint

For batch file uploads (CSV, Excel, Parquet):

  1. Landing zone: Designated upload location with access controls
  2. Validation: Schema and content validation before processing
  3. Archival: Move processed files to archive storage
  4. Notification: Alert data owners on successful ingestion

# Ingestion Methods

# CDC (Change Data Capture)

The primary method for transactional database ingestion. CDC captures incremental changes, reducing load on source systems and enabling near-real-time data availability.

Current implementation: 15-minute refresh cycles from source databases to BigQuery.

# Batch Loaders

For periodic bulk data loads from APIs, files, or systems without CDC support.

# Streaming

For real-time use cases requiring sub-minute latency (future consideration).

# Best Practices

  1. Document before ingesting: Register the data source in the catalog before building the pipeline
  2. Start with Bronze: Always land raw data in Bronze before transformation
  3. Automate quality checks: Include data tests in every pipeline
  4. Monitor costs: Set up alerts for unexpected cost increases
  5. Plan for schema evolution: Design for backward-compatible changes

# Related Sections

  • ETL - Pipeline implementation and blueprints
  • Object Storage - Where data lands after ingestion
  • Data Catalog - Registering and documenting sources