#
Data Sources
The data platform ingests data from multiple source systems across the organization. To ensure consistent, governed, and cost-effective data onboarding, we define blueprints for each category of data source.
#
Source System Categories
#
Transactional Databases (OLTP)
Core business systems including backoffice applications, point-of-sale (POS) systems, and partner integrations. These systems generate the primary operational data for the business.
- Current state: 19 transactional databases
- Ingestion method: CDC (Change Data Capture) running at 15-minute intervals
- Target: Bronze layer in object storage
#
API Data
External service integrations such as Infobip (communications), partner APIs, and third-party data providers.
#
Mobile Applications
User-facing mobile apps generate behavioral data, transaction events, and user interaction logs.
#
Operational Logs
Application logs, system metrics, and audit trails from various services across the infrastructure.
#
Data Onboarding Blueprints
Blueprints provide standardized patterns for onboarding new data sources. Each blueprint defines:
- Ingestion method: How data is captured and moved
- Schema handling: Validation, evolution, and documentation
- Quality checks: Automated tests and assertions
- Lineage metadata: Source and transformation tracking
- Cost considerations: Storage and compute optimization
#
DBMS Data Blueprint
For relational database sources using CDC:
- Connection setup: Secure connection with appropriate credentials
- Table selection: Define which tables to replicate
- CDC configuration: Set up change capture with appropriate frequency
- Schema registration: Document schema in data catalog
- Quality gates: Define freshness and completeness checks
#
API Data Blueprint
For REST APIs, webhooks, and event streams:
- Authentication: Secure credential management
- Rate limiting: Respect source system constraints
- Pagination handling: Complete data extraction patterns
- Error handling: Retry logic and failure notifications
- Schema inference: Document and validate response structures
#
Log Data Blueprint
For streaming log data (e.g., application logs to GCS):
- Format standardization: Define expected log structure (JSON, structured text)
- Partitioning strategy: Time-based organization for efficient querying
- Retention policies: Define lifecycle based on compliance requirements
- Cost controls: Avoid expensive patterns like single-column BigQuery tables over raw JSON
- Aggregation: Define summary tables for common query patterns
#
File-Based Data Blueprint
For batch file uploads (CSV, Excel, Parquet):
- Landing zone: Designated upload location with access controls
- Validation: Schema and content validation before processing
- Archival: Move processed files to archive storage
- Notification: Alert data owners on successful ingestion
#
Ingestion Methods
#
CDC (Change Data Capture)
The primary method for transactional database ingestion. CDC captures incremental changes, reducing load on source systems and enabling near-real-time data availability.
Current implementation: 15-minute refresh cycles from source databases to BigQuery.
#
Batch Loaders
For periodic bulk data loads from APIs, files, or systems without CDC support.
#
Streaming
For real-time use cases requiring sub-minute latency (future consideration).
#
Best Practices
- Document before ingesting: Register the data source in the catalog before building the pipeline
- Start with Bronze: Always land raw data in Bronze before transformation
- Automate quality checks: Include data tests in every pipeline
- Monitor costs: Set up alerts for unexpected cost increases
- Plan for schema evolution: Design for backward-compatible changes
#
Related Sections
- ETL - Pipeline implementation and blueprints
- Object Storage - Where data lands after ingestion
- Data Catalog - Registering and documenting sources