#
Data Sources
Although this does not define any set of tools, there are a few guidelines we shuold take into account when interacting with external data sources.
The most popular known ways that data enters the analytics data plataform are:
- Relational DBMS pushes from CDC into Google Cloud Storage
- Application Logs - read from Cloud Logging or stored into GCS
- Automated extraction processes importing external data into GCS or directly inserting into BigQuery. This data typically comes from
- Relational databases
- APIs
- Data files manually added into Google Cloud Storage
- Event-based or streaming systems pushing data into Cloud Storage or directly into BigQuery
Then after that, through the ETL layer this data is prepared and served.
When we look into data sources, there are processes we own and control and others that are outside of the reach of our engineering and governance.
When we look into the subject Data Sources, here is what we shuld have in mind:
#
Schema validation
As much as possible, enforce schema validation the earliest possible.
There is an implementation pattern proposed on etl that also enforces schema validation, but the earliest this checks occur, the better.
Here are a few techniques and suggestions for schema validation:
- If we do not own the process
- dbt + staging external table - Map new incoming files as external tables on BigQuery and with the help of dbt map these as transient sources and fail early with schema contracts. In this case, we first implement and document schema and then implement validation.
- If we own the process
- validate at runtime, create quarantine area for failed validations.
- use data catalog as source of truth.
- Be mindful about schema versioning and prepare the process for evolution
For all cases:
- Generate data and views (reports) about the process
- Be interested in knowing:
- What data came in and when
- What failed and what not
- Why it failed
- For instance: create a table or a few tables with:
- object added, bucket, key, hour, metadata tags, etag
- expected and actual metadata at the time of extraction
- Execution logs, status and outcome.
- Maintain a historical view of schema evolution in the data governance tool
- Have a process for natural and expected schema evolution
- What is the order of change
- Use schema version
- How to create retroactive views
#
Incremental loads
Storage at lakes is cheap.
Knowing this, we can store raw data to allow for recalculations, but we also need to know that processing this data comes at a cost, either BigQuery data-scan (or slots) costs or for compute-based runtimes.
With that said, optimize data modeling for incremental loads and early deeduplication mechanisms (when apply) that are smart, such as hashing. Avoid in-memory deduplication mechanisms when possible.
#
Data onboarding
- Establish and obey naming conventions
- Prefer incremental loads and if you cannot deduplicate during extract, offer mechanisms to make deduplication easier at transform time.
- Do not rely on long-lived credentials. Instrument your workload with workload identity - e.g.: container with workload identity configured with policies to access a schema on a specific Google Cloud SQL instance.
- Be mindful about data size and growth. Coordinate this with the data team for better strategies.
- If storing on GCS, choose compressed and typed storage formats - e.g.: parquet, avro.
- If only csv or json available, get rid of that soon in favor of improved downstream performance and costs.
- Store metadata about the onboarding process
- batch_id, timestamp, duration, objects keys, etc