#
Proposed Solution
This represents a future view. Understand each block by having a long list of checkmarks that would need to be ticked in order for thet view to be considered a success.
However, an implementing this into phases would take first, choosing a vertical scope and implementing it step by step where each iteration would tick a few checkmarks of the list.
#
Onboarding Blueprints
Thinking about appropriate use of each different storage tier, it is suggested that we introduce data onboarding blueprints.
By thinking of these different tiers, we are forced to think of different storage tiers and imagine cost patterns as well, for example:
- GCS - how to use [biggest driver: governance and security]
- What kind of files to store - e.g.: recommend compressed (parquet, orc, gz) processing patterns to handle cases wheren it needs to come as flat-files
- Lifecycle policy - When is data considered obsolete and should be pushed to a colder storage tier?
- Encryption - Use customer managed encryption keys and collaborate with Infrastructure to manage the lifecycle of these keys.
- Configure access logs - With company established retention times.
- Establish bucket naming convention.
- Establish use-cases for object versioning.
- Establish data redundancy practices - pricing interferes.
- Establish expected access policies - fine-grained IAM. Which identities can perform which actions at given known conditions ().
- BigQuery
- Define organization and naming conventions
- How data across teams will be organized?
- What are the tiers and how they will be organized
- What project/s will hold data - define collaboration across projects
- What region/s are we going to work on?
- Define partitioning and clustering blueprints
- Evaluate based off current tables
- With credit and vertical business there are potentially big savings opportunity here.
- Define fine-grained access policies, implement least-privilege principles.
- Define organization and naming conventions
- Optimized DW Layer
- What is this tool?
- It doesn't matter how much I run the same repetitive queries, cost is stable for the capacity I need
- e.g.: BigQuery BI Engine or Clickhouse
- Establish what use-cases are eligible. Suggestion:
- Certified gold case
- Cases where detailed drill downs are needed and record count is high.
- Cases where data already meets reusability and centrality standards.
- What is this tool?
- Transactional DB Use cases
- Should we provision a dedicated transactional db instance? Postgres?
- What are the cases and non-functional requirements?
- Data federation layer
- Think of a "middleware" to serve as a central serving of qeueries across different data sources
- e.g.: one query can retrieve data from Postgres and Buckets and is aware of all data access policies.
- Bigquery Data Federation
- ETL
- Establish data onboarding blueprints with schema-lock for example.
- Build ETL on top of a platform (dbt or whatever) that will promote reuse and governance
- As much as possible, reusable blueprints should be accessible to key business users
- e.g.: self-deployable blueprint that reads long-lived application logs. DE could work on strategic tools like such
- Governed, visibility of company-wide lineage/dependencies.
- Establish Build a centralized library of reusable tests. I assume that due to the variety
- Enforce schema verification against data catalog.
- Data Catalog and centralized governance
- IMO, giving the visibility of data, along with a strong set of enforced procedures, is the ultimate driver to improve data quality across the company.
- Rich experience, if possible with LLM Assistance, retrieval
- User needs to be at home, feel confidence and motivation.
- Needs to be able to govern automatically all available data sources. Automatically for upadtes
- Needs to govern also downstream data artifacts - dashboards, reverse etl.
- Need to be able to:
- Communicate maturity level, data stewards
- Hold comments, feedback
- Implement direct channel for support
- E.g.: datahub, open-metadata. Preference for managed and well integrated with our tools.
- It shuold, itself or along with a sidecar, enable data scanning schedules for PII.
- Establish playbook for PII Identification procedures.
- Masterdata management - identification of duplicate collection and records.
- As much as possible include data definitions (e.g.: sql) along with data types
- Bring examples and docs on how to consume/query data.
- Reporting / KPIs / Dashboards
- Adopt one tool that attends to the majority of use cases that today Metabase and GLS attend.
- Build a centralized view of important KPIs - They need to be monitored with proper access management policies
- e.g.: Monitored, socialized and why not governed by automated anomaly/spike detection models
- Enforce access control policies to
- Vertical artifacts (data sources, dashboards, charts, etc)
- Horizontal artifacts - row leve security, if applicable.
- Read-only permissions + Identity Federation for access audits, in contrast of a single identity per data connection.
- Data Science
- Socialize model lifecycle
- Data scientist devkit shuold come with opinionated way of running DS Workloads
- Github code as single source of truth
- Runs shold be remote - such as in a Notebook instance or a 3rd party tool with Github integration.
- Provide blueprints for model lifecycle management - conception, development, deployment, observability, retraining, sunsetting
- Establish data preparation good practices - data types, code reuse, data pipeline, repeatability, feature store, extensive data testing
- Provide lineage metadata
- Expose auditing capabilities
- Socialize model lifecycle
Notes
- Overall storage decisions should aim to simplify governance.
- Depending on the diversity of tools, governance complexity would increase
- While first describing 1-step forward improvements, it is not off the table discussing more disruptive approaches,such as adopting tools like Databricks, Snowflake and others on their domain. I only think these could be the second step.First steps should increment on tools and governance, but focus on ownership and behaviour.Otherwise we give the kid a garmin watch hoping they will become a runner.A small vertical implementation however shuold help us decide.
- Even though some tools are separate, 3rd party providers might integrate more than one of potential solutions,like hex.tech for example, that would integrate dashboards and notebooks in a simple platform. The adoptionof such tools however shuold be analyzed with caution since even though it immediately solves some problems,it can introduce a hard to overcome lock-in that can hurt horizontal aspects of DS and reporting.