# Proposed Solution - MVP

# Object/File Storage

Google Cloud Storage

  • GCS offers the highest interoperability among other options
  • It decouples storage from compute
  • It is easier to offer volatile processing on top of stored data
  • Highly available
  • Easier to impose access policies / RBAC, audit trail, immutability
  • Easier to apply tier transition policies
  • Best use:
    • Raw data / Landing area
    • Restricted access to data integration people only
    • Offer per-user sandboxes to maintain tack/responsibility for ad-hoc data files

Alternatives

  • Block Storage
    • Faster local workloads - e.g.: instances that process data while coupled to disk.
    • Can be much faster
    • Difficult to scale and operate
    • Can be used, but depending on processing framework
  • Google AlloyDB/Cloud SQL
    • Not 1:1 with object storage in comparisson, but instead of using GCS for CSVs, this data could be directly stored in DBMS. So attends to a subset of use cases for current GCS Use.

    • Maintenance won't scale as easy as GCS. of GCS is coupled with good use of DW maintenance and costs tend to scale better. Subject to use-case analyses.

# Relational database

Google AlloyDB

  • Used for transactional workloads when limited cluster cost justifies over BigQuery usage
  • Easier to maintain and replicate
  • Predictable cost
  • Appropriate transactional workload response times/latency
  • Still need to maintain indexes and keys

Alternatives

  • BigQuery
    • Higher latency
    • Prohibitive cost on high usage

# DW / Acceleration engine

BigQuery

  • Appropriate for OLAP workloads
  • Low maintenance - e.g.: indexes
  • When paired with good design (partitioning, clustering) offers great performance
  • Decoupled from compute - e.g.: doesn't charge for idle time

Clickhouse - experimental

  • Storage (DBMS) offering
  • Blazing fast, standard/widely adopted tech outside of cloud offers for DW

Alternatives

  • Dremio / Starburst (trino)
    • Similar offerings for distributed SQL engine
    • Pay for compute instances reading from Object storage or other engines
    • Enterprise pricing $$
  • BI Engine
    • Great couple for BIQuery for accelerating queries
    • Pay per mamory available
    • Can still fall back to BigQuery workloads on chaotic environment (many dashboards and users)
    • Low cost predictability
  • Databricks / Snowflake
    • Not only dw but also offer many other features on top of a lake
    • Enterprise pricing

# Reporting / Exploration

Add vertical cost control

Metabase Cloud

  • Managed instance
  • Can offer cac Looker Studio

Hex

  • Worth trying during PoC
  • Pay per editor only
  • Added value for ML

Alternatives

  • Thoughtspot
    • Enterprise pricing
  • Superset
  • Looker
    • Enterprise pricing
  • Tableau
    • Enterprise pricing
  • Power BI
    • Complex modeling overhead

# Orchestration

Cloud Composer

  • Best framework around for building dependency-aware data workloads
  • Unmatched community support
  • Publish dags via ci/cd integrated with GCS
  • Can use Dag Factory (https://astronomer.github.io/dag-factory) to declare dags as YAML

n8n

  • Stay alert, perhaps n8n could solve some problems

Alternatives

  • Astronomer
    • Managed Cloud-hosted Airflow
    • Better than Cloud composer, but keeping Cloud composer for integration with GCP account
  • Prefect
    • more simple concept
    • Like Dagster, more dynamic
  • Kestra
    • yaml-defined transformations
  • Dagster
    • Like Airflow, but more data-model aware
    • Great for fine-grained data lineage

# Transformations

dbt Cloud

  • Same concept as now
  • Added interface for non-devs
  • Auto lineage docs
  • Supports many DW/sql-speaking db

Semantic layer - how to use withouy DBT Cloud - pull from clients?

Alternatives

  • DBT Core
    • More dev friendly
    • Not so analyst-friendly
  • Dataform
    • Exclusive to BigQuery
    • Comparable to dbt
    • SQL + JavaScript for templating
    • Pay only for BigQuery
  • Hex
    • Also works as transformations engine
    • Not 1st class though

# Governance

  • Datahub
    • OSS
    • Pricing not public

Alternatives

  • OpenMetadata
    • OSS
    • SaaS still not available
  • Atlan
    • Bad reputation
  • Alation
  • data.world