#
Object Storage
Object storage is a highly scalable, flat-architecture data storage suited for massive volumes of unstructured data like images, backups, and sensor data, but also suited for data lake storage using files like csv and parquet. Data is stored as individual "objects" in a single flat pool (buckets or containers), each containing the data, metadata, and a unique identifier.
In our architecture, Google Cloud Storage plays this role.
#
Object storage usage fit
Because storage on buckets is decoupled from compute resources (cpu, ram), pricing is based on data stored, making it ideal to spend less when needing to store a great amount of data.
Because data lake technologies typically leverage object storage as their underlying data store, we can take advantage of transient processing frameworks, such as BigQuery where we pay for data scanned when processing a query, or Spark, where compute instances are allocated and charged only for the time they spend processing data.
Key takeaways so far:
- Files are objects in an object storage
- And they're stored in a flat architecture - not hierarchical
- Each object has an unique indeitifier assossiated to it
- Each object contain metadata associated to it
- And this metadata can be leveraged by data maintainers to associate important information such as source system, ingestion timestamp, retention and lifecycle policies and lineage identifiers
- Object storage is excelent to store data that you don't want to process immediately since you don't pay for instance-hours.
- Transient processing frameworks can be leveraged to process data at rest.
#
Bucket naming and other conventions
Because the amonut of buckets in an organization like UME with many systems and departments can be very big, it is important that we establish a naming convention to help maintainers spend less time figuring out ownership and destination of data. This helps people save time and some visits to the data governance tool.
This naming convention will help mostly on bronze data tier (check
For Bronze area, one suggestion is to make use of a few keys such as:
department- owner department from UMEenvironment- prod for Production, dev for developmentsource organization- what organization originates this data - for bronzesystem or datamart- what (sub)system originates this data - for bronzetier- medalion tier. bronze, silver gold. For most cases buckets will be used for bronze only.
{department}-{environment}-{source organization}-{system or datamart}-{tier}
Examples:
- credit-dev-ume-transactional-bronze
- cs-prod-infobip-whatsapp-bronze
- hr-prod-ume-employees-bronze
- marketing-dev-meta-fbads-bronze
- risk-prod-serasa-fraud-bronze
Typically if we need internediate Object storage for tiers silver and gold, the rule could be:
There are some reasone why we would like silver or gold to also be on buckets, such as:
- We are using another processing framework such as Spark
- We are exporting golden data into another serving layer
{department}-{environment}-{system or datamart}-{tier}
Examples:
- risk-prod-default-silver
- risk-prod-fraud-models-gold
- credit-prod-scoring-silver
- credit-dev-analytics-gold
- marketing-prod-campaigns-silver
- finance-prod-consolidated-payments-gold
Because most of internediate
#
Security considerations
- Encryption - Use encryption at rest. Always. Enforce through organizational policy at Google Cloud.
- Access policies - Apply least-privilege IAM controls at bucket and object level. Use groups rather than individual users. Avoid broad project-level permissions.
- Have policy templates and managed policies to handle access controls.
- Audit trail - Enable access and admin logs for compliance and security reviews. Retain according to company policy.
- Credentials - Do not rely on long-lived credentials. Use workload identity whenever possible.
- Track all IAM Credentials, usage and enforce narrower policies to reduce or eliminate usage.
- Prefer identity federation over shared service accounts to enable per-user audit trails.
- PII detection - All data stored in buckets should be scanned for Personally Identifiable Information. Establish scheduled scans and alerting for new or modified objects containing sensitive data. See Data Catalog for more information on PII classification and governance.
- Object versioning - Enable versioning on buckets that store critical or source-of-truth data. This allows recovery from accidental overwrites or deletions and supports point-in-time analysis when reprocessing historical data through the lakehouse pipeline.
#
File format guidelines
Whenever we are initiating file storage at object storage for analytics purposes, prefer compressed and typed storage formats:
- Parquet - Columnar, compressed, schema-aware. Best for analytical workloads.
- ORC - Optimized for large scans and heavy aggregations.
- Avro - Row-based with schema registry support. Good for schema evolution scenarios.
Eventually there will be no option for storing flat uncompressed files such as jsonl, csv or many small files.
The problems with these are such as:
- They take more storage - we pay for data size - and more data transfer.
- They aren't typed, so expose data parsing overhead when de-serializing.
- Many files increase data scan overhead and lower performance.
If only csv or json - for instance - is available from source, convert early in the pipeline to improve downstream performance and costs.
In this case, do small batch transformations. Make them idempotent. Strictly control what has been transformed into the target format - e.g.: from jsonl to parquet.
Never source or query any BigQuery external table from flat files or having a large amount of source files if the purpose is reporting. Under the supervision of the data team, BigQuery can be used as an incremental ETL Engine to optimize this storage scenario.
#
Medallion architecture
Data should be organized into tiers based on its maturity and quality level:
- Bronze Layer - Raw data preservation. Stores data exactly as received from source systems, with minimal or no transformation. This ensures auditability, ability to reprocess, and easier debugging.
- Silver Layer - Cleaned and validated data. Contains data that has been cleaned (null handling, deduplication, type corrections), validated (passed quality checks), and conformed (standardized naming, consistent formats).
- Gold Layer - Business-ready, curated datasets. Provides consumption-ready data with aggregations, business logic applied, and proper governance assigned.
#
Lifecycle policies
Define retention periods based on business value, compliance requirements, and cost considerations. Configure lifecycle rules to automatically:
- Move data to colder storage classes after defined periods (Standard → Nearline → Coldline → Archive)
- Delete temporary or expired data
- Clean up incomplete uploads
Example policy:
- Bronze data → Nearline after 90 days
- Staging data → Delete after 30 days
- Archive data → Coldline after 1 year
There is no one-size-fits-all rule to apply to projects. The suggested practice is:
- On project/data onboarding, whiteboard the expected data lineage and structure - sources, transformations, buckets, etc.
- For each bucket or source data, document data retention policies.
- Revisit data retention policies every quarter.
- Use the data catalog tool to store documentation.
- Have documentation freshness alerts. How often do we want data stewards to validata information - such as security information for our important cloud accounts?
#
Tasks
- Wrap up bucket naming rules with the teams.
- Add access auditing controls (reporting, alerts) to operational suggestions.
- Add long-lived credentials controls (reporting, alerts) to operational suggestions.