Introduction

Managing a large-scale BigQuery data warehouse presents unique challenges around cost optimization, performance tuning, and resource governance. One often overlooked source of insights is BigQuery's INFORMATION_SCHEMA, which provides powerful metadata that can transform how organizations manage their data assets. This case study examines practical applications of BigQuery metadata tables in an enterprise context.

The Challenge

After implementing Dagster for data orchestration across staging and production environments, our team faced several challenges that metadata analysis could address: Unclear understanding of table usage patterns across the organization Difficulty identifying optimization opportunities for infrequently accessed tables Storage costs growing without clear visibility into inactive datasets No systematic way to identify candidates for archival or deletion Lack of attribution for query performance issues

Metadata Tables as a Solution

BigQuery's INFORMATION_SCHEMA views provide comprehensive insights into warehouse operations. Key metadata tables include: INFORMATION_SCHEMA.TABLES: Contains metadata about table properties including size, row count, and creation/modification dates. INFORMATION_SCHEMA.JOBS_BY_PROJECT: Provides detailed information about query execution patterns and performance. INFORMATION_SCHEMA.TABLE_STORAGE: Offers insights into storage utilization by table, including breakdown by partition. INFORMATION_SCHEMA.COLUMN_FIELD_PATHS: Documents schema details including column names, data types, and descriptions.

Implementation Strategy

We implemented a comprehensive metadata analysis system using these views:

-- Sample query to identify inactive tables
SELECT
  t.table_catalog,
  t.table_schema,
  t.table_name,
  t.creation_time,
  DATE_DIFF(CURRENT_DATE(), MAX(j.creation_time), DAY) AS days_since_last_query
FROM
  `region-us`.INFORMATION_SCHEMA.TABLES AS t
LEFT JOIN (
  SELECT
    creation_time,
    referenced_tables
  FROM
    `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
  WHERE
    creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
) AS j
ON
  CONCAT(t.table_catalog, '.', t.table_schema, '.', t.table_name) IN
  (SELECT CONCAT(project_id, '.', dataset_id, '.', table_id) FROM UNNEST(j.referenced_tables))
GROUP BY 1, 2, 3, 4
HAVING days_since_last_query > 60 OR days_since_last_query IS NULL
ORDER BY days_since_last_query DESC NULLS FIRST

Key Insights and Applications

Our metadata analysis revealed several actionable insights: Usage-Based Partitioning Strategy: Analysis of query patterns revealed that adjusting partition strategies based on actual access patterns reduced storage costs by 42%. Right-Sizing Resources: Identifying peak usage times allowed us to implement targeted slot reservations, optimizing performance without overspending. Data Deprecation Pipeline: Following a similar approach to the referenced article, we implemented a safe table deprecation workflow using metadata to identify unused tables before adding a "_deprecated" suffix. Query Optimization: Identifying frequently joined tables led to strategic materialized views that reduced average query time by 67%. Access Pattern Analysis: Understanding which tables were accessed frequently but by only a small subset of users helped identify candidates for specialized materialized views.

Results and Impact

Implementing metadata-driven governance delivered significant benefits: Storage costs reduced by 37% through identification and archival of unused data Query performance improved by targeting optimization efforts at high-impact tables Development team efficiency increased with clear visibility into data usage patterns Security posture strengthened by identifying tables with sensitive data accessed by multiple users Data lifecycle management automated through consistent application of retention policies

Conclusion

BigQuery's metadata tables provide a wealth of information that, when properly leveraged, can transform data warehouse governance. By systematically analyzing usage patterns, storage metrics, and performance statistics, organizations can make informed decisions about optimization, retention, and access management. As demonstrated in our Dagster deployment project, these insights create a foundation for cost-effective, high-performance data operations while maintaining appropriate governance controls—proving that good governance starts with understanding what you already have.​​​​​​​​​​​​​​​​


Published

Category

data engineering

Tags

Stay in Touch