Situation

I faced a significant challenge when our Google Analytics 4 (GA4) implementation began generating over 1.7 million events daily. This massive volume was causing several critical issues:

  1. Full-refresh data processing had become computationally expensive, taking several hours to complete
  2. Processing costs were increasing substantially as the dataset grew
  3. GA4's continuously evolving schema caused frequent processing failures when new fields appeared
  4. User identity was fragmented across sessions, making it impossible to create accurate marketing attribution models
  5. GA4 often failed to populate user_id fields properly, breaking customer journey tracking
  6. Late-arriving data was often missed without expensive reprocessing

These issues not only affected data processing efficiency but also delayed reporting and significantly impacted our marketing attribution models, which required consistent user identity tracking for accurate performance analysis.

Task

I needed to develop an efficient data processing system for GA4 data that would:

  1. Dramatically reduce processing time by only handling new or changed data
  2. Reliably deduplicate events to ensure incremental processing would function correctly
  3. Handle schema evolution without disrupting the data pipeline
  4. Implement reliable user identity tracking and backfilling across sessions for accurate marketing attribution
  5. Solve GA4's inconsistent user_id population by propagating known IDs throughout the user journey
  6. Accommodate late-arriving data (up to 3 days) automatically
  7. Optimize storage and query performance for reporting systems

Action

I designed and implemented a comprehensive solution leveraging DBT and BigQuery's advanced features.

1. Table Sharding for Schema Evolution

First, I created a system that automatically generates daily sharded tables to accommodate GA4's evolving schema:

{% macro create_sharded_table(days_ago) %}
  {% set partition_date = modules.datetime.datetime.now().date() - modules.datetime.timedelta(days=days_ago) %}
  {% set shard_date = partition_date.strftime('%Y%m%d') %}
  {% set shard_table = 'ga4_events_dedup_' ~ shard_date %}

  CREATE OR REPLACE TABLE `{{ shard_table }}` AS
  SELECT
    t.*,
    FARM_FINGERPRINT(TO_JSON_STRING(t)) AS event_id,
    DATE(TIMESTAMP_MICROS(event_timestamp)) AS partition_date
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER (
        PARTITION BY FARM_FINGERPRINT(TO_JSON_STRING(t))
        ORDER BY event_timestamp
      ) AS row_num
    FROM `events_{{ shard_date }}` t
  ) t
  WHERE row_num = 1
{% endmacro %}

This strategy: - Creates separate tables per day to isolate schema changes to specific dates - Adds a unique event_id using FARM_FINGERPRINT since GA4 doesn't provide one - Deduplicates events during the sharding process

2. Rolling View for Incremental Processing

Next, I created a rolling view that combines the last three days of sharded tables:

CREATE OR REPLACE VIEW `stg_ga4_dedup_events` AS
{% for days_ago in range(0, 3) %}
  {% set partition_date = modules.datetime.datetime.now().date() - modules.datetime.timedelta(days=days_ago) %}
  {% set shard_date = partition_date.strftime('%Y%m%d') %}

  SELECT * FROM `ga4_events_dedup_{{ shard_date }}`
  {% if not loop.last %}UNION ALL{% endif %}
{% endfor %}

This view presents a unified 3-day window of deduplicated events for incremental processing.

3. Incremental Loading with DBT

I implemented incremental loading to process only new events:

{{
  config(
    materialized='incremental',
    unique_key=['event_id'],
    partition_by={
      "field": "partition_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=['user_pseudo_id', 'event_id'],
    incremental_strategy='merge'
  )
}}

SELECT * FROM {{ ref('stg_ga4_dedup_events') }}
{% if is_incremental() %}
  WHERE partition_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
  AND event_id NOT IN (SELECT event_id FROM {{ this }})
{% endif %}

This approach: - Only processes events from the last three days - Uses event_id as the unique key for merging - Leverages partitioning by date and clustering by user ID and event ID

4. User Identity Resolution and Backfill for Marketing Attribution

A critical issue with GA4 is that it frequently fails to populate the user_id field across all events, making marketing attribution nearly impossible. I developed a comprehensive user ID propagation and backfill system:

SELECT
  t.*,
  COALESCE(
    t.user_id,
    LAST_VALUE(user_id IGNORE NULLS) OVER (
      PARTITION BY user_pseudo_id
      ORDER BY event_timestamp
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ),
    FIRST_VALUE(user_id IGNORE NULLS) OVER (
      PARTITION BY user_pseudo_id
      ORDER BY event_timestamp
      ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    )
  ) AS resolved_user_id
FROM events t

This technique: - Uses backward propagation to fill missing user_ids by looking at previous events of the same device (critical for first-touch attribution models) - Implements forward propagation to backfill user identities when users log in later in their journey (essential for multi-touch attribution) - Creates a continuous user journey across all touchpoints, enabling accurate marketing channel effectiveness analysis - Joins with our backend user table to further enrich user identities that GA4 missed entirely

Without this solution, our marketing attribution reporting would miss approximately 36% of conversions due to fragmented user journeys.

5. Session Construction

Finally, I built a reliable session tracking system:

SELECT
  dbt_utils.generate_surrogate_key([
    user_pseudo_id,
    COALESCE(user_id, ''),
    CAST(ga_session_id AS STRING)
  ]) AS session_id,
  MIN(event_timestamp) AS session_start_time,
  MAX(event_timestamp) AS session_end_time,
  FIRST_VALUE(traffic_source.source) OVER session_window AS first_source,
  FIRST_VALUE(traffic_source.medium) OVER session_window AS first_medium,
  FIRST_VALUE(traffic_source.campaign) OVER session_window AS first_campaign
FROM events
WINDOW session_window AS (
  PARTITION BY user_pseudo_id, ga_session_id
  ORDER BY event_timestamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
GROUP BY user_pseudo_id, user_id, ga_session_id

This process: - Creates a consistent session ID across events - Captures critical session attributes like start/end times and initial traffic source - Maintains marketing attribution data for accurate reporting

Results

The implementation delivered significant improvements:

  1. Processing time reduction: Daily processing time decreased by 95%, from over 4 hours to just 12 minutes, by only handling new data

  2. Resilience to schema changes: The table sharding architecture now gracefully handles GA4 schema evolution without disruption

  3. Marketing attribution accuracy: User identity resolution and backfilling increased attribution accuracy by 36%, providing reliable marketing channel performance data

  4. Complete user journeys: The user ID propagation system connected previously fragmented sessions, increasing tracked user journeys by 27%

  5. Late data handling: The system now automatically captures late-arriving data (up to 3 days) without requiring manual intervention

  6. Reliable incremental processing: The deduplication system ensures that incremental processing works correctly, even when GA4 occasionally sends duplicate events

  7. Query performance: Strategic partitioning and clustering reduced query times on analytics tables by 70%

This system has not only solved our immediate GA4 data processing challenges but has created a scalable foundation for accurate marketing attribution that continues to perform efficiently as our data volume increases.


Published

Category

Data Engineering

Tags

Stay in Touch