GA4 Consent Modeling in BigQuery with the DBT-GA4 Package

GA4 Consent Modeling in BigQuery with the DBT-GA4 Package

While reviewing an issue with null event keys in the dbt-GA4 package, it occurred to me that one of the potential fixes would also enable modeling of non-consented data generated by Google’s consent mode.

We decided to not include this by default in the dbt-GA4 package, you can read our discussion here, but it’s still something that many people would like to do so let’s look at how to customize the dbt-GA4 package to mimic GA4’s consent modeling in BigQuery.

One of Google Analytics 4’s improvements over UA is the introduction of behavioral modeling for consent mode.

Behavioral modeling for consent mode takes the ID-less pings that GA4 sends when consent is denied, combines them with consented data to model the missing data.

Before going on, please be careful with consent mode. There are many people in the analytics community who doubt the legality of consent mode. Many of the applicable privacy laws are concerned with personal data and it seems unlikely that removing user and client IDs makes your browsing data not personal anymore.

Please be sure that any organization that you are working with is comfortable with the potential issues with consent mode before enabling it.

With that out of the way, lets replicate GA4’s behavioral modeling for consent mode and use the consent-denied pings to model the missing traffic in BigQuery using the dbt-GA4 package.

The concepts covered here are fairly advanced. The free, dbt-GA4 Setup course covers a basic implementation up to your basic dimensional models. This post builds on aggregate models that are covered in Advanced dbt-GA4.

Calculating Modeled Values Using Non-Consented Pings

We don’t need any complex machine learning algorithms. The math is really quite simple.

We want to divide the metric by the number of consented rows that the metric was calculated from to get the metric’s value per row, then we multiply that by the number of non-consented rows, and add the initial metric value.

consented_metric_value 
/ consented rows 
x non-consented_rows 
+ consented_metric_value

This calculation works for any counting metric with any number of dimensions added. And it has the bonus of being really simple.

Percentages and rates should be the same modeled or un-modeled so there’s no need to model them.

This calculation will work for session metrics but you have to plug in some different values because all non-consented sessions get the same session_key so the number of non-consented rows, when plugging in a session table, will always equal one.

This calculation does not work with your fact and dimension models. It requires that you run an aggregation of some sort which you would normally do down-stream of your fact and dimension models.

You could alternatively use this calculation in your data visualization platform, but we’re going to keep on with dbt and BigQuery.

Preparing the Data

The first thing that we need to do is to identify the non-consented rows. We’re going to do this by modifying the session_key which is defined in stg_ga4__events.

When the session_key is null, it is often because of consent mode, but not always. When user_pseudo_id is null, then it is consent mode at work. When ga_session_id is null, then it is an audience trigger firing.

We’re going to modify our session_key generation in order to identify both of these situations. It is a data modeling best-practice to replace null values with descriptive replacements.

Copy the stg_ga4__events file from the package to a folder under your dbt project’s models folder and give it the same name.

Replace the include_session_key common table expression (CTE) with this code:

-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events.

include_session_key as (
    select
        *
        , case
            when client_key is null then 'consent_denied'
            when session_id is null then 'audience_trigger_null_session_id'
            else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING))))
        end as session_key
from include_client_key
),

Here we are using a case statement to replace the null inputs to the session_key with descriptive values.

The user_pseudo_id is not a direct input to session_key but it is an input to client_key and a null user_pseudo_id will result in a null client_key so we use client_key to detect whether consent has been denied.

We could look up the user_pseudo_id field, but since we’re already using client_key in the else statement, else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING)))), we process less data by working off of the client_key which is already in use.

You’ll need to full-refresh stg_ga4__events and all of its descendants to populate the new ID values.

Model Counting Metrics Macro

We’ve already gone over the math for the model counting metrics macro. All that remains is to convert the math to BigQuery SQL and dbt Jinja.

Here’s the model_counting_metrics macro that I saved to my project’s macros folder.

{%- macro model_counting_metrics(consented_metrics_value, consented_rows, non_consented_rows) -%}
    safe_divide(sum({{consented_metrics_value}}) , count({{consented_rows}}) )
    * ifnull(count({{non_consented_rows}}), 0)
    + {{consented_metrics_value}}
{%- endmacro -%}

We’re using dbt’s Jinja parser to populate the macro with the various inputs.

We use safe_divide to prevent any divide by zero shenanigans.

We need the ifnull(count({{non_consented_rows}}), 0) function for cases where there are no non-consented rows. If we don’t force this to be zero, then we’ll get strange results.

Otherwise, the macro is pretty straight-forward.

Modeling Your Metrics

Now we can put this macro to use.

Here you can see a model named agg_ga4__page_location.

I often replace the default fct_ga4__pages model with this model because I like to prefix aggregations agg_ instead of fct_, I don’t care for the scroll metrics in the default table, and building off of your relevant event fact tables, if you’ve created them, is less expensive than building off of the event staging tables.

{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
    {% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}

{{
    config(
        materialized = 'incremental',
        incremental_strategy = 'insert_overwrite',
        tags = ["incremental"],
        partition_by={
            "field": "date",
            "data_type": "date",
            "granularity": "day"
        },
        partitions = partitions_to_replace
    )
}}

with pv as (
    select
        page_location
        , event_date_dt as date
        , count(*) as page_views
        , sum(
            case when session_key != 'consent_denied' then 1 else 0 end
        ) as consented_page_views
        , sum(
            case when session_key = 'consent_denied' then 1 else 0 end
        ) as non_consented_page_views
        , count(distinct session_key) as sessions
        , count(distinct
            case
                when user_id is null then session_key
                else user_id 
            end
        ) as users
        , sum(event_value_in_usd) as page_value
    from {{ref('fct_ga4__event_page_view')}}
    {% if is_incremental() %}
        where event_date_dt in ({{ partitions_to_replace | join(',') }})
    {% endif %}
    group by 1,2
)
, modeled_values as (
    select
        page_location
        , date
        , page_views
        , sessions
        , {{model_counting_metrics('sessions', 'page_views', 'non_consented_page_views')}} as modeled_sessions
        , users
        , {{model_counting_metrics('users', 'page_views', 'non_consented_page_views')}} as modeled_users
        , page_value
        , {{model_counting_metrics('page_value', 'page_views', 'non_consented_page_views')}} as modeled_page_value
    from pv
    group by 1,2,3,4,6,8
)
select * from modeled_values

Let’s break this down.

{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
    {% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}

This is the standard start for any dbt-GA4 partitioned model. We set use static_incremental_days to define partitions_to_replace so that we can plug that in to our config block and when we do query pruning as seen below.

{% if is_incremental() %}
    where event_date_dt in ({{ partitions_to_replace | join(',') }})
{% endif %}

The config block is pretty standard for a partitioned incremental model in BigQuery. This page location model will contain page data by date by page_location. So the number of rows will be roughly equal to the number of pages on the site.

{{
    config(
        materialized = 'incremental',
        incremental_strategy = 'insert_overwrite',
        tags = ["incremental"],
        partition_by={
            "field": "date",
            "data_type": "date",
            "granularity": "day"
        },
        partitions = partitions_to_replace
    )
}}

Typically, you want tens of thousands of rows in a partition before you start partitioning. The site I’m working on is smaller than that. I could change to a non-partitioned incremental model, but it’s not that big of a deal.

Next, we get the data that we want from fct_ga4__event_page_view.

with pv as (
    select
        page_location
        , event_date_dt as date
        , count(*) as page_views
        , sum(
            case when session_key != 'consent_denied' then 1 else 0 end
        ) as consented_page_views
        , sum(
            case when session_key = 'consent_denied' then 1 else 0 end
        ) as non_consented_page_views
        , count(distinct session_key) as sessions
        , count(distinct
            case
                when user_id is null then session_key
                else user_id 
            end
        ) as users
        , sum(event_value_in_usd) as page_value
    from {{ref('fct_ga4__event_page_view')}}
    {% if is_incremental() %}
        where event_date_dt in ({{ partitions_to_replace | join(',') }})
    {% endif %}
    group by 1,2
)

Our two main dimensions are page_location and date, so we get those right away and group by those fields later.

Then we calculate our metrics.

Since the fct_ga4__event_page_view model contains one row per page_view event and includes non-consented rows, then we don’t need to model the number of page views; we can just count the number of rows to get that metric.

, count(*) as page_views

Our macro takes consented_rows and non_consented_rows as inputs, so we need to calculate those.

, sum(
    case when session_key != 'consent_denied' then 1 else 0 end) as consented_page_views
, sum(
    case when session_key = 'consent_denied' then 1 else 0 end) as non_consented_page_views

Remember that we set the session_key to consent_denied when the client_key, as a result of a null user_pseudo_id, is null.

Finally, we need the counting metrics that we want to model.

, count(distinct session_key) as sessions
, count(distinct
    case
        when user_id is null then session_key
        else user_id 
    end
) as users
, sum(event_value_in_usd) as page_value

In this case, we’re getting sessions, users, and page value, but you can choose whatever you like.

Finally, we model create our modeled metrics.

, modeled_values as (
    select
        page_location
        , date
        , page_views
        , sessions
        , {{model_counting_metrics('sessions', 'page_views', 'non_consented_page_views')}} as modeled_sessions
        , users
        , {{model_counting_metrics('users', 'page_views', 'non_consented_page_views')}} as modeled_users
        , page_value
        , {{model_counting_metrics('page_value', 'page_views', 'non_consented_page_views')}} as modeled_page_value
    from pv
    group by 1,2,3,4,6,8
)

This is a simple example, but you can apply this to more complex aggregated models easily enough.

I do think it is probably a good idea to show both the modeled and raw metrics like I did in this sample. I’ve fielded too many why don’t these metrics match questions where the answer was that you’re looking at different metrics to not prefix any modeled data with modeled_.

Session Metrics

If we’re aggregating the session tables, all of the non-consented data will be found in one row with a session_key of consent_denied rather than one row per session with consent denied.

This row will have just a single session start timestamp, but hugely inflated page views, conversions, engagement time, and other metrics.

We then need to then do two things:

  1. Remove the non-consented data from our session tables to not throw off our session metrics too badly with a single, mega-session
  2. Use the ratio of consented to non-consented rows for all events to model session metrics

Removing Non-Consented Data from Session Tables

We have a few session_partition_key is not null checks that handled this before, but those don’t work now that we are assigning session_key values when the client_key or session_id is null.

We’ll need to copy a few session models from the dbt-GA4 package to our project, disable the package models, and update the project models.

stg_ga4__sessions_traffic_sources_daily.sql and stg_ga4__sessions_traffic_sources.sql

Filter out the new session_key values.

left join {{ref('ga4_source_categories')}} source_categories on events.event_source = source_categories.source
where session_partition_key is not null
and not contains_substr(session_partition_key, 'consent_denied')
and not contains_substr(session_partition_key, 'audience_trigger_null_session_id')

I use contains_substr(session_partition_key instead of checking session_key because we are already checking if the session_partition_key is not null so BigQuery caching shouldn’t cause any extra data look-ups with the extra conditions but it is a tiny optimization.

dbt_project.yml

Disable the package models.

models:
  ga4:
    staging:
      stg_ga4__sessions_traffic_sources_daily:
        +enabled: false
      stg_ga4__sessions_traffic_sources:
        +enabled: false

Model session metrics

Now that we have excluded the new session keys from our sessions table, we are ready to calculate the ratio of consented to non-consented events and plug those values along with session metrics in to our model_counting_metrcis macro.

We could calculate the ratio every time we need it, but that is wasteful as we need to scan the full events partition every single time we use it.

Instead, we’re going to create a table to hold these calculations so we don’t have to scan so much data.

In your staging folder, create a model named stg_ga4__consent_counts:


{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
    {% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
    config(
        materialized='table'
    )
}}

select
    event_date_dt
    , sum(
        case when session_key = 'consent_denied' then 1 else 0 end
    ) as consent_denied_events
    , sum(
        case when session_key != 'consent_denied' then 1 else 0 end
    ) as consented_events
from {{ref('stg_ga4__events')}}
{%- if is_incremental() -%}
    where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1

We create this model as a table rather than a view because we want to prevent our macro from scanning the entire base_ga4__events partition.

First, we set partitions_to_replace in order to use it for query pruning later.

{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
    {% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}

Then, we have our config block.

{{
    config(
        materialized='table'
    )
}}

With this set up as a table without incremental updates, we completely drop the previous table and build the table again with the new settings.

Our query just needs to get the daily counts of consented and non-consented events by day so we can calculate that ratio in our macro.

select
    event_date_dt
    , sum(
        case when session_key = 'consent_denied' then 1 else 0 end
    ) as consent_denied_events
    , sum(
        case when session_key != 'consent_denied' then 1 else 0 end
    ) as consented_events
from {{ref('stg_ga4__events')}}
{%- if is_incremental() -%}
    where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1

We use partitions_to_replace for query pruning and group by event_date_dt.

{%- if is_incremental() -%}
    where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1

We don’t need to incrementally update this table if all of the aggregate models where we call this macro are incremental.

This is because if we’re updating the last three days and today, with static_incremental_days set to 3, then the incremental models only need reference the last three days of data so it is okay that we drop the old version of the table and rebuild with only four rows for today and the last three days.

If we had an aggregate model that was not incremental, then we would need the ratio for every day in our warehouse and the stg_ga4__consent_counts model would only have today and the last three days.

Next, we need to create a model that aggregates session metrics and apply the macro.

I’m going to create an aggregated attribution model, agg_ga4__last_non_direct_attribution, in my marts folder that holds channel, source, medium, and campaign dimensions with various daily metrics.

-- not a partitioned model, but using partitions to replace for query pruning
{% set partitions_to_replace = ["current_date"] %}
{% for i in range(var("static_incremental_days", 1)) %}
    {% set partitions_to_replace = partitions_to_replace.append( "date_sub(current_date, interval " + (i + 1) | string + " day)") %}
{% endfor %}

{{config(
    materialized='incremental',
    unique_key=['session_partition_date', 'last_non_direct_default_channel_grouping', 'last_non_direct_source', 'last_non_direct_medium', 'last_non_direct_campaign']
)}}

with cnsent as (
    select
        *
    from {{ref('stg_ga4__consent_counts')}}
)
, dim as (
    select
        session_partition_key
        , session_partition_date
        , last_non_direct_default_channel_grouping
        , last_non_direct_source
        , last_non_direct_medium
        , last_non_direct_campaign
    from {{ref('dim_ga4__sessions_daily')}}
    {% if is_incremental() %}
        where session_partition_date in ({{ partitions_to_replace | join(',') }})
    {% endif %}
)
, fct as (
    select
        dim.session_partition_date
        , dim.last_non_direct_default_channel_grouping
        , dim.last_non_direct_source
        , dim.last_non_direct_medium
        , dim.last_non_direct_campaign
        , count(*) as sessions
        , sum(session_partition_count_page_views) as page_view_count
        , sum(session_partition_max_session_engaged) as engaged_sessions
        , sum(session_partition_sum_event_value_in_usd) as page_value
        , sum(session_partition_sum_engagement_time_msec) as session_engagement_time_msec
    , count(distinct
        case
            when session_number = 1 and user_id is not null then user_id
            else client_key
        end
    ) as new_users
    , count(distinct
        case
            when session_number != 1 and user_id is not null then user_id
            else client_key
        end
    ) as returning_users
    from {{ref('fct_ga4__sessions_daily')}} as f
    right join dim on dim.session_partition_key = f.session_partition_key
    {% if is_incremental() %}
        where f.session_partition_date in ({{ partitions_to_replace | join(',') }})
    {% endif %}
    group by 1,2,3,4,5
)
select
    session_partition_date
    , last_non_direct_default_channel_grouping
    , last_non_direct_source
    , last_non_direct_medium
    , last_non_direct_campaign
    , sessions
    , {{model_counting_metrics(
        'sessions'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_sessions
    , page_view_count
    , {{model_counting_metrics(
        'page_view_count'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_page_view_count
    , engaged_sessions
    , {{model_counting_metrics(
        'engaged_sessions'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_engaged_sessions
    , page_value
    , {{model_counting_metrics(
        'page_value'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_page_value
    , session_engagement_time_msec
    , {{model_counting_metrics(
        'session_engagement_time_msec'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_session_engagement_time_msec
    , new_users
    , {{model_counting_metrics(
        'new_users'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_new_users
    , returning_users
    , {{model_counting_metrics(
        'returning_users'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_returning_users
from fct as f
left join cnsent on cnsent.event_date_dt = f.session_partition_date
group by 1,2,3,4,5,6,8,10,12,14,16,18

This really isn’t all that different from the previous models. The big difference is that we’re plugging in the total consented and non-consented event counts into the model_counting_metrics macro rather than the counts of consented and non-consented sessions because we can’t get the count of non-consented sessions without consent.

We start by setting partitions_to_replace for query pruning.

{% set partitions_to_replace = ["current_date"] %}
{% for i in range(var("static_incremental_days", 1)) %}
    {% set partitions_to_replace = partitions_to_replace.append( "date_sub(current_date, interval " + (i + 1) | string + " day)") %}
{% endfor %}

Then we set our config ensuring that we are creating an incremental model so we don’t try to update rows where the date is not set in stg_ga4__consent_counts.

{{config(
    materialized='incremental',
    unique_key=['session_partition_date', 'last_non_direct_default_channel_grouping', 'last_non_direct_source', 'last_non_direct_medium', 'last_non_direct_campaign']
)}}

We also set the unique key to the combination of unique, daily dimensions.

Then, we get all of the rows from stg_ga4__consent_counts

with cnsent as (
    select
        *
    from {{ref('stg_ga4__consent_counts')}}
)

If we made stg_ga4__consent_counts an incremental model, then we’d want to add a where condition but, because it is a table that gets dropped and built with only the rows that we are processing, then we don’t gain anything here by adding a where condition.

We get our dimensions while using partitions_to_replace for partition pruning.

, dim as (
    select
        session_partition_key
        , session_partition_date
        , last_non_direct_default_channel_grouping
        , last_non_direct_source
        , last_non_direct_medium
        , last_non_direct_campaign
    from {{ref('dim_ga4__sessions_daily')}}
    {% if is_incremental() %}
        where session_partition_date in ({{ partitions_to_replace | join(',') }})
    {% endif %}
)

We join those to our metrics again remembering to prune partitions.

, fct as (
    select
        dim.session_partition_date
        , dim.last_non_direct_default_channel_grouping
        , dim.last_non_direct_source
        , dim.last_non_direct_medium
        , dim.last_non_direct_campaign
        , count(*) as sessions
        , sum(session_partition_count_page_views) as page_view_count
        , sum(session_partition_max_session_engaged) as engaged_sessions
        , sum(session_partition_sum_event_value_in_usd) as page_value
        , sum(session_partition_sum_engagement_time_msec) as session_engagement_time_msec
    , count(distinct
        case
            when session_number = 1 and user_id is not null then user_id
            else client_key
        end
    ) as new_users
    , count(distinct
        case
            when session_number != 1 and user_id is not null then user_id
            else client_key
        end
    ) as returning_users
    from {{ref('fct_ga4__sessions_daily')}} as f
    right join dim on dim.session_partition_key = f.session_partition_key
    {% if is_incremental() %}
        where f.session_partition_date in ({{ partitions_to_replace | join(',') }})
    {% endif %}
    group by 1,2,3,4,5

And then finally, we calculate modeled values.

select
    session_partition_date
    , last_non_direct_default_channel_grouping
    , last_non_direct_source
    , last_non_direct_medium
    , last_non_direct_campaign
    , sessions
    , {{model_counting_metrics(
        'sessions'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_sessions
    , page_view_count
    , {{model_counting_metrics(
        'page_view_count'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_page_view_count
    , engaged_sessions
    , {{model_counting_metrics(
        'engaged_sessions'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_engaged_sessions
    , page_value
    , {{model_counting_metrics(
        'page_value'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_page_value
    , session_engagement_time_msec
    , {{model_counting_metrics(
        'session_engagement_time_msec'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_session_engagement_time_msec
    , new_users
    , {{model_counting_metrics(
        'new_users'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_new_users
    , returning_users
    , {{model_counting_metrics(
        'returning_users'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_returning_users
from fct as f
left join cnsent on cnsent.event_date_dt = f.session_partition_date
group by 1,2,3,4,5,6,8,10,12,14,16,18

It would actually be cheaper to combine the last two steps, join dimensions to metrics and calculate modeled values.

Each group by costs more, but then you would have to input the calculation that you used for the metric that you are trying to model rather than the name that we set when we joined dimensions to metrics in the fct CTE like in the below code.

 , count(*) as sessions
    , {{model_counting_metrics(
        'count(*)'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_sessions
    , sum(session_partition_count_page_views) as page_view_count
    , {{model_counting_metrics(
        'sum(session_partition_count_page_views)'
        , 'cnsent.consented_events'
        , 'cnsent.consent_denied_events'
    )}} as modeled_page_value

I chose readability here, but I would probably choose efficiency over readability with production data.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *