GA4 Consent Modeling in BigQuery with the DBT-GA4 Package
While reviewing an issue with null event keys in the dbt-GA4 package, it occurred to me that one of the potential fixes would also enable modeling of non-consented data generated by Google’s consent mode.
We decided to not include this by default in the dbt-GA4 package, you can read our discussion here, but it’s still something that many people would like to do so let’s look at how to customize the dbt-GA4 package to mimic GA4’s consent modeling in BigQuery.
One of Google Analytics 4’s improvements over UA is the introduction of behavioral modeling for consent mode.
Behavioral modeling for consent mode takes the ID-less pings that GA4 sends when consent is denied, combines them with consented data to model the missing data.
Before going on, please be careful with consent mode. There are many people in the analytics community who doubt the legality of consent mode. Many of the applicable privacy laws are concerned with personal data and it seems unlikely that removing user and client IDs makes your browsing data not personal anymore.
Please be sure that any organization that you are working with is comfortable with the potential issues with consent mode before enabling it.
With that out of the way, lets replicate GA4’s behavioral modeling for consent mode and use the consent-denied pings to model the missing traffic in BigQuery using the dbt-GA4 package.
The concepts covered here are fairly advanced. The free, dbt-GA4 Setup course covers a basic implementation up to your basic dimensional models. This post builds on aggregate models that are covered in Advanced dbt-GA4.
Calculating Modeled Values Using Non-Consented Pings
We don’t need any complex machine learning algorithms. The math is really quite simple.
We want to divide the metric by the number of consented rows that the metric was calculated from to get the metric’s value per row, then we multiply that by the number of non-consented rows, and add the initial metric value.
consented_metric_value
/ consented rows
x non-consented_rows
+ consented_metric_value
This calculation works for any counting metric with any number of dimensions added. And it has the bonus of being really simple.
Percentages and rates should be the same modeled or un-modeled so there’s no need to model them.
This calculation will work for session metrics but you have to plug in some different values because all non-consented sessions get the same session_key
so the number of non-consented rows, when plugging in a session table, will always equal one.
This calculation does not work with your fact and dimension models. It requires that you run an aggregation of some sort which you would normally do down-stream of your fact and dimension models.
You could alternatively use this calculation in your data visualization platform, but we’re going to keep on with dbt and BigQuery.
Preparing the Data
The first thing that we need to do is to identify the non-consented rows. We’re going to do this by modifying the session_key
which is defined in stg_ga4__events
.
When the session_key
is null, it is often because of consent mode, but not always. When user_pseudo_id
is null, then it is consent mode at work. When ga_session_id
is null, then it is an audience trigger firing.
We’re going to modify our session_key
generation in order to identify both of these situations. It is a data modeling best-practice to replace null values with descriptive replacements.
Copy the stg_ga4__events
file from the package to a folder under your dbt project’s models folder and give it the same name.
Replace the include_session_key
common table expression (CTE) with this code:
-- Add key for sessions. session_key will be null if client_key is null due to consent being denied. ga_session_id may be null during audience trigger events.
include_session_key as (
select
*
, case
when client_key is null then 'consent_denied'
when session_id is null then 'audience_trigger_null_session_id'
else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING))))
end as session_key
from include_client_key
),
Here we are using a case statement to replace the null inputs to the session_key
with descriptive values.
The user_pseudo_id
is not a direct input to session_key
but it is an input to client_key
and a null user_pseudo_id
will result in a null client_key
so we use client_key
to detect whether consent has been denied.
We could look up the user_pseudo_id
field, but since we’re already using client_key
in the else
statement, else to_base64(md5(CONCAT(client_key, CAST(session_id as STRING))))
, we process less data by working off of the client_key
which is already in use.
You’ll need to full-refresh stg_ga4__events
and all of its descendants to populate the new ID values.
Model Counting Metrics Macro
We’ve already gone over the math for the model counting metrics macro. All that remains is to convert the math to BigQuery SQL and dbt Jinja.
Here’s the model_counting_metrics
macro that I saved to my project’s macros folder.
{%- macro model_counting_metrics(consented_metrics_value, consented_rows, non_consented_rows) -%}
safe_divide(sum({{consented_metrics_value}}) , count({{consented_rows}}) )
* ifnull(count({{non_consented_rows}}), 0)
+ {{consented_metrics_value}}
{%- endmacro -%}
We’re using dbt’s Jinja parser to populate the macro with the various inputs.
We use safe_divide
to prevent any divide by zero shenanigans.
We need the ifnull(count({{non_consented_rows}}), 0)
function for cases where there are no non-consented rows. If we don’t force this to be zero, then we’ll get strange results.
Otherwise, the macro is pretty straight-forward.
Modeling Your Metrics
Now we can put this macro to use.
Here you can see a model named agg_ga4__page_location
.
I often replace the default fct_ga4__pages
model with this model because I like to prefix aggregations agg_
instead of fct_
, I don’t care for the scroll metrics in the default table, and building off of your relevant event fact tables, if you’ve created them, is less expensive than building off of the event staging tables.
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "date",
"data_type": "date",
"granularity": "day"
},
partitions = partitions_to_replace
)
}}
with pv as (
select
page_location
, event_date_dt as date
, count(*) as page_views
, sum(
case when session_key != 'consent_denied' then 1 else 0 end
) as consented_page_views
, sum(
case when session_key = 'consent_denied' then 1 else 0 end
) as non_consented_page_views
, count(distinct session_key) as sessions
, count(distinct
case
when user_id is null then session_key
else user_id
end
) as users
, sum(event_value_in_usd) as page_value
from {{ref('fct_ga4__event_page_view')}}
{% if is_incremental() %}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{% endif %}
group by 1,2
)
, modeled_values as (
select
page_location
, date
, page_views
, sessions
, {{model_counting_metrics('sessions', 'page_views', 'non_consented_page_views')}} as modeled_sessions
, users
, {{model_counting_metrics('users', 'page_views', 'non_consented_page_views')}} as modeled_users
, page_value
, {{model_counting_metrics('page_value', 'page_views', 'non_consented_page_views')}} as modeled_page_value
from pv
group by 1,2,3,4,6,8
)
select * from modeled_values
Let’s break this down.
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
This is the standard start for any dbt-GA4 partitioned model. We set use static_incremental_days
to define partitions_to_replace
so that we can plug that in to our config block and when we do query pruning as seen below.
{% if is_incremental() %}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{% endif %}
The config block is pretty standard for a partitioned incremental model in BigQuery. This page location model will contain page data by date by page_location
. So the number of rows will be roughly equal to the number of pages on the site.
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "date",
"data_type": "date",
"granularity": "day"
},
partitions = partitions_to_replace
)
}}
Typically, you want tens of thousands of rows in a partition before you start partitioning. The site I’m working on is smaller than that. I could change to a non-partitioned incremental model, but it’s not that big of a deal.
Next, we get the data that we want from fct_ga4__event_page_view
.
with pv as (
select
page_location
, event_date_dt as date
, count(*) as page_views
, sum(
case when session_key != 'consent_denied' then 1 else 0 end
) as consented_page_views
, sum(
case when session_key = 'consent_denied' then 1 else 0 end
) as non_consented_page_views
, count(distinct session_key) as sessions
, count(distinct
case
when user_id is null then session_key
else user_id
end
) as users
, sum(event_value_in_usd) as page_value
from {{ref('fct_ga4__event_page_view')}}
{% if is_incremental() %}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{% endif %}
group by 1,2
)
Our two main dimensions are page_location
and date
, so we get those right away and group by
those fields later.
Then we calculate our metrics.
Since the fct_ga4__event_page_view
model contains one row per page_view
event and includes non-consented rows, then we don’t need to model the number of page views; we can just count the number of rows to get that metric.
, count(*) as page_views
Our macro takes consented_rows
and non_consented_rows
as inputs, so we need to calculate those.
, sum(
case when session_key != 'consent_denied' then 1 else 0 end) as consented_page_views
, sum(
case when session_key = 'consent_denied' then 1 else 0 end) as non_consented_page_views
Remember that we set the session_key
to consent_denied
when the client_key
, as a result of a null user_pseudo_id
, is null.
Finally, we need the counting metrics that we want to model.
, count(distinct session_key) as sessions
, count(distinct
case
when user_id is null then session_key
else user_id
end
) as users
, sum(event_value_in_usd) as page_value
In this case, we’re getting sessions, users, and page value, but you can choose whatever you like.
Finally, we model create our modeled metrics.
, modeled_values as (
select
page_location
, date
, page_views
, sessions
, {{model_counting_metrics('sessions', 'page_views', 'non_consented_page_views')}} as modeled_sessions
, users
, {{model_counting_metrics('users', 'page_views', 'non_consented_page_views')}} as modeled_users
, page_value
, {{model_counting_metrics('page_value', 'page_views', 'non_consented_page_views')}} as modeled_page_value
from pv
group by 1,2,3,4,6,8
)
This is a simple example, but you can apply this to more complex aggregated models easily enough.
I do think it is probably a good idea to show both the modeled and raw metrics like I did in this sample. I’ve fielded too many why don’t these metrics match questions where the answer was that you’re looking at different metrics to not prefix any modeled data with modeled_
.
Session Metrics
If we’re aggregating the session tables, all of the non-consented data will be found in one row with a session_key
of consent_denied
rather than one row per session with consent denied.
This row will have just a single session start timestamp, but hugely inflated page views, conversions, engagement time, and other metrics.
We then need to then do two things:
- Remove the non-consented data from our session tables to not throw off our session metrics too badly with a single, mega-session
- Use the ratio of consented to non-consented rows for all events to model session metrics
Removing Non-Consented Data from Session Tables
We have a few session_partition_key is not null
checks that handled this before, but those don’t work now that we are assigning session_key
values when the client_key
or session_id
is null.
We’ll need to copy a few session models from the dbt-GA4 package to our project, disable the package models, and update the project models.
stg_ga4__sessions_traffic_sources_daily.sql
and stg_ga4__sessions_traffic_sources.sql
Filter out the new session_key
values.
left join {{ref('ga4_source_categories')}} source_categories on events.event_source = source_categories.source
where session_partition_key is not null
and not contains_substr(session_partition_key, 'consent_denied')
and not contains_substr(session_partition_key, 'audience_trigger_null_session_id')
I use contains_substr(session_partition_key
instead of checking session_key
because we are already checking if the session_partition_key
is not null so BigQuery caching shouldn’t cause any extra data look-ups with the extra conditions but it is a tiny optimization.
dbt_project.yml
Disable the package models.
models:
ga4:
staging:
stg_ga4__sessions_traffic_sources_daily:
+enabled: false
stg_ga4__sessions_traffic_sources:
+enabled: false
Model session metrics
Now that we have excluded the new session keys from our sessions table, we are ready to calculate the ratio of consented to non-consented events and plug those values along with session metrics in to our model_counting_metrcis
macro.
We could calculate the ratio every time we need it, but that is wasteful as we need to scan the full events partition every single time we use it.
Instead, we’re going to create a table to hold these calculations so we don’t have to scan so much data.
In your staging folder, create a model named stg_ga4__consent_counts
:
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
config(
materialized='table'
)
}}
select
event_date_dt
, sum(
case when session_key = 'consent_denied' then 1 else 0 end
) as consent_denied_events
, sum(
case when session_key != 'consent_denied' then 1 else 0 end
) as consented_events
from {{ref('stg_ga4__events')}}
{%- if is_incremental() -%}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1
We create this model as a table rather than a view because we want to prevent our macro from scanning the entire base_ga4__events
partition.
First, we set partitions_to_replace
in order to use it for query pruning later.
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
Then, we have our config block.
{{
config(
materialized='table'
)
}}
With this set up as a table without incremental updates, we completely drop the previous table and build the table again with the new settings.
Our query just needs to get the daily counts of consented and non-consented events by day so we can calculate that ratio in our macro.
select
event_date_dt
, sum(
case when session_key = 'consent_denied' then 1 else 0 end
) as consent_denied_events
, sum(
case when session_key != 'consent_denied' then 1 else 0 end
) as consented_events
from {{ref('stg_ga4__events')}}
{%- if is_incremental() -%}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1
We use partitions_to_replace
for query pruning and group by event_date_dt
.
{%- if is_incremental() -%}
where event_date_dt in ({{ partitions_to_replace | join(',') }})
{%- endif -%}
group by 1
We don’t need to incrementally update this table if all of the aggregate models where we call this macro are incremental.
This is because if we’re updating the last three days and today, with static_incremental_days
set to 3, then the incremental models only need reference the last three days of data so it is okay that we drop the old version of the table and rebuild with only four rows for today and the last three days.
If we had an aggregate model that was not incremental, then we would need the ratio for every day in our warehouse and the stg_ga4__consent_counts
model would only have today and the last three days.
Next, we need to create a model that aggregates session metrics and apply the macro.
I’m going to create an aggregated attribution model, agg_ga4__last_non_direct_attribution
, in my marts folder that holds channel, source, medium, and campaign dimensions with various daily metrics.
-- not a partitioned model, but using partitions to replace for query pruning
{% set partitions_to_replace = ["current_date"] %}
{% for i in range(var("static_incremental_days", 1)) %}
{% set partitions_to_replace = partitions_to_replace.append( "date_sub(current_date, interval " + (i + 1) | string + " day)") %}
{% endfor %}
{{config(
materialized='incremental',
unique_key=['session_partition_date', 'last_non_direct_default_channel_grouping', 'last_non_direct_source', 'last_non_direct_medium', 'last_non_direct_campaign']
)}}
with cnsent as (
select
*
from {{ref('stg_ga4__consent_counts')}}
)
, dim as (
select
session_partition_key
, session_partition_date
, last_non_direct_default_channel_grouping
, last_non_direct_source
, last_non_direct_medium
, last_non_direct_campaign
from {{ref('dim_ga4__sessions_daily')}}
{% if is_incremental() %}
where session_partition_date in ({{ partitions_to_replace | join(',') }})
{% endif %}
)
, fct as (
select
dim.session_partition_date
, dim.last_non_direct_default_channel_grouping
, dim.last_non_direct_source
, dim.last_non_direct_medium
, dim.last_non_direct_campaign
, count(*) as sessions
, sum(session_partition_count_page_views) as page_view_count
, sum(session_partition_max_session_engaged) as engaged_sessions
, sum(session_partition_sum_event_value_in_usd) as page_value
, sum(session_partition_sum_engagement_time_msec) as session_engagement_time_msec
, count(distinct
case
when session_number = 1 and user_id is not null then user_id
else client_key
end
) as new_users
, count(distinct
case
when session_number != 1 and user_id is not null then user_id
else client_key
end
) as returning_users
from {{ref('fct_ga4__sessions_daily')}} as f
right join dim on dim.session_partition_key = f.session_partition_key
{% if is_incremental() %}
where f.session_partition_date in ({{ partitions_to_replace | join(',') }})
{% endif %}
group by 1,2,3,4,5
)
select
session_partition_date
, last_non_direct_default_channel_grouping
, last_non_direct_source
, last_non_direct_medium
, last_non_direct_campaign
, sessions
, {{model_counting_metrics(
'sessions'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_sessions
, page_view_count
, {{model_counting_metrics(
'page_view_count'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_page_view_count
, engaged_sessions
, {{model_counting_metrics(
'engaged_sessions'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_engaged_sessions
, page_value
, {{model_counting_metrics(
'page_value'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_page_value
, session_engagement_time_msec
, {{model_counting_metrics(
'session_engagement_time_msec'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_session_engagement_time_msec
, new_users
, {{model_counting_metrics(
'new_users'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_new_users
, returning_users
, {{model_counting_metrics(
'returning_users'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_returning_users
from fct as f
left join cnsent on cnsent.event_date_dt = f.session_partition_date
group by 1,2,3,4,5,6,8,10,12,14,16,18
This really isn’t all that different from the previous models. The big difference is that we’re plugging in the total consented and non-consented event counts into the model_counting_metrics
macro rather than the counts of consented and non-consented sessions because we can’t get the count of non-consented sessions without consent.
We start by setting partitions_to_replace
for query pruning.
{% set partitions_to_replace = ["current_date"] %}
{% for i in range(var("static_incremental_days", 1)) %}
{% set partitions_to_replace = partitions_to_replace.append( "date_sub(current_date, interval " + (i + 1) | string + " day)") %}
{% endfor %}
Then we set our config ensuring that we are creating an incremental model so we don’t try to update rows where the date is not set in stg_ga4__consent_counts
.
{{config(
materialized='incremental',
unique_key=['session_partition_date', 'last_non_direct_default_channel_grouping', 'last_non_direct_source', 'last_non_direct_medium', 'last_non_direct_campaign']
)}}
We also set the unique key to the combination of unique, daily dimensions.
Then, we get all of the rows from stg_ga4__consent_counts
with cnsent as (
select
*
from {{ref('stg_ga4__consent_counts')}}
)
If we made stg_ga4__consent_counts
an incremental model, then we’d want to add a where condition but, because it is a table that gets dropped and built with only the rows that we are processing, then we don’t gain anything here by adding a where condition.
We get our dimensions while using partitions_to_replace
for partition pruning.
, dim as (
select
session_partition_key
, session_partition_date
, last_non_direct_default_channel_grouping
, last_non_direct_source
, last_non_direct_medium
, last_non_direct_campaign
from {{ref('dim_ga4__sessions_daily')}}
{% if is_incremental() %}
where session_partition_date in ({{ partitions_to_replace | join(',') }})
{% endif %}
)
We join those to our metrics again remembering to prune partitions.
, fct as (
select
dim.session_partition_date
, dim.last_non_direct_default_channel_grouping
, dim.last_non_direct_source
, dim.last_non_direct_medium
, dim.last_non_direct_campaign
, count(*) as sessions
, sum(session_partition_count_page_views) as page_view_count
, sum(session_partition_max_session_engaged) as engaged_sessions
, sum(session_partition_sum_event_value_in_usd) as page_value
, sum(session_partition_sum_engagement_time_msec) as session_engagement_time_msec
, count(distinct
case
when session_number = 1 and user_id is not null then user_id
else client_key
end
) as new_users
, count(distinct
case
when session_number != 1 and user_id is not null then user_id
else client_key
end
) as returning_users
from {{ref('fct_ga4__sessions_daily')}} as f
right join dim on dim.session_partition_key = f.session_partition_key
{% if is_incremental() %}
where f.session_partition_date in ({{ partitions_to_replace | join(',') }})
{% endif %}
group by 1,2,3,4,5
And then finally, we calculate modeled values.
select
session_partition_date
, last_non_direct_default_channel_grouping
, last_non_direct_source
, last_non_direct_medium
, last_non_direct_campaign
, sessions
, {{model_counting_metrics(
'sessions'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_sessions
, page_view_count
, {{model_counting_metrics(
'page_view_count'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_page_view_count
, engaged_sessions
, {{model_counting_metrics(
'engaged_sessions'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_engaged_sessions
, page_value
, {{model_counting_metrics(
'page_value'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_page_value
, session_engagement_time_msec
, {{model_counting_metrics(
'session_engagement_time_msec'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_session_engagement_time_msec
, new_users
, {{model_counting_metrics(
'new_users'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_new_users
, returning_users
, {{model_counting_metrics(
'returning_users'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_returning_users
from fct as f
left join cnsent on cnsent.event_date_dt = f.session_partition_date
group by 1,2,3,4,5,6,8,10,12,14,16,18
It would actually be cheaper to combine the last two steps, join dimensions to metrics and calculate modeled values.
Each group by
costs more, but then you would have to input the calculation that you used for the metric that you are trying to model rather than the name that we set when we joined dimensions to metrics in the fct
CTE like in the below code.
, count(*) as sessions
, {{model_counting_metrics(
'count(*)'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_sessions
, sum(session_partition_count_page_views) as page_view_count
, {{model_counting_metrics(
'sum(session_partition_count_page_views)'
, 'cnsent.consented_events'
, 'cnsent.consent_denied_events'
)}} as modeled_page_value
I chose readability here, but I would probably choose efficiency over readability with production data.