Comparing GA4 Session Attribution with DBT-GA4
The raw GA4 Bigquery export includes event-scoped attribution data.
On the first page of a visit, the referrer will be from a different site from your own, or blank, and any UTM codes will be present on that page. When someone clicks through to a subsequent page, the referrer will be the previous page and UTM codes will be lost unless you do something specific to persist them.
GA4 calculates the source, medium, and campaign and adds those parameters to events whenever the referrer is from a site that hasn’t been excluded or UTM parameters are present.
This means that all events on the first page of a visit will have source, medium, and campaign, if applicable, and that events on subsequent pages will not have these parameters unless there is an external referrer again or another set of UTM parameters.
Furthermore, the channel calculations are performed after the data is exported to BigQuery, so there is no channel data in the export.
As a result, the raw data really isn’t all that appropriate for reporting attribution data. The best you can do is use the first_visit
and session_start
events to get a count of user- and session-scoped metrics but these events are sometimes missing and are often duplicated.
If you want to know the source/medium of your purchase events, then those purchase events had better happen on the first page of the visit or you will be out of luck.
Session attribution is one of the main reasons to use the dbt-GA4 package as it handles all of the intricacies of session attribution and channel assignment without needing any configuration.
However, the package does not calculate attribution the same way that GA4 does.
We are going to look at how the dbt-GA4 attribution calculations compare with the default GA4 ones so that you can explain to your users why dbt-GA4 attribution metrics might differ from metrics in the GA4 interface and why the dbt-GA4 numbers are (probably) better.
How GA4 calculates session dimensions
First, we’re going to have to look at how GA4 calculates session dimensions to understand how dbt-GA4 differs and why we decided to calculate these dimensions differently in the package.
As I’ve already said, the raw data is event-scoped with all events on the first page getting attribution data. A typical, two-page visit would look something like this:
event_number | event_name | source | medium | campaign | page_referrer | page_location |
---|---|---|---|---|---|---|
1 | session_start | cpc | my_campaign | facebook.com | site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign | |
1 | first_visit | cpc | my_campaign | facebook.com | site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign | |
1 | page_view | cpc | my_campaign | facebook.com | site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign | |
2 | scroll | cpc | my_campaign | facebook.com | site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign | |
3 | user_engagement | cpc | my_campaign | facebook.com | site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign | |
4 | page_view | site.com | site.com/buy-now | |||
5 | purchase | site.com | site.com/buy-now |
I’m using event_number
, which isn’t in the export, as a simple replacement for timestamp. The event_number
values that share a number are all part of the same batch and have the same timestamp.
Additionally, this sample unnests the source
, medium
, and campaign
values and simplified some of the values for demonstration purposes
The first five events in the above example all happen on the same page with a page_referrer
set to ‘facebook.com’ and a page_location
of ‘site.com?utm_source=facebook&utm_medium=cpc&utm_campaign=my_campaign’.
All of those events have their source
, medium
, and campaign
values set because the page they are on has UTM values.
The events on the subsequent pages do not have values set because there are no UTMs and the page_referrer
is from your own site.
If you were to blindly count the number of events with campaign, you would get a result of 5 even though all of these events are from one person.
Similarly, if you were to count the source / medium of the purchase
event, you would get a source / medium of null having one purchase.
If you want to get the source / medium for the purchase event, you would need to get that dimension from another event in the session.
This example is a simple one. The reality is actually a whole lot more complex.
One of the biggest differences between UA and GA4 is how GA4 does not break sessions when you use UTMs internally. From the document comparing UA metrics to GA4, “Using UTM tagging on your own website isn’t recommended since it will reset the session in Universal Analytics. If you do use UTMs on your own website, you may notice a much higher count of sessions in UA than in GA4.”
The truth is that GA4 looks at each attribution parameter individually with some surprising results.
Let’s look at some more examples. We’re going to only look at page_view
events from now on to keep things simple.
event_number | event_name | source | medium | campaign |
---|---|---|---|---|
1 | page_view | organic | null | |
2 | page_view | null | null | null |
3 | page_view | internal_link | link | my_campaign |
All of these page_view
events are in the same session. In GA4, the source / medium for this session would be ‘google / organic’. This is what we want.
However, GA4 calculates each attribution field independently. The campaign in this example would be ‘my_campaign’ with a source of ‘google’ and a medium of ‘organic’.
Since Google doesn’t allow us to tag their organic search results with utm_campaign
parameters, it should be impossible for an organic visit to have a campaign, but it is and this is why.
Last, non-direct attribution reverses the direction of this calculation giving you a source of ‘internal_link’, a medium of ‘link’ and a campaign of ‘my_campaign’ even though, as an internally-tagged UTM link, it could literally be impossible for a session to start with those values and the session started as ‘google / organic’.
A better name for this attribution model might be first, non-direct attribution.
These issues are a big problem with campaign values in GA4 and are even worse if you use some of the less-used UTM parameters like utm_term
. They also make malformed UTM parameters even more painful as a misspelling like utm_soucre
would result in a null value for source paired with a value for the medium.
Google shouldn’t take too much heat for users messing up UTM codes, but it is worth knowing how this works so you know what is causing any mis-matched source and medium values that you see.
How DBT-GA4 handles attribution
It seems likely that Google uses this structure to make running Google Analytics cheaper. By looking up each column individually, Google doesn’t need to build and store a sessions table. Everything works off of the raw data or a single processed events table.
Additionally, Google can calculate these values without using a window function. Window functions look to the past or the future to calculate values and can be costly processing-wise, which is the main expense of a BigQuery warehouse.
The Advanced dbt-GA4 course goes in to depth on the challenges of window functions so that you can understand why we avoid using them and how to minimize the costs when using them.
Additionally, we have tried to limit the use of window functions in the dbt-GA4 package. We removed the exits metrics, for example, because we did not deem the value of the metric to be worth the cost of calculating it.
The cost of this no doubt adds up when you are giving away the most popular web analytics platform on the internet.
Session tables are probably the most complex feature of dbt-GA4, but, even with a high-traffic site, the cost of building these tables should almost always outweigh the benefit of providing a simplified interface for data visualization users to work with.
The session tables, being shorter than the raw event tables, may even be cheaper to use depending on how often people query the session tables and how often you rebuild your daily partitions but this is situational.
For most sites, the costs are trivial, so the dbt-GA4 package builds session tables by default and it calculates sessions differently.
In dbt-GA4, attribution parameters are keyed off of the source
field. The package will get all attribution parameters from the first source
field to have a value.
event_number | event_name | source | medium | campaign |
---|---|---|---|---|
1 | page_view | organic | null | |
2 | page_view | null | null | null |
3 | page_view | internal_link | link | internal_campaign |
In the above session, dbt-GA4 would report a source / medium of ‘google / organic’ and a campaign of ‘null’ while GA4 would report ‘google / organic’ with a campaign of ‘internal_campaign’.
This prevents situations where you can have a campaign or term attached to a source / medium combination that shouldn’t have either of those parameters like ‘google / organic’.
It does have the drawback that, if you were to misspell utm_source
, then the session would be attributed as ‘direct / none’ even with a valid medium and campaign unless there were internal UTMs found later in the session as the misspelled source would be recorded as null.
event_number | event_name | source | medium | campaign |
---|---|---|---|---|
1 | page_view | null | cpc | my_campaign |
2 | page_view | null | null | null |
3 | page_view | internal_link | link | internal_campaign |
This session would be ‘internal_link / link’ with a campaign of ‘internal_campaign’ in dbt-GA4 while it would be ‘internal_link / cpc’ with a campaign of ‘my_campaign’ in GA4.
Similarly, if you were to use an internal UTM, a direct session would get attributed to the internal UTM values or ‘google / organic’ if someone, for example, went to Google mid-session to search for a specific page on your site.
event_number | event_name | source | medium | campaign |
---|---|---|---|---|
1 | page_view | null | null | null |
2 | page_view | null | null | null |
3 | page_view | internal_link | link | internal_campaign |
Both GA4 and dbt-GA4 would attribute this session to ‘internal_link / link’ with a campaign of ‘internal_campaign’ even though the session should be a direct session.
In either case, it is true that internal links don’t break sessions but it is not true that internal links don’t mess up your attribution data so it is still best to avoid internal links.
For last, non-direct attribution, dbt-GA4 calculates last, non-direct after the session values have been set and, like with session-scope, last, non-direct-scope is keyed off of the source
field.
Furthermore, in the package, all attribution parameters, including channel groupings, are prefixed with their scope. Event-scope dimensions are prefixed with event_
so the source
parameter, for example, is renamed to event_source
. Session- and last, non-direct-scoped are similarly prefixed with session_
and last_non_direct_
so that end users aren’t ever confronted with parameters of ambiguous scope like source
or medium
.
Comparing GA4 with dbt-GA4
The main difference between GA4 and dbt-GA4 is that GA4 calculates attribution dimension individually while the dbt-GA4 package keys off of the source
field and sets all attribution parameters based on the first event where the source
field is present.
The GA4 approach can result in combinations of mis-matched values like ‘google / organic’ traffic having a campaign while the dbt-GA4 package is immune to this issue.
This difference extends to the calculation of last, non-direct traffic where GA4 continues to look at each event-scope dimension in isolation while the dbt-GA4 package calculates sessions first and then looks at those session-scope dimensions to calculate last, non-direct.
The dbt-GA4 package suffers most when source
is misconfigured, like what would happen if you misspell ‘utm_source’ when tagging traffic.
Internal links do not break sessions in either GA4 or the dbt-GA4 package, but internal links can mess up the attribution for both resulting in source, medium, and campaign combinations that should be impossible in GA4, and shifting direct traffic to the internal link-tagged UTM values in both GA4 and dbt-GA4.