How does working with dbt-GA4 tables compare with working with the raw export?

If you Google how to do virtually anything with the GA4 raw export to BigQuery, you will probably find guides showing you how to accomplish your objective with the raw data.

The GA4 BigQuery website is probably the most most comprehensive resource out there but there are new ones popping up every week.

Theoretically, it is possible to use the dbt-GA4 package without understanding the raw export, but that discounts the immense debugging usefulness that comes from understanding the raw export so it is best to plan on learning how to query the raw data.

The dbt-GA4 package makes it much easier to query the GA4 data in BigQuery so much so that by using the package you are able to accomplish the same things as outlined in most of the tutorials available online without needing to consult a tutorial.

Sharded tables

The raw GA4 export to BigQuery uses sharded tables instead of partitioned tables. Sharded tables are the old way of partitioning tables.

Partitioning is how we prevent BigQuery from getting a lifetime’s worth of data when we only want one day.

Sharding and partitioning accomplish the same things: they both prevent BigQuery from processing more data than you need. Partitioning, however, is more flexible and easier to use.

When you query a sharded table, your query will look something like this:

select
  *
from `project.dataset.events_20240326`

If you look at the end of the from statement, you’ll see the table events_20240326 includes the date of the table that you want to query.

That’s fine and easy if you want to only query one day’s worth of data, but what if you want all of the data since the beginning of March, 2024? Then your query would look like this:

select
	*
from `project.dataset.events_*`
where cast(_table_suffix as int64) >= 20240301

And if you wanted to query only data from March, 2024?

select
	*
from `project.dataset.events_*`
where _table_suffix between format_date('%Y%m%d', '2024-03-01') and format_date('%Y%m%d', '2024-03-31')

It’s not a huge amount of code, but a similar query against the date partitioned dbt-GA4 tables would look something like this:

select
	*
from `project.dataset.events`
where event_date_dt between '2024-03-01' and '2024-03-31'

The dbt-GA4 query works against the actual date field which is more intuitive than manipulating the _table_suffix and it doesn’t require that you know how to use format_date or cast functions.

It is a small advantage, but we are only on the first simplification that dbt-GA4 provides when interacting with your GA4 data.

Unnesting data

The GA4 export to BigQuery uses Repeated Record data types that need to be unnested before you can really use them.

Event parameters are where you will interact the most with repeated records, but user properties and ecommerce items are also nested. The raw event parameter repeated records look like the below screenshot.

![[event_params.png]]

The whole repeated record is event_params and nested underneath it is the key column and another, non-repeated record called value which contains string_value, int_value, float_value, and double_value columns.

You turn the event parameters in to metrics and dimensions by selecting the key that you want and pairing that with the relevant value type.

Extracting the page location, for example, is done with this code:

(select value.string_value from unnest(event_params) where key = 'page_location') as page_location`

That is just one column of data that in dbt-GA4 would be selected by page_location.

If you wanted to query the raw data to count all page_view events that happened on each page on a visitors’ first session in the month of March, 2024, then the query would look like this:

with prep as (
	select
		(select value.string_value from unnest(event_params) where key = 'page_location') as page_location
		, (select value.int_value from unnest(event_params) where key = 'ga_session_number') as ga_session_number
	from `project.dataset.events_*`
	where _table_suffix between format_date('%Y%m%d', '2024-03-01') and format_date('%Y%m%d', '2024-03-31')
	and event_name = 'page_view'
)
select
	page_location
	, countif(ga_session_number = 1) as new_users
from prep
group by page_location

The with prep as (select ...) portion of the query is something called a common table expression (CTE). It lets you chain queries together in situations where you need to do something to a field, like unnesting ga_session_number, before doing performing another operation on the same field, the countif(ga_session_number = 1) statement in the above query.

The dbt-GA4 package unnests everything for you in stages with the sgt_ga4__events model unnesting all of the parameters that are used by all events and event-specific parameters in staging models, like stg_ga4__event_page_view for page_view events. This has the effect of abstracting away all of the unnest statements while letting you select date-sharded raw data using friendlier date format.

The code to perform the same task as above with dbt-GA4, would look like this:

select
	page_location
	, countif(ga_session_number = 1) as new_users
from `project.dataset.stg_ga4__event_page_view`
where event_date_dt between '2024-03-01' and '2024-03-31'
group by page_location

Differences between querying raw GA4 and dbt-GA4

The dbt-GA4 package abstracts away all of the complexity of working with BigQuery making it easier to query the data while rendering all of the GA4 data in a format that any BigQuery-compatible data visualization tool can use.

This is accomplished by a series of intermediary staging tables. These staging tables are a table type called a View which act like an alias to a query that would create a table in the format that you specify, but does not actually create that table.

As a result, most of the data users will never need to even learn how much you have simplified their ability to access and use the data.

The abstracted interface against the nested and sharded raw data is the base improvement that the dbt-GA4 package provides. The advantages compound as you build complex logic on top of this abstracted interface.

Session logic in the dbt-GA4 package, for example, is quite complex. But that complexity is multiplied if you need to replicate it against the raw data to the point where you will need hundreds of lines of code to do a query that takes five lines in dbt-GA4.