US Population Data for Google Analytics

Wednesday, Mar 13, 2019| Tags: Google Analytics

Geography data import in Google Analytics is a lesser-used feature. In their docs, Google emphasizes the ability to create custom sales districts. While this is quite useful for the right type of business, you can also use geo-data import to augment your data.

Please note, this covers how to format the data for Google. If you just want to import the data that I’ve formatted, skip to the end of this post, grab the CSV and follow the instructions in the last section.

Google lets you match your imported data to City ID, Country ISO code, Region ID, and Sub-Continent Code so to start you’ll want some data that you can match at that level.

Additionally, you can only import custom dimensions this way. This isn’t really a problem but it means that you’ll probably need to do some pre-processing of your data to, for example, turn numerical data in to buckets that are suitable as dimensions.

I looked at importing raw numerical data and then bucketing the data via custom segments, but GA doesn’t let you create segments using less than or greater than operators and doing this via regular expressions seems like a quick way to give myself nightmares.

First, let’s look at Google’s geo data.

import pandas as pd
import numpy as np

ga = pd.read_csv('geotargets-2019-02-11.csv')

Criteria ID Name Canonical Name Parent ID Country Code Target Type Status
0 1000002 Kabul Kabul,Kabul,Afghanistan 9075393.0 AF City Active
1 1000003 Luanda Luanda,Luanda Province,Angola 9070431.0 AO City Active
2 1000004 The Valley The Valley,Anguilla 2660.0 AI City Active
3 1000010 Abu Dhabi Abu Dhabi,Abu Dhabi,United Arab Emirates 9041082.0 AE City Active
4 1000011 Ajman Ajman,Ajman,United Arab Emirates 9047096.0 AE City Active

Google Analytics Geo Data

GA uses the Criteria ID to match city data so we need to find a way to match our external data with this ID.

For that, we can replicate the Canonical Name in our external data without too much manual intervention and then merge the two data sets on Canonical Name before uploading the merged data to GA.

It’s not evident in the above sample, but there are many more Target Types than just the four entities that we can match to GA data. The techniques I describe here would be fantastic when combined with Zip/Postal Code census data to analyze conversion rates by average household income.

However, Google’s geo data is used for more products than just GA, Ads being a prime example. It doesn’t seem like you can back door postal code and other data types into GA using the data import feature.

ga['Target Type'].value_counts()
Postal Code               48390
City                      38101
Neighborhood               4666
County                     3417
Municipality               2117
Province                   1131
District                    948
Region                      909
Congressional District      441
Airport                     395
Department                  240
State                       235
University                  219
Country                     213
City Region                 183
Governorate                 121
National Park                96
Borough                      88
Prefecture                   49
Okrug                        28
Canton                       26
Autonomous Community         22
TV Region                    14
Union Territory               7
Territory                     4
Name: Target Type, dtype: int64

The basic pattern for city canonicals seems to be City,State,Country. However, a deeper look into the data reveals that this pattern doesn’t hold one-hundred percent of the time as there are sometimes multiple non-city entities that would have naming collisions if without modifying the canonical. As a result, some cities canonicalize to City,City,State,Country.

Luckily, these collisions don’t seem to be a problem if you remove the non-city entities from this data. So we’re going to remove anything that isn’t a city.

We’ll be matching US city population data in this example, so let’s also remove the non-US data.

ga = ga[ga['Country Code'] == 'US']
ga = ga[ga['Target Type'] == 'City']
Criteria ID Name Canonical Name Parent ID Country Code Target Type Status
10642 1012873 Anchorage Anchorage,Anchorage,Alaska,United States 21132.0 US City Active
10643 1012874 Anderson Anderson,Alaska,United States 21132.0 US City Active
10644 1012875 Angoon Angoon,Alaska,United States 21132.0 US City Active
10645 1012876 Atqasuk Atqasuk,Alaska,United States 21132.0 US City Active
10646 1012877 Utqiagvik Utqiagvik,Alaska,United States 21132.0 US City Active

You can see Anchorage as an example of a canonical that uses the City,City,State,Country pattern.

Another pattern that is used to avoid collisions is City,County,State,Country. Luckily Google seems to add “County” to the end of every county name so this will be fairly easy to detect and fix.

In addition, a little birdie told me that we’re going to have problems with matching “Saint” and “St.” as well as consitently matching names with “Town” and “Township” in city names.

So let’s make canonicals that are more useful to us by using just the City,State,Country pattern, standardizing all names to use “St.” rather than “Saint” and drop “Town” and “Township” from names.

We’ll drop some columns that we’re done with while we’re at it. Please note I spent days comparing source and formatted columns for potential collisions and errors before dropping any data. If you’re following along with another data set, don’t drop anything just yet.

import re
no_match = []
def ga_format_temp_canonical(string):
  if "Saint" in string:
    string = string.replace('Saint', 'St.')
  if "Township" in string:
    string = string.replace(' Township', '')
  if " Town" in string:
    string = string.replace(' Town', '')
  #find and remove County
  match ="^([\w \-\.\/]+),([\w ]+) County,(.*)", string)
  if match:
    string = + ',' +
  #find and remove duplicate City patterns
  match ="^([\w \-\.\/]+),([\w ]+),(.*)", string)
  if match:
    if ==
      string = + ',' +
      return string
      return string
    no_match.append(string) #for finding holes in our regex
    return string
  return "Function failed"

ga['Temp Canonical'] = ga['Canonical Name'].apply(ga_format_temp_canonical)

ga = ga.drop(['Name','Canonical Name', 'Parent ID', 'Country Code', 'Target Type', 'Status'], axis=1)
Criteria ID Temp Canonical
10642 1012873 Anchorage,Alaska,United States
10643 1012874 Anderson,Alaska,United States
10644 1012875 Angoon,Alaska,United States
10645 1012876 Atqasuk,Alaska,United States
10646 1012877 Utqiagvik,Alaska,United States

US Census Data

So for this exercise, I’m going to grab city and town population data from the US census. This dataset is particularly convenient because it is offered in CSV format without any need for scraping or other processing. It is also very thorough as it includes geographical estimates for the entire US population.

By comparison, Canadian data seems to be limited to actual cities (as defined by their respective provinces) or broken out into census geo codes that need to be remapped separately.

Let’s grab the US data and look at it.

us = pd.read_csv('sub-est2017_all.csv', encoding = "ISO-8859-1")
0 40 1 0 0 0 0 0 A Alabama Alabama 4779736 4780135 4785579 4798649 4813946 4827660 4840037 4850858 4860545 4874747
1 162 1 0 124 0 0 0 A Abbeville city Alabama 2688 2688 2684 2677 2629 2612 2595 2587 2575 2567
2 162 1 0 460 0 0 0 A Adamsville city Alabama 4522 4522 4516 4502 4479 4457 4437 4409 4376 4347
3 162 1 0 484 0 0 0 A Addison town Alabama 758 754 751 751 744 743 740 734 734 728
4 162 1 0 676 0 0 0 A Akron town Alabama 356 356 355 345 345 341 337 335 332 332

Well that’s an easy way to get a headache.

Once again presenting a veneer of confidence that masks the trial and error that went in to producing this, let’s fix this data and drop everything we don’t need.

We’re going to take the name and state name fields and fuse them together with some commas and a country name to match against the Temp Canonical that we created in the GA set.

I don’t know what the PRIMGEO_FLAG, or primitive geography flag, is for. But what it does is duplicate all of the data. Let’s drop every row where it equals 0. This will also have the side-benefit of removing state populations from our data.

If you look through the source data, you will see all sorts of post-fixes in the NAME field like city and charter township. We’ll need to drop them and while we’re at it we’re going to make sure “Saint,” “Township,” and “Town” are standardized the same way as our Google data.

This will create quite a few duplicate canonicals. The (pt.) post-fix stands for part and to match the Google data we want to amalgamate everything and look at the whole. So we’ll do that, sum the 2017 population estimate and drop all of the other fields in the process.

Finally, we’ll drop rural and county data that isn’t located in a town or city that we can match to the Google data.

us = us[us['PRIMGEO_FLAG'] != 0]

def format_canonical_name(city, state):
  #US data uses St. for Saint consistently, Google is mixed so standardize on US
  if "Saint" in city:
    city = city.replace('Saint', 'St.')
  if "Township" in city:
    city = city.replace(' Township', '')
  if " Town" in city:
    city = city.replace(' Town', '')
  #Careful with the order here
  if city.endswith(' city and borough'):
    city = city[:-17]
  elif city.endswith(' charter township'):
    city = city[:-17]
  elif city.endswith(' municipality'):
    city = city[:-13]
  elif city.endswith(' city'):
    city = city[:-5]
  elif city.endswith(' town'):
    city = city[:-5]
  elif city.endswith(' borough'):
    city = city[:-8]
  elif city.endswith(' city (pt.)'):
    city  = city[:-11]
  elif city.endswith(' township'):
    city = city[:-9]
  elif city.endswith(' village'):
    city = city[:-8]
  elif city.endswith(' charter'):
    city = city[:-8]
  elif city.endswith(' town (pt.)'):
    city = city[:-11]
  elif city.endswith(' borough (pt.)'):
    city = city[:-14]
  elif city.endswith(' village (pt.)'):
    city = city[:-14]
  elif city.endswith(' city (balance) (pt.)'):
    city = city[:-21]
  elif city.endswith(' city (balance)'):
    city = city[:-15]
  return city + ',' + state +',United States'

us[['Temp Canonical']] = us.apply( lambda row: pd.Series(format_canonical_name(row[8], row[9])), axis=1 )

us = us.groupby(['Temp Canonical'], as_index=False)['POPESTIMATE2017'].sum()

#sometimes we can't automatically generate working canonicals
manual_interventions = {
    'Nashville-Davidson metropolitan government (balance),Tennessee,United States': 'Nashville,Tennessee,United States',
    'Louisville/Jefferson County metro government (balance),Kentucky,United States': 'Louisville,Kentucky,United States',
    'Urban Honolulu CDP,Hawaii,United States': 'Honolulu,Hawaii,United States',
    'Lexington-Fayette urban county,Kentucky,United States': 'Lexington,Kentucky,United States',
    'Boise City,Idaho,United States': 'Boise,Idaho,United States',
    'San Buenaventura (Ventura),California,United States': 'Ventura,California,United States',
    'New York,New York,United States': 'New York,United States',
    'Union City,Ohio,United States': 'Union,Ohio,United States',
    'Penn Run,Pennsylvania,United States': 'Penn,Pennsylvania,United States',
    'Lakewood,Pierce,Washington,United States': 'Lakewood,Washington,United States',        
for key, value in manual_interventions.items():
  us.loc[us['Temp Canonical'] == key, 'Temp Canonical' ] = value

#We're only interested in urban population
us = us[~us['Temp Canonical'].str.startswith('Balance of')  ]

#Let's also get rid of county data
us = us[~us['Temp Canonical'].str.contains('County')]

Temp Canonical POPESTIMATE2017
0 Aastad,Minnesota,United States 213
1 Abbeville,Alabama,United States 2567
2 Abbeville,Georgia,United States 2789
3 Abbeville,Louisiana,United States 12272
4 Abbeville,Mississippi,United States 436

Blending US and GA Datasets

Now we merge the datasets. We’ll rename some columns while we’re at it.

us_ga = pd.merge(ga, us, on="Temp Canonical", suffixes=('_ga','_us'), how="inner" )
us_ga.rename(columns={'Criteria ID': 'ga:cityId', 'POPESTIMATE2017': '2017'}, inplace=True)

ga:cityId Temp Canonical 2017
0 1012873 Anchorage,Alaska,United States 294356
1 1012874 Anderson,Alaska,United States 337
2 1012875 Angoon,Alaska,United States 452
3 1012876 Atqasuk,Alaska,United States 244
4 1012878 Bethel,Alaska,United States 6456

Sort the cities into buckets by population.

a = us_ga['2017'].tolist()
p = sum(a) / 5
curr_percentile = p
q = []
q.append(min(a) -1 )
r = 0
for each in a:    
    if r < curr_percentile:
        r = r + each
        q.append(each + 1)
        r = r + each
        curr_percentile = curr_percentile + p
q.append(max(a) +1)

labels = ['small town', 'town', 'large town','city','major city']

us_ga['quintile'] = pd.cut(us_ga['2017'], q, include_lowest=True, labels=labels)

#check that our buckets are roughly equal
small town    43610647
town          43637694
large town    43716638
city          44237984
major city    42709542
Name: 2017, dtype: int64

#GA throws an error if there are columns in your csv that you don't use
us_ga = us_ga.drop(columns=['Temp Canonical', '2017'], axis=1)
#export to CSV
us_ga.to_csv('us-geo-ga.csv', index=False)
#we dropped the index column from the CSV
ga:cityId quintile
0 1012873 city
1 1012874 small town
2 1012875 small town
3 1012876 small town
4 1012878 small town

Google Analytics Data Import

From here on we are following Google’s documentation starting at Step 3. If you need more detailed instructions, I encourage you to go to there.

You can just grab the finished CSV and following along from here.

  1. Create a Custom Dimension with session scope.
  2. Create a Data Set (under Admin > Property > Data Import) where you assign the newly created Custom Dimension to the import.
  3. At the end of the Data Set creation dialogue, copy the dimension ID (it will look like ga:dimension10).
  4. Open the CSV, replace “quintile” with the dimension ID and save it again as a CSV.
  5. Upload the CSV manually (Admin > Property > Data Import) select the Data Set you created previously and Manage Uploads.

And there, you’re done.

Discuss on LinkedIn.

Get Updates

You're data will be protected in accordance with our privacy policy.

The Refresher ~ Analytics for Marketers

Marketer-friendly summary of the most important changes to browsers, privacy laws, and analytics delivered quarterly to your inbox.

Learn More

Let's talk about your data

I want to help you get more from your data now while setting up culture and processes that will help you scale your data analytics team in the future.

Schedule a Call