The raw tracking data in BigQuery is one of the most interesting and powerful features offered by GA360 (the paid version of Google Analytics). It allows you to query the tracking data without any kind of limitations or sampling. This article shows basic examples on how to use BigQuery to extract information from the GA data.
Of course, data in BigQuery can be queried using the basic web interface on https://bigquery.cloud.google.com. Nevertheless, it's much more convenient to use the BigQuery API in conjunction with your favorite programming language. For the examples shown below, I used Google Cloud Datalab - a useful web-based tool which combines Python code, documentation and visualization.
A full description of the GA360 BigQuery export schema can be found on https://support.google.com/analytics/answer/3437719.
Import necessary libraries
import datalab.bigquery as bq
import pandas as pd
import matplotlib.pyplot as plt
Sample 1: Get number of users, sessions and pageviews for a given timeframe
Notes:
- in BigQuery the visitId is kept over the 0am break
- the GA interface only counts sessions that have at least one interaction
%%sql
SELECT
date,
EXACT_COUNT_DISTINCT(fullVisitorId) AS Users,
EXACT_COUNT_DISTINCT(CONCAT(STRING(fullVisitorId), STRING(visitStartTime))) AS GASessions,
EXACT_COUNT_DISTINCT(CONCAT(STRING(fullVisitorId), STRING(visitId))) AS BigQuerySessions,
SUM(IF(hits.type ='PAGE', 1, 0)) AS PageViews
FROM TABLE_DATE_RANGE([myProject:123456789.ga_sessions_],
TIMESTAMP('2018-06-01'),
TIMESTAMP('2018-06-05'))
GROUP BY date
date | Users | GASessions | BigQuerySessions | PageViews |
---|---|---|---|---|
20180604 | 23.623 | 28.730 | 28.730 | 74.206 |
20180601 | 15.326 | 19.526 | 19.526 | 58.923 |
20180602 | 14.322 | 18.080 | 18.080 | 52.225 |
20180603 | 16.301 | 21.035 | 21.035 | 61.926 |
20180605 | 19.137 | 23.870 | 23.870 | 65.074 |
Sample 2: Show sessions grouped by source and medium
The totals.visits field contains 1 for visits with at least one interaction hit.
%%sql
SELECT
trafficSource.source,
trafficSource.medium,
SUM(totals.visits) as Sessions
FROM TABLE_DATE_RANGE([myProject:123456789.ga_sessions_],
TIMESTAMP('2018-06-01'),
TIMESTAMP('2018-06-05'))
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
trafficSource_source | trafficSource_medium | Sessions |
---|---|---|
organic | 35.731 | |
cpc | 30.458 | |
newsletter | 13.384 | |
(direct) | (none) | 12.449 |
myPartner1 | partner | 5.278 |
myPartner2 | partner | 2.022 |
bing | cpc | 1.649 |
bing | organic | 1.382 |
myPartner3 | partner | 1.371 |
m.facebook.com | referral | 1.078 |
Sample 3: Show pageviews grouped by page
Pageviews have the hits.type field set to "PAGE".
%%sql
SELECT
hits.page.pagePath,
COUNT(*) as pageviews
FROM TABLE_DATE_RANGE([myProject:123456789.ga_sessions_],
TIMESTAMP('2018-06-01'),
TIMESTAMP('2018-06-05'))
WHERE hits.type = 'PAGE'
GROUP BY hits.page.pagePath
ORDER BY pageviews DESC
LIMIT 10
hits_page_pagePath | pageviews |
---|---|
/shop/page1 | 32.030 |
/home/ | 23.035 |
/shop/customer/account/login/ | 19.960 |
/ | 15.730 |
/news | 14.756 |
/shop/zubehoer | 12.288 |
/shop/ | 11.801 |
/page3 | 10.084 |
/shop/checkout/cart/ | 10.038 |
/shop/ | 7.790 |
Sample 4: Group data by custom dimension
Custom dimension values are stored in a nested field as index, value pairs. User and session custom dimensions are stored on the top level of the session grouped data - hit level custom dimensions are stored for each hit in the hits.customDimensions object.
%%sql
SELECT pagetype, SUM(pageviews) AS pageviews
FROM
(
SELECT
1 AS pageviews,
IF(hits.customDimensions.index = 4, hits.customDimensions.value, null) AS pagetype,
FROM TABLE_DATE_RANGE([myProject:123456789.ga_sessions_],
TIMESTAMP('2018-06-01'),
TIMESTAMP('2018-06-05'))
WHERE hits.type='PAGE'
HAVING Pagetype IS NOT NULL
)
GROUP BY pagetype
ORDER BY pageviews DESC
pagetype | pageviews |
---|---|
content | 70.177 |
listing | 52.203 |
homepage | 35.239 |
product | 34.134 |
Sample 5: Load data into pandas dataframe and plot
BigQuery can be loaded in the widely used pandas dataframe format for further processing or plotting.
First Step: Define Query
%%sql --module myExampleQuery
SELECT
date,
EXACT_COUNT_DISTINCT(fullVisitorId) AS Users,
EXACT_COUNT_DISTINCT(CONCAT(STRING(fullVisitorId), STRING(visitStartTime))) AS GASessions,
EXACT_COUNT_DISTINCT(CONCAT(STRING(fullVisitorId), STRING(visitId))) AS BigQuerySessions,
SUM(IF(hits.type ='PAGE', 1, 0)) AS PageViews
FROM TABLE_DATE_RANGE([myProject:123456789.ga_sessions_],
TIMESTAMP('2018-06-01'),
TIMESTAMP('2018-06-15'))
GROUP BY date
Second Step: Load Data into Pandas dataframe
myDataFrame = bq.Query(myExampleQuery).to_dataframe()
Preview the data
myDataFrame.head()
Plot the data
myPlot = myDataFrame.plot(x="date",y=["GASessions","Users"])
myPlot.set_xticklabels(myDataFrame["date"], rotation=25)
plt.show()