GA4 Path Analysis with BigQuery

This article details the process of building two BigQuery tables for path analysis, with a focus on creating Looker Studio reports that visualize user journeys through page views and events. It accompanies a GitHub repository featuring code for automating these transformations using Google Dataform.

I’m often frustrated by GA4’s limited ability to visualize user journeys through a website, and I know many people feel the same. Understanding how users navigate from blogs to product pages, which paths lead to conversions, or where they go after viewing service pages should be straightforward. Clear insights from these paths could offer significant value.

GA4 does have the Path Exploration report, but it is a bit clunky, and the data is not exportable. So when a client recently asked me about path exploration reporting, I thought to myself, “I’ve got all the data I need in BigQuery, how hard can it be?”

It turns out, it is pretty hard. I found it difficult for various reasons, but the main one is what I dubbed “the angels are in the details” principle. 

Usually, when we analyze data, we are most interested in aggregated metrics that pertain to a dimension. We tend to abstract away from individual users and look at the top traffic sources, landing pages, keywords, etcetera. But with path behavior, the most common paths are boring and teach us little. I can pretty much guarantee that the number one path on your website looks like this:

/

In other words, it’s a single-page visit to your home page. As I dug into path behavior, I realized that I could learn little from paths like that, but a lot from exploring detailed path behavior. The goal of charts and graphs is typically to highlight trends and simplify, but sometimes, the angels are in the details.

In this and subsequent articles, I will share the methods I’ve developed for finding the angels.

Google Dataform

I’m going to be doing the BigQuery data transformation in Google Dataform. Google Dataform is part of the Google Cloud Platform – there are no fees to use Dataform itself, but the data transformations will incur some BigQuery analysis fees. What those fees are depends very much on how many events your GA4 property collects. For twooctobers.com, which is what I will be using as an example, we get about two thousand events per day and the analysis fees are pennies per month.

The reason I’m using Dataform is because it facilitates automating data transformations in BigQuery and integrates with GitHub source control. It is also very helpful for organizing projects and testing and documenting code.

I will be sharing Dataform code examples, but I will not be spending a lot of time explaining how Dataform works. If you are not familiar with it, I recommend adding it to your ‘things-I-need-to-learn’ list. Or, if you are already comfortable with dbt, converting my code samples to dbt syntax should be fairly straightforward. One feature of Dataform I will point out, because it is super-cool, is that you can define a query as ‘incremental’. This updates a table without any risk of duplicating rows. You can schedule the query to run daily, and set the date range of the query to several days to allow for the GA4 export data to settle.

Creating your own Dataform repository from my github repository

(Note, if all of this sounds too complicated, hire us and we’d be happy to set it up and get it working for you. If you mostly care about reporting on user paths in Looker Studio, I also created a Looker Studio BigQuery custom query version, but be mindful that it could get expensive if you run it on high-volume GA4 properties or for long date ranges.)

I’ve shared my path analysis Dataform project on github, and will walk through the components below. To work with the code, at a bare minimum you’ll need a Google Cloud Project with billing enabled, and an export of your GA4 data in BigQuery. Assuming you have those already:

It is also possible to create a Dataform repository without connecting to github. If you want to do that, you can just manually recreate & copy my files in your repository.

Turning GA4 Event Data into Path Sequences

sequenced_events.sqlx

In my first pass at doing this, I created a single query that generated a table suitable for reporting in Looker Studio. Then, when I started working with the data, I realized that that schema was good for that purpose, but pretty limiting otherwise, so I broke it out into multiple stages. This query extracts just the data we need for path analysis from the GA4 BigQuery export, making subsequent steps easier and saving on analysis costs. Think of it as your path-analysis foundation. I’ll show you a few things you can build on top of the foundation, but the possibilities are limitless. 

This step is executed by the ‘sequenced_paths.sqlx’ file and creates/updates a table named the ‘name’ property in the ‘config’ setting at the top of the file. By default, Dataform creates tables in the ‘dataform’ dataset, but you can change that in the workflow_settings.yaml file. 

Here is a data dictionary of the table it creates:

  • date [date] – the date on which the session started
  • session_id [string] – the GA4 ga_session_id parameter is not actually unique, so this is a concatenation of user_pseudo_id and ga_session_id
  • event_timestamp [integer] – the timestamp of the event, directly from the event record
  • page_path [string] – the page path extracted from the full URL (page_location) associated with a page_view event. This populates with the event name if you have modified the query to include events besides page_views
  • seq_num [integer] – the sequence number of the page_view (or other event). The first page_view has a sequence number of 1, the second is 2, and so on.
  • session_source_medium [string] – equates to the Session source / medium dimension in GA4

See comments in the code for additional details.

Extracting Path Data from the page_location Parameter

extract_path.sqlx

This is a User-defined Function (UDF) that is part of an excellent library of BigQuery utilities authored by Google and community contributors. The UDF is referenced in the above query and removes the domain name and URL parameters from the GA4 page_location parameter.

Structuring Path Sequences for Looker Studio Reporting

session_paths.sqlx

Whereas the previous table has one row for each step, this query builds a table where one row equals a session and each step is broken out into its own column. In my query, I am getting the first ten steps in a session, but you can easily modify the last part of the query to add more.

Here is a data dictionary of the table it creates:

  • date [date]
  • session_id [string] – the GA4 ga_session_id parameter is not actually unique, so this is a concatenation of user_pseudo_id and ga_session_id
  • session_source_medium – equates to the Session source / medium dimension in GA4
  • step_1 … step_10 [string] – columns for each step.
  • all_steps [string] – this is a concatenation of all steps in a sequence. This field makes it possible to search for sequences with a particular page somewhere in the sequence.
  • number_of_steps [integer] – the number of steps in a given session’s path sequence. This is useful if you want to filter for paths longer than three steps, for example

Reporting on User Paths in Looker Studio

Below is an example of a Looker Studio report you can build with the session_paths table.

What’s in the report?

  1. A drop-down filter for step_1, which has been renamed to ‘Landing page’.
  2. A drop-down filter for session_source_medium, which has been renamed to ‘Source / medium’.
  3. A slider to filter on the number of steps in a sequence – this is connected to the dimension number_of_steps.
  4. A search box tied to the dimension all_steps. It defaults to a Contains search, so searching will bring up paths that include your search term anywhere in the sequence.
  5. The dimensions columns in the report are ‘step_1’ through ‘step_5’. Each has been renamed to be a bit friendlier. I also added conditional formatting to make it a little easier to interpret. Blank values mean a user left the site – e.g. the first row above describes users who landed on the home page, went to the /contact/ page, completed a ‘contact us submission’ event, then left.
  6. A count of session_ids.

To build a report like this:

  1. Add the session_paths table as a data source
    1. Add Data > BigQuery
    2. Find and select the table
  2. Insert a table chart
    1. Make sure the data source you just added is the data source for the table
    2. Select Steps 1 through 5 as your dimensions (I picked 5, but you could do more if you make the report wider or enable horizontal scrolling on your table).
    3. Rename Step 1 to “Landing page”
    4. Select Record Count as your metric, and rename it sessions. Since there  is one row in the table per session, these are equivalent.

Fuss about as much as you’d like with formatting. Personally, this is my favorite part 🙂

Additional Analysis with GA4 Path Data

Custom Queries

For ad-hoc queries, I find the sequenced_events table very useful. It’s great for answering questions like, “what are the pages that precede or follow a particular page view?” It is possible to query the GA4 export table directly to answer this, but much easier with sequenced_events.

Here’s an example of a query to report on previous and next pages for a given page:

    
     WITH page_sequences AS (
  SELECT
    session_id,
    seq_num,
    page_path,
    COALESCE(LAG(page_path) OVER (PARTITION BY session_id ORDER BY seq_num),"entrance") AS prev_page_path,
    COALESCE(LEAD(page_path) OVER (PARTITION BY session_id ORDER BY seq_num),"exit") AS next_page_path
  FROM `yourproject.sandbox.sequenced_events`
  WHERE date BETWEEN "2024-09-10" AND "2024-09-16"
)
SELECT
  page_path,
  prev_page_path,
  next_page_path,
  COUNT(*) AS count
FROM page_sequences
WHERE page_path = '/' -- Replace with your desired page_path
GROUP BY page_path, prev_page_path, next_page_path
ORDER BY count DESC;
    
   

Note that the output of this query is great for building a sankey diagram in Looker Studio. With a few adjustments you can replace the dates and page_path with Looker Studio custom query parameters to make a dynamic dashboard!

You may also want to modify my sequenced_events and session_paths queries to add additional dimensions such as device category or geography to sessions. I wanted to keep my example relatively simple, but you could also compare paths and path lengths based on other factors.

Insights from AI

I got good results feeding path data to ChatGPT and asking for insights. I’ll be sharing that process in another article, but here is a teaser of the sort of thing I got back.

Analyzing the user paths that include /blog/ pages on Two Octobers’ website reveals several important insights about user engagement, content effectiveness, and potential areas for improvement. Here are the key findings:

1. User Engagement with Blog Content

      • Users who visit /blog/ pages are likely seeking information related to marketing trends, apprenticeship, and help with features of GA4. The presence of multiple paths involving /blog/ indicates that this content attracts significant traffic, suggesting that users find it valuable.

    2. Common Paths Including /blog/

        • After visiting /blog/, users often navigate to other pages, such as:
              • /about-us/

              • /services/

              • /team/

          • This indicates that users may be seeking more context about the company, its values, and leadership after consuming news content. The inclusion of /services/ is particularly notable, as it suggests an interest in the company’s offerings following blog consumption.

        3. Drop-Off Rates

            • Some paths show abrupt drop-offs either at /blog/ or immediately after navigating away from it. Understanding where users leave the site can help identify potential issues
            • If a user accesses a specific /blog/ article but does not proceed to further pages, it could suggest that the article did not meet their expectations or that the next logical steps weren’t clear.

          It keeps going, but that gives you a taste. I’m not worried about being replaced by AI (yet), but that’s a really great starting point for sharing insights with a client.

          Starburst Diagrams

          As I described at the outset, sequence behavior is really hard to visualize. One approach I am experimenting with is starburst diagrams. Using the plotly visualization library in python, you can make a dynamic starburst that allows you to expand and collapse nodes to explore paths to or from a particular content type or page. I will be sharing how to do this in a future article.

          I also think it would be interesting to use this data to create nodes and weights on a network or force-directed tree diagram, but I haven’t tried that yet.

          I would really appreciate feedback and suggestions on what I have shared, in the comments below or by contacting me directly. In my research, I was not able to find a lot of in-depth content on this topic. I like what I have done so far, but I feel like there is so much I have yet to explore and I would love to connect with anyone who is interested in collaborating.

          Nico Brooks

          Nico loves marketing analytics, running, and analytics about running. He's Two Octobers' Head of Analytics, and loves teaching. Learn more about Nico or read more blogs he has written.

          Recent Posts

          Analytics Roundup – Updates from October 2024

          ChatGPT traffic in the GA house! Plus new features in GA4 and understanding GTM first-party…

          13 hours ago

          Five Great Plug-and-Play SEO Dashboards

          Preview five great dashboards for SEO reporting and analysis, and find the one that works…

          3 weeks ago

          Digital Marketing Updates: October 2024

          Ways Google is emphasizing the value of authenticity in organic results, the intro of search…

          1 month ago