User Paths – A Data Driven Model
Why User Paths?
What’s the most used concept in web analytics?
I bet that if you asked this question around, “session” would be one of more common answers.
It is universally acknowledged that session is one of the key concepts in web analytics and it is very useful in a variety of applications. In certain situations though, we need to consider something more general, don’t we?
In some cases we’d like to analyze interactions over longer periods of time and across many sessions. Marketing attribution is one of the important use cases requiring such a multi-session approach. In an attribution problem, we analyze the traffic sources leading to a conversion and evaluate the impact of each source (basically saying, which sources were essential for that particular conversion and which were redundant).
Certainly, to do so we need to define which sources are even taken into account in the first place! We need to define a user path leading to the conversion.
Attribution is not the only area where the user paths are essential. Think about the possibilities! Behavioral analysis…across many sessions. Identifying pain points…across many sessions. These seem like very powerful tools to enhance the business and they do require finding user paths as a preliminary step.
So, how do we do this in a data-driven way? Let’s dive in!
We’re going to assume the Google Analytics data model, since this is the industry standard. We’ll go ahead and use the GA4 data as Universal Analytics is deprecated anyway.
We’ll need to access the raw data to see the parameters which are obscured in the reports (like clientID), hence the Big Query link is a must.
To illustrate the method, we’re going to use the sample GA4 dataset provided by Google:
Though Google warns us that the data is obfuscated and the internal consistency of the dataset might be somewhat limited, we can work with that for the sake of the illustration.
Let’s get to it and grab some data! All we really need for now is:
- Unique client ID
- Unique session ID
- Timestamp of each session start
- Label for the converting sessions
GA4 data model is strictly event-based and session is more of a derived concept – unfortunately this means that we won’t be able to get all the data we need right away – we’ll have to work a bit harder for that.
Let me just state the query first and explain the details below:
WITH TMP AS ( SELECT user_pseudo_id, ep.value.int_value AS session_id, event_timestamp, CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END AS purchase FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` e,UNNEST(event_params) ep WHERE _TABLE_SUFFIX BETWEEN '20210101' AND '20210131' AND ep.key='ga_session_id' ) SELECT user_pseudo_id, session_id, TIMESTAMP_MICROS(MIN(event_timestamp)) AS session_start, SUM(purchase) AS transactions FROM TMP GROUP BY user_pseudo_id, session_id
So, if you examine the GA4 data model you’ll notice the user_pseudo_id column which will work fine as the unique clientID. sessionID, on the other hand, is somewhat hidden – it’s available in the nested array called event_params.
The array contains key-value pairs of all parameters given for a particular event – we’re looking for the int_value of the ga_session_id parameter.
Since we’re working on the event-level data, we don’t have any session-level statistics – we’ll have to calculate them ourselves.
That’s why we’re creating a temporary table TMP first – we select all events along with their timestamps and use a CASE WHEN statement to distinguish the purchase events representing transactions.
Once we have all that data, we can summarize the temporary table grouping by user_pseudo_id and session_id to get the desired session-level results.
Last but not least, mind the condition for the _TABLE_SUFFIX in the WHERE clause. GA data is stored in table shards (one for each day) – this condition is going to select the data from January 2021 yielding around 120’000 sessions – for the sake of the experiment, this kind of volume will do okay; moreover, the resulting table should be small enough for you to save it as a local CSV file.
Introducing Breakpoints (transactions)
Now that we have the session data, let’s define our user paths. To do so, we’ll find the breakpoints – the moments when one user path ends and another begins.
It’d be just splendid to do this in an objective way that’s universally true, but in reality this would prove very difficult if not undoable. What we’re going to do is create a model based on some assumptions.
To find our first assumption, let’s rephrase the problem first. A user path is a collection of sessions leading to the conversion, right? This means that alternatively we could call it a conversion process. So, when does the conversion process end? When the conversion occurs, of course!
Well, in reality this might not be exactly true. Imagine the situation when a customer buys an item and returns an hour later to buy a related item – in this case it would be reasonable to include that follow-up transaction in the previous user path.
But again, a model is a simplification of reality based on some assumptions. While one could consider a more nuanced approach to the breakpoints which would result in a more complex model, this time we’re going to assume that a conversion definitely ends a conversion process.
Let’s illustrate this with an example! For instance, check out the person with user_pseudo_id = 27488559.4267170544. Let’s call them Alice for convenience. Of course I’m using Excel, by the way, why wouldn’t I?
There’s a bunch of sessions and two transactions associated with Alice. Let’s plot this on the timeline – since we had two transactions, there are two separate conversion processes for this user (I marked the transactions in orange):
Alice’s conversion processes look rather reasonable but there’s one thing that could raise an eyebrow. These two sessions in the very beginning don’t quite match, do they?
See, a lot of time passed between session number 2 and session number 3 – it’s been more than a week! Typically sessions within one customer journey have been more clustered together. Wouldn’t it be sensible to exclude those first sessions and treat them as an entirely separate customer journey?
This observation leads us to the second key assumption of the model. We’re going to assume that prolonged inactivity ends the transaction process.
This, on the other hand, raises another essential question: when does a period of inactivity become prolonged? Which basically means: how do we decide that a transaction process has failed due to inactivity and should be discarded?
To do it we’ll empirically find the distribution of the waiting times and then test each waiting time against it. We’ll say that a waiting time longer than a certain quantile of this distribution ends the current user path. The choice of the breakpoint is up to us – if we want to promote longer customer journeys, we might allow longer inactivity periods and vice versa.
Let’s plot the normalized histogram of waiting times along with a couple of quantiles:
We can see that if a user is going to return at all, it’s most likely to happen within the first couple of hours. Also, we might have been right about Alice’s sessions – the probability of return within the first week is more than 90% so the period of inactivity between second and third session seems a bit excessive.
Let’s include the breakpoints due to inactivity in Alice’s session history; we’ll take the 90% quantile as a inactivity limit (5.79 days, no less).
Now we can plot Alice’s sessions on the timeline again (converting sessions are marked in orange, detected inactivities in red):
Indeed, we can see that the first two sessions have been separated as intended!
Do note that the breakpoints due to transactions are completely independent from the breakpoints due to inactivity. In fact, two breakpoints can occur simultaneously – in this case, the session is going to be orange and red at the same time!
Let’s illustrate this for Alice’s sessions. This time we’re going to set the inactivity limit at 75% quantile (1.79 days):
In this case session number five becomes an isolated one-session customer journey!
Take notice that breakpoints of different types affect the customer journeys in different ways: breakpoint due to inactivity affects the current session (“if inactivity has been detected, start new user path right now”), breakpoint due to conversion, on the other hand, affects the next session (“if transaction has been detected, next session will belong to a new user path”). Be sure to take this into consideration when you put all the elements together!
This about concludes this lengthy blog post, thanks for sticking around to the very end! I hope that you found it interesting and educational (at least a little bit).
In my opinion, the concept of a user path is essential in modern web analytics since grouping sessions in larger building blocks is the “Step 0” in many advanced analyses.
Today, we’ve created a data-driven model of user paths which takes into account behavioral patterns of our customers. We’ve also learned how to change the model parameters to promote longer or shorter conversion processes depending on our preference.
Now that we’ve completed the “Step 0”, we can use this newly acquired knowledge and dive into further analyses. For instance, we can use our custom user paths to build more sophisticated attribution models – even a simplistic position based approach (not to mention the actual data-driven and machine learning models) will benefit from this added layer of complexity.
We can also expand the behavioral analyses. We won’t be confined to a single session anymore! Instead, we’ll be able to analyze the interactions spanning across all sessions on the user path. This way, we might be able to identify the early pain points which are likely to cause the conversion process to fail.
We’re going to discuss these subjects in more detail in the dedicated blog posts. See you next time!