A handy trick for using google sheets regex
File under: anecdotes from my love affair with Google Sheets.
If you’ve ever had to consolidate a bunch of messy data into a few stem patterns, read on.
I recently found myself needing to consolidate a bunch of Google Analytics page view data into a shortened list. The client had not made use of the ‘Exclude URL Query Parameters’ feature, so the page views included session IDs and other dynamic parameters that turned a few pages into a list of 1,500 unique URLs. I knew I could do this with regular expressions, and I thought to check if Sheets has a way to do it since Google includes support for regular expressions in many of their products. I was happy to find that Sheets was no exception. If you are not familiar with regular expressions, they are instances of a standard syntax for pattern-matching text. Check out this post for some basic how-tos: Regular Expressions 101.
Find and replace
The find and replace function in Google Sheets has a checkbox ‘Search using regular expressions’. Next to the checkbox, there is a link to their help page. Dive in there if you want to get into the real nitty-gritty. I’m going to describe one simple use of this functionality that doesn’t require in-depth knowledge of regular expressions.
This is a sample of the data I started with:
Page | Pageviews |
---|---|
/products/calendar_new?arrival=2017-02-04&nights;=4&session;=Nb1lmU0p | 1 |
/products/calendar_new?arrival=28+Jan+2017&nights;=4&session;=M0tpd0I0 | 1 |
/products/calendar_new?arrival=19+Jan+2017&nights;=6&session;=KZjV0cksz | 1 |
/products/calendar_new?arrival=21+Oct+2016&nights;=2&session;=b3I2SzUE | 1 |
/products/new?arrival=11+Oct+2016&nights;=4&session;=EVjNMZn4 | 1 |
/products/new?arrival=1+Jan+2017&nights;=6&session;=MR1lCR09aV | 1 |
/products/new?arrival=1+Nov+2016&nights;=4&session;=JeHFpZ2o | 1 |
/products/new?arrival=1+Oct+2016&nights;=7&session;=GRVV2SU9 | 1 |
To replace the “arrival” value in the URL, I use a regular expression that looks like this:
arrival=.+?&
Here’s a breakdown of what that says:
- Look for the text “arrival=”
- . and + are special characters in regular expressions. The pattern .+ looks for a sequence of one or more of any character
- The ? is also a special character – in this case, it indicates to stop searching after the first instance it finds of the & character
- Altogether, it says to look for “arrival=” followed by anything until you reach an ampersand.
One thing you might be wondering: if ? is a special character, how do you match a ? in text? To do this, use a front slash to escape the ?, like this \?. There are a number of other special characters that need to be escaped. For example, periods and plus signs need to be escaped if you are searching for them as text, e.g. \. and \+
Test it out
If you want to test out a regular expression to see if it matches what you are looking for, www.regexr.com is a great, free tool. It also has a concise reference on how to write regular expressions, including more detail on escape characters.
Here’s what my pattern looks like in the Find and replace box:
Notice that I am replacing it with “arrival=&” – I didn’t want to remove the parameter entirely, I just wanted to get rid of the dynamic values. You could also replace with nothing to get rid of the parameter completely. Next, I do the following replacements:
- Find nights=.+?& and replace with nights=&
- Find session=.+ and replace with session=
Once I’m done replacing, I run a pivot to get summary totals on the consolidated URLs:
Page | Pageviews |
---|---|
/products/calendar_new?arrival=&nights;=&session;= | 4 |
/products/new?arrival=&nights;=&session;= | 4 |
How cool is that? With Excel, I used to do this kind of thing with ridiculously complicated text formulas. Did I mention how much I love Google Sheets?