Google Analytics spam – getting a clean view
In Part 1 of our ‘Tackling GA spam’ series, we examined the hot topic of language spam, what it is, how to identify it – and most importantly – how to deal with it. This week we’ll be taking a look at the different types of Google Analytics (GA) spam and how you can deal with them effectively to give you a clean dataset and accurate traffic numbers for your reports
What is GA spam?
GA spam can appear in a number of forms in your analytics account, but the end result will nearly always be the addition of fake sessions which can artificially inflate your reports, make conversion rates look lower and generally make a mess of things.
Let’s take a look at a few of the key types of GA spam and what they mean for you…
In this example, fake sessions show up within your ‘Language’ report (found under the ‘Geo’ dropdown in the ‘Audience’ reports section of your GA account).
In a healthy view, you’d ordinarily see a variety of language types in the ISO 639-1 format (i.e. “en-gb” & “en-us”).
When affected by language spam, you will typically see a message has been injected such as the following recent example during the presidential campaign in the USA:
As you can see in the above example, this spam language label registered 526 fake sessions which will give your reports an artificial traffic increase. Failing to deal with this can result in artificial seasonality and make subsequent months appear to be of poor month on month performance.
Fake landing pages spam
In this example, fake landing page URL’s will show up in the ‘Landing Pages’ report (found under the ‘Site Content’ dropdown in the ‘Behaviour’ reports section of your GA account).
This is usually an easy one to spot as they stick out like a sore thumb amongst your expected list of page URL’s as in this example below:
As with the language spam example, this will add fake sessions to your ‘Landing Pages’ report, skewing your reporting numbers, conversion and bounce rates, etc.
In this example, when looking at your ‘Referral’ report (found under the ‘All Traffic’ dropdown in the ‘Acquisition’ reports section of your GA account) you’ll see the offending domains listed. The good news is, they’re easy to spot.
Typical examples in the past have included: “abcdefgi.xyz”, “social-buttons.xyz” and “rank-check.online” but there are an infinite number of variations that tend to give themselves away by looking spammy from the word go. You can also sort by ‘Bounce Rate’ to show referral sources with a 100% bounce rate; typically this will contain your spam referrers.
As well as adding fake sessions to your report, this type of spam is designed with the intention of getting you to visit the offending sites where you’ll be presented with products for sale and display ads. With this in mind: Don’t visit that site!
How is it added to my account?
The spam to be found lurking in that dark corner of your GA account is usually injected in one of 2 key ways. These methods are responsible for the 3 types of spam we have listed above.
So-called because there is no physical visit to your website with this one. The method is usually responsible for the majority of GA spam attacks and follows a simple process of the data being sent directly to Google through the use of the “Measurement Protocol” which was intended for developers to send raw user interaction data to GA’s servers.
With this one, your website is crawled by a spambot which impersonates a user visiting your website (in a similar way to how major search engines crawl your website to index your data).
The spambot exits your site and leaves a record that will appear to be a genuine visit in your reports, even adding fake session times and bounce rates in its wake.
It is important to note that these spambot’s typically tend to ignore exclusions you may have added to your robots.txt file to try and counter this problem.
How can I remove it?
All of the fake data in your GA account caused by the spam types discussed in this article can be removed with the addition of filters to your default reporting view.
Don’t worry, it’s not as complicated as it sounds and we’ve run through each option in a step by step format below.
NB It’s important to add any new filters in a new reporting view to allow you time to test them out and ensure they don’t strip out data they shouldn’t be. Once you add a filter to a view within your analytics account, the data it strips out going forward will be lost.
Valid hostnames filter
One way of removing fake referral traffic from a GA account is to set up a “Valid Hostnames” filter. This will allow only genuine referral traffic through to your reports and eliminate ‘Ghost visits’. Genuine referrals should record the hostname as your own domain rather than a spammy domain name or “(not set)”.
Navigate to the ‘Network’ report (found under the ‘Technology’ dropdown in the ‘Audience’ reports section of your GA account).
Set a date range which covers a long period of time (say the last 6 – 12 months) to ensure you uncover all variations of fake hostnames).
Change the ‘Primary Dimension’ to ‘Hostname’ to get a list of all of the hostnames that have your referral traffic claims to have arrived at.
You will now have a list of all hostnames, valid and fake alike. You should see your primary domain listed as one of these hostnames and other properties where your GA tracking code has been added.
This could include subdomains, IP addresses which you know to be yours, payment services and shopping carts, caching services, Google translation services and content delivery networks.
Next we need to make a list of all of the valid hostnames that you want to record GA data for.
Be careful to investigate any hostnames you are unsure of fully before adding them to an exclude list.
Your list should now look something like this:
Now we are going to combine the valid hostnames into a regular expression filter (REGEX) that can be utilised to set up our “Include” filter.
We will be using the “|” (pipe) symbol to separate each hostname and will need to utilise a “\” (backslash) between each full stop and hyphen (-).
The resulting REGEX filter pattern from the hostname list above would therefore appear as follows:
Set up a new reporting view to allow testing of this filter for 2 – 4 weeks, then go to the ‘Admin’ settings for your view and select the ‘Filters’ option in the ‘View’ column. Now click on the ‘Add Filter’ button:
Choose the ‘Create New Filter’ radio button and give your filter a meaningful name which explains what it excludes at a quick glance.
- Choose the ‘Custom filter’ type and the ‘Include’ radio button.
- Under the ‘Filter Field’ dropdown, search for ‘Hostname’ and select it.
- Now, enter your ‘Hostname’ regular expression pattern in the ‘Filter Pattern’ field.
- Do not select the ‘Case-sensitive’ box or any of the additional radio buttons below.
- To get a view of how the filter would affect your data over the last 7 days, click on the ‘Verify this filter’ hyperlink at the bottom of the page (this may return no result if the numbers are small).
- If you get a result, check the list of hostnames this filter will exclude to ensure that nothing is being stripped out in error.
Your new filter should look like the below:
Once you have successfully created your new filter you should monitor its success over the next 2 – 4 week period before making the decision to copy it over to your default reporting view.
Spam crawlers filters
To tackle targeted spam visits made by bots and to capture spam from those who have figured out your hostname, you will need to set up ‘Spam Crawler’ filters which contain a REGEX filter pattern featuring the list of persistent offenders.
This will need to be periodically updated to ensure it continues to include all domains that are actively spamming your GA account, but combined with the ‘Valid Hostnames’ and ‘Language Spam’ filters, will give you a clean dataset.
Make a list of the offending domains currently clogging up your referral reports and add any typical examples being reported by SEO’s to use as your filter pattern. Check out great posts which feature the latest filter patterns you can utilise such as this one from Analyticsedge.com.
There is a character limit in relation to the filter pattern box, but here at Click Consult, we’ve been able to add up to 800 characters before hitting it.
Your filter pattern should look like a longer version of the below example:
Select the new reporting view you’ve just set up to test your ‘Valid Hostnames’ filter, then go to the ‘Admin’ settings for your view and select the ‘Filters’ option in the ‘View’ column. Now click on the ‘Add Filter’ button as you did for your ‘Valid Hostnames’ filter.
Again choose the ‘Create New Filter’ radio button and give your ‘Spam Crawler’ filter/s a name.
- Choose the ‘Custom filter’ type and the ‘Exclude’ radio button.
- Under the ‘Filter Field’ dropdown, search for ‘Campaign Source’ and select it.
- Now, enter your ‘Spam Filter’ REGEX pattern in the ‘Filter Pattern’ field.
- Do not select the ‘Case-sensitive’ box or any of the additional radio buttons below as before.Again you can get a view of how the filter would affect your data over the last 7 days with the ‘Verify this filter’ hyperlink. Check any results to ensure you are happy with what is being stripped out.
Your new filter should look like the below:
As was the process with your ‘Valid Hostnames’ filter, you should monitor its success over the next 2 – 4 week period, copying it over to your default reporting view once you are happy that it is effective.
To find out how to strip out language spam from your GA reports, check out Part 1 of our GA spam series.
Top tip Utilising GA’s built-in bot filtering function will exclude hits from a variety of known spiders and bots. To enable this, select the ‘Admin’ header nav item in your GA account, select the view you want to apply this to from the ‘View’ dropdown and click on ‘View Settings’. Tick the box under the ‘Bot Filtering’ heading and you’re done.
What about my historical data?
If you’ve set up the spam filters suggested above and found them to be effective, you might be wondering how you can also get a clean set of historical data.
Don’t worry, this is possible too and it’s really easy to do by setting up a new segment which your spam filters can be applied to.
To find out how to do this for each of the above examples, check out the ‘What about historical data’ section in our dealing with language spam in Google Analytics post.
Is your website performing as well as it could in the SERPs – request your FREE bespoke organic search (SEO) site analysis now to find out where you could improve.