Despite the hate that Jetpack gets for being a bloatware plugin, it is one of my favorite and the first step whenever I setup a new WordPress install. However, Jetpack does have a few irritating habits that I cannot overlook. One of these is the stats module. The module actually does pretty well, posting data to the wordpress.com dashboard and making it easy for me to quickly glance at the number of visitors I’ve had for the day.
However, every so often the module craps out and logs a large number of visits from crawlers, bots and spiders as legitimate hits, since those are not in the official list of crawlers, bot and spiders to look out for. To fix this, I went out to look for the list and to add to it. One quick GitHub code search later, I found that the file class-jetpack-user-agent.php is responsible for hosting the list of non-humans to look out for. What I found inside was actually a pretty comprehensive list of software, but one that definitely needed extending.
If you want to do what I did, find the file in your WP installation at –
/wp-content/plugins/jetpack/class.jetpack-user-agent.php
Inside the file, look for the following array variable –
$bot_agents
You’ll see that the array already contains common bots like alexa, googlebot, baiduspider and so on. However, I deepdived (meaning did a sublime text search) into my access.log files and found some more. To extend the array, simply look for the last element (which should be yammybot) and extend it as follows –
'yammybot', 'ahrefsbot', 'pingdom.com_bot', 'kraken', 'yandexbot', 'twitterbot', 'tweetmemebot', 'openhosebot', 'queryseekerspider', 'linkdexbot', 'grokkit-crawler', 'livelapbot', 'germcrawler', 'domaintunocrawler', 'grapeshotcrawler', 'cloudflare-alwaysonline',
Note that you want to leave in the last comma, and you want all the entries in lower case. This doesn’t actually matter, because the PHP function that does the string compare is case-insensitive, but it just looks neater. You’ll also notice that I’ve added the precise names of the bots, like ‘grokkit-crawler’ and ‘clousflare-alwaysonline’ but you can be less specific and save yourself some pain. This will, however, affect your final stats outcome.
Notes –
- Some of the bots are pretty interesting. I saw tweetmemebot, which is from a company called datasift, which seems to be in the business of trawling all social networks for interesting links and providing meaningful insights into them. Another was twitterbot. Why the heck does twitter need to send out a bot? We submit our links to it willingly! Also interesting were livelapbot, germcrawler and kraken. I have no idea why they’re looking at my site.
- Although Jetpack does not have a comprehensive list of bots, it still does a pretty good job. I found the main culprit of the stats mess in my case. Turns out, CloudFlare, in an effort to provide their AlwaysOnline service (which is enabled for my site), looks at all our pages frequently and this doesn’t sit well with Jetpack. I hope this tweak will fix this now.
- Although this fix is currently in place, every time the Jetpack plugin gets updated, all these entries will disappear. That’s why this blog post is both a tutorial for you all and a reminder and diary entry for me to make this change every time I run a Jetpack update. However, if someone can tell me a way to permanently extend Jetpack, or if someone can reach out to the Jetpack team (hey Nitin, why don’t you file a GitHub issue against this?) it’ll be awesome and I’ll be super thankful!
Update – I was trying to be hip and did a fork of Jetpack and GitHub, made the changes and then tried to make a pull request. Turns out, I don’t know how to do that, so I opened an issue instead. It sits here.
Twitterbot is actually a good bot in that it needs to fetch the URLs for TwitterCards (enhanced Tweets for URLs on your site). If you do not use the TwitterCard markup then you can block Twitterbot in “robots.txt”.
well, the idea is not to block any bots but to make sure they don’t show up in your analytics… 🙂
True enough. But I was surprised to see Twitterbot called out in this post because I have used Jetpack for years and I don’t believe it records Twitterbot, which spends a lot of time on my servers. I would expect to see more reported traffic on some newer sites than I do if Twitterbot were triggering the Jetpack statistics. It might be worth running some experiments to see what can trigger Jetpack (I know the SEMalt bot does that). I’ll have to think about that.
Maybe you’re right. Maybe Jetpack’s servers ignore Twitterbot for metrics by default. But it doesn’t hurt to have it removed from the data anyways. I’m not hurting twitterbot’s ability to work, just jetpack’s ability to see it.
Or am I? I’m not familiar enough with Jetpack code to saw for sure. There’s no way for me to check what is being sent to jetpack servers. That’s troublesome for me, if only because I’d love to know how to edit the data they use to calculate my visitor analytics.
I have a love-hate relationship with JetPack. But I’m afraid it does not run deep enough for me to say I know exactly what it is doing.
Do share the results of your experiments… Would help me a lot!
If I find time to do it I’ll leave a comment here.