If you're reading this blog and you don't subscribe to Nathan Gilliatt's treasure trove of intelligent insight and good data stop reading and go subscribe.
Now, Nathan and I are normally on the same page on most topics, but perhaps never as much as today. His post on bad data in social media metrics is dead on.A few key points:
"Lately, I'm hearing more about social data that's intentionally tainted. If you're looking for meaning in social media data, you may have to deal with adversaries.
At the recent Social Media Analytics Summit, Dana Jacob gave a talk on the spam that finds its way into the search results of social media analysis platforms, skewing the numbers. One tidbit that Dana shared to illustrate the challenge: If you consider all of the creative misspellings, there are 600 quintillion (6 x 1020) ways to spell Viagra. So removing all of the spam from your data is a challenge.
- Spam seems to come in two flavors, neither of which will help you understand public opinion or online coverage. One is designed to fool people, to get them to click a link. It may lead to malware or fraud, or to some sort of product for sale. The other is designed to fool search engines with keywords and links embedded in usually irrelevant text. It's usually obvious to a human reader, but the hope seems to be that some search engines will count the links in their ranking of the target site.
- Gaming analytics platforms
Another presenter outlined a more direct challenge to the social media analyst when he described his system to game analytics systems with content farms and SEO tactics. He talked about using weaknesses in analytics systems to plant information in them. One slide described his methods as "weaponizing information in a predictive system," which doesn't leave a lot of room for exaggeration.
Beyond the crowdmapping context, can you detect opposition personas that post false reports in social media? It's a standard tactic in the government/political arena, but it could hit you in business, too. All you need is a motivated opponent."
Several recent projects we've worked on show that about 60% of raw data that comes in from commerical sources (names you would all recognize) might be invalid. Which is all the more reason as volumes of data skyrocket, we need better visibility into what is going into the data. That's what we'll be talking about at the AMEC Measurement Conference in June.
In the mean time, we're looking for more examples, if you've encountered invalid data, let us know.