“Big Data” is another one of those buzz words that seems to be everywhere these days. We hear stories regularly about how fast the world’s data grows and how big it’s going to be by 20xx. Vendors then reason that we should buy their wares to cope. This infographic is typical:
I have several deep professional connections to big data, going back decades, so when I say I think a lot of it is manufactured silliness, I’m hoping you’ll pause before laughing me off.
The fact is, most of the “data” that’s exploding is not hard-won intellectual treasure for the ages; it’s marginal stuff like the viewing history on Fred Flintstone’s deleted Netflix account. More than big data, we’re experiencing a “big crud” wave, because we’re pack rats. This comic has it right:
I’m not claiming that all big data is worthless; some amazing things become possible at the scale of billions of records. For Netflix, maybe Fred Flintstone’s viewing history is valuable. Maybe. However, big data is only an asset if we can derive some value from it. And an awful lot of big data doesn’t pass that smell test, either because our tools are inadequate, or because the data becomes stale, or because it wasn’t particularly interesting data to start with.
The value we want to derive is insight.
If you’re willing to be serious about the big data wave, then find the best of breed tools that push what’s possible. I recommend Perfect Search, for example; running a query 100x to 1000x faster than Google or Oracle, on a dataset 100x bigger, is the kind of tool that you need. And of course there are tools like hadoop and Google BigQuery and Amazon’s bulk load and Glacier and … Consider capturing value from big data while it’s in flight, and not storing it at all.
If you don’t want to surf the wave, then I have a relatively easy solution. It’s called the delete button. Go watch an episode of “Hoarders” and tell me I’m wrong. :-)
 I worked in the backup industry for over a decade, including on BackupExec and NetBackup, which collectively owned most of the world’s backups. When hundreds or thousands of clients stream over infiniband to media servers backed by peta-scale tape farms, and then use backups for security auditing and disaster recovery planning and regulatory compliance, that’s big data.
I also worked in the search industry. We used to get requirements like “We need to index 380 billion tweets. How long would that take?” Or, “We’d like to index each trade on the New York Stock Exchange, the FTSE, and the Tokyo Stock Exchange. We want to do it in realtime, and support thousands of queries per second at the same time.” Yep. I wasn’t kidding about the world needing Perfect Search technology.
Now I work at Adaptive Computing, which happens to A) make the scheduling software that runs the largest supercomputers on the planet; and B) sell cloud management software that’s at the core of some of the world’s largest private cloud deployments. Each of these markets generates serious big data war stories.
 I know, I know. Deleting isn’t easy. You have to know what can be deleted and what can’t. You have regulatory compliance issues. I still claim that getting better at deleting is easier than getting better at big data. That’s probably a good subject for another post…
- Are we Jumping the Shark on Big Data for HR? (workforceplanning.wordpress.com)
- Transforming Big Data Into Actionable Insight [Infographic] (neolane.com)
- How Are You Managing Big Data? Data, Data Everywhere | Domo | Blog (domo.com)
- Don’t Confuse Big Data With Storage (informationweek.com)
3 thoughts on “Big Crud Isn’t Big Data”
Data that isn’t valuable today could be critical tomorrow, and getting rid of it is irreversible. The very nature of data forces us to become digital packrats, accumulating and maintaining bits (pun intended) of cruft for what seems like an incomprehensible period of time. With storage getting cheaper and cheaper, there’s not much disincentive to do so.
I was really disappointed that Microsoft backed off of its ambitious WinFS project. It would have helped home users tame some of this ever-increasing data.