Codecraft

software = science + art + people

Big Crud Isn't Big Data

2013-04-09

“Big Data” is another one of those buzz words that seems to be everywhere these days. We hear stories regularly about how fast the world’s data grows and how big it’s going to be by 20xx. Vendors then reason that we should buy their wares to cope. This infographic is typical:

dataneversleeps_2-0_v2

I have several deep professional connections to big data[1], going back decades, so when I say I think a lot of it is manufactured silliness, I’m hoping you’ll pause before laughing me off.

The fact is, most of the “data” that’s exploding is not hard-won intellectual treasure for the ages; it’s marginal stuff like the viewing history on Fred Flintstone’s deleted Netflix account. More than big data, we’re experiencing a “big crud” wave, because we’re pack rats. This comic has it right:

image credit: Ryan North (qwantz.com)

I’m not claiming that all big data is worthless; some amazing things become possible at the scale of billions of records. For Netflix, maybe Fred Flintstone’s viewing history is valuable. Maybe. However, big data is only an asset if we can derive some value from it. And an awful lot of big data doesn’t pass that smell test, either because our tools are inadequate, or because the data becomes stale, or because it wasn’t particularly interesting data to start with.

The value we want to derive is insight.

If you’re willing to be serious about the big data wave, then find the best of breed tools that push what’s possible. I recommend capturing value from big data while it’s in flight, and not storing it at all.

If you don’t want to surf the wave, then I have a relatively easy[2] solution. It’s called the delete button. Go watch an episode of “Hoarders” and tell me I’m wrong. :-)


[1] I worked in the backup industry for over a decade, including on BackupExec and NetBackup, which collectively owned most of the world's backups. When hundreds or thousands of clients stream over infiniband to media servers backed by peta-scale tape farms, and then use backups for security auditing and disaster recovery planning and regulatory compliance, that's big data. I also worked in the search industry. We used to get requirements like "We need to index 380 billion tweets. How long would that take?" Or, "We'd like to index each trade on the New York Stock Exchange, the FTSE, and the Tokyo Stock Exchange. We want to do it in realtime, and support thousands of queries per second at the same time." Yep. I wasn't kidding about the world needing Perfect Search technology. Now I work at Adaptive Computing, which happens to A) make the scheduling software that runs the largest supercomputers on the planet; and B) sell cloud management software that's at the core of some of the world's largest private cloud deployments. Each of these markets generates serious big data war stories. [2] I know, I know. Deleting isn't easy. You have to know what can be deleted and what can't. You have regulatory compliance issues. I still claim that getting better at deleting is easier than getting better at big data. That's probably a good subject for another post...

Comments

  • Jesse Harris, 2013-04-09:

    Data that isn't valuable today could be critical tomorrow, and getting rid of it is irreversible. The very nature of data forces us to become digital packrats, accumulating and maintaining bits (pun intended) of cruft for what seems like an incomprehensible period of time. With storage getting cheaper and cheaper, there's not much disincentive to do so. I was really disappointed that Microsoft backed off of its ambitious WinFS project. It would have helped home users tame some of this ever-increasing data.