Big Data In Motion

2012-11-07

I’ve been at Cloud Expo this week, listening to lots of industry hoopla about building cloud-centric apps, managing clouds, purchasing hardware for clouds, buying private clouds from public cloud providers, and so forth.

One interesting decision made by the organizers of the conference was to bring “big data” under the same conference umbrella. There’s a whole track here about big data, and it gets mentioned in almost every presentation.

And I’ve sensed a shift in the wind.

Years and months ago, “big data” was all about mining assets in a data warehouse. You accumulated your big data over time. It sat in a big archive, and you planned to analyze it. You spun up hadoop or used some other map-reduce-style tool to crunch for days or weeks until you achieved some analytical goal.

What I’m hearing now is an acknowledgement that an important use case for big data — perhaps the most important use case — has little to do with data at rest. Instead, it recognizes that you’ll never have time to go back and sift through a vast archive; you have to notice trends by analyzing data as it streams past and disappears into the bit bucket. The data is still big, but the bigness has more to do with volume/throughput, and less to do with cumulative size.

This has interesting implications. Algorithms that were written on the assumption that you can corral the data set under analysis need to be replaced by ones based on statistical sampling; exactness needs to give way to fuzziness.

Interestingly, I think this will make computer-driven data analysis much more similar to the way humans process information. As I’ve said elsewhere, when faced with a difficult design problem, a smart question to ask is: how does Mother Nature solve it?