Big Data In Motion

I’ve been at Cloud Expo this week, listening to lots of industry hoopla about building cloud-centric apps, managing clouds, purchasing hardware for clouds, buying private clouds from public cloud providers, and so forth.

Photo credit: aquababe (Flickr)

One interesting decision made by the organizers of the conference was to bring “big data” under the same conference umbrella. There’s a whole track here about big data, and it gets mentioned in almost every presentation.

And I’ve sensed a shift in the wind.

Years and months ago, “big data” was all about mining assets in a data warehouse. You accumulated your big data over time. It sat in a big archive, and you planned to analyze it. You spun up hadoop or used some other map-reduce-style tool to crunch for days or weeks until you achieved some analytical goal.

What I’m hearing now is an acknowledgement that an important use case for big data–perhaps the most important use case–has little to do with data at rest. Instead, it recognizes that you’ll never have time to go back and sift through a vast archive; you have to notice trends by analyzing data as it streams past and disappears into the bit bucket. The data is still big, but the bigness has more to do with volume/throughput, and less to do with cumulative size.

This has interesting implications. Algorithms that were written on the assumption that you can corral the data set under analysis need to be replaced by ones based on statistical sampling; exactness needs to give way to fuzziness.

Interestingly, I think this will make computer-driven data analysis much more similar to the way humans process information. As I’ve said elsewhere, when faced with a difficult design problem, a smart question to ask is: how does Mother Nature solve it?

What makes high-value questions?

Perfect Search (where I used to work) makes a search engine that performs/scales orders of magnitude better than competitors like Solr/Lucene with hadoop, FAST, Autonomy, and Google Search Appliances. This makes them a best-of-breed tool for many big data problems. They can do on one box what it would take competitors an entire rack of hardware to pull off. And usually that one box still runs an order of magnitude faster.

semantic network for "big data"

photo credit: metaroll (Flickr)

Despite the compelling value, sales have ramped more slowly than Perfect Search would like (ain’t it always the case…). Some reasons have to do with marketing, but I’ve recently had another insight that feels compelling to me.

My insight is this: high-value questions demand insight, not fact retrieval.

This might seem like old hat. After all, there’s a reason why business intelligence is a market segment unto itself, and why IBM is betting its corporate future on analytics. I think BI is going after the right kind of thing, but a lot of that community has lost its way and become little more than glorified reporting.

Here’s why.

Question categories

Questions that are interesting in the information age have answers that fall into three broad categories:

  • Unknowable
  • Known
  • Discoverable

Why is chocolate so awesomeUnknowable.

What is the population of BangladeshKnown.

How can I sell more widgets to housewives between the ages of 25 and 40Discoverable.

For structured data, the preferred way to get known answers is a DBMS (or a noSQL DB, maybe). For unstructured data, Google’s full text indexing is state-of-the-art (and Perfect Search’s is a quantum improvement). But nowadays, looking up known answers is passé. The world needs tools to do it, but the technology is not especially interesting.

Do our BI tools discover anything?

The central value proposition of big data is inseparably connected to discoverable answers. These gems are fundamentally different from facts waiting to be sliced; they’re rational guesses based on deduction and supported by rigorous data analysis.

In other words, if we’re not building big data solutions that hypothesize rather than report, we’re underdelivering. We call it data science, right? Isn’t the scientific method all about hypotheses and testing?

Business intelligence products and services that show pretty dashboards or reports are not really delivering insight; they’re exposing information and depending on the human intelligence in the minds of the users to provide the hypotheses and analysis that turns it into insight. Sometimes that happens, if a graph shows something interesting and noteworthy–but a lot of times, minutiae overwhelms, and BI is a waste of the customer’s money.

Enterprise search struggles, as an industry, because it’s trying to sell drill bits to customers who want holes, and it’s forgotten that it’s the hole, not the bit, that makes the customer passionate (thanks Zig Ziglar for the analogy). In other words, it is also depending on the customer to provide analysis that turns data into insight.

A new kind of big data technology

We can do better.

I propose a new kind of data analytics product/process/service that implements the scientific method on big data. Click here for an overview presentation.