Big Crud Isn’t Big Data

“Big Data” is another one of those buzz words that seems to be everywhere these days. We hear stories regularly about how fast the world’s data grows and how big it’s going to be by 20xx. Vendors then reason that we should buy their wares to cope. This infographic is typical:


I have several deep professional connections to big data[1], going back decades, so when I say I think a lot of it is manufactured silliness, I’m hoping you’ll pause before laughing me off.

The fact is, most of the “data” that’s exploding is not hard-won intellectual treasure for the ages; it’s marginal stuff like the viewing history on Fred Flintstone’s deleted Netflix account. More than big data, we’re experiencing a “big crud” wave, because we’re pack rats. This comic has it right: Continue reading

Do Androids Browse (For Electric Sheep)?

The movie Blade Runner is based on a Philip K. Dick short story entitled “Do Androids Dream of Electric Sheep?

Perhaps some new questions should be added to this classic…

In an interesting example of science fiction becoming reality, a group of researchers is now creating a sort of world wide web for the robots of the world. Whether or not androids dream, they may soon be able to use social networks for robots, and use public, internet-accessible resources to get their day-to-day work done. The initiative is called “RoboEarth“:

I believe this sort of technological evolution is the wave of the future. It represents a promising confluence of cloud computing, distributed architecture, big data, hadoop-like map-reduce, supercomputing, ubiquitous internet connectivity, and the every-device-has-an-IP-address promise of IPv6. It would be nice if my next Roomba didn’t have to relearn the floorplan of my house, but could simply download knowledge that the older model has laboriously developed. I’ll bet over the next decade, the market will discover hundreds of variations on that theme.

I just hope we’re smart enough to stop before robots start frittering away their time clicking cows on Facebook… :-)

Big Data In Motion

I’ve been at Cloud Expo this week, listening to lots of industry hoopla about building cloud-centric apps, managing clouds, purchasing hardware for clouds, buying private clouds from public cloud providers, and so forth.

Photo credit: aquababe (Flickr)

One interesting decision made by the organizers of the conference was to bring “big data” under the same conference umbrella. There’s a whole track here about big data, and it gets mentioned in almost every presentation.

And I’ve sensed a shift in the wind.

Years and months ago, “big data” was all about mining assets in a data warehouse. You accumulated your big data over time. It sat in a big archive, and you planned to analyze it. You spun up hadoop or used some other map-reduce-style tool to crunch for days or weeks until you achieved some analytical goal.

What I’m hearing now is an acknowledgement that an important use case for big data–perhaps the most important use case–has little to do with data at rest. Instead, it recognizes that you’ll never have time to go back and sift through a vast archive; you have to notice trends by analyzing data as it streams past and disappears into the bit bucket. The data is still big, but the bigness has more to do with volume/throughput, and less to do with cumulative size.

This has interesting implications. Algorithms that were written on the assumption that you can corral the data set under analysis need to be replaced by ones based on statistical sampling; exactness needs to give way to fuzziness.

Interestingly, I think this will make computer-driven data analysis much more similar to the way humans process information. As I’ve said elsewhere, when faced with a difficult design problem, a smart question to ask is: how does Mother Nature solve it?

What makes high-value questions?

Perfect Search (where I used to work) makes a search engine that performs/scales orders of magnitude better than competitors like Solr/Lucene with hadoop, FAST, Autonomy, and Google Search Appliances. This makes them a best-of-breed tool for many big data problems. They can do on one box what it would take competitors an entire rack of hardware to pull off. And usually that one box still runs an order of magnitude faster.

semantic network for "big data"

photo credit: metaroll (Flickr)

Despite the compelling value, sales have ramped more slowly than Perfect Search would like (ain’t it always the case…). Some reasons have to do with marketing, but I’ve recently had another insight that feels compelling to me.

My insight is this: high-value questions demand insight, not fact retrieval.

This might seem like old hat. After all, there’s a reason why business intelligence is a market segment unto itself, and why IBM is betting its corporate future on analytics. I think BI is going after the right kind of thing, but a lot of that community has lost its way and become little more than glorified reporting.

Here’s why.

Question categories

Questions that are interesting in the information age have answers that fall into three broad categories:

  • Unknowable
  • Known
  • Discoverable

Why is chocolate so awesomeUnknowable.

What is the population of BangladeshKnown.

How can I sell more widgets to housewives between the ages of 25 and 40Discoverable.

For structured data, the preferred way to get known answers is a DBMS (or a noSQL DB, maybe). For unstructured data, Google’s full text indexing is state-of-the-art (and Perfect Search’s is a quantum improvement). But nowadays, looking up known answers is passé. The world needs tools to do it, but the technology is not especially interesting.

Do our BI tools discover anything?

The central value proposition of big data is inseparably connected to discoverable answers. These gems are fundamentally different from facts waiting to be sliced; they’re rational guesses based on deduction and supported by rigorous data analysis.

In other words, if we’re not building big data solutions that hypothesize rather than report, we’re underdelivering. We call it data science, right? Isn’t the scientific method all about hypotheses and testing?

Business intelligence products and services that show pretty dashboards or reports are not really delivering insight; they’re exposing information and depending on the human intelligence in the minds of the users to provide the hypotheses and analysis that turns it into insight. Sometimes that happens, if a graph shows something interesting and noteworthy–but a lot of times, minutiae overwhelms, and BI is a waste of the customer’s money.

Enterprise search struggles, as an industry, because it’s trying to sell drill bits to customers who want holes, and it’s forgotten that it’s the hole, not the bit, that makes the customer passionate (thanks Zig Ziglar for the analogy). In other words, it is also depending on the customer to provide analysis that turns data into insight.

A new kind of big data technology

We can do better.

I propose a new kind of data analytics product/process/service that implements the scientific method on big data. Click here for an overview presentation.