Friday, February 28, 2014

Big Data : On the Precipice of a Collapse

Before anyone freaks out, I'm talking about a technology collapse, not a market collapse or the steep downhill slope of a hype curve.  But I guess a "technology collapse" doesn't sound much better.  Maybe "technology consolidation" is a better word, but that's not really what I mean.  Oh well, let me explain...

For anyone that has been through introductory computer science courses, you know that different data structures are suited for different applications.   With a bit of analysis you can classify the cost of an operation against a specific structure, and how that will scale as the size of the data or the input scales. (See Big O Notation if you would like to geek out.)  In Big Data, things are no different, but we are usually talking about performance and scaling relative to additional constraints such as the CAP Theorem.

So what are our "data structures and algorithms" in the Big Data space?  Well, here is my laundry/reading list:
So what does that have to do with an imminent collapse?  Well, market demands are pushing our systems to ingest an increasing amount of data in a decreasing amount of time, while also making that data immediately available to an increasing variety of queries.

We know from our classes that to accommodate the increased variety of queries, we need different data structures and algorithms to service those queries quickly.  That leaves us with no choice but to duplicate data across different data structures and to use different tools to query across those systems.   Often this approached is referred to as "Polyglot Persistence".  That worked, but it left the application to orchestrate the writes and queries across the different persistence mechanisms.  (painful)

To alleviate that pain, people are collapsing the persistence mechanisms and interfaces. Already, we see first-level integrations.  People are combining inverted-indexes and search/query mechanisms with distributed databases. e.g.
  • Titan integrated Elastic Search.
  • Datastax integrated SOLR.
  • Stargate combines Cassandra and Lucene
This is a natural pairing because the queries you can perform efficiently against distributed databases are constrained by the partitioning and physical layout of the data.  The inverted indexes fill the gap, allowing users to perform queries along dimensions that may not align with the data model used in the distributed database.  The distributed database can handle the write-intensive traffic, while the inverted indexes handle the read-intensive traffic, likely with some lag in synchronization between the two. (but not necessarily)

The tight-integration between persistence mechanisms makes it transparent to the end-user that data was duplicated across stores, but IMHO, we still have a ways to go.  What happens if you want to perform an ad-hoc ETL against the data?  Well, then you need to fire up Hive or Pig (Spark and/or Shark), and use a whole different set of infrastructure, and a whole other language to accomplish your task.

One can imagine a second or third-level integration here, which unifies the interface into "big data": a single language that would provide search/query capabilities (e.g. lucene-like queries), with structured aggregations (e.g. SQL-like queries), with transformation, load and extract functions (e.g. Pig/Hive-like queries) rolled into one-cohesive platform that was capable of orchestrating the operations/functions/queries across the various persistence and execution frameworks.

I'm not sure quite what that looks like.  Would it use Storm or Spark as the compute framework? perhaps running on YARN, backed by Cassandra, Elastic Search and Titan, with pre-aggregation capabilities like those found in Druid?

Who knows?   Food for thought on a Friday afternoon.
Time to grab a beer.

(P.S. shout out Philip Klein and John Savage, two professors back at Brown that inspired these musings)



No comments: