Wednesday, January 23, 2013

BigData & NoSQL databases- The market trend


During the past couple of years there have been a lot of technology news around the NoSQL databases especially with the Big Data Story revving up. Some of them are MongoDB, Cassandra, Hbase, Couch DB, Redis, membase, Neo4J, CouchDB, Accumulo, TripleStore, membase, DynamoDB etc. Most of these NoSQL databases can be categorized into few groups, they are as follows:
  1. Key-Value Stores:
    This technology uses a hash table where there is a unique key and a pointer to a particular item of data. The Key/value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value
    Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB, Amazon SimpleDB, Riak
    Typical Applications: Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc.
    Strengths:
    Fast lookups
    Weakness:
    Stored data has no schema
  2. Column family store:
    This type was created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. The columns are arranged by column family.
    Examples: Cassandra, HBase, Riak
    Typical Applications: Distributed file systems
    Strengths:
    Fast lookups, good distributed storage of data
    Weakness:
    Very low-level API
  3. Document store:
    These were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of Key/value, allowing nested values associated with each key.  Document databases support querying more efficiently.
    Examples: CouchDB, MongoDb, Elastic Search
    Typical Applications: Web applications, Content Management systems
    Strengths:
    Tolerant of incomplete data
    Weakness:
    Query performance, no standard query syntax
  4. Graph Databases:This model follows the flexible graph model which can scale across multiple machines. This does not have the tables of rows and columns and the rigid structure of SQL, . NoSQL databases do not provide a high-level declarative query language like SQL to avoid overtime in processing. Rather, querying these databases is data-model specific. Many of the NoSQL platforms allow for RESTful interfaces to the data, while other offer query APIs.
    Examples: Neo4J, InfoGrid, Infinite Graph
    Typical Applications: Social networking, Recommendations etc
    Strengths: Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc.
    Weakness: Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster.
 Eventhough there are several of these NoSQL databases in the market, I am not sure whether they can be compared as apples to apples and consider most of them specialized for very specific horizontal usecases or vertical needs. Hence there is one single NoSQL database that can solve your whole enterprise wide problem and hence the implementation of these NoSQL DB would be decided by different departments within the organization.

NoSQL databases are not a replacement for the conventional RDBMS. It is expected to supplement or   augment the data for other business needs. It is also expected that in the coming years some of the established Database vendors such as Oracle, Sybase, Microsoft, IBM would take over some of the active NoSQL databases and merge them into their portfolio.


The worldwide NoSQL software market is expected to reach $3.4 Billion by 2018. It is expected to grow  at a CAGR of 21% between 2013 and 2018. NoSQL market is expected to generate $14 Billion in revenues for the period 2013 – 2018.

The NoSQL market has been very active in the past 1 year especially with several venture capital funding, mergers & acquisitions and new product offerings:
  • September 2012 – In-Q-Tel, the venture investment arm of the U.S. Intelligence Community, invests in 10gen, developer of the MongoDB open source database;
  • August 2012 – Sqrrl, a National Security Agency’s spin-off startup, raised $2Mln to develop NoSQL database Accumulo
  • July 2012 – NuoDB raises $10Mln to develop cloud NoSQL database that behaves like traditional SQL
  • June 2012 – Cloudant launches NoSQL data layer service for Windows Azure;
  • May 2012 – 10gen secures $42 million in venture funding;
  • January 2012 – Amazon launches DynamoDB, a new NoSQL data service;
  • January 2012 – Oracle announces the availability of Oracle Big Data Appliance and partners with Cloudera to provide an Apache Hadoop distribution and tools for the Big Data Appliance;
  • November 2011 – Cloudera Inc., the provider of Apache Hadoop-based data management software and services, raises $40 million.
  • November 2011 – Basho, the company behind Riak, raises $5 Mln.  

Some Usecases of When to Use NoSQL: 

  • Logging/Archiving. Log-mining tools are handy because they can access logs across servers, relate them and analyze them.
  • Social Computing Insight. Many enterprises today have provided their users with the ability to do social computing through message forums, blogs etc.
  • External Data Feed Integration. Many companies need to integrate data coming from business partners. Even if the two parties conduct numerous discussions and negotiations, enterprises have little control over the format of the data coming to them. Also, there are many situations where those formats change very frequently – based on the changes in the business needs of partners.
  • Front-end order processing systems. Today, the volume of orders, applications and service requests flowing through different channels to retailers, bankers and Insurance providers, entertainment service providers, logistic providers, etc. is enormous. These requests need to be captured without any interruption whenever an end user makes a transaction from anywhere in the world. After, a reconciliation system typically updates them to back-end systems as well as updates the end user on his/her order status.
  • Enterprise Content Management Service. Content Management is now used across companies’ different functional groups, for instance, HR or Sales. The challenge is bringing together different groups using different meta data structures in a common content management service.
  • Real-time stats/analytics. Sometimes it is necessary to use the database as a way to track real-time performance metrics for websites (page views, unique visits, etc.)  Tools like Google Analytics are great but not real-time — sometimes it is useful to build a secondary system that provides basic real-time stats. Other alternatives, such as 24/7 monitoring of web traffic, are a good way to go, too.

What Type of Storage Should you use?

Here’s a short summary that might help you make your decision:
NoSQL
  • Storage should be able to deal with very high load
  • You do many write operations on the storage
  • You want storage that is horizontally scalable
  • Simplicity is good, as in a very simple query language (without joins)
RDBMS
  • Storage is expected to be high-load, too, but it mainly consists of read operations
  • You want performance over a more sophisticated data structure
  • You need powerful SQL query language


The other big news is around the Cloud databases such as Google BigQuery, Amazon Redshift and Cloudera Impala.
This big data space would be hotter by the end of 2013  as many they would solve many of the big-data and analytical problems for the businesses. We would also see several investments happening from the established vendors in this space and also some consolidation by M&A by the end of 2014/early 2015. A nice space to watch!

News from Feb 2014: IBM buys NoSQL cloud provider Cloudant
http://www.computerworld.com/s/article/9246543/IBM_buys_NoSQL_cloud_provider_Cloudant 

No comments: