Skip to main content

Database Continuum On The Cloud - From Schemaless To Full-Schema

A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system. Let's look at the brief history of the evolution of DBMS, a data mining renaissance, and what we really need to design a data store that makes sense from the consumption and not the production view point.

Brief history of evolution of DBMS

Traditionally the relational database systems were designed to meet the needs of transactional applications such as ERP, SCM, CRM etc. also known as OLTP. These database systems provided row-store, indexes that work for selective queries, and high transactional throughput.

Then came the BI age that required accessing all the rows but fewer columns and had the need to apply mathematical functions such as aggregation, average etc. on the data that was being queried. Relational DBMS did not seem to be the right choice but the vendors figured out creative ways to use the same relational DBMS for the BI systems.

As the popularity of the BI systems and the volume of data grew two kinds of solutions emerged - one that still used the relational DBMS but accelerated the performance via innovative schema and specialized hardware and the other kind, columnar database, that used column-store instead of row-store. A columnar DBMS stores data grouped in columns so that a typical BI query can read all the rows but fewer columns in single read operation. Columnar vendors also started adding compression and main-memory computation to accelerate the runtime performance. The overall runtime performance of BI systems certainly got better.

Both the approaches, row-based and columnar, still required ETL - a process to extract data out of the transactional systems, apply some transformation functions, and load data into a separate BI store. They did not solve the issue of "design latency" - upfront time consumed to design a BI report due to the required transformation and a series of complicated steps to model a report.

Companies such as Greenplum and Aster Data decided to solve some of these legacy issues. Greenplum provides design-time agility by adopting a dump-all-your-data approach to apply the transformation on the fly only when needed. Aster Data has three layers to address the query, load, and execute aspects of the data. These are certainly better approaches that uses the parallelism really well and has cloud-like behavior but are still designed to patch up the legacy issues and do not provide clean design-time data abstraction.

What do we really need?

MapReduce is powerful since it is extremely simple to use. It has only three functions - map, split, and reduce. Such schemaless approaches have lately grown popularity due to the fact that developers don't want to lock themselves into a specific data model. They also want to explore adhoc computing before optimizing the performance. There are also extreme scenarios such as FriendFeed using relational database MySQL to store schema-less data. MapReduce has very low barrier to entry to get started. On the other hand a fully-defined schema approach by relational and columnar DBMS offers great runtime performance once the data is loaded and indexed for transactional access and executing BI functions such as aggregation, average, mean etc.

What we really need is a continuum from a schemaless to a full schema database based on the context, action, and access patterns of the data. A declarative abstracted persistence layer to access and manipulate the database that is optimized locally for various actions and access patterns is the right approach. This will allow the developers to fetch and manipulate the data independent of the storage and access mechanism. For example, developers can design an application where single page can perform a complex structured and unstructured search, create a traditional transaction, and display rich analytics information from single logical data store without worrying about what algorithms are being used to fetch and store data and how the system is designed to scale. This might require a hybrid data store architecture that optimizes the physical storage of data for certain access patterns and uses redundant storage replicated in real-time and other mechanisms such as accelerators for other patterns to provide unified data access to the applications upstream.

Schemaless databases such as SimpleDB, CouchDB, and Dovetail are in their infancy but the cloud makes it a good platform to support the key requirements of schemaless databases - incremental provisioning and progressive structure. Cloud also makes it a great platform for the full-schema DBMS by offering utility-style incremental computing to accelerate the runtime performance. A continuum on the cloud may not be that far-fetched after all.

Comments

Popular posts from this blog

Emergent Cloud Computing Business Models

The last year I wrote quite a few posts on the business models around SaaS and cloud computing including SaaS 2.0 , disruptive early stage cloud computing start-ups , and branding on the cloud . This year people have started asking me – well, we have seen PaaS, IaaS, and SaaS but what do you think are some of the emergent cloud computing business models that are likely to go mainstream in coming years. I spent some time thinking about it and here they are: Computing arbitrage: I have seen quite a few impressive business models around broadband bandwidth arbitrage where companies such as broadband.com buys bandwidth at Costco-style wholesale rate and resells it to the companies to meet their specific needs. PeekFon solved the problem of expensive roaming for the consumers in Eurpoe by buying data bandwidth in bulk and slice-it-and-dice-it to sell it to the customers. They could negotiate with the operators to buy data bandwidth in bulk because they made a conscious decision not to st...

Focus On Your Customers And Not Competitors

A lorry is a symbol of Indian logistics and the person who is posing against it is about to rethink infrastructure and logistics in India. Jeff Bezos is enjoying his trip to India charting Amazon’s growth plan where competitors like Flipkart have been aggressively growing and have satisfied customer base. This is not the first time Bezos has been to India and he seems to understand Indian market far better than many CEOs of American companies. His interview with a leading Indian publication didn’t get much attention in the US where he discusses Amazon’s growth strategy in India. When asked whether he is in panic mode: For 19 years we have succeeded by staying heads down, focused on our customers. For better or for worse, we spend very little time looking at our competitors. It is better to stay focused on customers as they are the ones paying for your services. Competitors are never going to give you any money. I always believe in focusing on customers, especially on their latent unme...

Purple Squirrels

It is fashionable to talk about talent shortage in the silicon valley. People whine about how hard it is to find and hire the "right" candidates. What no one wants to talk about is how the hiring process is completely broken. I need to fill headcount: This is a line that you hear a lot at large companies. Managers want to hire just because they are entitled to hire with a "hire or lose headcount" clause. Managers spend more time worrying about losing headcount and less time finding the right people the right way. Chasing a mythical candidate: Managers like to chase purple squirrels . They have outrageous expectations and are far removed from reality of talent market. Managers are also unclear on exactly what kind of people they are looking to hire. Bizarre interview practices: "How many golf balls can fit in a school bus?" or "can you write code with right hand while drawing a tree with left hand?" We all have our favorite bizarre interview st...