Skip to main content

Database Continuum On The Cloud - From Schemaless To Full-Schema

A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system. Let's look at the brief history of the evolution of DBMS, a data mining renaissance, and what we really need to design a data store that makes sense from the consumption and not the production view point.

Brief history of evolution of DBMS

Traditionally the relational database systems were designed to meet the needs of transactional applications such as ERP, SCM, CRM etc. also known as OLTP. These database systems provided row-store, indexes that work for selective queries, and high transactional throughput.

Then came the BI age that required accessing all the rows but fewer columns and had the need to apply mathematical functions such as aggregation, average etc. on the data that was being queried. Relational DBMS did not seem to be the right choice but the vendors figured out creative ways to use the same relational DBMS for the BI systems.

As the popularity of the BI systems and the volume of data grew two kinds of solutions emerged - one that still used the relational DBMS but accelerated the performance via innovative schema and specialized hardware and the other kind, columnar database, that used column-store instead of row-store. A columnar DBMS stores data grouped in columns so that a typical BI query can read all the rows but fewer columns in single read operation. Columnar vendors also started adding compression and main-memory computation to accelerate the runtime performance. The overall runtime performance of BI systems certainly got better.

Both the approaches, row-based and columnar, still required ETL - a process to extract data out of the transactional systems, apply some transformation functions, and load data into a separate BI store. They did not solve the issue of "design latency" - upfront time consumed to design a BI report due to the required transformation and a series of complicated steps to model a report.

Companies such as Greenplum and Aster Data decided to solve some of these legacy issues. Greenplum provides design-time agility by adopting a dump-all-your-data approach to apply the transformation on the fly only when needed. Aster Data has three layers to address the query, load, and execute aspects of the data. These are certainly better approaches that uses the parallelism really well and has cloud-like behavior but are still designed to patch up the legacy issues and do not provide clean design-time data abstraction.

What do we really need?

MapReduce is powerful since it is extremely simple to use. It has only three functions - map, split, and reduce. Such schemaless approaches have lately grown popularity due to the fact that developers don't want to lock themselves into a specific data model. They also want to explore adhoc computing before optimizing the performance. There are also extreme scenarios such as FriendFeed using relational database MySQL to store schema-less data. MapReduce has very low barrier to entry to get started. On the other hand a fully-defined schema approach by relational and columnar DBMS offers great runtime performance once the data is loaded and indexed for transactional access and executing BI functions such as aggregation, average, mean etc.

What we really need is a continuum from a schemaless to a full schema database based on the context, action, and access patterns of the data. A declarative abstracted persistence layer to access and manipulate the database that is optimized locally for various actions and access patterns is the right approach. This will allow the developers to fetch and manipulate the data independent of the storage and access mechanism. For example, developers can design an application where single page can perform a complex structured and unstructured search, create a traditional transaction, and display rich analytics information from single logical data store without worrying about what algorithms are being used to fetch and store data and how the system is designed to scale. This might require a hybrid data store architecture that optimizes the physical storage of data for certain access patterns and uses redundant storage replicated in real-time and other mechanisms such as accelerators for other patterns to provide unified data access to the applications upstream.

Schemaless databases such as SimpleDB, CouchDB, and Dovetail are in their infancy but the cloud makes it a good platform to support the key requirements of schemaless databases - incremental provisioning and progressive structure. Cloud also makes it a great platform for the full-schema DBMS by offering utility-style incremental computing to accelerate the runtime performance. A continuum on the cloud may not be that far-fetched after all.

Comments

Popular posts from this blog

Emergent Cloud Computing Business Models

The last year I wrote quite a few posts on the business models around SaaS and cloud computing including SaaS 2.0 , disruptive early stage cloud computing start-ups , and branding on the cloud . This year people have started asking me – well, we have seen PaaS, IaaS, and SaaS but what do you think are some of the emergent cloud computing business models that are likely to go mainstream in coming years. I spent some time thinking about it and here they are: Computing arbitrage: I have seen quite a few impressive business models around broadband bandwidth arbitrage where companies such as broadband.com buys bandwidth at Costco-style wholesale rate and resells it to the companies to meet their specific needs. PeekFon solved the problem of expensive roaming for the consumers in Eurpoe by buying data bandwidth in bulk and slice-it-and-dice-it to sell it to the customers. They could negotiate with the operators to buy data bandwidth in bulk because they made a conscious decision not to st...

Reveiw: Celluon Epic Laser Keyboard

The Celluon Epic is a Bluetooth laser keyboard. The compact device projects a QWERTY keyboard onto most flat surfaces. (Glass tabletops being the exception) You can connect the Epic to vertically any device that supports Bluetooth keyboards including devices running iOS , Android , Windows Phone, and Blackberry 10. On the back of the device there is a charging port and pairing button. Once you have the Epic paired with your device it acts the same as any other keyboard. For any keyboard the most important consideration is the typing experience that it provides. The virtual keyboard brightness is adjustable and is easy to see in most lighting conditions. Unfortunately the brightness does not automatically adjust based on ambient light. With each keystroke a beeping sound is played which can be turned down. The typing experience on the Epic is mediocre at best. Inadvertently activating the wrong key can make typing frustrating and tiring. Even if you are a touch typist you'll still ...

Rise Of Big Data On Cloud

Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud. Three years back, I wrote a post, The Future Of The BI In Cloud  where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a lar...