Skip to main content

Database Continuum On The Cloud - From Schemaless To Full-Schema

A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system. Let's look at the brief history of the evolution of DBMS, a data mining renaissance, and what we really need to design a data store that makes sense from the consumption and not the production view point.

Brief history of evolution of DBMS

Traditionally the relational database systems were designed to meet the needs of transactional applications such as ERP, SCM, CRM etc. also known as OLTP. These database systems provided row-store, indexes that work for selective queries, and high transactional throughput.

Then came the BI age that required accessing all the rows but fewer columns and had the need to apply mathematical functions such as aggregation, average etc. on the data that was being queried. Relational DBMS did not seem to be the right choice but the vendors figured out creative ways to use the same relational DBMS for the BI systems.

As the popularity of the BI systems and the volume of data grew two kinds of solutions emerged - one that still used the relational DBMS but accelerated the performance via innovative schema and specialized hardware and the other kind, columnar database, that used column-store instead of row-store. A columnar DBMS stores data grouped in columns so that a typical BI query can read all the rows but fewer columns in single read operation. Columnar vendors also started adding compression and main-memory computation to accelerate the runtime performance. The overall runtime performance of BI systems certainly got better.

Both the approaches, row-based and columnar, still required ETL - a process to extract data out of the transactional systems, apply some transformation functions, and load data into a separate BI store. They did not solve the issue of "design latency" - upfront time consumed to design a BI report due to the required transformation and a series of complicated steps to model a report.

Companies such as Greenplum and Aster Data decided to solve some of these legacy issues. Greenplum provides design-time agility by adopting a dump-all-your-data approach to apply the transformation on the fly only when needed. Aster Data has three layers to address the query, load, and execute aspects of the data. These are certainly better approaches that uses the parallelism really well and has cloud-like behavior but are still designed to patch up the legacy issues and do not provide clean design-time data abstraction.

What do we really need?

MapReduce is powerful since it is extremely simple to use. It has only three functions - map, split, and reduce. Such schemaless approaches have lately grown popularity due to the fact that developers don't want to lock themselves into a specific data model. They also want to explore adhoc computing before optimizing the performance. There are also extreme scenarios such as FriendFeed using relational database MySQL to store schema-less data. MapReduce has very low barrier to entry to get started. On the other hand a fully-defined schema approach by relational and columnar DBMS offers great runtime performance once the data is loaded and indexed for transactional access and executing BI functions such as aggregation, average, mean etc.

What we really need is a continuum from a schemaless to a full schema database based on the context, action, and access patterns of the data. A declarative abstracted persistence layer to access and manipulate the database that is optimized locally for various actions and access patterns is the right approach. This will allow the developers to fetch and manipulate the data independent of the storage and access mechanism. For example, developers can design an application where single page can perform a complex structured and unstructured search, create a traditional transaction, and display rich analytics information from single logical data store without worrying about what algorithms are being used to fetch and store data and how the system is designed to scale. This might require a hybrid data store architecture that optimizes the physical storage of data for certain access patterns and uses redundant storage replicated in real-time and other mechanisms such as accelerators for other patterns to provide unified data access to the applications upstream.

Schemaless databases such as SimpleDB, CouchDB, and Dovetail are in their infancy but the cloud makes it a good platform to support the key requirements of schemaless databases - incremental provisioning and progressive structure. Cloud also makes it a great platform for the full-schema DBMS by offering utility-style incremental computing to accelerate the runtime performance. A continuum on the cloud may not be that far-fetched after all.

Comments

Popular posts from this blog

15 YEARS OLD GIRL IMPREGNATED AND MAN RESPONSIBLE FOR IT TOOK FOR AN ABORTION THAT FAILED

BBI FACILITATE ARREST OF 35 YEARS OLD FOR DEFILEMENT, IMPREGNATING 15 YEARS OLD GIRL AND ABORTING FIVE MONTHS PREGNANCY IN ANAMBRA STATE. Today, at 1:26pm, We received a complaint from a concerned citizen who informed us of a 15yrs old girl brought into a hospital for medical treatment. Our intelligence team led by Director General Gwamnishu Emefiena Harrison Kenneth Nwaobi Ezika Kene and others left Asaba and arrived Ogidi Anambra state for investigation. 35yrs Chris Azuoma took the victim to hospital where she was injected and given abortion pills. She bled heavily and had complications and so decided to take her to a specialist hospital to evacuate the foetus. Getting to the hospital, we met the management and identified ourselves as Human rights group and they granted us permission to interview the victim. She confirmed the story and the perpetrator confessed forcefully having unprotected sexual intercourse with the victim. 2015 Administration of Criminal Justice permit private per

Hacking Into The Indian Education System Reveals Score Tampering

Debarghya Das has a fascinating story on how he managed to bypass a silly web security layer to get access to the results of 150,000 ISCE (10th grade) and 65,000 ISC (12th grade) students in India. While lack of security and total ignorance to safeguard sensitive information is an interesting topic what is more fascinating about this episode is the analysis of the results that unearthed score tampering. The school boards changed the scores of the students to give them "grace" points to bump them up to the passing level. The boards also seem to have tampered some other scores but the motive for that tampering remains unclear (at least to me). I would encourage you to read the entire analysis and the comments , but a tl;dr version is: 32, 33 and 34 were visibly absent. This chain of 3 consecutive numbers is the longest chain of absent numbers. Coincidentally, 35 happens to be the pass mark. Here's a complete list of unattained marks - 36, 37, 39, 41, 43, 45, 47, 49, 51, 53,

Reveiw: Celluon Epic Laser Keyboard

The Celluon Epic is a Bluetooth laser keyboard. The compact device projects a QWERTY keyboard onto most flat surfaces. (Glass tabletops being the exception) You can connect the Epic to vertically any device that supports Bluetooth keyboards including devices running iOS , Android , Windows Phone, and Blackberry 10. On the back of the device there is a charging port and pairing button. Once you have the Epic paired with your device it acts the same as any other keyboard. For any keyboard the most important consideration is the typing experience that it provides. The virtual keyboard brightness is adjustable and is easy to see in most lighting conditions. Unfortunately the brightness does not automatically adjust based on ambient light. With each keystroke a beeping sound is played which can be turned down. The typing experience on the Epic is mediocre at best. Inadvertently activating the wrong key can make typing frustrating and tiring. Even if you are a touch typist you'll still