Skip to main content

Rise Of Big Data On Cloud


Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud.

Three years back, I wrote a post, The Future Of The BI In Cloud where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a large scale data warehouse and the second was lack of scale out computing for I/O intensive applications.

A year back Amazon announced RedShift, a data warehouse service in the cloud, and last week they announced high I/O instances for EC2. We have come a long way and more and more I look at the current capabilities and trends, Big Data, at scale, on the cloud, seems much closer to reality.

From a batched data warehouse to interactive analytic applications:

Hadoop was never designed for I/O intensive applications, but Hadoop being a compelling computational scale out platform developers had a strong desire to use it for their data warehousing needs. This made Hive and HiveQL popular analytic frameworks but this was a sub optimal solution that worked well for batch loads and wasn't suitable for responsive and interactive analytic applications. Several vendors realized there's no real reason to stick to the original style of MapReduce. They still stuck to the HDFS but significantly invested into alternatives to Hive which are way faster.

There are series of such projects/products that are being developed on HDFS and MapReduce as a foundation but by adding special data management layers on top of it to run interactive queries much faster compared to plain vanilla Hive. Some of those examples are Impala from Cloudera and Apache Drill from MapR (both based on Dremel), HAWQ from EMC, Stinger from Hortonworks and many other start-ups. Not only vendors but the early adopters such as Facebook created Hive projects such as Presto, an accelerated Hive, which they recently open sourced.

From raw data access frameworks to higher level abstraction tools: 

As vendors continue to build more and more Hive alternatives I am also observing vendors investing in higher level abstraction frameworks. Pig was amongst those first higher level frameworks that made it easier to express data analysis programs. But, now, we are witnessing even higher layer rich frameworks such as Cascading and Cascalog not only to write SQL queries but write interactive programs in higher level languages such as Clojure and Java. I'm a big believer in empowering developers with right tools. Working directly against Hadoop has a significant learning curve and developers often end up spending time on plumbing and other things that can be abstracted out in a tool. For web development, popularity of Angular and Bootstrap are examples of how right frameworks and tools can make developers way more efficient not having to deal with raw HTML, CSS, and Javascript controls.

From solid state drives to in-memory data structures: 

Solid state drives were the first step in upstream innovation to make I/Os much faster but I am observing this trend go further where vendors are investing into building in-memory resident data management layers on top of HDFS. Shark and Spark are amongst the popular ones. Databricks has made big bets on Spark and recently raised $14M. Shark (and hence Spark) is designed to be compatible with Hive but designed to run queries 100x times faster by using in-memory data structures, columnar representation, and optimizing MapReduce not to write intermediate results back to disk. This looks a lot like MapReduce Online which was a research paper published a few years back. I do see a UC Berkeley connection here.

Photo courtesy: Trey Ratcliff

Comments

Popular posts from this blog

Emergent Cloud Computing Business Models

The last year I wrote quite a few posts on the business models around SaaS and cloud computing including SaaS 2.0 , disruptive early stage cloud computing start-ups , and branding on the cloud . This year people have started asking me – well, we have seen PaaS, IaaS, and SaaS but what do you think are some of the emergent cloud computing business models that are likely to go mainstream in coming years. I spent some time thinking about it and here they are: Computing arbitrage: I have seen quite a few impressive business models around broadband bandwidth arbitrage where companies such as broadband.com buys bandwidth at Costco-style wholesale rate and resells it to the companies to meet their specific needs. PeekFon solved the problem of expensive roaming for the consumers in Eurpoe by buying data bandwidth in bulk and slice-it-and-dice-it to sell it to the customers. They could negotiate with the operators to buy data bandwidth in bulk because they made a conscious decision not to st...

Reveiw: Celluon Epic Laser Keyboard

The Celluon Epic is a Bluetooth laser keyboard. The compact device projects a QWERTY keyboard onto most flat surfaces. (Glass tabletops being the exception) You can connect the Epic to vertically any device that supports Bluetooth keyboards including devices running iOS , Android , Windows Phone, and Blackberry 10. On the back of the device there is a charging port and pairing button. Once you have the Epic paired with your device it acts the same as any other keyboard. For any keyboard the most important consideration is the typing experience that it provides. The virtual keyboard brightness is adjustable and is easy to see in most lighting conditions. Unfortunately the brightness does not automatically adjust based on ambient light. With each keystroke a beeping sound is played which can be turned down. The typing experience on the Epic is mediocre at best. Inadvertently activating the wrong key can make typing frustrating and tiring. Even if you are a touch typist you'll still ...