Skip to main content

Unsupervised Machine Learning, Most Promising Ingredient Of Big Data


Orange (France Telecom), one of the largest mobile operators in the world, issued a challenge "Data for Development" by releasing a dataset of their subscribers in Ivory Coast. The dataset contained 2.5 billion records, calls and text messages exchanged between 5 million anonymous users in Ivory Coast, Africa. Various researchers got access to this dataset and submitted their proposals on how this data can be used for development purposes in Ivory Coast. It would be an understatement to say these proposals and projects were mind-blowing. I have never seen so many different ways of looking at the same data to accomplish so many different things. Here's a book [very large pdf] that contains all the proposals. My personal favorite is AllAborad where IBM researchers used the cell-phone data to redraw optimal bus routes. The researchers have used several algorithms including supervised and unsupervised machine learning to analyze the dataset resulting in a variety of scenarios.

In my conversations and work with the CIOs and LOB executives the breakthrough scenarios always come from a problem that they didn't even know existed or could be solved. For example, the point-of-sale data that you use for your out-of-stock analysis could give you new hyper segments using clustering algorithms such as k-means that you didn't even know existed and also could help you build a recommendation system using collaborative filtering. The data that you use to manage your fleet could help you identify outliers or unproductive routes using SOM (self organizing maps) with dimensionality reduction. Smart meter data that you use for billing could help you identify outliers and prevent thefts using a variety of ART (Adoptive Resonance Theory) algorithms. I see endless scenarios based on a variety of unsupervised machine learning algorithms similar to using cell phone data to redraw optimal bus routes.

Supervised and semi-supervised machine learning algorithms are also equally useful and I see them complement unsupervised machine learning in many cases. For example, in retail, you could start with a k-means to unearth new shopping behavior and end up with Bayesian regression followed by exponential smoothing to predict future behavior based on targeted campaigns to further monetize this newly discovered shopping behavior. However, unsupervised machine learning algorithms are by far the best that I have seen—to unearth breakthrough scenarios—due to its very nature of not requiring you to know a lot of details upfront regarding the data (labels) to be analyzed. In most cases you don't even know what questions you could ask.

Traditionally, BI has been built on pillars of highly structured data that has well-understood semantics. This legacy has made most enterprise people operate on a narrow mindset, which is: I know the exact problem that I want to solve and I know the exact question that I want to ask, and, Big Data is going to make all this possible and even faster. This is the biggest challenge that I see in embracing and realizing the full potential of Big Data. With Big Data there's an opportunity to ask a question that you never thought or imagined you could ask. Unsupervised machine learning is the most promising ingredient of Big Data.

Comments

Popular posts from this blog

Emergent Cloud Computing Business Models

The last year I wrote quite a few posts on the business models around SaaS and cloud computing including SaaS 2.0 , disruptive early stage cloud computing start-ups , and branding on the cloud . This year people have started asking me – well, we have seen PaaS, IaaS, and SaaS but what do you think are some of the emergent cloud computing business models that are likely to go mainstream in coming years. I spent some time thinking about it and here they are: Computing arbitrage: I have seen quite a few impressive business models around broadband bandwidth arbitrage where companies such as broadband.com buys bandwidth at Costco-style wholesale rate and resells it to the companies to meet their specific needs. PeekFon solved the problem of expensive roaming for the consumers in Eurpoe by buying data bandwidth in bulk and slice-it-and-dice-it to sell it to the customers. They could negotiate with the operators to buy data bandwidth in bulk because they made a conscious decision not to st...

Reveiw: Celluon Epic Laser Keyboard

The Celluon Epic is a Bluetooth laser keyboard. The compact device projects a QWERTY keyboard onto most flat surfaces. (Glass tabletops being the exception) You can connect the Epic to vertically any device that supports Bluetooth keyboards including devices running iOS , Android , Windows Phone, and Blackberry 10. On the back of the device there is a charging port and pairing button. Once you have the Epic paired with your device it acts the same as any other keyboard. For any keyboard the most important consideration is the typing experience that it provides. The virtual keyboard brightness is adjustable and is easy to see in most lighting conditions. Unfortunately the brightness does not automatically adjust based on ambient light. With each keystroke a beeping sound is played which can be turned down. The typing experience on the Epic is mediocre at best. Inadvertently activating the wrong key can make typing frustrating and tiring. Even if you are a touch typist you'll still ...

Rise Of Big Data On Cloud

Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud. Three years back, I wrote a post, The Future Of The BI In Cloud  where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a lar...