Skip to main content

A Data Scientist's View On Skills, Tools, And Attitude



I recently came across this interview (thanks Dharini for the link!) with Nick Chamandy, a statistician a.k.a a data scientist at Google. I would encourage you to read it; it does have some great points. I found the following snippets interesting:

Recruiting data scientists:
When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don’t want to miss out on a great candidate because he or she comes from a non-statistics background and doesn’t search for the right keyword. On my team alone, we have had successful “statisticians” with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.
I share the same view. The best scientists I have met are not statisticians by academic training. They are domain experts and design thinkers and they all share one common trait: they love data! When asked how they might build a team of data scientists I highly recommend people to look beyond traditional wisdom. You will be in good shape as long as you don't end up in a situation like this :-)

Skills:
The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into “pipelines.”
Most companies won't have the kind of highly skilled development army that Google has but then not all companies would have Google scale problem to deal with. Though I suggest two things: a) build a very strong community of data scientists using social tools so that they can collaborate on challenges and tools they use b) make sure that the chief data scientist (if you have one) has very high level of management buy-in to make things happen otherwise he/she would be spending all the time in "alignment" meetings as opposed to doing the real work.

Data preparation:
There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data.
I disagree. I do strongly believe the tools need to involve to do some of these things and the data scientists should not be spending their time to compensate for the inefficiencies of the tools. Becoming intimate with the data—have empathy for the problem—is certainly a necessity but spending time on pulling, fixing, and aggregating data is not the best use of their time.

Attitude:
To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one’s attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics.
As I would say bring tools and knowledge but leave bias and expectations aside. The best data scientists are the ones who are passionate about data, can quickly learn a new domain, and are willing to make and fail and fail and make.

Image courtesy: xkcd

Comments

Popular posts from this blog

Emergent Cloud Computing Business Models

The last year I wrote quite a few posts on the business models around SaaS and cloud computing including SaaS 2.0 , disruptive early stage cloud computing start-ups , and branding on the cloud . This year people have started asking me – well, we have seen PaaS, IaaS, and SaaS but what do you think are some of the emergent cloud computing business models that are likely to go mainstream in coming years. I spent some time thinking about it and here they are: Computing arbitrage: I have seen quite a few impressive business models around broadband bandwidth arbitrage where companies such as broadband.com buys bandwidth at Costco-style wholesale rate and resells it to the companies to meet their specific needs. PeekFon solved the problem of expensive roaming for the consumers in Eurpoe by buying data bandwidth in bulk and slice-it-and-dice-it to sell it to the customers. They could negotiate with the operators to buy data bandwidth in bulk because they made a conscious decision not to st...

Reveiw: Celluon Epic Laser Keyboard

The Celluon Epic is a Bluetooth laser keyboard. The compact device projects a QWERTY keyboard onto most flat surfaces. (Glass tabletops being the exception) You can connect the Epic to vertically any device that supports Bluetooth keyboards including devices running iOS , Android , Windows Phone, and Blackberry 10. On the back of the device there is a charging port and pairing button. Once you have the Epic paired with your device it acts the same as any other keyboard. For any keyboard the most important consideration is the typing experience that it provides. The virtual keyboard brightness is adjustable and is easy to see in most lighting conditions. Unfortunately the brightness does not automatically adjust based on ambient light. With each keystroke a beeping sound is played which can be turned down. The typing experience on the Epic is mediocre at best. Inadvertently activating the wrong key can make typing frustrating and tiring. Even if you are a touch typist you'll still ...

Rise Of Big Data On Cloud

Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud. Three years back, I wrote a post, The Future Of The BI In Cloud  where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a lar...