Saturday, April 8, 2017

What do you really need to know to be a data scientist? Data Science Lovers and Haters

Previously I discussed the Super Data Science podcast and credit modeling in terms of the modeling strategy and models used. The discussion also covered data science in general, and one part of the conversation I thought was well worth discussing in more detail. It really gets to the question of what's it take to be a data scientist. There is a ton of energy spent on this in places like LinkedIn and other forums. I think the answer comes in two forms. From the 'lovers' of data science its all about what kind of advice can I give people to help and encourage them to create value in this space. To the 'haters' its more like now that I have established myself in this space what kind of criterion should we have to keep people out and prevent them from creating value.  But before we get to that, here is some great dialogue from Kirill discussing a trap that data scientists or aspiring data scientists fall into:

Kirill: "I think there’s a level of acumen that people should have, especially going into data science role. And then if you’re a manager you might take a step back from that. You might not need that much detail…If you’re doing the algorithms, that acumen might be enough. You don’t need to know the nitty-gritty mathematical academic formulas to everything about support vector machines or Kernels and stuff like that to apply it properly and get results. On the other hand, if you find that you do need that stuff you can go and spend some additional time learning. A lot of people fall into the trap. They try to learn everything in a lot of depth, whereas I think the space of data science is so broad you can’t just learn everything in huge depths. It’s better to learn everything to an acceptable level of acumen and then deepen your knowledge in the spaces that you need."

Greg: "if you don’t want to get into that detail, I totally get it. You can be totally fine without it. I have never once in my career had somebody ask me what are the formulas behind the algorithm….there’s a lot of jobs out there for people that don’t know them."

I admit I used to fall into this trap. In fact this blog is a direct result. Early in my career I had the mindset if you can't prove it you can't use it. I really didn't feel confident about an algorithm or method until I understood it 'on paper' and could at least code my own version in SAS IML or R. A number of posts here were based on this work and mindset. Then, a very well known and accomplished developer/computational scientist that frequently helped me gave the good advice that with this mindset you might never get any work done. Or only a fraction of work.

Given the amount of discussion you might see on LinkedIn or the so called data science community about real or fake data scientists (lots of haters out there) in the Talk Python to Me podcast author Joel Grus (of Data Science from Scratch) provides what I think is the most honest discussion of what data science is and what data scientists do:

"there are just as many jobs called data science as there are data scientists"

That is kind of paraphrasing and kind of out of context and yes very general. Almost defining a word using the word in the definition. But it is very very TRUE.  That is because the field is largely undefined. To attempt to define it is futile and I think would be the antithesis of data science itself. I will warn though that there are plenty of data science haters out there that would quibble with what Greg and Joel have said above.

These are people that want to impose something more strict. Some minimum threshold. Common threads indicate some fear of a poser or fake data scientist fooling some company into hiring them or incompetently pointing and clicking their way through an analysis without knowing what is going on and calling themselves a data scientist. While I understand that concern, its one extreme. It can easily morph into a straw man argument for a more political agenda at the other extreme. That might lead to a listing of minimal requirements to be a real data scientist, some laundry list of requirements (think  big data technologies, degrees and the like). Economists know all about this and we see it in the form of licensing and rent seeking in a number of professions and industries. Broadly speaking its a waste of resources. Absolutely in this broad space economists would also recognize merit in signaling through certification, certain degree programs or course work, or other methods of credentialization. But there is a big difference between competitive signaling and non-competitive rent seeking behaviors.

In its inception, data science was all about disruption. As described in Johns Hopkins applied economics program description:

“Economic analysis is no longer relegated to academicians and a small number of PhD-trained specialists. Instead, economics has become an increasingly ubiquitous as well as rapidly changing line of inquiry that requires people who are skilled in analyzing and interpreting economic data, and then using it to effect decisions ………Advances in computing and the greater availability of timely data through theInternet have created an arena which demands skilled statistical analysis, guided by economic reasoning and modeling.”

This parallels data science. Suddenly you no longer need a PhD in statistics or a software engineering background or an academics' level of acumen to create value added analysis. (although those are all excellent backgrounds for doing some advanced work in data science no doubt).  Its that basic combination of subject matter expertise, some knowledge of statistics and machine learning, and ability to write code or use software to solve problems. That's it. Its disruptive and the haters hate it. They simultaneously embrace the disruption and want to reign it in and fence out the competition. I hate it for the haters but you don't need to be able to code your own estimators or train a neural net from scratch to use it. And there is probably as much or more value creating professional space out there for someone that can clean a data set and provide a set of cross tabs as there is for the know how to set up a Hadoop cluster.

Below are a couple of really great KDNuggets articles in this regard written by Karolis Urbonas, Head of Business Intelligence at Amazon:

How to think like a data scientist to become one

What makes a great data scientist?



No comments:

Post a Comment