1

If I had a maths degree with a little foundation in statistics, what would be the top $100$ questions/posts on CrossValidated or MathStackExchange or MathsOverflow or Stack Overflow that I would have to study in order to become a data scientist?

And if not possible, could you give me an idea of the steps (skills, knowledge, courses) that I would need for different career paths? For example research data scientist, data scientist in a start up, machine learning scientist, etc?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Euler_Salter
  • 1,426
  • 14
  • 33
  • the answer would be very subjective and too broad... – Haitao Du Apr 19 '17 at 14:02
  • Learn programming, that's the main skill of a successful data scientist. It's all about being able to manipulate the data in different forms and sizes. – Aksakal Apr 19 '17 at 14:03
  • @hxd1011 I know it would be very broad. however I feel that the practical aspect of the stackexchange sites is incomparable to any other programme or textbook I could find, in any area of mathematics or programming. Hence I think that if someone was to gather the most educative and important questions here, then people could become data scientists just by browsing the site. – Euler_Salter Apr 19 '17 at 14:05
  • 1
    I don't think this is the right approach to 'become' a data scientist. The way I am doing it is that I had started with a few (but good quality) online courses on data science/machine learning/statistics to understand the basics and then find a couple of problems to work my skills on. Only _then_ you should start consulting stackexchange, when you have a specific question arising from your problem. – DimP Apr 19 '17 at 14:05
  • @Aksakal What programming languages would you suggest? I am intermediate Python programmer and I can use MATLAB and Mathematica. However, talking to a guy doing Machine Learning, he said probably in the future Julia will be the most used. – Euler_Salter Apr 19 '17 at 14:06
  • @DimP Can I ask you what online courses you would rank as good quality? – Euler_Salter Apr 19 '17 at 14:07
  • It doesn't matter which programming language you know as long as you know it well to manipulate massive amounts of data efficiently. This involves understanding the infrastructure and tools, e.g. Linux, cloud, hadoop etc. That's why I didn't write "programming *language*", and instead used *programming*, it's wider than just knowing the language. It's all about the tool set. – Aksakal Apr 19 '17 at 14:09
  • 1
    @Aksakal And what about the statistics tool set, is it secondary? – Euler_Salter Apr 19 '17 at 14:11
  • 4
    You *won't* learn it by reading any top 100 Q&A. – Tim Apr 19 '17 at 14:13
  • 1
    Statistics is important BUT programming is a disqualifying skill for a data scientist. If you can't program you're useless, while without strong statistics you're still useful as a data scientist – Aksakal Apr 19 '17 at 14:14
  • @Tim then what would be the most efficient way? – Euler_Salter Apr 19 '17 at 14:15
  • Shouldn't [datascience.stackexchange.com](https://datascience.stackexchange.com/) be on the list? (BTW I agree with Tim) – GeoMatt22 Apr 19 '17 at 14:17
  • @Aksakal Okay. Since my programming skills are beginner level, what areas of programming would you suggest me to look at? I know you said "programming" and not programming language, however I feel kind of lost having to wander around such a broad spectrum of languages and skills – Euler_Salter Apr 19 '17 at 14:17
  • 2
    @Euler_Salter from the top 100 questions you will learn how to solve 100 specific problems, or answer 100 questions. It is as you learned 100 most common words in Japanese and pretended that you knew the language! – Tim Apr 19 '17 at 14:17
  • @GeoMatt22 oh wow I didn't even know about that site.. – Euler_Salter Apr 19 '17 at 14:17
  • @Tim true. Although I thought data science was very practical as in you needed a lot of experience to really succeed – Euler_Salter Apr 19 '17 at 14:19
  • @Euler_Salter Any of the following have some plausibility and value (and likely none are really sufficient on their own): appropriate university degree, relevant courses/MOOCs, working through relevant books or just getting into it (if you can, with training/supervision in the department). It's hard to imagine that reading some top 100 Q&A would work, it would be too much without context, is too little material. Of course, if you are looking for the fastest way (and don't care if you would be able to do anything useful), just self-label yourself as a data scientist on LinkedIn... – Björn Apr 19 '17 at 14:20
  • @Bjorn ahah I loved the LinkedIn think. Do you have any suggestion about the MOOCs ? For example is datacamp any good? Or maybe Lynda.com? – Euler_Salter Apr 19 '17 at 14:24
  • Programming skills: relational databases and SQL particularly; non relational DB with a whole crop of NoSQL buzz; distributed data storage such as Hadoop; distributed computing such as Spark; analytics presentation tools; Javascript and HTML5 and basic GUI development to be able to scrape the data from Web; a proper programming language such as C++ or Java even Python will do; math/stat libraries; some stat package like R; some scripting such as Perl/sed/awk on OS level. – Aksakal Apr 19 '17 at 15:39
  • @Aksakal wow, I didn't think that much – Euler_Salter Apr 19 '17 at 15:40
  • How to eat an elephant? Pick a spoon and start eating. Start with databases, for instance, for a mathematician SQL will make a lot of sense, it's based on a solid theory (relations). Or Javascript - it's fun, write a web crawler, for example. Every programmer knows a lot of languages and tools, because they're all very interesting to learn and use. If you find this annoying then quit the field and do something else, maybe sell vacuum cleaners? – Aksakal Apr 19 '17 at 15:44

1 Answers1

5

First, it is hard to define what is data scientist. Different people have different definitions. See this question: What is a data scientist?

Then, you may want to think what kind of job you want to do: more on building statistical model? Or more on making software (engineering work) tuning on clusters to processing huge amount of data. In addition, what field are you interested in? e.g., Health care, education? An interesting diagram can be found here. Where "hacking skills" and "subject expertise" are also extremely important.

I personally do not think Kaggle Competition can be counted for "science". And blindly using pre-developed packages to make "accurate predictions" and climb on ladder board is meaningful work.

Steps to follow is really depending on your objective (dream job). Working in a big company as a research (data) scientist vs. working in a start-up that taking care of everything from server/cloud maintenance to talking to customer would be totally different.

If you want to become a research scientist in AI / machine learning, read top conference papers in the filed (Such as AAAI, NIPS, ICML, ICDM) get try to get a Ph.D. in machine learning / statistics / operations research. (We have an interesting question here Why do Statistics, Machine learning and Operations research stand out as separate entities)

If you want to become a data scientist in a start up company, staring with MOOC courses may be better. In most of those courses, they emphasize how to use it in real world instead of the theoretical background behind statistical methods. In addition, trying to learning how to use could, e.g., AWS and learn how to programming (e.g., in python / java). Finally, the communication skill is also very important.

To conclude: think more on the career path you want to have in the future. How much time you want to spend on reading paper and deriving math. How much time you want to spend to debug the code and configure the server. Which field / subject expertise are you thinking you want to get into. Then search related questions and answers to read.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • stake in the ground? :) – Aksakal Apr 19 '17 at 14:10
  • @hxd1011 nice link, I never saw that venn diagram before. I would like to work on Machine Learning/ AI. I often find it confusing indeed to see the difference between someone working in ML or a data scientist. I'll have a read of your website and the question you linked, thanks! – Euler_Salter Apr 19 '17 at 14:14
  • Yes the link probably gives you as much information as you will find on StackExchnage. – Michael R. Chernick Apr 19 '17 at 14:15
  • @hxd1011 so what steps would you suggest me to follow? What online courses, what skills would I need to learn/enrol ? – Euler_Salter Apr 19 '17 at 14:18
  • @hxd1011 I know it is a big effort, but if you could select the steps for some of these different careers (for example research data scientist, working in start-up, machine learning scientist, AI scientist) it would be great! – Euler_Salter Apr 19 '17 at 14:25
  • @hxd1011 would you say a PhD in ML or Statistics is better than a MSc? For example I know CSML (Computational Statistics and Machine Learning) at UCL is really good as a Master. Would you say it is just better to get a PhD or would I need both master and PhD? – Euler_Salter Apr 19 '17 at 14:40
  • @Euler_Salter Sorry you get more high level suggestions from me... there is no good or bad for different things / program / career path. Is Stanford Ph.D good? depends what you want to do... – Haitao Du Apr 19 '17 at 14:44
  • @hxd1011 thank you a lot for your answer, it is really good! I like that you keep improving it! However my goal would be to work in ML and AI research, or start ups at most. But still on ML. I would like to be at the forefront of innovation in that field – Euler_Salter Apr 19 '17 at 15:00