22

Is machine learning an important subject for any statistician to become acquainted with? It seems that machine learning is statistics. Why don't statistics programs (undergraduate and graduate) require machine learning?

Andre Silva
  • 3,070
  • 5
  • 28
  • 55
user20616
  • 1,431
  • 3
  • 11
  • 7
  • 1
    Readers here might be interested in the following thread: [What is the difference between data mining, statistics, machine learning, and AI](http://stats.stackexchange.com/questions/5026/). – gung - Reinstate Monica Jul 10 '13 at 21:12
  • 2
    Don't know, but I'm pretty sure that everyone doing machine learning should learn statistics. – Dave Mar 28 '14 at 15:27

3 Answers3

18

Machine Learning is a specialized field of high dimensional applied statistics. It also requires considerable programming background which isn't necessary for a good quantitative program, especially at the undergraduate level but also to some extent at the graduate level. It has application only to the prediction aspect of statistics, whereas mathematical statistics as well as inferential and descriptive applied statistics require attention. Many programs offer students the chance to have a great deal of exposure to machine learning (CMU for instance), but industrial statisticians on the whole rarely get the chance to apply these tools, barring certain high profile tech jobs.

While I have recently seen many data scientist and machine learning positions in the job market, I think the general job description of "statistician" does not require a machine learning background, but does require an impeccable understanding of basic statistics, inference, and communication: these should really be the core of a graduate statistics program. Machine learning and data science are also relatively new as job titles and as disciplines. It would be a disservice to those seeking employment as statisticians to sway their problem solving strategies toward machine learning if it's mostly abandoned in business/pharma/bioscience enterprise for underwhelming efficacy in 10 or 20 years.

Lastly, I don't feel that machine learning tremendously enhances a solid understanding of statistics. Statistics is fundamentally a cross-disciplinary field and it's important to communicate and convince non-technical experts in your field (such as doctors, CFOs, or administrators) exactly why you chose the methodology you did choose. Machine learning is such a niche, highly technical field that, in many applied practices, only promises incrementally better performance than standard tools and techniques. Many of the methods in supervised and unsupervised learning are perceived by non-experts (and even some less trained experts) as "black box". When asked to defend their choice of a specific learning method, there are explanations that fall flat and draw on none of the applied problem motivated circumstances. This is a great risk to advising any decision making process.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • I'm afraid I don't agree with your explanation (last paragraph) of machine learning! What tools do you refer to when you say ML provides only slight improvement compared to "standard tools"? There are some problems that machine learning can tackle for which there are no "standard" tools! Contrary to your belief, ML is not a "black box" science, there is always sound scientific theory behind it. – revolusions Feb 07 '13 at 19:55
  • Well, if machine learning is a niche of applied statistics, and there are black boxes around - wouldn't that just mean that statisticians should work out the theory and find out what is in the black box? I.e. speaking for more interaction between normal/pure/proper statistics and machine learning? After all, there may be good black boxes around... (remember partial least squares?) – cbeleites unhappy with SX Feb 07 '13 at 20:08
  • 1
    Could you explain a bit more in detail what exactly you mean with the explanations falling flat (examples maybe?)? – cbeleites unhappy with SX Feb 07 '13 at 20:09
  • 10
    I can't describe the differences between a linear discriminant analysis, support vector machines, and a GLM LASSO in a way that makes sense to a doctor. So I built a logistic regression model for breast cancer risk prediction using a handful of carefully adjusted covariates. When presented, the doctors immediately launched into an enlightening discussion about their effect sizes. The discrimination of my "science" model was very comparable to more sophisticated ML techniques (overlapping 90% CIs for AUC based on bootstrap in validation sample), and I'm not the only one with such a case report! – AdamO Feb 07 '13 at 21:18
  • @AdamO: what has *your* inability to communicate so that medical doctors understand you to do with methods being black boxes? I find particularly the principles of LDA and SVM quite easy to explain. – cbeleites unhappy with SX Feb 07 '13 at 22:28
  • But: the possibilities for interpretation IMHO are definitively a big point in deciding for or against a certain method. I don't wonder about your model using the techniques you prefer does well, see [this question](http://stats.stackexchange.com/questions/49536/predictive-performance-depends-more-on-expertise-of-data-analyst-than-on-method) I asked partially triggered by your answer. But, as this about (assuming it is - just a 90% CI overlap for AUC alone does not convince me ;-) ) – cbeleites unhappy with SX Feb 07 '13 at 22:34
  • 4
    @cbeleites, have you ever had to communicate to a substantive person with at best a college algebra amount of math knowledge? SVM does not produce effect sizes in terms that the doctors would understand; the width of the margin does not make sense for them, unlike odd ratios that they are very much used to. If you can't talk the client's language, they won't waste their time and money on you. – StasK Feb 07 '13 at 23:53
  • @AdamO The comment about you not being able to explain a SVM to a doctor is unfair. Just because you couldn't explain it doesn't mean that it is "black-box". – revolusions Feb 08 '13 at 10:35
  • @AdamO As long as we are talking about breast cancer, I've helped develop ML tools that can optimise breast cancer multi agent chemotherapy treatment regimens. We had NO PROBLEM whatsoever explaining our methods to oncologists! – revolusions Feb 08 '13 at 10:36
  • @Stask: I regularly work with medical doctors, biologists etc. ("substantive" ranging from students learning data analysis to chief surgeons/professors who don't experiment themselves). I agree completely that *we* have to talk the client's language. IMHO interpretability in the way needed is an important argument when deciding about the method, particularly if both description and prediction are needed. Usually works well: "We can interpret LDA spectroscopically because ..., but not SVMs, because.... OTOH, LDA finds straight class boundaries only, while SVM manages curved as well => ..." – cbeleites unhappy with SX Feb 08 '13 at 13:04
  • @cbeleites, thanks for clarification. Your own comment here goes back to your question about what's more important, the method or the person applying it. I probably would not go for a straight LDA unless it falls short on the sample size/dimensions ratio, and would try QDA or some sort of kernel stuff, at least to see if there is any improvement. Then it becomes a matter of personal preferences (I would use B-splines over polynomials on any day) and what's there in the analyst's statistical tool box. From that perspective, ML is an important but not the only possible tool. – StasK Feb 08 '13 at 14:54
  • @revolusions I have sat in on a number of consulting sessions to know a few warning signs. I would never recommend a method to a client if they couldn't write up a substantial "statistical methods" section on their work in an academic paper, including insightful critiques. I prefer model based approaches to all aspects of statistics because I appreciate a client interpreting model coefficients and verifying that they agree with his/her previous technical knowledge. When a client is confused, he/she often won't say so, they tend to disengage from the process. That's a huge risk. – AdamO Feb 08 '13 at 16:21
  • @StasK: we're usually far on the wrong side with our sample sizes. In fact, often so far that plain LDA is discouraged because it needs large sample sizes :-) : Typically raw data of 10³ variates, which I reduce to 10² by spectroscopic knowledge, and 10⁰ - 10¹ patients/animals/cell culture batches per class of which say, 10¹-10⁶ spectra are available per "case". But we do have the advantage that we know how certain measurement channels correspond linearly to physico- and bio-chemical properties (concentrations)... Comments are getting too long, if we want to go on, lets continue in chat. – cbeleites unhappy with SX Feb 08 '13 at 20:21
  • (-1) This answer uses a very narrow interpretation of machine learning. Not all machine learning deals with high-dimensional data, and not all machine learning is a form of specialised statistics. – MLS Feb 20 '13 at 12:57
  • Great discussion here about communicating and interpreting a model's results. The modelling procedure and the relationship between producers of models and consumers of models is worthy of a separate thread, but I'm not sure whether it would be on-topic on CV. The Introduction to [Modelling Economic Series](http://books.google.ie/books?id=180ipKzWPScC&lpg=PP1&dq=modelling%20economic%20series%20advanced%20texts%20in%20econometrics%20consumer&pg=PA15#v=onepage&q=consumer&f=false) (edited by Clive Granger) touches upon this (e.g. p.15), but it would be nice to see it extended to the context of ML. – Graeme Walsh Jun 25 '13 at 08:11
  • 2
    @GraemeWalsh fantastic point. I struggle greatly with the concept of using sophisticated predictive models for predictive inference, as is often the case in structural equation modeling or Granger's eponymous causality. I think there remains a great deal of work to be done in this area. For instance, intuitively I recognize a great deal of similarity between semi-parametric modeling and marginal structural models, but am unsure where the differences lie. – AdamO Jun 25 '13 at 21:32
  • ***"only promises incrementally better performance than standard tools and techniques"***, is wrong. Ever looked at the Kaggle prediction competitions lately? Traditional statistical tools are utterly dominated (and this is a massive understatement) in every single competition across completely different domains when compared to ML models with tens of millions of parameters. And this is on a totally unseen test set, linear regression and GLMs get laughed out the door in favour of random forests and deep learning. – Jase Feb 08 '14 at 10:27
  • 2
    @Jase you should take a look at the invited paper from the Netflix contest winners. Their reports were very similar, even with Bayesian model averaging running posterior weights on a large space of models, they observed that Pca seemed to have a dominating posterior weight under all conditions. That's not to say that they are equivalent, but there is a trade off between simplicity and accuracy that makes me favor simpler models than those the ml arena offers. One could analogously think of how sophisticated parametric models perform similarly to nonparametric ones. – AdamO Feb 08 '14 at 19:14
14

OK, let's talk about the elephant of statistics with our sight blindfolded by what we've learnt from one or two people we've closely worked with in our grad programs...

Stat programs require what they see fit, that is, what is the most important stuff they want their students to learn given a limited amount of time the students will have on the program. Requiring one narrow area means kissing goodbye to some other areas that can be argued to be equally important. Some programs require measure theoretic probability, some don't. Some require a foreign language, but most programs don't. Some programs take Bayesian paradigm as the only thing worth studying, but most don't. Some programs know that the greatest demand for statisticians is in survey statistics (at least that's the case in the US), but most don't. Biostat programs follow the money and teach SAS + the methods that will sell easily to medical and pharma sciences.

For a person designing agricultural experiments, or collecting survey data via phone surveys, or validating psychometric scales, or producing disease incidence maps in a GIS, machine learning is an abstract art of computer science, very distant from statistics that they work with on a daily basis. None of these people will see any immediate benefit from learning support vector machines or random forests.

All in all, machine learning is a nice complement to other areas of statistics, but I would argue that the mainstream stuff like multivariate normal distribution and generalized linear models need to come first.

StasK
  • 29,235
  • 2
  • 80
  • 165
5

Machine learning is about gaining knowledge/learning from data. For example, I work with machine learning algorithms that can select a few genes that may be involved in a particular type of disease from DNA Microarray data (e.g. cancers or diabetes). Scientists can then use these genes (learned models) for early diagnosis in the future (classification of unseen samples).

There is a lot of statistics involved in machine learning but there are branches of machine learning that do not require statistics (e.g. genetic programming). The only time you would need statistics in these instances would be to see if a model that you have built using machine learning is statistically significantly different from some other model.

In my opinion, an introduction to machine learning for statisticians would be advantageous. This will help statisticians to see real world scenarios of application of statistics. However, it shouldn't be compulsory. You may become a successful statistician and spend your whole life without ever having to go near machine learning!

revolusions
  • 138
  • 1
  • 7
  • 2
    I'd say you need statistics every time you report the performance of your model. Mabe that is because my profession is analytical chemistry, where one of the important rules is "a number without confidence interval is no result". – cbeleites unhappy with SX Feb 07 '13 at 19:58
  • 1
    @cbeleites I agree with you. What I meant was that statisticians don't necessarily need to be machine learning experts! They can get by without learning machine learning :) – revolusions Feb 07 '13 at 20:09
  • 1
    @cbeleites, or *multiple* confidence intervals in the case of multimodal esimators (eg, Sivia & Skilling *Data Analysis*). – alancalvitti Feb 08 '13 at 01:49