8

I do not think that this is a difficult question, but I guess someone needs experience to answer it. It is a question that is asked a lot here, but I did not found any answer that explains the reasons of choosing an appropriate ML algorithm.

So, let's suppose we have a set of data. And let's suppose I want to do clustering (This could be classification or regression if I also had labels or values or my training set data).

What should I consider before choosing an appropriate algorithm? Or I just choose algorithms in random?

In addition how I choose any data preprocessing that can be applied at my data? I mean are there any rules of the format "IF feature X has property Z THEN do Y"?

In addition are there any other things except preprocessing and choosing my data that I miss and you want to advice me about them?

For example, lets suppose that I want to do clustering. Is saying "I will apply k means at that problem" the best approach? What can improve my performance?

I will chose as best answer the answer that is much more justified and explains everything that someone should consider.

Jim Blum
  • 604
  • 1
  • 6
  • 15
  • 3
    While the question is legit when asked in a specific situation, it is to complex to be answered in general. What you are essentially looking for is a cookbook or a cheatsheet. In this regard, see this questions:[Machine learning cookbook / reference card / cheatsheet?](http://stats.stackexchange.com/questions/12386/machine-learning-cookbook-reference-card-cheatsheet), [Statistical models cheat sheet](http://stats.stackexchange.com/questions/1252/statistical-models-cheat-sheet/). – mlwida Feb 19 '14 at 13:01
  • 4
    I'd like to add this magnificient flowchart (to mymind anyway) : http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html – merours Feb 19 '14 at 13:16
  • 2
    In my experience the cheatsheets are useful only within the domain they were made for (if at all). E.g. the scikit-learn flowchart doesn't resemble the work-flow very well I use to decide how my chemometric data analyses should be done (besides the fact that I can hardly ever get >50 *biological* replicates) ... Which doesn't mean that it isn't useful - it's just that chemometric classification data is different from text classification (which is my impression/guess of the domain the author of the workflow data is in) and thus different considerations get different weight. – cbeleites unhappy with SX Feb 19 '14 at 13:32
  • @steffen Thanks for your answer. However, I do not think that my question has an answer at the question you propose. In addition what is proposed there is actually a variety of books (barber or Friedman)! or the Andrew Ng course which is really different from what I am asking here. If I would like that I would simply ask, what are some good resources to start learning ML :) But thanks again from your comment – Jim Blum Feb 19 '14 at 13:47
  • 2
    Here is another attempt: According to the no-free lunch theorem, there is no best method for every data set. So: In order to write a the best-method-guide, one has to take into account all properties which could occur in the data plus all type of domain specific problems (kudos to cbeleites) plus all properties of all algorithms. This is insane. Hence the cheat-sheet question got books as answers where all this knowledge is stored. You extended it even more, by asking for methods for both clustering (unsupervised) and classification / regression (supervised learning). – mlwida Feb 19 '14 at 14:34
  • 1
    No offense, but you may underestimate the complexity ... similar questions (IMHO) would be: What should one consider to cure an arbitrary but fixed illness ? What should one take into account when creating a software ? No offense, english is not my native tongue, I do not know how to explain my point. As I have stated in the linked question, if I could write such a guide, I would sell it :). I am an advocate of "keep it simple and accessible", but in this case I really dont know how :( – mlwida Feb 19 '14 at 14:40
  • @steffen Thanks again for your comment. I didn't ask for a best method :) that is what I said :) "What should I consider before choosing an appropriate algorithm?" I did not said "Tell me the best algorithm" :) So I guess I am looking for some rules that exist, or some empirical rules that someone can find. – Jim Blum Feb 19 '14 at 15:14
  • @steffen The software counterexample, has actually an answer :) That he should thing about the use of the program as there are some languages that are more mathematical than others[...], that there are languages that have more ready libraries than others. Or that a language is more popular than other for this reason. In the same way, I could say that if it is intended for develop the system for different OS you should also consider using java, or that if it will be online should also consider thinking more about security. So, there are some empirical or not answers at this kind of questions :) – Jim Blum Feb 19 '14 at 15:19

2 Answers2

1

are there any rules of the format "IF feature X has property Z THEN do Y"?

Yes, there are such rules. Or rather, if x then is is sensible to try y and z and avoid w.

However, what is sensible and what is not depends on

  • your application (influences e.g. expected complexity of the problem)
  • the size of the data set: how many rows, how many columns, how many independent cases
  • the type of data / what kind of measurement. E.g. gene microarray data and vibrational spectroscopy data often have comparable size, but the different nature of the data suggests different regularization approaches.
  • and in practice also on your experience in applying different methods.

Without more specific information I think that is about as much as we can say.

If you want to have a general answer to the general problem, I recommend the Elements of Statistical Learning for a start.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
0

There is a classic paper (Wolpert, 1996) that discusses no-free-lunch theorem mentioned above. The paper can be found here. But according to the paper and most practitioners, "there are [rarely] a priori distinctions between learning algorithms." Note: I replaced "no" with "rarely".

Reference

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.

Silverfish
  • 20,678
  • 23
  • 92
  • 180
Scott Worland
  • 246
  • 2
  • 2
  • 1
    This is a bit brief by our standards and it would help if you could flesh the answer out a little, perhaps by saying some more about what Wolpert discussed. I've added a full citation for you - we prefer giving a complete reference, partly out of fear of "linkrot" if links stop working in future – Silverfish Aug 12 '16 at 15:53
  • The paper formalizes the no free lunch theorem mathematically. A very short summary would be that even if an algorithm just provides a random guess for the target, there are instances where that algorithm would outperform another, more sophisticated algorithm that can get "confused by the data". It doesn't mean that nothing can be known a priori (ie. you obviously wouldn't choose a binary classifier for a regression problem), but just that there is not a straightforward way to predict how an algorithm will generalize to a test data set until you try. – Scott Worland Aug 15 '16 at 13:49
  • Thanks. It's best to edit this into the answer itself rather than post it as a comment. – Silverfish Aug 15 '16 at 14:26