1

I am currently working on a classification problem, and I am observing that the linear classifiers are outperforming the non-linear classifiers. This is very unintuitive to me. What could be the cause of this?

  • training data?
  • number of features? feature types?

I hypothesize its the lack of training data that is the cause of this, but my google-fu hasn't been good enough to find any scientific articles suggesting so.

kanghj91
  • 13
  • 3
  • Can you say more about your situation, your data, your classifiers & their performance? I suspect this will be hard to answer well without more information. – gung - Reinstate Monica Mar 03 '16 at 14:33
  • Oops, sorry I should have included them in the question. I am working on an NLP problem on a very specific domain, and the features are standard NLP style features like n-grams, POS-tags. However, my dataset is not large, which is why I suggested that the lack of training data could be the reason why. The performance was evaluated using accuracy, and F-measure. – kanghj91 Mar 03 '16 at 15:46

4 Answers4

4

If you are truly interested in all-or-nothing classification you can fool the proportion classified correctly with a variety of bogus models. If you are interested in prediction instead, and use proper accuracy scoring rules, you'll see that what outperforms other methods in a variety of situations is additive models that allow predictors to act nonlinearly (e.g., regression splines). You could call this class of models generalized additive models (GAMs, which is strictly speaking for the nonparametric case) or additive smooth models.

The reason that additive models, or linear models for that case, can outperform other methods in many situations is that they are effectively Bayesian with a prior distribution that places weight on additive effects and places little weight on non-additive (interactive; synergistic) effects. Use of prior information can really boost mean squared error (and other measures) of predictive accuracy. We find in many situations that the dominant effects are additive, and complex interactive effects (of the type featured by random forests, SVMs, recursive partitioning, and other approaches) are not very predictive.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    Thanks for your answer, Frank. Although many of the other answers were extremely useful too, I think your answer was most helpful for me to learn more. I think this is very interesting. Would you have any good introductory material regarding GAMs? – kanghj91 Mar 04 '16 at 16:10
  • 1
    Start with my Regression Modeling Strategies course notes at http://biostat.mc.vanderbilt.edu/rms - see the link at the very top right. Look at the chapter on regression splines. Once you understand parametric GAMs you can look at nonparametric GAMs, then transform-both-sides nonparametric additive regression. – Frank Harrell Mar 04 '16 at 16:19
2

If a problem is nonlinear and its class boundaries cannot be approximated well with linear hyperplanes, then nonlinear classifiers are often more accurate than linear classifiers. If a problem is linear, it is best to use a simpler linear classifier.

The performance depends on the problem.

WeiYuan
  • 488
  • 4
  • 9
0

What sort of nonlinear classifier you're using matters. Supposing that there is an underlying linear relationship, a polynomial fit will obviously work at least as well as a linear fit but a decision tree that uses a sequence of binary divisions will require an arbitrary number of divisions to capture the underlying relationship.

Matthew Graves
  • 1,399
  • 6
  • 13
  • Currently, I have worked on using SVM with different kernels, neural networks, and decision trees. Oddly the best classifier with LIBLinear, beating out the SVMs with other kernels. – kanghj91 Mar 03 '16 at 15:49
0

"Performance" is a loaded term.

Variations on meaning:

  • How long it takes to compute (seconds, cycles, ...)
  • How much memory it uses during compute (on-die Bytes, L2, RAM, cache to disk)
  • How big is your favorite fit metric ($R^2$, AIC, ROC, ...)

I suspect that:
It is really a linear problem, then the nonlinear can model part of the noise with the signal, and a decent fit metric can indicate that. Nonlinear fits also take longer to compute than linear fits.

When you say "performance", what exactly do you mean?

EngrStudent
  • 8,232
  • 2
  • 29
  • 82
  • Sorry for missing details in the question! By performance, I meant accuracy, and F-measure. But surely, your answer then raises the question on what a linear problem is, and how to identify one... – kanghj91 Mar 03 '16 at 15:50
  • @kanghj - "is it linear?" is a fun question! It is a variation of "when is a cow like a sphere". Linearization is big in control system engineering circles like aircraft controls. There are near (but sub) optimal solutions that assume linearity on weakly nonlinear systems. SVM is about turning things into linearly separable forms. As a first answer, I would say that if the purely linear model has the better separability, run-time, and fit statistics than the non-linear, that might be a place to start looking. – EngrStudent Mar 03 '16 at 16:07