Should I consider using nonlinear expression in logistic regression when the training set is just nonlinear?

Question

Conventional logistic regression expresses log odds as a linear expression of predictors, i.e., $\beta'x$. And the problem reduces to using linear classifier $\beta'x=0$ to classify. But sometimes the training set is apparently nonlinear (may be tested using linear programming), for example, if I even visualize two almost separable circles are the two classes in a two-dimensional training set, should I just use something like $\beta_1x_1^2+\beta_2x_2^2+\beta_3$ to express log odds? Is this arbitrary enough since the functional form is based on my visualization? But on the other hand, isn't conventional logistic regression also quite arbitrary since it visualizes linearity using techniques such as linear programming?

Do people in practice attempt to fit in any nonlinear functions for logistic regression?

See http://stats.stackexchange.com/a/64039/919 for an example. That answers your last question. Perhaps it indicates ways in which your first question might be answered. For a more extensive answer to the last question, consult Frank Harrell's book on *Regression Modeling Strategies* and study his extensive remarks here on CV concerning splines. — whuber, Oct 03 '16 at 15:33
Just finished reading the long post you hyperlinked. Is the idea here suggesting exploring statistical model (here the kernel method) by theory (correct me if I am wrong)? That said, does logistic regression without any kernel trick also need a theory to support its linear separability given certain data? And unfortunately, I can't use linear programming since it's just not theory, I guess. @whuber — Nicholas, Oct 04 '16 at 23:11
You seem to be confusing logistic regression with something entirely different, such as SVMs. — whuber, Oct 05 '16 at 00:14
Could you please explain a bit? It's really interesting but I don't even know SVM so far. BTW, seems I used the wrong term, by kernel method, I am actually meaning doing a nonlinear transformation on predictors. @whuber — Nicholas, Oct 05 '16 at 01:08

GeoMatt22 · Answer 1 · 2016-10-03T17:25:49.543

The term "linear" here refers to the relationship between the features (predictors) and the parameters in the regression equation $$\mathbb{E}[\,y\mid \mathbf{f}\,]=\mathbf{f}^T\boldsymbol{\beta}$$ Here $\boldsymbol{\beta}=[\beta_1,\ldots,\beta_m]$ is the vector of parameters to be estimated from the data, while for a given data point $(\mathbf{x},y)$, the vector $\mathbf{f}(\mathbf{x})=[f_1,\ldots,f_m]$ is a set of features computed from $\mathbf{x}$, and for logistic regression, $y=\mathrm{logit}[\Pr(\mathrm{class}=1)]$.

In your example, the feature-mapping $$\mathbf{f}(\mathbf{x})=[x_1^2,x_2^2,1]$$ is nonlinear, but the regression is still linear in the parameters $\boldsymbol{\beta}$.

For your example problem of "separating circles", you can see a nice demonstration of linear vs. nonlinear approaches using the TensorFlow Playground site:

Using a linear model with linear features $(x,y)$, the data cannot be separated.
The data can be separated using a linear model with nonlinear features $(x^2,y^2)$.
The data can be separated using linear features, but only using a nonlinear model.

(For each example above, click on the link and then press "play" to train the classifier.)

The first example can do no better than chance (i.e. 50% error). The second example is the one from your question. The third example shows a neural network with a hidden layer that learns 3 new "hybrid features", which are then used to classify the data.

For your second question, I would say it is quite common to use nonlinear feature mappings. This is often done in the context of the so-called "kernel trick" (e.g. for SVMs). Classically, these feature mappings are pre-specified. As shown in the third example above, nonlinear feature mappings can also be learned, which is often done in the context of deep learning.

Appreciate this jargon clarification. But any comment on having a nonlinear feature-mapping? — Nicholas, Oct 03 '16 at 15:50
Appreciate again. But one question, suppose your training set is not linearly separable, but there seems to be a pattern, maybe a circle, maybe a hyperbolic curve, then would you apply any kernel trick or would you prefer nonlinear models such as deep learning? — Nicholas, Oct 04 '16 at 23:20
@Nicholas I am really not an expert in these areas. My understanding is that for many problem domains there are well developed (possibly nonlinear) feature mappings that are known to work. These may be very sophisticated, and commonly were honed over years of study by generations of specialists. For deep learning, the "top level" classifier will typically be an SVM/logistic regression/etc. The goal is for the hidden layers to *learn* appropriate nonlinear feature mappings in a generic way, that can apply across domains. (This is my broad understanding, at least.) — GeoMatt22, Oct 04 '16 at 23:37
By the way the "kernel trick" is *not* a method to create complex/nonlinear feature mappings. It is purely a way to be able to efficiently *train* (and evaluate) a model which *uses* high dimensional feature mappings. — GeoMatt22, Oct 04 '16 at 23:38

Should I consider using nonlinear expression in logistic regression when the training set is just nonlinear?

1 Answers1