2

I'm starting my first Machine Learning project to classify some entities and I decided to use Logistic Regression for the task.

Initially I starter with around 10 features and I can see that my model is underfitting the data (F-Score around 0.63).

That can be explained because all of my features are of first order and so my hypothesis is a first order polynomial.

I would like to add more of higher order features, but I quickly realized that I don't have a good intuition on how to do that. I could take each of my features $X_n$ and add new ones $X_{n^2}$, $X_{n^3}$ etc. I could also start adding more complex features like $X_1$ * $X_2$ etc.

Immediatelly I noticed that there are countless possibilities. How do I start? What are good practices in adding more features. How can I avoid overfitting the data?

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
ŁukaszBachman
  • 435
  • 1
  • 5
  • 9
  • 2
    Logistic regression is not a classification technique. – Frank Harrell May 19 '17 at 12:15
  • @FrankHarrell why not? Seems like it's completely fine to treat the output hyphothesis from Logistic Regression to classify items. Example: https://www.cs.cmu.edu/~kdeng/thesis/logistic.pdf – ŁukaszBachman May 22 '17 at 10:32
  • 2
    It's only appropriate to do that if you have a utility function, e.g., the costs of wrong decision in both directions. When logistic regression was invented by DR Cox in 1958 it was to directly estimate probabilities. For more see http://www.fharrell.com/2017/01/classification-vs-prediction.html . As described there, one of the greatest mistakes made in machine learning is to develop classifiers when the appropriate solution is risk estimation. There is a nomenclature problem in the reference you gave. Probabilities can have nothing to do with classification. – Frank Harrell May 22 '17 at 11:37
  • @ŁukaszBachman This thread may be of interest. https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 – Sycorax Jun 29 '17 at 03:07

3 Answers3

0

If you are really want to create higher order features to a logistic regressor then I would suggest you expand your features with interaction between features $X_1*X_2$, nonlinear features like $log(X_1)$ and $X_1^2$. Everything exactly like you proposed.

Finally to avoid over-fitting and at the same time doing variable selection apply a LASSO regularizer, it will both penalize model complexy and also induce sparsity. Only the subset of features, high order features that are of higher importance will be kept by the model.

You might also want to consider non linear models, they try to discover the optimal non-linearity by themselves (e.g. neural networks).

Ramalho
  • 737
  • 5
  • 14
  • sounds sensible and thanks for mentioning LASSO, but again my question is - where do I stop? I can easily add $X_n^2$ but should I also add $X_n^3$, $X_n^4$, etc.? If I create new features that represent interactions like $X_1* X_2$, should I then also add $X_1* X_2^2$? Of course I could do that iteratively and stop when I'm satisfied with accuracy of my classifier, but that seems to be a too big job to be done manually. – ŁukaszBachman Oct 16 '16 at 08:08
0

First of all, it's important to know the number of entites you have. The number of regressors you can have will be very dependent to that.

Have you split your data in a training set and a validation set ?

After, you're not necessarily "under fitting", maybe a model with a F-score of 0.63 is the best model possible.

Be careful to not add to many features, it will add variance to your model. To know which feature you have to keep, you have to use a significance test for every feature. You can see here a example on R : http://www.r-tutor.com/elementary-statistics/logistic-regression/significance-test-logistic-regression If there insignificant feature, you delete one by one the features with the higher p_value.

You will quickly see that the feature with a high polynomial degree are often insignificant.

el Josso
  • 392
  • 1
  • 14
0

Your best guess to 'where to stop' is to continuously plot your metrics (precision/recall/accuracy/misclassification) with your test set. As soon as they start to deteriorate you're likely overfitting and might want to reduce the number of features. Also, you have to use your intuition while selecting which polinomial features to add. Prioritize the features that seem to be more relevant.

See the error-analysis section here: http://www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html

Also, another way to add features that might improve the algo is to breakdown factor/categorical features into dummy features (https://discuss.analyticsvidhya.com/t/how-to-handle-categorical-variables-in-logistic-regression/247/4)

Hope this helps

aike
  • 1