SelectKbest feature selection - non normal

Question

I am using SelectKbest for my feature selection process. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

My data is non normal and actually skewed. I don't transform/scale it either since i am using tree based method (xgboost) binary classifier.

I have 200+ features therefore for better performance i would like to somehow reduce these.

I am using selectKbest(score=f_classif)

From my understanding f_classif interpretes the values of y as class labels and computes, for each feature X[:,i] of X, an F-statistic. The formula used is exactly the one given here: one way ANOVA F-test, with K the number of distinct values of y. I am sure this needs an underlying assumption of normlally dsitrubuted features. I have been reading alternative scoring functions for my classification task, e.g. chi2 as opposed to f_classif.

Since this is non parametric would you say this is more suited for my data?

[Of possible interest](https://stats.stackexchange.com/questions/18214/why-is-variable-selection-necessary) — Dave, Feb 18 '22 at 17:00
This is too vague to understand: `SelectKBest` is just a mechanical algorithm for selecting items with the highest scores among a list. It tells us almost nothing useful about your "feature selection process." We could be helpful if you would say more about the context, objectives of your analysis, and statistical nature of these features. — whuber, Feb 18 '22 at 17:49
@whuber SelectkBest is a feature selection process https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html . Will add more — Maths12, Feb 18 '22 at 18:16
I relied on that documentation in my first comment, which therefore still stands: we need far more details to know how to answer your question. — whuber, Feb 18 '22 at 18:22
Hi whuber, i have edited. My question is really just around which scoring method is suitable — Maths12, Feb 25 '22 at 10:52
Why do you think that reducing the features will improve your performance? — Dave, Feb 25 '22 at 10:56
I am worried about overfitting and also the 'curse of high dimensionality' https://www.ee.columbia.edu/~sfchang/course/spr/papers/Trunk_ProblemOfDimensionality.pdf — Maths12, Feb 25 '22 at 10:58
You might be interested in my answer [here](https://stats.stackexchange.com/a/555163/247274) and the [link](https://stats.stackexchange.com/questions/331782/if-only-prediction-is-of-interest-why-use-lasso-over-ridge/331809#331809) contained in it. — Dave, Feb 25 '22 at 11:04

SelectKbest feature selection - non normal

0 Answers0