I have a biological dataset with 30.000 features (genes) and 1000 data points (cells). Basically I have two major classes of cells: 1 and 0 with a distribution of 90/10.
Now I am trying to classify these correctly using nested cross validation. The first thing I tried was to manually decrease the number of features by considering biological relevant subsets of the total feature set (reduced to 20 features), which gives me reasonable results (0.7 F2 score).
However, I am wondering if I use the whole feature set if I will get big overfitting since I have much less data points than features.
Is it true that I would overfit my data if I use the whole feature set? And if so, are there any ways to decrease the feature set without prior biological knowledge?
Thanks a lot! Tomi