I have a $198$-sample dataset containing miRNA types (numerical features) and one categorical feature "Type" with values "Tumor" or "Healthy".
Index miRNA1 miRNA2 miRNA3 Type
1 48421.52 24242.14 23842.1518 Tumor
2 2757.96 28965.2 7339.57 Healthy
3 4300.34 52565.07 6981.41 Healthy
... ... ...
198 23854.73 24722.28 7611.53 Tumor
Since there are 1584 of these features in total, I need to select the ones that are most influential towards developing a Tumor.
My approach is described below. Is it correct?
The distributions of features are mostly log-normal. I've transformed each feature with a Box-Cox transformation to get approximately normal distributions. I scaled the values with Min-Max scaler to put them in range $[0,1]$.
miRNA1 has $100$ Healthy samples and $98$ Tumor samples. I should make a null hypothesis that Tumor samples have the same values as Healthy samples. I calculate mean and standard deviation for Tumor samples and Healthy samples, calculate the t-score and calculate the p-value, using significance level of $0.05$ and DF in this case is $97$. This is a two-tails test so it is $p$-value $\times 2$. If it's lower than $0.05$ I reject the null hypothesis and consider miRNA1 as a feature that impacts Tumor development, right?