1

I have a $198$-sample dataset containing miRNA types (numerical features) and one categorical feature "Type" with values "Tumor" or "Healthy".

  Index     miRNA1          miRNA2          miRNA3            Type
   1       48421.52        24242.14        23842.1518        Tumor
   2       2757.96         28965.2         7339.57           Healthy
   3       4300.34         52565.07        6981.41           Healthy
           ...             ...             ...
   198     23854.73        24722.28        7611.53           Tumor

Since there are 1584 of these features in total, I need to select the ones that are most influential towards developing a Tumor.

My approach is described below. Is it correct?

The distributions of features are mostly log-normal. I've transformed each feature with a Box-Cox transformation to get approximately normal distributions. I scaled the values with Min-Max scaler to put them in range $[0,1]$.

miRNA1 has $100$ Healthy samples and $98$ Tumor samples. I should make a null hypothesis that Tumor samples have the same values as Healthy samples. I calculate mean and standard deviation for Tumor samples and Healthy samples, calculate the t-score and calculate the p-value, using significance level of $0.05$ and DF in this case is $97$. This is a two-tails test so it is $p$-value $\times 2$. If it's lower than $0.05$ I reject the null hypothesis and consider miRNA1 as a feature that impacts Tumor development, right?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
Alex
  • 13
  • 4
  • Sounds good except that the results of the final model will be extremely biased (too large effects, too good inference etc.) – Michael M Sep 04 '19 at 07:28
  • @MichaelM Thanks for comment, browsed sereval posts here [1](https://stats.stackexchange.com/questions/365209/feature-selection-using-cross-validation) [2](https://stats.stackexchange.com/questions/2306/feature-selection-for-final-model-when-performing-cross-validation-in-machine) [3](https://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation) and indeed bias is a problem in general, will probably use advice from other posts to minimize it's effect. – Alex Sep 04 '19 at 12:05

2 Answers2

5

First, what you have is high-dimensional data. This alone poses some problems and you should use a method which was designed for this, which is better suited.

Second, a regular t-test is a bad idea in this case, it is a univariate test - meaning it does not consider multiple variables together and their possible interactions. Also, p-values are not meant to be used for feature selection.

Nonetheless, if you are fixed on a t-test, it would be better to use a permutation test to test for significance, as you have many variables which will lead to some serious corrections, when you adjust your p-values for multiple testing, and you will adjust them, right?

Finally, personally I would use LASSO regression to solve this, which is a better and simpler option, LASSO automatically performs feature selection and it considers all the variables together, rather than one by one.

user2974951
  • 5,700
  • 2
  • 14
  • 27
  • 2
    (+1) piggybacking to add: university stats classes often teach that doing some transformations to make data normal will solve your problems, but that is largely not the case. LASSO is a great option here if you need a full model, but another thing statistical genetics/genomics people do in this situation is false-discovery rate control testing, which I would argue is a bit simpler and works better out-of-the-box (though you gotta be careful with the interpretation). – Sheridan Grant Sep 04 '19 at 08:00
  • @SheridanGrant You are right, I forgot to add a comment about needlessly transforming variables, since a t-test does not require normally distributed _data_. As for FDR, it is a good approach, but I would still consider LASSO a better option, since it models all the variables together, while FDR is still used in the context of univariate hypothesis testing. – user2974951 Sep 04 '19 at 08:12
  • @SheridanGrant and user thanks for your input! I wanted to start with t-test because they seem relatively easy to understand and are used in numerous studies in this area. A lot of papers also use ANOVA, pearson correlation, data clustering techniques [study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1523197/) Unfortunately none of these papers describe how they carry out these tests so I don't really know how to get started properly. I will consider using lasso and random Forest for this task. 1/2 – Alex Sep 04 '19 at 12:22
  • The objective is however to firstly reduce the dataset to around 10-20 features and then experiment with different models. That's how most papers in this area do, they use techniques to get dataset of 10 features and then use ANN to build a model (they have microarray data, maybe that's why they use ANN instead of simpler classifiers. To the distribution part: I normalized them since almost every study does that and I'm a begginer so just did it as well, also read that parametirc tests do assume gaussian like dist. (In these studies they use Kolmogorov-Smirnov test and then normlize) 2/2 – Alex Sep 04 '19 at 12:22
  • 1
    @Alex I understand, feature selection is important. However, you should not use p-values to do that, it is a bad idea that has been discussed numerous times on this site, as to why it leads to potentially very bad results. Additionally: 1) the others measures you mentioned also are not used for feature selection except for clustering, 2) I would not use ANN when you have such a small sample size, 3) a lot of people perform normalization, because of a wrong belief that it will improve the results, 4) LASSO or RF is a good option. – user2974951 Sep 04 '19 at 12:43
0

even though there are better options for feature selection but t-test can also be part of your research. What you is did correct up to some extent but you need to take the t-score as your reference for your feature ranking. the higher the t-score value, the better the feature is. I would recommend information theory-based filter methods and you can take the order of features from a decision tree using information gain and based on my experience it worked far way better than the t-test (t-test cannot remove redundancy) and I was even better than embedded feature selection approaches like L1 penalized logistic regression and RF from the point of feature subset compactness.

  • "worked better" in the sense of proper out-of-sample and out-of-time validation? Or rather by getting seemingly good results? – Michael M Sep 26 '21 at 08:33