Simple preprocessing large columns set

Question

I have a huge dataset about the effect of the drug on cancerous cell lines with a 17k column. I need to prepare a simple regression, but I don't know, how to pick the most important columns. I prepare a simple preprocessing:

variances <- apply(X=data.train, MARGIN=2, FUN=var)


sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:500]

# use that to subset the original data
dat.highvariance <- data.train[, sorted]

It is the correct way?

score 1 · Answer 1 · answered Jun 03 '20 at 09:47

how to pick the most important columns

It is depending on how do we define important. What you are doing is picking the columns that have more variance (information), which is a good way to start.

However, be aware of that columns with large variance may have nothing to do with the regression target.

See this post for details.

How to decide between PCA and logistic regression?

Simple preprocessing large columns set

1 Answers1