I have a huge dataset about the effect of the drug on cancerous cell lines with a 17k column. I need to prepare a simple regression, but I don't know, how to pick the most important columns. I prepare a simple preprocessing:
variances <- apply(X=data.train, MARGIN=2, FUN=var)
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:500]
# use that to subset the original data
dat.highvariance <- data.train[, sorted]
It is the correct way?