4

I tried to use random forest to classify microarray data. Basing on research of L.Breiman and Tao Shi, I constructed a synthetic data base using bootstrap methods (Assuming it is a matrix with samples on row and genes on column, for each gene in each sample, values are selected with replacement in the gene column, descripted in http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#unsup ).

The original data is labeled as 1, and the synthetic data is labeled as 2. The combined data is used as input for random forest as supervised data. The wired stuff is that I got a error rate of 100%. In my view, the worst case would be the error is 50%, which means randomly participate the combined dataset. I tried this approach for many times, it always gave nearly 100% error rate.

In practical, my data set is 7*297 matrix, 7 samples and 297 genes.

Besides, I have a problem in understanding how unsupervised pattern work in the R "randomForest" package, and how it calculate the proximity. From the source code, I get this:

if (!is.null(y)) {
    if (length(y) != n) 
        stop("length of response must be the same as predictors")
    addclass <- FALSE
}
else {
    if (!addclass) 
        addclass <- TRUE
    y <- factor(c(rep(1, n), rep(2, n)))
    x <- rbind(x, x)
}

It seems that the new data set is treated as combined raw data. However, I didn't get observed result if the assumption is true:

iris.urf <- randomForest(iris[, -5])
iris.urf$proximity[1:5,1:5]
MDSplot(iris.urf, iris$Species)

xx.1 <- randomForest(as.matrix(rbind(iris[, -5], iris[, -5])),
          factor(rep(1:2,each=150)), proximity=TRUE)
xx.1$proximity[1:5,1:5]
xx.2 <- xx.1$proximity
xx.1$proximity <- xx.2[1:150,1:150]
MDSplot(xx.1, iris$Species)

The proximity is very different, but the mdsplot gives similar pattern.

Could someone help me on these?

smci
  • 1,456
  • 1
  • 13
  • 20
ccshao
  • 597
  • 2
  • 8
  • 14
  • 2
    This usually happens when you have paradoxes in the data, i.e. pairs of identical samples with respect to genes and with different classes. However I don't understand why are you doing this -- unsupervised model is not a very useful construct, in practice it is only good for clustering objects (i.e. samples in your case). –  Sep 03 '12 at 17:11
  • 2
    With 7 observations there's probably not enough data points. A random forest is an ensemble of trees, and trees typically require more data points to learn the classification model – JCWong Sep 03 '12 at 17:11
  • @mbq: I do want to use proximity produced by RF to cluster samples. Another advantage of RF is that it also provide importance for each variable, which will be helpful in selecting subset variable for downstream analysis. – ccshao Sep 04 '12 at 08:44
  • 1
    @hiberbear Well, clustering 7 samples... but it is doable. But importance *certainly won't make any sense* here. Let's say those are all human genes -- thus you'll get genes responsible for eye color, arthritis and alcohol tolerance mixed together. Without decision, how the forest can tell which gene set to look for? This way you will only get some random selection depending on how datum objects were generated. –  Sep 04 '12 at 08:58
  • @mbq, you and JCWong are right,RF needs suitable dataset to work properly. Another technical question, which is not so relevant, is how randomForest (the R package I used) calculate the proximity in unsupervised pattern. From the source code, I guessed that the function simply use "rind" to replicate the raw input data, label the new data with factor "1" and "2", then treat the new set as supervised pattern. I tried to explicitly give the new dataset to function "randomForest" and let it run in supervised pattern. It gave very different results even if I set the seed. Any experience on this? – ccshao Sep 04 '12 at 09:24
  • @hiberbear Ok, so let's get this straight. You have duplicated the implementation of randomForest unsupervised learning and it is not giving you the same proximity matrix, am I right? If so, can you edit your post to a clarified version and maybe also add your code? –  Sep 04 '12 at 14:45
  • previously to your citation there is `n – O_Devinyak Sep 05 '12 at 19:46
  • 1
    @hiberbear You do not need to construct your synthetic data. Just type `rf – O_Devinyak Sep 05 '12 at 19:50
  • @fosgen, I saw the method of using unsupervised in example of randomForest. However, what I dont understand is how is calculated. I got different proximity matrix using the default method and customized synthetic data. Maybe one of them is better. – ccshao Sep 07 '12 at 10:42
  • 1
    @hiberb This is becouse you don't bootstrap your variables in class 2. This procedure is described in the page you have cited and is implemented in the function _createClass_ which is located in the source file _rfutils.c_ (line 25) and _createClass_ itself is called from _classRF_ (line 238, file _rf.c_). – O_Devinyak Sep 07 '12 at 16:54
  • @fosgen, thanks so much for the detail explanation, I saw the code you mentioned. – ccshao Sep 10 '12 at 09:16
  • Try removing some features with low variable importance.The error rate will probably start dropping. – amanita kiki Nov 11 '15 at 06:23

0 Answers0