I tried to use random forest to classify microarray data. Basing on research of L.Breiman and Tao Shi, I constructed a synthetic data base using bootstrap methods (Assuming it is a matrix with samples on row and genes on column, for each gene in each sample, values are selected with replacement in the gene column, descripted in http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#unsup ).
The original data is labeled as 1, and the synthetic data is labeled as 2. The combined data is used as input for random forest as supervised data. The wired stuff is that I got a error rate of 100%. In my view, the worst case would be the error is 50%, which means randomly participate the combined dataset. I tried this approach for many times, it always gave nearly 100% error rate.
In practical, my data set is 7*297 matrix, 7 samples and 297 genes.
Besides, I have a problem in understanding how unsupervised pattern work in the R "randomForest" package, and how it calculate the proximity. From the source code, I get this:
if (!is.null(y)) {
if (length(y) != n)
stop("length of response must be the same as predictors")
addclass <- FALSE
}
else {
if (!addclass)
addclass <- TRUE
y <- factor(c(rep(1, n), rep(2, n)))
x <- rbind(x, x)
}
It seems that the new data set is treated as combined raw data. However, I didn't get observed result if the assumption is true:
iris.urf <- randomForest(iris[, -5])
iris.urf$proximity[1:5,1:5]
MDSplot(iris.urf, iris$Species)
xx.1 <- randomForest(as.matrix(rbind(iris[, -5], iris[, -5])),
factor(rep(1:2,each=150)), proximity=TRUE)
xx.1$proximity[1:5,1:5]
xx.2 <- xx.1$proximity
xx.1$proximity <- xx.2[1:150,1:150]
MDSplot(xx.1, iris$Species)
The proximity is very different, but the mdsplot gives similar pattern.
Could someone help me on these?