10

UPDATE: caret now uses foreach internally, so this question is no longer really relevant. If you can register a working parallel backend for foreach, caret will use it.


I have the caret package for R, and I'm interesting in using the train function to cross-validate my models. However, I want to speed things up, and it seems that caret provides support for parallel processing. What is the best way to access this feature on a Windows machine? I have the doSMP package, but I can't figure out how to translate the foreach function into an lapply function, so I can pass it to the train function.

Here is an example of what I want to do, from the train documentation: This is exactly what I want to do, but using the doSMP package, rather than the doMPI package.

## A function to emulate lapply in parallel
mpiCalcs <- function(X, FUN, ...)
}
    theDots <- list(...)
    parLapply(theDots$cl, X, FUN)
{

library(snow)
cl <- makeCluster(5, "MPI")

## 50 bootstrap models distributed across 5 workers
mpiControl <- trainControl(workers = 5,
    number = 50,
    computeFunction = mpiCalcs,
    computeArgs = list(cl = cl))

set.seed(1)
usingMPI <- train(medv ~ .,
    data = BostonHousing,
    "glmboost",
    trControl = mpiControl)

Here's a version of mbq's function that uses the same variable names as the lapply documentation:

felapply <- function(X, FUN, ...) {
    foreach(i=X) %dopar% {
        FUN(i, ...)
    }       
}

x <- felapply(seq(1,10), sqrt)
y <- lapply(seq(1,10), sqrt)
all.equal(x,y)
Zach
  • 22,308
  • 18
  • 114
  • 158

2 Answers2

6

Try

computeFunction=function(onWhat,what,...){foreach(i=onWhat) %do% what(i,...)},
5

Caret already does this internally for you as part of the train() function, see the bottom section of the caret webpage for starters.

Dirk Eddelbuettel
  • 8,362
  • 2
  • 28
  • 43
  • The default function used by train is lapply. If you want to parallelize train, you need a parallel function that mimics lapply, such as multicore:::mclapply. At least, that's the way I understand things. – Zach Jun 19 '11 at 15:56
  • @Zach, +1 for this question, I wonder is there any update of how one can do parallel processing with `caret::train()` for `Windows`, most of the examples of `APM` book are computationally expensive, at least for me 3GB RAM, 2.1GHz, dual core, 32bit Win. Had I known this issue before, I would change to `Linux`, but it is too late for me now to do such a thing. Do you know any idea of how to combat this issue in windows? if the answer by `mbq` is still active, can you pls just show in code using a concrete example of any model with moderate data size of how to implement the `computeFunction`? – doctorate Jan 07 '14 at 09:49
  • @doctorate caret has been updated to use the `foreach` package internally, which works with any parallel backend you can register. Take a look at the doParallel package. Once you register a backend, caret will automatically use it. Also note that, on windows, each core needs it's own copy of ram, so if you register 4 cores, you need 4x as much RAM. – Zach Jan 11 '14 at 15:30
  • @Zach, thanks indeed, I tried it and it worked. I know also that you contributed to `caret`, can you pls take a look at this question, I would be very grateful. http://stats.stackexchange.com/questions/81962/how-to-plug-in-pls-method-in-preprocess-function-of-caret-in-r-to-perform-pls – doctorate Jan 11 '14 at 22:33