1

Consider the following example. Suppose I want to model x in DF2 using everything else. While it makes sense to do pca pre-processing on variable 1 to 10, I want to keep z the factor variable as is and simply center and scale y and v. My first question is whether this approach makes sense at all.

DF = data.frame(x=rep(c("a","b","c"),each=3), 
            y=c(1,3,6), v=1:9, 
            z=rep(c('m','n','l'),3))
DF

i <- 1:10
s <- sapply(i, function(x) rep(x*5, 9))
s

DF2 <- cbind(DF, s)
DF2
str(DF2)

If it does make sense, how can I do it efficiently? I think one approach is to use preProcess on variable 1 to 10 and recreate the data frame using the resulting principal components and y, v, and z. But that doesn't seem very efficient.

At the same time, It is often advised such as in this post that pre-processing is done during resampling in the train function. But I haven't figured out how to specify the different pre-processing methods for different predictors.

Thanks in advance for any insight.

GABAergic
  • 11
  • 2
  • This doesn't get much at the substance of your question, but I think the `transform` function is probably a lot more efficient for adding variables; i.e., `DF2 – Russ Lenth Aug 12 '14 at 19:58

1 Answers1

2

Here is an example using logistic regression.

You would need to adapt this for other models. For example, if the model needs the predictors to be on the same scale, then you would need to add another step in the fit and predict functions to normalize all the predictors.

Also, you could make the number of components a tuning variable (see the other example that you mentioned on the package website).

set.seed(1)
dat <- twoClassSim(200)

funcs <- getModelInfo("glm", regex = FALSE)[[1]]

funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
  ## Conduct PCA and generate the new predictors
  for_pca <- 1:5
  num_pc <- 2
  pca <- preProcess(x[, 1:5], method = "pca")
  pc <- predict(pca, x[, 1:5])[, 1:num_pc, drop = FALSE]

  ## glm needs a data frame and formula, so bind the data together
  ## in a data frame and attach the outcome
  dat <- cbind(x[, -for_pca, drop = FALSE], pc)
  dat <- as.data.frame(dat)
  dat$y <- y
      ## Save the model and attache the information needed
      ## to predict new samples
      out <- glm(y ~ ., data = dat, family = binomial)
      out$pp <- pca
  out$for_pca <- colnames(x)[for_pca]
  out
}

funcs$predict <- function(modelFit, newdata, submodels = NULL) {
      ## Generate the PC's, attach and predict
      pc <- predict(modelFit$pp, newdata[, modelFit$for_pca])
      orig_vars <- !(colnames(newdata) %in% modelFit$for_pca)
      dat <- cbind(newdata[, orig_vars, drop = FALSE], 
                   pc)
      dat <- as.data.frame(dat)
      prob <- predict(modelFit, dat, type = "response")
      ## Predict the class
      ifelse(prob >= .5, 
             modelFit$obsLevels[2], 
             modelFit$obsLevels[1])
}


set.seed(2)
mod <- train(Class ~ ., data = dat,
             method = funcs,
             trControl = trainControl(method = "cv"))

For this example:

> mod
Generalized Linear Model 

200 samples
15 predictor
2 classes: 'Class1', 'Class2' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 

Resampling results

Accuracy  Kappa  Accuracy SD  Kappa SD
0.665     0.321  0.116        0.232 

> coef(mod$finalModel)
(Intercept)    Linear04    Linear05    Linear06    Linear07 
-0.22918645 -0.86777592  0.22813460 -0.59662663  0.52737593 
Linear08    Linear09    Linear10  Nonlinear1  Nonlinear2 
-0.21819921  0.50468429 -0.14011715  0.57582282 -0.18439884 
Nonlinear3         PC1         PC2 
0.04742595  0.02288815  0.14073538 

Max

topepo
  • 5,820
  • 1
  • 19
  • 24