9

I have a question about the "weights" and "prior" in R's rpart function. This question has been asked before here, but the answer doesn't quite make sense.

Currently I have very unbalanced data where the target is only 0.0066% of the whole dataset, which has over 2 million rows. I want to know if either the "weights" or the "prior" can help me with this biased dataset, and how they would be used.

I tried oversampling the target and downsampling the noise and then producing an ensemble of my predictions, but I did not achieved the desired result.

Gavin M. Jones
  • 87
  • 1
  • 12
Jason
  • 91
  • 1
  • 2
  • 5

1 Answers1

5

I see two questions here.

1) What is the difference between weights and parms in rpart?

If you look at the code, weights argument is passed to the model.frame object, so it should be applied towards each observation of your dataset, just like in lm.

if (is.data.frame(model)) {
    m <- model  ## <---- m is defined here
    model <- FALSE
}
else {
    indx <- match(c("formula", "data", "weights", "subset"), 
        names(Call), nomatch = 0L)
    if (indx[1] == 0L) 
        stop("a 'formula' argument is required")
    temp <- Call[c(1L, indx)]
    temp$na.action <- na.action
    temp[[1L]] <- quote(stats::model.frame)  ## <---- passed to model.frame
    m <- eval.parent(temp)
}
Terms <- attr(m, "terms")
if (any(attr(Terms, "order") > 1L)) 
    stop("Trees cannot handle interaction terms")
Y <- model.response(m)
wt <- model.weights(m)  ## <---- used as observation weights

On the other hand, parms is for the class weights, which deals with unbalanced class size. I believe this is what you are looking for.

2) How to use the parms argument?

If you look at the description of parms:

For classification splitting, the list can contain any of: the vector of prior probabilities (component prior), ...

Hence, you want to store your prior probability vector in a list with name "prior". The order of probability should be exactly the same as the output of levels(data$y), where y indicates your response variable. For example, you might want to try something like the following:

fit <- rpart(y ~ x1 + x2 + x3, data = data, parms = list(prior = c(0.000066, 1 - 0.000066)))
Boxuan
  • 233
  • 1
  • 3
  • 9
  • how to calculate the prior values for more, i.e. example 3 classes ? @mark, i cant write comments and at the moment i'm not able to finish the registration to ask a new question, after clicking the confirmation link in email and setting my password it tells me that i'm already registered. so maybe one of the mods can make this a comment? – laz Jun 29 '17 at 13:29
  • @guest12345 You could just write 3 probabilities to the prior? e.g., `parms = list(prior = c(0.2, 0.3, 0.5))` – Boxuan May 27 '18 at 05:36