Difference between weights and prior in rpart and how to use them

Question

I have a question about the "weights" and "prior" in R's rpart function. This question has been asked before here, but the answer doesn't quite make sense.

Currently I have very unbalanced data where the target is only 0.0066% of the whole dataset, which has over 2 million rows. I want to know if either the "weights" or the "prior" can help me with this biased dataset, and how they would be used.

I tried oversampling the target and downsampling the noise and then producing an ensemble of my predictions, but I did not achieved the desired result.

score 5 · Answer 1 · answered May 23 '16 at 20:56

I see two questions here.

1) What is the difference between `weights` and `parms` in `rpart`?

If you look at the code, weights argument is passed to the model.frame object, so it should be applied towards each observation of your dataset, just like in lm.

if (is.data.frame(model)) {
    m <- model  ## <---- m is defined here
    model <- FALSE
}
else {
    indx <- match(c("formula", "data", "weights", "subset"), 
        names(Call), nomatch = 0L)
    if (indx[1] == 0L) 
        stop("a 'formula' argument is required")
    temp <- Call[c(1L, indx)]
    temp$na.action <- na.action
    temp[[1L]] <- quote(stats::model.frame)  ## <---- passed to model.frame
    m <- eval.parent(temp)
}
Terms <- attr(m, "terms")
if (any(attr(Terms, "order") > 1L)) 
    stop("Trees cannot handle interaction terms")
Y <- model.response(m)
wt <- model.weights(m)  ## <---- used as observation weights

On the other hand, parms is for the class weights, which deals with unbalanced class size. I believe this is what you are looking for.

2) How to use the `parms` argument?

If you look at the description of parms:

For classification splitting, the list can contain any of: the vector of prior probabilities (component prior), ...

Hence, you want to store your prior probability vector in a list with name "prior". The order of probability should be exactly the same as the output of levels(data$y), where y indicates your response variable. For example, you might want to try something like the following:

fit <- rpart(y ~ x1 + x2 + x3, data = data, parms = list(prior = c(0.000066, 1 - 0.000066)))

how to calculate the prior values for more, i.e. example 3 classes ? @mark, i cant write comments and at the moment i'm not able to finish the registration to ask a new question, after clicking the confirmation link in email and setting my password it tells me that i'm already registered. so maybe one of the mods can make this a comment? — laz, Jun 29 '17 at 13:29
@guest12345 You could just write 3 probabilities to the prior? e.g., `parms = list(prior = c(0.2, 0.3, 0.5))` — Boxuan, May 27 '18 at 05:36

Difference between weights and prior in rpart and how to use them

1 Answers1

1) What is the difference between weights and parms in rpart?

2) How to use the parms argument?

1) What is the difference between `weights` and `parms` in `rpart`?

2) How to use the `parms` argument?