Let's start with data description of the website visits I analyse :
- 6M rows
- Dependant variable
quotation
is binary and takes values0
and1
with1% of value 1
- The other 3 variables are
temperature
,humidity
andminute
of the day
The objective is to identify quotation trend
based on the weather to optimize communication campaigns and not to determine if for a given visit there will be a quotation
.
To avoid overfitting problems due to the large dataset I decided to cross-validate my tree-models to determine the right one.
My questions :
Due to the low probability of quotation = 1
even the best leaf-node
gets a 5%
with the training sample
. Therefore, if I do a predict()
on my Testing sample
I get only 0
for all nodes.
- Is there a way with the
party package
to attribute the corresponding node to each value of theTesting sample
- Is that the right method to evaluate my different models since
predict()
doesn't seem to work for me (0 for all observations)?
I went there but every suggestions are based on predict
which is I feel of no help in my case...