I am new to R and have basic stats understanding. Please excuse me if my questions are basic in nature. I am learning and having these questions and your answers would help me in updating my knowledge.
I have a data with around a million rows and hence using data.table.
I want to have intervals defined for my continuous variable (independent variables). The intervals should be defined such that they are related and have impact on my dependent variable. After doing some search I came across the usage of rpart.
Below is my sample data and code -
library("data.table")
#####Generating sample data
set.seed(1200)
id <- 1:100
bills <- sample(1:20,100,replace = T)
nos <- sample(1:80,100,replace = T)
stru <- sample(c("A","B","C","D"),100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)
df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
tree.reg<-rpart(value ~ nos, data=df1,method="anova",
control=rpart.control(minsplit=30,cp=0))
plot(tree.reg)
text(tree.reg)
This generates tree and I can visualize the plot and create the interval. So in this example we would have intervals like -
- =>60
- 48-59
- 41.5-47
- 31.5-41
- 20.5-31
- <20.5
In this case, I need help / clarity on below points -
- how to treat the extreme left and right intervals. value = 463.2 and value = 766.1
- Is there a method by which one would not have to look at the plot and can get these intervals from the tree itself using some code?
- I wanted to control the number of nodes to 5-7 and for that manually kept changing the minsplit value. started with 5 and stopped at 30. But this would be different for other variables say bills. How can we control this so that the tree gives only 5-7 nodes
Can you guide me to the step by step method of the tree algorithm, that is used to create the tree. How does the algorithm decide "where to split" and "which group to split". motive is to generate the same tree using base R code without using rpart function.
what are the assumptions for building a tree? Does my data need to be normal distributed? Should I be doing any transformation?
Do suggest if you think there is some other better method which can be used to create intervals for continuous variables keeping in mind the dependent variable. The motive is to have homogeneous sets and not more than 5-7 intervals
Please help !!