discretization to create intervals for continuous variables

Question

I am new to R and have basic stats understanding. Please excuse me if my questions are basic in nature. I am learning and having these questions and your answers would help me in updating my knowledge.

I have a data with around a million rows and hence using data.table.

I want to have intervals defined for my continuous variable (independent variables). The intervals should be defined such that they are related and have impact on my dependent variable. After doing some search I came across the usage of rpart.

Below is my sample data and code -

library("data.table")

#####Generating sample data

set.seed(1200)
id <- 1:100
bills <- sample(1:20,100,replace = T)
nos <- sample(1:80,100,replace = T)
stru <- sample(c("A","B","C","D"),100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))


tree.reg<-rpart(value ~ nos, data=df1,method="anova", 
                control=rpart.control(minsplit=30,cp=0))

plot(tree.reg)
text(tree.reg)

This generates tree and I can visualize the plot and create the interval. So in this example we would have intervals like -

=>60
48-59
41.5-47
31.5-41
20.5-31
<20.5

In this case, I need help / clarity on below points -

how to treat the extreme left and right intervals. value = 463.2 and value = 766.1
Is there a method by which one would not have to look at the plot and can get these intervals from the tree itself using some code?
I wanted to control the number of nodes to 5-7 and for that manually kept changing the minsplit value. started with 5 and stopped at 30. But this would be different for other variables say bills. How can we control this so that the tree gives only 5-7 nodes
Can you guide me to the step by step method of the tree algorithm, that is used to create the tree. How does the algorithm decide "where to split" and "which group to split". motive is to generate the same tree using base R code without using rpart function.
what are the assumptions for building a tree? Does my data need to be normal distributed? Should I be doing any transformation?

Do suggest if you think there is some other better method which can be used to create intervals for continuous variables keeping in mind the dependent variable. The motive is to have homogeneous sets and not more than 5-7 intervals

Please help !!

[Don't bin your continuous data at all.](https://stats.stackexchange.com/q/68834/1352) Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's Regression Modeling Strategies) to capture any nonlinearity. — Stephan Kolassa, Oct 25 '17 at 18:50
@StephanKolassa, Thank you for your suggestion. In this case once I have created the intervals I want to use it to calibrate my factors by looking at the difference in proportions between by universe and sample. There would be cases where I do not find the same continuous value in my sample / universe and hence the need to create intervals so that I can take care of all possible values. I hope this makes sense. Do suggest some way out to handle this situation of creating the intervals. — user1412, Oct 26 '17 at 05:48

discretization to create intervals for continuous variables

0 Answers0