I'm a stat newbie and not good at English, but I will try my best to explain my problems.
- First, I have a dependent variable (y) and 5 independent variables (x1, x2, x3, x4, x5)
- x1 and x2 are categorical. x3, x4, x5 are continuous (integer).
- Focus on x5, its histogram looks like this:
- Then, I categorize x5 using 5 as a cutpoint and call this new categorical variable x5_cat
- Next, I use x1, x2, x3, x4, x5 and an interaction between x5 and x5_cat as independent variables to predict y by logistic regression
- Code in R looks like this: glm(y ~ x1 + x2 + x3 + x4 + x5 + x5:x5_cat, data = train, family = 'binomial')
Because of my limited knowledge of statistics, I can't explain why my solution is wrong. Can anyone help me ?
Edited :
The cutpoint cuts x5 into 2 categories that change from a non linear distribution into 2 nearly linear distributions, one is positive and another is negative slope. I think the coefficient of the interaction term can tell me whether there is a different effect of x5 on y when x5 is below the cutpoint and when x5 is above the cutpoint.