Feature Engineering : combine a categorical Feature and a continuous Feature

Question

When we analyze data , we can observe several variables that may contain mutual information. For an example , There can be a binary variable such as Y=Have you ever smoke ? And then there will be a follow up question such that (in this case it is a continuous variable) How old when you first smoke ?

For the variable X=that measures How old when you first smoke ? ,

$X_1$= {$x_1$=0 ; if never smoke, $x_1$=1 ; if smoke }

$X_2$ = {$x_2$=0 ; if $x_1$=0 $x_2$ >=0 ; if $x_1$=1 }

So the distribution of $X_2$ will be like this :

That means it contain several zeros since it is depends on the previous question ($X_1$)

One way to deal with this type of problem is calculate the age of first smoke ($X_2$) only for users. (i.e eliminating zeros) . Then the drawback is that it will reduce the sample size with respect to $X_2$ variable.

Another way to model $X_2$ is convert it to a categorical variable. For an example someone can do like this:

$X_2 categorized$ ={"Never Smoke" ; $X_2$=0 , "Young" ; 0< $X_2$ <=15 , "Middle" ; 15< $X_2$ <=20 , "old" ; $X_2$>20}

But Is there way to Model X by preserving the continuous nature using a mixture distribution ? Mixture distribution in the sense that ,this may be something like the product of $X_2$ and $X_1$. However I am not sure how to do this .

Since in this case $X_1$ is binary , taking the product of $X_2$ and $X_1$ seems to make sense. But I am not sure how this will work in general , i.e when $X_1$ has more than 2 categories.

Any help would be great

Why do you need this combined into a single variable? It seems to me that if you did, you'd need to account for the fact that the relationship might differ for the never smokers, which would get you back to having multiple variables. — gung - Reinstate Monica, Mar 16 '20 at 00:44
@gung-ReinstateMonica I may need to combine because there can be situations where the number of zeros in X are significantly higher. In such situations, my opinion is dealing only with X can be misleading. — student_R123, Mar 16 '20 at 00:52
@gung-ReinstateMonica In this case also there about 150 cases of zeros for X variable — student_R123, Mar 16 '20 at 00:53
I'm not sure I see the problem with that. Are these the response variable[s], or are these predictor variables? — gung - Reinstate Monica, Mar 16 '20 at 04:26
@gung-ReinstateMonica All are predictors . I changed the notation in the question. I apologize for any confusion. — student_R123, Mar 16 '20 at 13:55
This might be an answer, maybe even a duplicate: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model/372258#372258 — kjetil b halvorsen, Oct 07 '20 at 15:49
@student_R123 - One of my favorite papers is "Zero Inflated Poisson Regression ..." by Dianne Lambert. https://www.jstor.org/stable/1269547 You can make a Zero-inflated Gaussian Mixture Model to fit your data. GMM is the probit-domain analog of piecewise linear fits. (I'm sure there is a probit equivalent of a piecewise cubic hermite polynomial, but I don't know what it is.) — EngrStudent, Oct 07 '20 at 15:57

score 0 · Answer 1 · answered Mar 20 '20 at 02:11

This is probably just a hack that does not solve this kind of problem in general, but may be well-suited for your problem: a person that does not smoke is equivalent to a person that starts smoking at the age of infinity. Hence if you transform your $X2$ into $X2' = 1/X2$, then a person that never smoked should have a value $0 = 1 / \infty$, while other people just have $1/X2$. If you are doing some kind of linear regression, this will destroy the original linearity, but should be fine for nonlinear regression techniques.

Feature Engineering : combine a categorical Feature and a continuous Feature

1 Answers1