0

Let's imagine that we have three variables - $Y, X_1$ and $X_2$, where $Y$ is dependent continuous variable, $X_1$ is continuous variable and $X_2$ is discrete variable with two factors - $0$ and $1$. I'm interested in linear model $Y \sim X_1$, however I want to think of statistical test which will determine whether its necessary to divide $X_1$ into two categories with respect to $X_2$.

My idea

My idea was to create first model $Y \sim X_1 + X_1 \cdot 1_{\{X_2 = 0\}} $ and second model $Y \sim X_1$.

Since those two models are nested, I can compare them using F - test to check whether reduction is rational i.e. if sum of squares error ($\sum_{i = 1}^n (Y_i - \hat{Y}_i)^2$) is significantly different.

Is this make any sense to you? Is there any more popular way how it can be done?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
John
  • 279
  • 1
  • 7
  • *discrete variable with two factors - 0 and 1* it would be more standard terminology to say *with two levels*. Also, what you really are doing with the second model is to introduce the *interaction* of $X_1$ with $X_2$. Then you would usually also include the direct effect of $X_2$. – kjetil b halvorsen Dec 09 '21 at 15:25

2 Answers2

2

You are using some non-standard terminology (a more descriptive title would also help). Your two models can be written as $$ \begin{align} \label{I}\tag{I} Y_i&= \alpha +\beta X_{1i} + \epsilon_i \\ \label{II'}\tag{II'} Y_i&=\alpha+\beta X_{1i} \cdot \mathbb{1}(X_{2i}=0)+\epsilon_i \end{align} $$ but here equation $\eqref{II'}$ is a version of the interaction model (assuming the factor $X_2$ is binary-coded 0/1) $$ \label{II}\tag{II} Y_i=\alpha+\beta X_{1i} + \gamma_0 X_{2i} +\gamma_1 X_{1i}X_{2i} + \epsilon_i $$ and usually you would prefer $\eqref{II}$ to your $\eqref{II'}$, as it violates the heredity principle, that usually one should include all main effects included in an interaction. But of course, your case could be an exception ... see the former posts Do all interactions terms need their individual terms in regression model? and Including the interaction but not the main effects in a model.

Then to your question Modulo the above, your idea makes sense. The models are nested, so the F-test is a good way to compare the models.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
0

If I understand you correctly, you are basically asking whether $\textbf{X}_2$ is a "usefull" variable (and/or its interactions). You use words like "rational" or "necessary", but these words have meaning in the context of some defined goal.

The role $\textbf{X}_2$ ought to play depends on this goal. One goal might be maximizing $R^2$, in all likelihood adding $\textbf{X}_2$ and interaction terms will help you further this goal, unless $\textbf{X}_2$ is entirely uncorrelated with $Y$. On the other hand, maybe you are interested in a particular causal effect, like the effect of $\textbf{X}_1$ on $Y$. In this case including $\textbf{X}_2$ depends on $Cov(X_2,X_1)$ and $Cov(X_ 2, error)$, and does not need necessarily help you in your goal.

confused student
  • 451
  • 1
  • 2
  • 8