4

I have a dependent variable that is continuous and I have two independent variables: one continuous and one categorical (with 2 categories)

The interaction between the independent variables is significant. Which statistical analysis should I use (in R) to proceed with the analysis and document the interaction?

(Should I simply analyze each of the two categories separately using simple linear regression?)

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
user2564011
  • 43
  • 1
  • 1
  • 3

2 Answers2

9

In the scenario you describe least squares regression will allow you to tell a very straightforward story:

First of all, imagine that you have no dichotomous independent variable. So:

(1) $y_{i} = \beta_{0} + \beta_{1}x_{1i} + \varepsilon_{i}$

Your regression describes the relationship between your dependent variable $y$ and your continuous independent variable $x_{1}$ as a straight line, with intercept $\beta_{0}$ and slope $\beta_{1}$. Cool? Cool.

Now add both the dichotomous independent variable $x_{2}$ and the interaction between $x_{1}$ and $x_{2}$ to the model:

(2) $y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \beta_{3}x_{1i}x_{2i} + \varepsilon_{i}$

So now what is your model telling you? Well, (assuming $x_{2}$ is coded 0/1) when $x_{2} = 0$, then the model reduces to equation (1) because $\beta_{2} \times 0 = 0$ and $\beta_{3} \times x_{1} \times 0 = 0$. So that is easy-peasy puddin' pie.

What about when $x_{2} =1$? Well now the $y$-intercept is $\beta_{0} + \beta_{2}$ (Right? Because $\beta_{2} \times 1 = \beta_{2}$).

And the slope of the line relating $y$ to $x_{1}$ is now $\beta_{1} + \beta_{3}$ (Right? Because $\beta_{1}\times x_{1} + \beta_{3} \times x_{1} \times 1 = \beta_{1}\times x_{1} + \beta_{3} \times x_{1} = (\beta_{1} + \beta_{3})\times x_{1}$).

So when $x_{2}=1$ you simply have a second regression line relating $y$ to $x_{1}$, with a different intercept (if $\beta_{2} \ne 0$) and a different slope (if $\beta_{3} \ne 0$ which will be true if you tested a significant interaction term in, say, ANOVA).

How do you communicate this? A single graph with two regression lines overlaying your data (possibly with different colored/shaped/sized markers when $x_{2}=1$), and a label indicating which line corresponds to $x_{2}=0$ and $x_{2}=1$. Also providing your audience with the values of the $\beta$s and their standard errors and/or confidence intervals is good (like, in a table of multiple regression results).

Cool? Cool.

Finally, while all the above tells you about trend relationships between $y$ and $x_{1}$ given $x_{2}$, least squares regression also tells you about strength of association. If you had a single independent variable, you'd probably want to use something like $R^{2}$ to describe this strength of association, but when you add variables $R^{2}$ doesn't quite mean what it did before. So you might use generalized $R^{2}$, or Pseudo-$R^{2}$ or some such.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • How do I follow up such a significant interaction between categorical and continuous variable? That is, how can I breakdown the interaction effect for each sub-category of the factor variable? – user2564011 Jun 09 '14 at 17:17
  • 1
    Well. You can do the same thing, except instead of two lines, you would have $k$ lines corresponding to $k$ groups. Assume you have 4 groups, A, B, C, and D. Then you could use three 0/1 indicator variables for groups B, C and D. When B, C and D = zero, the estimates reduce to equation (1). When only C and D = 0, the estimates reduce to something like equation (2), same when B and D = 0 and when B and C = 0. – Alexis Jun 09 '14 at 20:37
  • What if there are multiple levels in the categorical variable i.e more than 2 even 7 or 8 – CocoCrisp Nov 13 '18 at 12:52
  • @Liger I already answered this with an example in the comment directly above the one you just wrote (I used a categorical variable with 4 groups, but categorical variables with others numbers work the same way, just need to be [effect coded](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-effect-coding/), and you are good to go). – Alexis Nov 14 '18 at 01:41
7

@Alexis seems to cover the equations pretty well. Here's some example code in :

set.seed(8);d8a=data.frame(x=rnorm(99),z=rbinom(99,1,.5))    #Data sim'd to fit the scenario
d8a$y=(d8a$x+rnorm(99,0,3))*(2*d8a$z-1)                      #Guarantees an interaction
summary(lm(y~scale(x)*factor(z),d8a))       #Fits a GLM with OLS – this is the part you need

$$\rm Output$$

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1575 -2.1416 -0.2051  1.8558  6.5765 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.17374    0.43557  -0.399 0.690867    
scale(x)            -1.11354    0.45224  -2.462 0.015608 *  
factor(z)1           0.01546    0.58976   0.026 0.979144    
scale(x):factor(z)1  2.24831    0.59689   3.767 0.000287 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.922 on 95 degrees of freedom
Multiple R-squared: 0.1328, Adjusted R-squared: 0.1054 
F-statistic:  4.85 on 3 and 95 DF,  p-value: 0.003492

$$\rm Plot$$

require(ggplot2);ggplot(d8a,aes(x,y,color=factor(z)))+stat_smooth(method=lm)+geom_point()

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105