Effect of a variable in a t-test comparison?

Question

In a study, we compared, using the Student t-test (data are normal), the means of a protein expression for 2 groups of patients (n=100). We found that the expression was statistically different (p-value < 0.005).

One reviewer of our work is asking if the ages of the patients, in the two different groups, can affect the statistical significance that we found?

Could you please tell me what approach I should use to assert if the age of the patients are biasing the test or not?

Here are some details about the procedure that I am using, especially regarding the comparison between the t.test results and the regression results.

I am using R ('t.test' and 'glm' methods) for all the computations. I have simplified my dataset, create some artificial data, and removed the age from the dataset, as my new question from above comments is: does it make sense to have different results from a t.test and the regression.

#50 random values
x <- rnorm(50)

#60 other random values
y <- rnorm(60)

# perform a t.test
t.test(x,y)



     Welch Two Sample t-test

data:  x and y
t = 1.956, df = 25.253, p-value = 0.04161
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.01016491  0.39826273
sample estimates:
mean of x mean of y 
0.7273823 0.5333334

#format the data
df <- data.frame(y=c(x,y),group=c(rep("x",50),rep("y",60)))

#perform a regression
fit <- glm(y~group,data=df)

#print the resuls
summary(fit)

Call:
glm(formula = y ~ bc, data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.48892  -0.23710   0.04165   0.22003   0.46359  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.72738    0.09225   7.885 1.37e-08 ***
bcy         -0.19405    0.11298  -1.717   0.0969 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you can see the t.test is significant but not the coefficient of the regression.. Does it make sense ?

PyjamaNinja · Answer 1 · 2017-10-15T20:39:10.657

5

You could simply run a multiple regression analysis with two predictors (group and age) predicting the outcome variable (protein expression).

In the output of a standard statistical software, you will you get a beta value for each predictor (i.e. predicted change in the outcome variable [protein expression] with a one unit increase in the predictor, in this case, the difference in group membership), as well as a t value, which tests whether the beta value is significantly different from 0. You have essentially run a t-test for the group variable whilst controlling for the variance associated with age (and vice versa).

You suggest that your data are normally distributed (not sure how you tested for this and if you tested the observed data or residuals), but it's also worth checking if there are linear relationships between the predictors and outcome variable, as well as if there are any outliers.

edited Oct 15 '17 at 20:39

answered Oct 15 '17 at 18:45

PyjamaNinja

479
3
12

Thank you very much PyjamaNinja. Someone suggested me to perform a t-test between the ages of the two groups. If the pvalue of the t-test is not significant then it will imply that the age do not affect my experiment.. Is it true ? Do you suggest too see approach ? – Mel Oct 15 '17 at 19:08
2

The problem with the method you describe is that even if your groups differ in age, you have no idea whether this influences their scores on the outcome. Probably does, but you can't say for sure. By adding both group and age to a regression model, you explicitly account for the relationship each has on the outcome. – PyjamaNinja Oct 15 '17 at 19:17
3

Doing a significance test on ages is not very informative since you shouldn’t accept the null hypothesis if it is not significant. Better is pyjamaninja’s suggestion which is basically analysis of covariance. – David Lane Oct 15 '17 at 19:18
Thank you to both of you for your suggestion. I have performed the regression analysis using these two factors. The associated t-values for the Beta values (of group and age) are not significant... (cool !, there is then no impact of the age on my variable). Last question please: When I perform a regression analysis using only the groups to predict the expression of my variable, then I do not find that the associated t-value for the Beta value is significant. Does it make sense? Thanks ! – Mel Oct 16 '17 at 05:24
Yes, that does make sense. Parameter estimates and significances of predictors will change if you add (or remove) other predictors. – Stephan Kolassa Oct 16 '17 at 07:57

score 1 · Answer 2 · answered Jan 23 '18 at 21:03

The problem with using a standard linear regression model to assess the validity of a t-test estimate with unequal variance assumption is that the linear regression uses an inappropriate pooled variance estimate.

To obtain approximately similar (but not identical) inference to the t-test with unequal variance assumption you should use robust standard errors, which use a weighted combination of the residuals to estimate the variability of the regression coefficient.

I cannot verify your output since you did not use set.seed, but it just so happens that the inference from the two models is arbitrarily close. In fact, I could not possibly care less about their disagreement at the equally arbitrarily 0.05 significance level. However, let's introduce marked heteroscedasticity to underscore their differences. Now the t.test and regression model disagree at the 0.0295 level. In another post I can tell you why 0.0295 reflects a clinically appropriate level of statistical significance.

set.seed(1234)
x <- rnorm(50)
y <- rnorm(60, sd=2)
t.test(x,y)
df <- data.frame(y=c(x,y),group=c(rep("x",50),rep("y",60)))
fit <- lm(y ~ group, data=df)
summary(fit)

Gives:

t = -2.2603, df = 85.708, p-value = 0.02634

and

            Estimate Std. Error t value Pr(>|t|)  
groupy        0.6329     0.2974   2.128   0.0356 *

which you can see they plainly disagree at my well-reasoned and clinically sound significance level. However, the sandwich based inference is much closer:

library(sandwich)
library(lmtest)
coeftest(fit, vcov=vcovHC)

Which gives:

            Estimate Std. Error t value  Pr(>|t|)    
groupy       0.63291    0.28247  2.2406 0.0270991 *

Which agrees out to 2 decimal places rather than 1 and totally agrees with my well-reasoned 0.0295 significance level. As a note, they only agree in terms of committing a type 1 error.

In summary:

Sandwich standard errors give approximately unbiased inference to the Fisher Behren's problem when transforming a vector of outcomes and regressing it on the group indicator in a linear regression model, as is found in the t-test with unequal variance assumption using the Welch's degrees of freedom approximation.

score 0 · Answer 3 · answered Oct 16 '17 at 14:06

0

If I understood correctly, the Student test gives you a difference between groups while the regression analysis shows no difference. In this case, there may be a problem because both analyses should show the same result.

answered Oct 16 '17 at 14:06

Emmanuel.W

133
8

I disagree. The question itself does not consider regression techniques (the answer by @pyjamaNinja does). Moreover, a regression analysis which would include age as a covariate/confounder would/could show an effect on protein expression for the grouping variable **which is different from a t-test** (as the t-test does not 'correct' for age). – IWS Oct 16 '17 at 15:28
@IWS, My answer was actually a response to the last comment of Mel but I still don't have enough reputation to answer directly. Furthemore, Mel evokes well in this comment a model without age as covariate. – Emmanuel.W Oct 16 '17 at 15:44
@Emmanuel.W I have just checked my results. The Student's t-test is significant (p-value <0.05) in this situation. Nevertheless, a regression analysis constructed only with the groups (thus without considering the age) to predict the protein expression is not significant at all, and the coefficient is very low. I have many other situations like that with other proteins that we are studying. Is there something that I am doing bad ? Thank you for your help. Mel – Mel Oct 17 '17 at 17:07
1

@Mel Hi Mel, it might be worth starting a new question with specific details of what you are doing, characteristics of your variables, software, etc. Try to be as thorough as you can - right now we have no idea what you are doing and so cannot give any answers. – PyjamaNinja Oct 17 '17 at 17:29
1

@Mel can you give us a screen of your results (maybe in a new question as suggested by PyjamaNinja)? – Emmanuel.W Oct 18 '17 at 09:39

Effect of a variable in a t-test comparison?

3 Answers3