When the effect size of a covariate is high and yet not significant

Question

I was reading this answer to the question on whether all covariates should be kept in the model or just those that are statistically significant, and I noticed the point number 2:

The effect size of a covariate may be high, even if it is not significant.

And, this has always been on my mind, why we could have such a situation. Can you please elaborate this point on why this could happen and maybe a hypothetical example?

score 4 · Accepted Answer · answered Oct 30 '17 at 19:42

Significance means detectability. That, in turn, depends (among other things) on the amount of data. A common way to see large but insignificant effect sizes, then, is when there isn't much data. Since such examples are numerous and easy to create, I won't dwell on this rather uninteresting point.

There are subtler things that can go on. Even with relatively large amounts of data, a model can fail to detect an effect because that effect is masked or otherwise confused with another effect.

Here's an example involving a plain-vanilla linear regression with two explanatory variables $x_1$ and $x_2$ and a response $y$ that is conditionally independent, Normal, and of constant variance--in other words, as beautiful a situation as one could hope for when applying Least Squares methods.

This scatterplot matrix shows how the three variables are related within a sample of three hundred observations. Perhaps, for instance, an experimenter was able to observe a system in three different conditions; measured $x_1,x_2,$ and $y$ one hundred times in each condition; and wishes to understand how $y$ might be related to the $x_i$. That sounds like a typical (and therefore important) situation to understand.

There does seem to be a relationship: although the values of $y$ are spread over ranges of roughly $0-4$, $1-5$, and $2-6$, these ranges shift upwards as the values of the $x_i$ increase from near $-1$ to near $1$. The regression overall is significant (the p-value is too tiny to be calculated).

Here are the coefficient estimates and their associated statistics:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.06269    0.05619  54.501   <2e-16
x.1         -1.95875    1.88340  -1.040    0.299    
x.2          2.96594    1.88206   1.576    0.116

Notice:

The p-values for the coefficients are 0.299 and 0.116. Neither would be considered "significant" in most situations: they are too large.
The coefficient estimates ("effect sizes") of $-1.96$ and $2.97$ are large. What does "large" mean here? Simply that since each of the $x_i$ varies by more than $1 - (-1)=2$ in the dataset, a coefficient of (say) $2.97$ translates to variations of more than $2\times 2.97\approx 6$ in the prediction of $y$. Since the total variation in $y$ is only from $0$ to $6$, this means $x_1$ alone can completely determine the value of $y$! That's large.

This failure-to-detect the effects of the $x_i$ comes about because the $x_i$ separately give us nearly the same information: they are said to be (almost) collinear. Collinearity can be subtle and much more difficult to detect when there are more than two explanatory variables. Search our site for more examples.

You may modify this example to explore the effects of sample size, variability, coefficients, and so on. Here is the R code. Comment out the line set.seed(17) when experimenting, so that you get randomly different results each time.

library(data.table)
n <- 100    # One-third of the sample size
tau <- .02  # Conditional standard deviation of the explanatory variables
sigma <- 1  # Error standard deviation
#
# Create data.
#
set.seed(17)                            # Creates reproducible data
x <- c(rep(1,n), rep(0,n), rep(-1,n))   # Experimental "condition"
X <- data.table(x.1=rnorm(3*n, x, tau), # Regressor x.1
                x.2=rnorm(3*n, x, tau)) # Regressor x.2
invisible(X[, y := rnorm(3*n, 3+x+x.2-x.1, sigma)]) # y (including error)
#
# Plot the data.
#
pairs(X, pch=19, col="#00000010")
#
# Perform least squares regression.
#
fit <- lm(y ~ ., X)
summary(fit)

When the effect size of a covariate is high and yet not significant

1 Answers1

Linked