GLM binomial regression in python shows significance for any random vector

Question

I have a dependent variable as a binomial count, and I have used the GLM model as suggested in this post. My model shows up highly significant, which I found very suspicious. To check, I ran the model on a vector with randomly generated numbers.

I get the following output: If I'm interpreting it correctly, it's still highly significant, with the associated p-value of the chi-sq statistic <0.01.

Any ideas on what might be happening?

You have over 300,000 observations. Anything that is not exactly zero will likely be statistically significant. — Heteroskedastic Jim, Jul 24 '18 at 22:24
@user162986 While it's true that smaller and smaller effects become significant with large sample sizes, for randomly generated data that should not be the case. — Bryan Krause, Jul 24 '18 at 22:32
I'm not familiar with GLM in python - is your model lacking an intercept? — Bryan Krause, Jul 24 '18 at 22:35
@BryanKrause in a single run, one can't say. We can only say what we'd expect over many runs. I now realize I didn't read the question properly and I should have. OP can gain learn more about the performance of stats models glm by running a simulation. I'll try one and post an answer. — Heteroskedastic Jim, Jul 24 '18 at 22:37
I suppose I presumed that the OP ran more than one iteration with the synthetic data, since doing so would be so easy. — Bryan Krause, Jul 24 '18 at 22:38
Also @user162986 you should see https://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing — Bryan Krause, Jul 24 '18 at 22:40
@BryanKrause I skimmed the post. Does it contradict what we know that if the null is exactly true, ie 0, the distribution of p-values should be uniform? The difference is that in reality, the null is never exactly true, effects are only just trivial, such as that at large n, everything is statistically significant. — Heteroskedastic Jim, Jul 24 '18 at 22:45
@user162986 Yes that post covers that topic. Like I said, smaller and smaller effects in real data sets become significant with large sample sizes, but if you generate data under the null hypothesis there is no bias toward erroneous rejection with large sample sizes. — Bryan Krause, Jul 24 '18 at 22:48
the model keeps being significant every time I run it, with or without intercept. @BryanKrause so I can rule out the 'too much data' problem? Also, could anyone point me to where I could find deviance explained in this context? Thanks so much for the help — vvv, Jul 24 '18 at 23:08
Is that call to np.random.randint correct? It doesn't look right to me... check https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randint.html — jbowman, Jul 24 '18 at 23:13
it seems to be, when I look at a few values. in any case what is the probability that I will randomly generate a non-random vector that ended up significant? :) — vvv, Jul 24 '18 at 23:59
Correct me if I'm wrong - I may well be - but wouldn't that generate values between `0` and `len(data_endog)-1` rather than 0 or 1? Try instead replicating the sequence (0,1) 184,000 times and using that as your target variable instead, just as another check. — jbowman, Jul 25 '18 at 00:09
I redid it with np.random.randint(0,1,len(data_endog)), same result — vvv, Jul 25 '18 at 20:27

Heteroskedastic Jim · Answer 1 · 2018-07-25T00:30:45.820

I attempted to test OP's claim.

import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm

n = 300000
res = []
for _ in xrange(1000):
    y = np.random.randint(0, 2, n) # Bernoulli with p = 50%
    # x \sim U(0, 300000 - 1)
    x = np.random.randint(0, n, n)
    # Standardize strange x to help with numerical issues
    x = (x - np.mean(x)) / np.std(x)
    res.append(sm.GLM(y, x, family=sm.families.Binomial()).fit().pvalues[0])

A couple of comments. The data generation process for the regressor in OP's example is strange but correct. In my simulation as in OP's, x is a uniformly distributed with minimum 0 and maximum 299,999. I standardized the regressor to reduce any chances of software problems with model fitting. I used 1000 replications. Sample size is 300,000.

Here is the result:

plt.hist(res)
plt.show()

The p-values are uniformly distributed between 0 and 1 as one would expect. statsmodels behaves as expected. I tried the variation in OP's examples keeping y the same and changing x. Same outcome.

sorry, I am confused - what is this saying? that my random generator is correct or..? — vvv, Jul 25 '18 at 20:24
@vvv Your random number generation though correct is strange. You are generating a variable that can take on any value between 0 and ~370,000. Usually, we do not use such variables in models without transforming them somehow. The major point here is that statsmodel GLM behaves as one might expect. Most of the p-values at large sample sizes are greater than .05 (~95% of the p-values), if you repeat the process enough times. — Heteroskedastic Jim, Jul 25 '18 at 21:35

GLM binomial regression in python shows significance for any random vector

1 Answers1