2

I have extremely large number of observations (8524152) of soil moisture, precipitation, evapotranspiration, delta precipitation, and delta evapotranspiration. I ran a multiple linear regression model and my result looks like

Call:
lm(formula = SMDI ~ ET + delta_ET + PRCP + delta_PRCP, data = regData)

Residuals:
 Min  vvvv     1Q   Median       3Q      Max 
-10414.0     67.1    133.9    192.2   8737.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.508196   0.797889 -109.67   <2e-16 ***
ET            0.083853   0.001225   68.46   <2e-16 ***
delta_ET      0.267973   0.001270  211.04   <2e-16 ***
PRCP          0.237649   0.003255   73.02   <2e-16 ***
delta_PRCP    0.257458   0.003250   79.23   <2e-16 ***



Residual standard error: 1705 on 8524147 degrees of freedom
Multiple R-squared:  0.4424,    Adjusted R-squared:  0.4424 
F-statistic: 1.691e+06 on 4 and 8524147 DF,  p-value: < 2.2e-16

The t-stat for evapotranspiration (ET), Precipitation (PRCP), delta_PRCP, and delta_ET are same, and the combined p-value is also extremely small. allmost < 2.2e-16. is this possible?

Juvin

Learner
  • 1,528
  • 3
  • 18
  • 34
Juvin
  • 21
  • 3
  • 2
    The t-statistics are not the same, and they are similar. Yes, this is possible. You have a very large sample, so we'd expect the p-values to be tiny. – Jeremy Miles Mar 04 '15 at 23:34
  • Simply having a large sample, doesn't imply that p-values will be "tiny." – StatsStudent Mar 04 '15 at 23:40
  • 2
    I didn't say it was guaranteed, I said it was expected. In any real dataset with a sample of over 8 million, it's almost guaranteed that the p-values will be tiny. I'd be interested in a counterexample where the sample is that large and they p-values are not tiny. – Jeremy Miles Mar 04 '15 at 23:50
  • 4
    @StatsStudent in practice point nulls are almost never perfectly true, so in large enough samples you would expect even trivial effects to be giving extremely small p-values. – Glen_b Mar 05 '15 at 00:16
  • 2
    Juvin -- The reason the p-values are tiny is due to the very large sample size, as @Jeremy pointed out - even tiny effects will be many standard errors from 0. The reason the p-values are all shown as the same value is discussed in [this answer](http://stats.stackexchange.com/a/78840/805). In essence, it's effectively the smallest value it's numerically meaningful to give as a calculated p-value, so if the p-value goes below that, you really should just show the inequality. The comments there about whether such tiny values are statistically meaningful would be apply here as well. – Glen_b Mar 05 '15 at 00:26
  • Excellent points made so far, but there are major complications you didn't ask about. It's hardly possible to have 8 million data points without some redundancy in time or space. Your regression calculations are not in any sense adjusting for that. Here's another you didn't ask, but it's crucial. A hyperplane fitted to these variables predicts negative soil moisture at the origin, which is unphysical, even as a limiting case beyond the range of the data. So, you have an physically implausible model with P-values too small to be computable, or so it seems. – Nick Cox Mar 05 '15 at 02:04
  • Nick Cox actually hints at the point I was making to some extent. Duplication of values and accuracy of measurements (and rounding) can result in real-life datasets such that when analyzed, the p-values aren't "almost guaranteed" to be tiny. A counterexample as requested is provided here in R: x|t|) (Intercept) 0.3360 0.2861 1.174 0.240 x 0.1687 0.1435 1.176 0.239 – StatsStudent Mar 05 '15 at 03:02
  • 1
    Here's another. Your model has a **positive** coefficient on evapotranspiration (ET) and another on change in ET. But the effect of evapotranspiration is a **decrease** in soil moisture. These positive coefficients may be an artefact of time and space resolution of your data, or of ET working as a kind of proxy for nonlinear relationship(s) with precipitation, but on the face of it this is physically absurd. Check for correlations between your variables. (If soil moisture is a deficit, you still have the same problem, but with the signs of the precipitation coefficients.) – Nick Cox Mar 05 '15 at 12:06

2 Answers2

1

What you are getting here is actually not a coincidence in your data - it is just the numerical limits of statistical computing. If you check the characteristics of the machine you will see that R represents numerical values in binary form, and there is an imposed limit on the number of binary digits it uses to store a number. The value you are seeing here the smallest positive number that can be represented with fifty-two binary digits, which is presumably the present limit setting of your machine. You can check this in R by running the following code:

.Machine$double.ulp.digits
[1] -52

.Machine$double.eps
[1] 2.220446e-16

2^(-52)
[1] 2.220446e-16

identical(2^(-52), .Machine$double.eps)
[1] TRUE

So, all that you are seeing in your regression output is that all your p-values are below the smallest positive floating-point number in the present setting in R. This does not imply any amazing co-incidence of your p-values; it just means they are being expressed below the same upper bound. If you would like to get more accuracy in your p-values, you will need to look at how to change the calculation limits in your machine (see e.g., here).

Ben
  • 91,027
  • 3
  • 150
  • 376
0

The t-statistics are similar, but different. The p-values are very small and appear similar, but in fact, are likely very small, but outside the range of numerical accuracy (so small they can't be reliably computed, so you see the values you are seeing). They are extremely small, so the p-values are simply indicating strong significance, assuming a properly constructed model that meets model assumptions.

StatsStudent
  • 10,205
  • 4
  • 37
  • 68