10

Let's say I have data that has some uncertainty. For example:

X  Y
1  10±4
2  50±3
3  80±7
4  105±1
5  120±9

The nature of the uncertainty could be repeat measurements or experiments, or measuring instrument uncertainty for example.

I would like to fit a curve to it using R, something that normally I would do with lm. However, this does not take into account the uncertainty in the data when it gives me the uncertainty in the fit coefficients, and consequently the prediction intervals. Looking at the documentation, the lm page has this:

...weights can be used to indicate that different observations have different variances...

So it makes me think that maybe this has something to do with it. I know the theory of doing it manually, but I was wondering if it's possible to do that with the lm function. If not, is there any other function (or package) that's capable of doing this?

EDIT

Seeing some of the comments, here is some clarification. Take this example:

x <- 1:10
y <- c(131.4,227.1,245,331.2,386.9,464.9,476.3,512.2,510.8,532.9)
mod <- lm(y ~ x + I(x^2))
summary(mod)

Gives me:

Residuals:
    Min      1Q  Median      3Q     Max 
-32.536  -8.022   0.087   7.666  26.358 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.8050    22.3210   1.783  0.11773    
x            92.0311     9.3222   9.872 2.33e-05 ***
I(x^2)       -4.2625     0.8259  -5.161  0.00131 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.98 on 7 degrees of freedom
Multiple R-squared:  0.986, Adjusted R-squared:  0.982 
F-statistic: 246.7 on 2 and 7 DF,  p-value: 3.237e-07

So basically, my coefficients are a=39.8±22.3, b=92.0±9.3, c=-4.3±0.8. Now lets say that for each data point, the error is 20. I will use weights = rep(20,10) in the lm call and I get this instead:

Residual standard error: 84.87 on 7 degrees of freedom

but the std errors on the coefficients do not change.

Manually, I know how to do it with calculating the covariance matrix using matrix algebra and putting the weights/errors in there, and deriving the confidence intervals using that. So is there a way of doing it in the lm function itself, or any other function?

Gimelist
  • 579
  • 6
  • 17
  • If you know the distribution of the data you can bootstrap it using the `boot` package in R. Afterwards you can let a linear regression run over the bootstrapped data set. – Ferdi Sep 19 '16 at 08:33
  • `lm` will use the normalized variances as weights and then assume that your model is statistically valid to estimate parameters uncertainty. If you think that this is not the case (error bars too small or too large), you should not trust any uncertainty estimate. – Pascal Sep 19 '16 at 08:49
  • See also this question here:http://stats.stackexchange.com/questions/113987/lm-weights-and-the-standard-error – jwimberley Sep 19 '16 at 11:58

1 Answers1

15

This type of model is actually much more common in certain branches of science (e.g. physics) and engineering than "normal" linear regression. So, in physics tools like ROOT, doing this type of fit is trivial, while linear regression is not natively implemented! Physicists tend to call this just a "fit" or a chi-square minimizing fit.

The normal linear regression model assumes that there is an overall variance $\sigma$ attached to every measurement. It then maximizes the likelihood $$ L \propto \prod_i e^{-\frac{1}{2} \left( \frac{y_i-(ax_i+b)}{\sigma} \right)^2} $$ or equivalently its logarithm $$ \log(L) = \mathrm{constant} - \frac{1}{2\sigma^2} \sum_i (y_i-(ax_i+b))^2 $$ Hence the name least-squares -- maximizing the likelihood is the same as minimizing the sum of squares, and $\sigma$ is an unimportant constant, as long as it is constant. With measurements that have different known uncertainties, you'll want to maximize $$ L \propto \prod e^{-\frac{1}{2} \left( \frac{y-(ax+b)}{\sigma_i} \right)^2} $$ or equivalently its logarithm $$ \log(L) = \mathrm{constant} - \frac{1}{2} \sum \left( \frac{y_i-(ax_i+b)}{\sigma_i} \right)^2 $$ So, you actually want to weight the measurements by the inverse variance $1/\sigma_i^2$, not the variance. This makes sense -- a more accurate measurement has smaller uncertainty and should be given more weight. Note that if this weight is constant, it still factors out of the sum. So, it doesn't affect the estimated values, but it should affect the standard errors, taken from the second derivative of $\log(L)$.

However, here we come to another difference between physics/science and statistics at large. Typically in statistics, you expect that a correlation might exist between two variables, but rarely will it be exact. In physics and other sciences, on the other hand, you often expect a correlation or relationship to be exact, if only it weren't for pesky measurement errors (e.g. $F=ma$, not $F=ma+\epsilon$). Your problem seems to fall more into the physics/engineering case. Consequently, lm's interpretation of the uncertainties attached to your measurements and of the weights isn't quite the same as what you want. It'll take the weights, but it still thinks there is an overall $\sigma^2$ to account for regression error, which is not what you want -- you want your measurement errors to be the only kind of error there is. (The end result of lm's interpretation is that only the relative values of the weights matter, which is why the constant weights you added as a test had no effect). The question and answer here have more details:

lm weights and the standard error

There are a couple of possible solutions given in the answers there. In particular, an anonymous answer there suggests using

vcov(mod)/summary(mod)$sigma^2

Basically, lm scales the covariance matrix based on its estimated $\sigma$, and you want to undo this. You can then get the information you want from the corrected covariance matrix. Try this, but try to double-check it if you can with manual linear algebra. And remember that the weights should the the inverse variances.

EDIT

If you're doing this sort of thing a lot you might consider using ROOT (which seems to do this natively while lm and glm do not). Here's a brief example of how to do this in ROOT. First off, ROOT can be used via C++ or Python, and its a huge download and installation. You can try it in the browser using a Jupiter notebook, following the link here, choosing "Binder" on the right, and "Python" on the left.

import ROOT
from array import array
import math
x = range(1,11)
xerrs = [0]*10
y = [131.4,227.1,245,331.2,386.9,464.9,476.3,512.2,510.8,532.9]
yerrs = [math.sqrt(i) for i in y]
graph = ROOT.TGraphErrors(len(x),array('d',x),array('d',y),array('d',xerrs),array('d',yerrs))
graph.Fit("pol2","S")
c = ROOT.TCanvas("test","test",800,600)
graph.Draw("AP")
c.Draw()

I've put in square roots as the uncertainties on the $y$ values. The output of the fit is

Welcome to JupyROOT 6.07/03

****************************************
Minimizer is Linear
Chi2                      =       8.2817
NDf                       =            7
p0                        =      46.6629   +/-   16.0838     
p1                        =       88.194   +/-   8.09565     
p2                        =     -3.91398   +/-   0.78028    

and a nice plot is produced:

quadfit

The ROOT fitter can also handle uncertainties in the $x$ values, which would probably require even more hacking of lm. If anyone knows a native way to do this in R, I'd be interested to learn it.

SECOND EDIT

The other answer from the same previous question by @Wolfgang gives an even better solution: the rma tool from the metafor package (I originally interpreted text in that answer to mean that it did not calculate the intercept, but that's not the case). Taking the variances in the measurements y to be simply y:

> rma(y~x+I(x^2),y,method="FE")

Fixed-Effects with Moderators Model (k = 10)

Test for Residual Heterogeneity: 
QE(df = 7) = 8.2817, p-val = 0.3084

Test of Moderators (coefficient(s) 2,3): 
QM(df = 2) = 659.4641, p-val < .0001

Model Results:

         estimate       se     zval    pval    ci.lb     ci.ub     
intrcpt   46.6629  16.0838   2.9012  0.0037  15.1393   78.1866   **
x         88.1940   8.0956  10.8940  <.0001  72.3268  104.0612  ***
I(x^2)    -3.9140   0.7803  -5.0161  <.0001  -5.4433   -2.3847  ***

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This is definitely the best pure R tool for this type of regression that I've found.

jwimberley
  • 3,679
  • 2
  • 11
  • 20
  • I think it is basically wrong to undo the scaling by `lm`. If you do this, validation statistics, such as the chi-squared, will be off. If the dispersion of your residuals does not match your error bars, something is wrong in the statistical model (either the model choice or the error bars or the normal hypothesis...). In either case, the parameters uncertainties will be unreliable !!! – Pascal Sep 19 '16 at 15:41
  • @PascalPERNOT I haven't though about this; I'll think about your comments. To be honest, I agree in a general sense in that I think the best solution is to use physics or engineering software guaranteed to solve this problem correctly, rather than hack `lm` to get the correct output. (If anyone is curious, I'll show how to do this in `ROOT`). – jwimberley Sep 19 '16 at 15:46
  • `@jwimberley` It took me some time to figure this out, but you should not do predictions (that's generally what parameters uncertainty are used for) with a statistically non valid model. If the standard deviation of your residuals is smaller than your stated error bars (chi2 < 1) the errors bars are overestimated, or random errors have been combined with systematic errors (very common scenario); at the other end, if your chi2 is too large, your errors bars are underestimated or your model is wrong. In both scenarios, the parameter uncertainty you would estimate is probably meaningless... – Pascal Sep 19 '16 at 15:58
  • @PascalPERNOT OK, I agree with you there. But what if the chi2 is neither too small nor too large? With good data and a good model this is frequently the case. For instance, a quadratic fits the model provided by the questioner nearly perfectly: in my `ROOT` example, just updated to show a quadratic fit, the quadratic fit has a $\chi^2$ of 8.2817 on 7 degrees of freedom. Granted, I made up the errors here, but this isn't an atypical occurrence in practice. – jwimberley Sep 19 '16 at 16:45
  • @PascalPERNOT Re: your first comment -- are you basically claiming that the $\sigma$ produced `lm` is related to the dispersion of the residuals? I agree, but I believe that its *unrelated* to the dispersion of the normalized residuals, $(y_i-(a x_i + b))/\sqrt(\sigma_i)$. Since the latter is what matters in this problem, the $\sigma$ produced by `lm` is still double-counting uncertainties. – jwimberley Sep 19 '16 at 17:39
  • 1
    One potential advantage of the statistician's approach to the problem is that it allows pooling of variance estimates among observations at different levels. If the underlying variance is constant or has some defined relation to measurements as in Poisson processes, then the analysis will typically be improved versus what you get from the (typically unrealistic) assumption that the measured variance for each data point is correct and thus unfairly weighting some data points. In the data of the OP, I'd guess that the constant-variance assumption might be better. – EdM Sep 19 '16 at 17:58
  • @EdM I agree with you to a certain extant, in that physicists/scientists can over-rely on the procedure I outlined, and that it requires careful estimates of systematic uncertainties, etc., which brings pitfalls, but I do think the better approach varies on a case-by-case basis. I'm familiar with cases where, after accounting for systematic uncertainties, there are still 2x-5x or greater relative differences in estimated uncertainties, and that the extra weighting of some points is very much fair. – jwimberley Sep 19 '16 at 18:25
  • Interesting discussion here. I wonder if there's an R package that allows doing this instead of using ROOT. – Gimelist Sep 19 '16 at 23:08
  • @Michael A more common name for your problem is regression with errors-in-variables. When there are only errors in y, I haven't found anything better than the lm hack shown in the answer -- this you gives the same results as `ROOT` (so either they're both right or they're both wrong; I'm confident its the former). When there are uncertainties in the x and y measurements, there is apparently something called Deming regression and is implemented in the package `deming`. This seems to be much more robust than what `ROOT` does with x errors (a hack), but it sadly only supports linear regressions. – jwimberley Sep 20 '16 at 01:59
  • 1
    @jwimberley I assume that $\sigma$ ensures that the Weighted residuals standard error is rescaled to 1, before computing the parameters covariance matrix. You can check this my multiplying your weights by $\sigma^2$ and see how the "Residuals standard error" output is affected. In your example, it changes from 1.088 to 1. If your setup is statistically valid, the scaling has only a minor effect on the parameters uncertainties... – Pascal Sep 20 '16 at 12:43
  • @Michael Actually, the `rma` tool is perfect for what you need, as shown in the new edit. However, do pay attention to the comments from Pascal and EdM that you must be careful that your uncertainties truly reflect to the total uncertainty in your measurements, including systematic affects, otherwise they could lead to biased results. – jwimberley Sep 20 '16 at 14:35
  • 1
    There is a good discussion of these issues in Chapter 8 of Andreon, S. and Weaver, B. (2015) Bayesian methods for the physical sciences. Springer. http://www.springer.com/us/book/9783319152868 – Tony Ladson Sep 20 '16 at 22:51
  • **This answer appears to be based on a fundamental mistake:** it confounds the error variance in the regression, which for observation $i$ should be written as $\sigma^2+\sigma_i^2,$ with the *measurement error* in the response variable $y_i.$ The extra term $\sigma_2$ has the same meaning as the original $\sigma^2$ at the beginning. Neglecting it is tantamount to supposing virtually all the variation in $y$ is due to (a) variation in $x$ and (b) measurement error, but nothing more. – whuber Jul 31 '20 at 15:56
  • @whuber "Neglecting it is tantamount to supposing virtually all the variation in is due to (a) variation in and (b) measurement error, but nothing more" Completely agree, and that is exactly the point. In a scientific or engineering setting, the use of a regression like this is to use the data (which includes measurement *and* systematic errors) to test the null hypothesis that y is a linear function of x. A poor chi^2 can lead to a rejection of that null hypothesis. Keeping an extra regression error term floating around can deflate the chi-square and be an incorrectly weak test – jwimberley Jul 31 '20 at 16:16
  • For a not unrealistic example, imagine that I wanted to see evidence for a sinusoidal observable in some physical observable (e.g. the DAMA experiment, or other experiments looking for seasonal changes correlated with Earth's orbit). Imagine the data looked like x – jwimberley Jul 31 '20 at 16:31