1

I have a two dimensional dataset which comes from prior statistical analysis. Each point $(x_n,y_n)$ in the dataset has an error estimate $\sigma_{x,n}, \sigma_{y,n}$ for both coordinates. The $x_n$ are proportions and by definition in $(0,1)$ and the $y_n$ are growth rates, and tend to be in the range of $(-0.02,0.06)$. From both theoretical considerations and eye-balling the plot, it looks like there is a linear relationship between the coordinates.

I want to find the line $y = \hat{a}x + \hat{b}$ of best fit for this dataset, with estimates of uncertainty for both $\hat{a}$ and $\hat{b}$. What are the standard (or best?) ways to do this?


What I tried so far

Currently I am using least squares weighted by $\frac{1}{\sqrt{\sigma_{x,n}^2 + \sigma_{y,n}^2}}$.

In particular, with polyfit function from numpy, I do the following:

[a,b], [[a_v,ab_v],[_,b_v]] = \
    np.polyfit(x_data,y_data,1,w=1/np.sqrt(y_err**2 + x_err**2),full=False,cov=True)

Where I interpret the square roots of a_v and b_v as my erors on $\hat{a}$ and $\hat{b}$. I've seen this sort of technique used before, but I understand that it can be very sensitive to outliers. Alternatively, I've considered getting error estimates through bootstrapping or jackknifing, this eliminates -- to some extent -- the need to get an error estimate from my fit, but it still forces me to choose some method of finding a line of best fit, and is computationally more intensive (especially if I start resampling on the raw data and not on the output of prior statistical analyses).

Artem Kaznatcheev
  • 2,569
  • 1
  • 20
  • 38
  • 2
    These fall under the aegis of measurement error models https://en.wikipedia.org/wiki/Errors-in-variables_models ...Its been a few years since I have looked at these models.. hopefully the wiki link is helpful – ashokragavendran Jul 05 '16 at 22:15
  • 2
    See http://stats.stackexchange.com/search?q=deming. Your question is essentially a duplicate of http://stats.stackexchange.com/questions/137115, which hasn't (yet) been answered. – whuber Jul 05 '16 at 22:18
  • @whuber thanks for those links. I did not know about error-in-variables before ashokragavendran commented on this post, so this has been useful. I don't think I am asking the same thing as 137115, since I am only interested in the linear case (I am alright with ignoring that $x$ is a proportion, if that makes life significantly easier) and really want explicit discussion of error estimates on the fits. I will look into Deming regression. – Artem Kaznatcheev Jul 05 '16 at 23:26
  • After the lead that @whuber gave me, I think the answers on the following two questions also work as answers for me, and the come with code examples in [R](http://stats.stackexchange.com/q/201859/4872) and [Mathematica](http://mathematica.stackexchange.com/q/13054/40835). It is fine to close my question as duplicate of these. If it is left open then I will answer my own question with Python code in a couple of days. – Artem Kaznatcheev Jul 05 '16 at 23:44

0 Answers0