2

I have a lot of $x,y$ data. I was considering using linear regression to fit the equation $y=mx+c$, but I want to find a value for $m$ that makes $c$ as near as possible to zero.

Can I therefore use the equation $y=mx$ and merely divide the sum of all $y$ by the sum of all $x$ to obtain $m$?

Would it be appropriate to square the data before summing, and then square-root, so that there is least-squared error? This would however mean that $m$ will inevitably be positive, which may be wrong.

Edit: C is actually an error term which I would like to be zero. When I have new data for x, and I want to predict y, would it be better to use m from fitting y=mx, or would it be better to use m from fitting y=mx+c and pretend that c is zero?

HumbleOrange
  • 449
  • 2
  • 6
  • 1
    You may want to see https://stats.stackexchange.com/questions/159691/regression-without-intercept-deriving-hat-beta-1-in-least-squares-no-matr There is no need to square the data. You can apply least-squares optimization to the $y=mx+\epsilon$ equation as it is – David Jun 27 '19 at 09:45

2 Answers2

4

If you want $c$ to be exactly 0, just fit a linear regression without an intercept.

Dividing $y$ by $x$ would be a bad idea, unless you assume the error is proportional to $x$. There is no reason why you should take the sum. No need to square the data.

If you only need a soft constraint, i.e. $c$ near zero, you could fit a Bayesian linear regression with a prior on $c$ centred on 0 and arbitrarily sharp.

Guillem
  • 195
  • 7
  • 3
    The wikipedia article on simple linear regression has an entire section for this '$y=\beta x + \epsilon$' case : [Simple linear regression without the intercept term (single_regressor)](https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor)) Instead of estimating the $\beta$ by using the ratio of the means $\hat\beta = \frac{\bar{y}}{\bar{x}} = \frac{\sum{y_i}}{\sum{x_i}}$ you will be using a ratio of weighted mean $\hat \beta = \frac{\sum{x_i y_i}}{\sum x_i x_i}$. – Sextus Empiricus Jun 27 '19 at 09:07
  • 1
    But indeed, much more important is describing the situation well. What can we assume about the errors (I think it is wrong to call dividing $y$ by $x$ a bad idea, or at least it is expressed very strongly and has less of the nuance that follows in your answer)? Is $c$ fixed to zero or does it follow a distribution centered around zero? Etc. – Sextus Empiricus Jun 27 '19 at 09:08
  • I expect c to be centered around zero. Would this make any difference to the calculation of m? – HumbleOrange Jun 27 '19 at 12:37
  • If you assume $c = 0$ but it's not, the model is misspecified and the estimation of $m$ will be biased. If your assumption is just that $c$ is near 0, you could just try a regression with an intercept and check the intercept estimate post-hoc. As I have mentioned a better solution would be a Bayesian linear regression which would allow you to explicitly encode the belief that $c$ should be near 0. You could do this with the brms or rstanarm packages in R for instance. – Guillem Jun 27 '19 at 12:57
1

As Guillem side: fit a linear regression without intercept. This will make sure the squared error is minimized and satisfy your requirement.

In R, you can do lm(y~x-1).

Details: search linear regression without intercept

Haitao Du
  • 32,885
  • 17
  • 118
  • 213