Questions tagged [least-squares]

Refers to a general estimation technique that selects the parameter value to minimize the squared difference between two quantities, such as the observed value of a variable, and the expected value of that observation conditioned on the parameter value. Gaussian linear models are fit by least squares and least squares is the idea underlying the use of mean-squared-error (MSE) as a way of evaluating an estimator.

Overview

Refers to a general estimation technique that selects the parameter value to minimize the squared difference between two quantities, such as the observed value of a variable, and the expected value of that observation conditioned on the parameter value. Gaussian linear models are fit by least squares and least squares is the idea underlying the use of mean-squared-error (MSE) as a way of evaluating an estimator.

Formulation

Given a set of data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{p}$ and a vector of coefficients $\beta$, the least squares estimate is the solution to the equation:

$$\widehat{\beta}_{LS} = \underset{\beta} {\text{arg min}} \sum\limits_{i=1}^{n}(y_i - \sum\limits_{j=1}^{p}x_{i,j}\beta_{j})^2 = || {\bf y - X\beta}||^2$$

Using linear algebra, one can find the least squares hyperplane:

$$ {\bf \widehat{\beta} = (X^TX)^{-1}X^{T}y} $$

References

Least squares methods are treated in many introductory statistics resources and textbooks, but there are also advanced resources dedicated only to the subject, for example:

2460 questions
98
votes
5 answers

Mean absolute error OR root mean squared error?

Why use Root Mean Squared Error (RMSE) instead of Mean Absolute Error (MAE)?? Hi I've been investigating the error generated in a calculation - I initially calculated the error as a Root Mean Normalised Squared Error. Looking a little closer, I…
user1665220
  • 1,105
  • 1
  • 8
  • 6
93
votes
2 answers

When to use regularization methods for regression?

In what circumstances should one consider using regularization methods (ridge, lasso or least angles regression) instead of OLS? In case this helps steer the discussion, my main interest is improving predictive accuracy.
NPE
  • 5,351
  • 5
  • 33
  • 44
66
votes
3 answers

Maximum likelihood method vs. least squares method

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimaton (LSE) ? Why can't we use MLE for predicting $y$ values in linear regression and vice versa? Any help on this topic will be greatly appreciated.
evros
  • 751
  • 2
  • 7
  • 6
66
votes
3 answers

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$ $$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda…
Heisenberg
  • 4,239
  • 3
  • 23
  • 54
56
votes
5 answers

How to derive the ridge regression solution?

I am having some issues with the derivation of the solution for ridge regression. I know the regression solution without the regularization term: $$\beta = (X^TX)^{-1}X^Ty.$$ But after adding the L2 term $\lambda\|\beta\|_2^2$ to the cost function,…
user34790
  • 6,049
  • 6
  • 42
  • 64
56
votes
5 answers

Regression when the OLS residuals are not normally distributed

There are several threads on this site discussing how to determine if the OLS residuals are asymptotically normally distributed. Another way to evaluate the normality of the residuals with R code is provided in this excellent answer. This is another…
55
votes
3 answers

Where does the misconception that Y must be normally distributed come from?

Seemingly reputable sources claim that the dependent variable must be normally distributed: Model assumptions: $Y$ is normally distributed, errors are normally distributed, $e_i \sim N(0,\sigma^2)$, and independent, and $X$ is fixed, and …
colorlace
  • 1,010
  • 11
  • 25
55
votes
4 answers

Why sigmoid function instead of anything else?

Why is the de-facto standard sigmoid function, $\frac{1}{1+e^{-x}}$, so popular in (non-deep) neural-networks and logistic regression? Why don't we use many of the other derivable functions, with faster computation time or slower decay (so…
Mark Horvath
  • 795
  • 1
  • 8
  • 9
50
votes
6 answers

What algorithm is used in linear regression?

I usually hear about "ordinary least squares". Is that the most widely used algorithm used for linear regression? Are there reasons to use a different one?
50
votes
5 answers

Is minimizing squared error equivalent to minimizing absolute error? Why squared error is more popular than the latter?

When we conduct linear regression $y=ax+b$ to fit a bunch of data points $(x_1,y_1),(x_2,y_2),...,(x_n,y_n)$, the classic approach minimizes the squared error. I have long been puzzled by a question that will minimizing the squared error yield the…
Tony
  • 1,583
  • 4
  • 15
  • 20
46
votes
6 answers

Why don't linear regression assumptions matter in machine learning?

When I learned linear regression in my statistics class, we are asked to check for a few assumptions which need to be true for linear regression to make sense. I won't delve deep into those assumptions, however, these assumptions don't appear when…
44
votes
8 answers

Is it valid to include a baseline measure as control variable when testing the effect of an independent variable on change scores?

I am attempting to run an OLS regression: DV: Change in weight over a year (initial weight - end weight) IV: Whether or not you exercise. However, it seems reasonable that heavier people will lose more weight per unit of exercise than thinner…
42
votes
1 answer

Proof that the coefficients in an OLS model follow a t-distribution with (n-k) degrees of freedom

Background Suppose we have an Ordinary Least Squares model where we have $k$ coefficients in our regression model, $$\mathbf{y}=\mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}$$ where $\mathbf{\beta}$ is an $(k\times1)$ vector of coefficients,…
Garrett
  • 601
  • 1
  • 6
  • 10
39
votes
3 answers

Why is RSS distributed chi square times n-p?

I would like to understand why, under the OLS model, the RSS (residual sum of squares) is distributed $$\chi^2\cdot (n-p)$$ ($p$ being the number of parameters in the model, $n$ the number of observations). I apologize for asking such a basic…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
38
votes
4 answers

Why squared residuals instead of absolute residuals in OLS estimation?

Why are we using the squared residuals instead of the absolute residuals in OLS estimation? My idea was that we use the square of the error values, so that residuals below the fitted line (which are then negative), would still have to be able to be…
PascalVKooten
  • 2,127
  • 5
  • 22
  • 34
1
2 3
99 100