Explicit solution for linear regression with two predictors

Question

I have some samples of data of the form $x,y$ and $z=f(x,y)$. I wish to fit a plane $$z = Ax + By + C$$ to the data with the smallest mean square errors.

I have found an "answer" in section 3 of this document, but it is left in the form of some equations left to be solved. I just about have the ability to solve these equations, but the process gets so messy that the likelihood of me making a mistake is quite high. Surely someone has written out the full solution longhand somewhere (it might be called "closed form") anyway, in the form $$A=\ldots,$$ $$B=\ldots,$$ $$C=\ldots.$$

EDIT: Maybe "closed form" is the wrong phrase. So let me be clear. I want an explicit solution for $A$, $B$, and $C$, and not a solution that ends with "So now if you can solve these equations then you can discover the values of $A$, $B$ and $C$".

Your edit is not clear. You **are** asking about closed form solution for linear regression. It is given in the linked thread. If you use the formula, you'll get the answer. What else do you need? — Tim, Feb 16 '16 at 09:47
The answers in your links leave me having to solve three simultaneous equations in order to find A, B and C. I am concerned I will make a mistake. — Mick, Feb 16 '16 at 10:00
Indeed, there is a closed-form solution in your reference. Take a step back and look at what you are actually doing. You are solving a system of linear equations -- the rank of this matrix should equal the number of datapoints you have. Why would you go any further at this point? This way you are not stuck with three variables. All you would have to do is solve the system of linear equations, which you can at Wolfram Alpha if you want. The solution to these equations is also the closed-form solution you are looking for but always dependent on the number of free parameters in this model. — pAt84, Feb 16 '16 at 10:11
We can do that for you and will get your answer but this is something you should really do on your own as a learning experience. I can see you struggle with exactly that but getting through this struggle will be much more of an reward than if we were just to give you the answer. — pAt84, Feb 16 '16 at 10:12
I kept on running out of space on my 2ftx3ft whiteboard... yes literally. Even if I solved it, I would not be confident that I hadn't made a mistake along the way. — Mick, Feb 16 '16 at 10:16
Can you pre-calcuate the sums and then just solve it? Dragging around these sigma signs is annoying, so the first thing you should think about is substitution. Does this help you? — pAt84, Feb 16 '16 at 10:34
Quietly downvoting *any* answer, unless it is obviously wrong, is churlish behavior. — whuber, Feb 22 '16 at 14:09
The "answer" does not answer the question as stated. CV has already forced me to explain repeatedly exactly what form of answer I wanted, both here and on meta. I am sorry if I failed to explain it yet again. — Mick, Feb 22 '16 at 14:27

score 15 · Answer 1 · edited Apr 13 '17 at 12:44

Elsewhere on this site, explicit solutions to the ordinary least squares regression

$$\mathbb{E}(z_i) = A x_i + B y_i + C$$

are available in matrix form as

$$(C,A,B)^\prime = (X^\prime X)^{-1} X^\prime z\tag{1}$$

where $X$ is the "model matrix"

$$X = \pmatrix{1 & x_1 & y_1 \\ 1 & x_2 & y_2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & y_n}$$

and $z$ is the response vector

$$z = (z_1, z_2, \ldots, z_n)^\prime.$$

That's a perfectly fine, explicit, computable answer. But maybe there is some additional understanding that can be wrung out of it by inspecting the coefficients. This can be achieved by choosing appropriate units in which to express the variables.

The best units for this purpose center each variable at its mean and use its standard deviation as the unit of measurement. Explicitly, let the three means be $m_x, m_y,$ and $m_z$ and the three standard deviations be $s_x, s_y,$ and $s_z$. (It turns out not to matter whether you divide by $n$ or $n-1$ in computing the standard deviations. Just make sure you use a consistent convention when you compute any second moment of the data.) The values of the variables in these new units of measurement are

$$\xi_i = \frac{x_i - m_x}{s_x},\ \eta_i = \frac{y_i - m_y}{s_y},\ \zeta_i = \frac{z_i - m_z}{s_z}.$$

This process is known as standardizing the data. The variables $\xi$, $\eta$, and $\zeta$ are the standardized versions of the original variables $x$, $y$, and $z$.

These relationships are invertible:

$$x_i = s_x \xi_i + m_x,\ y_i = s_y \eta_i + m_y,\ z_i = s_z \zeta_i + m_z.$$

Plugging these into the defining relationship

$$\mathbb{E}(z_i) = C + Ax_i + By_i$$

and simplifying yields

$$\mathbb{E}(s_z \zeta_i + m_z) = C + A(s_x \xi_i + m_x) + B(s_y \eta_i + m_y).$$

Solving for the expectation of the dependent variable $\zeta_i$ yields

$$\mathbb{E}(\zeta_i) = \left(\frac{C + Am_x + Bm_y - m_z}{s_z}\right) + \left(\frac{A s_x}{s_z}\right) \xi_i + \left(\frac{B s_y}{s_z}\right) \eta_i.$$

If we write these coefficients as $\beta_0, \beta_1, \beta_2$ respectively, then we can recover $A, B, C$ by comparing and solving. For the record this gives

$$A = \frac{s_z \beta_1}{s_x},\ B = \frac{s_z \beta_2}{s_y},\text{ and }C = s_z \beta_0 + m_z - A m_x - B m_y.\tag{2}$$

The point of this becomes apparent when we consider the new model matrix

$$\Xi = \pmatrix{1 & \xi_1 & \eta_i \\ 1 & \xi_2 & \eta_2 \\ \vdots & \vdots & \vdots \\ 1 & \xi_n & \eta_n}$$

and the new response matrix $\zeta = (\zeta_1, \zeta_2, \ldots, \zeta_n)$, because now

$$\Xi^\prime \Xi = \pmatrix{n & 0 & 0 \\ 0 & n & n\rho \\ 0 & n\rho & n}$$

and

$$\Xi^\prime \zeta = (0, n\tau, n\upsilon)^\prime$$

where $\rho$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \xi_i \eta_i$, $\tau$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \xi_i \zeta_i$, and $\upsilon$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \eta_i \zeta_i$.

To solve the normal equations $(1)$ we may divide both sides by $n$, giving

$$\pmatrix{1 & 0 & 0 \\ 0 & 1 & \rho \\ 0 & \rho & 1}\pmatrix{\beta_0 \\ \beta_1 \\ \beta_2} = \pmatrix{0 \\ \tau \\ \upsilon} .$$

What originally looked like a formidable matrix formula has been reduced to a truly elementary set of three simultaneous equations. Provided $|\rho| \lt 1$, its solution is easily found to be

$$\pmatrix{\hat\beta_0 \\ \hat\beta_1 \\ \hat\beta_2} = \frac{1}{1-\rho^2}\pmatrix{0 \\ \tau-\rho\upsilon \\ \upsilon-\rho\tau}.$$

Plugging these into the coefficients in $(2)$ produces the estimates $\hat A, \hat B,$ and $\hat C$.

In fact, even more has been achieved:

It is now evident why the cases $|\rho|=1$ are problematic: they introduce a divide-by-zero condition in the solution.
It is equally evident how to determine whether a solution exists when $|\rho=1|$ and how to obtain it. It will exist when the second and third normal equations in $\Xi$ are redundant and it will be obtained simply by ignoring one of the variables $x$ and $y$ in the first place.
We can derive some insight into the solution generally. For instance, from $\hat\beta_0=0$ in all cases, we may conclude that the fitted plane must pass through the point of averages $(m_x, m_y, m_z)$.
It is now evident that the solution can be found in terms of the first two moments of the trivariate dataset $(x, y, z)$. This sheds further light on the fact that coefficient estimates can be found from means and covariance matrices alone.
Furthermore, equation $(2)$ shows that the means are needed only to estimate the intercept term $C$. Estimates of the two slopes $A$ and $B$ require only the second moments.
When the regressors are uncorrelated, $\rho=0$ and the solution is that the intercept is zero and the slopes are the correlation coefficients between the response $z$ and the regressors $x$ and $y$ when we standardize the data. This is both easy to remember and provides insight into how regression coefficients are related to correlation coefficients.

Putting this all together, we find that (except in the degenerate cases $|\rho|=1$) the estimates can be written

$$\eqalign{ \hat A &= \frac{\tau - \rho\upsilon}{1-\rho^2} \frac{s_z}{s_x} \\ \hat B &= \frac{\upsilon - \rho\tau}{1-\rho^2} \frac{s_z}{s_y} \\ \hat C &= m_z -m_x \hat A - m_y \hat B. }$$

In these formulae, the $m_{*}$ are the sample means, the $s_{*}$ are the sample standard deviations, and the greek letters $\rho, \tau,$ and $\upsilon$ represent the three correlation coefficients (between $x$ and $y$, $x$ and $z$, and $y$ and $z$, respectively).

Please note that these formulas are not the best way to carry out the calculations. They all involve subtracting quantities that might be of comparable size, such as $\tau-\rho\upsilon$, $\upsilon-\rho\tau$, and $m_z - (-m_x \hat A - m_y \hat B)$. Such subtraction involves loss of precision. The matrix formulation allows numerical analysts to obtain more stable solutions that preserve as much precision as possible. This is why people rarely have any interest in term-by-term formulas. The other reason there is little interest is that as the number of regressors increases, the complexity of the formulas grows exponentially, quickly becoming too unwieldy.

As further evidence of the correctness of these formulas, we may compare their answers to those of a standard least-squares solver, the lm function in R.

#
# Generate trivariate data.
#
library(MASS)
set.seed(17)
n <- 20
mu <- 1:3
Sigma <- matrix(1, 3, 3)
Sigma[lower.tri(Sigma)] <- Sigma[upper.tri(Sigma)] <- c(.8, .5, .6)
xyz <- data.frame(mvrnorm(n, mu, Sigma))
names(xyz) <- c("x", "y", "z")
#
# Obtain the least squares coefficients.
#
beta.hat <- coef(lm(z ~ x + y, xyz))
#
# Compute the first two moments via `colMeans` and `cov`.
#
m <- colMeans(xyz)
sigma <- cov(xyz)
s <- sqrt(diag(sigma))
rho <- t(t(sigma/s)/s); rho <- as.vector(rho[lower.tri(rho)])
#
# Here are the least squares coefficient estimates in terms of the moments.
#
A.hat <- (rho[2] - rho[1]*rho[3]) / (1 - rho[1]^2) * s[3] / s[1]
B.hat <- (rho[3] - rho[1]*rho[2]) / (1 - rho[1]^2) * s[3] / s[2]
C.hat <- m[3] - m[1]*A.hat - m[2]*B.hat
#
# Compare the two solutions.
#
rbind(beta.hat, formulae=c(C.hat, A.hat, B.hat))

The output exhibits two identical rows of estimates, as expected:

         (Intercept)         x        y
beta.hat    1.522571 0.3013662 0.403636
formulae    1.522571 0.3013662 0.403636

score 2 · Accepted Answer · answered Feb 16 '16 at 11:09

2

Found it...

$$\begin{align}A&=\frac{(\overline{zx}-\bar z\bar x)(\overline{y^2}-\bar y^2)-(\overline{zy}-\bar z\bar y)(\overline{xy}-\bar x\bar y)}{(\overline{x^2}-\bar x^2)(\overline{y^2}-\bar y^2)-(\overline{xy}-\bar x\bar y)^2}\\ B&=\frac{(\overline{x^2}-\bar x^2)(\overline{zy}-\bar z\bar y)-(\overline{zx}-\bar z\bar x)(\overline{xy}-\bar x\bar y)}{(\overline{x^2}-\bar x^2)(\overline{y^2}-\bar y^2)-(\overline{xy}-\bar x\bar y)^2}\\ C&=\bar z-A\bar x-B\bar y.\end{align}$$

answered Feb 16 '16 at 11:09

Mick

304
2
13

I gave a second answer because the matrices wouldn't fit into a comment. – pAt84 Feb 16 '16 at 11:32
@MarcoB establishes an answer at http://mathematica.stackexchange.com/questions/107426 that looks similar, so I think we're good here. – Feb 16 '16 at 12:57

score 0 · Answer 3 · answered Feb 16 '16 at 11:32

But you should have understood it! Take a look at your design matrix again (first equation page 4). Elements are written as (x,y).

Element (3,1) = Element (1,3)

Element (2,1) = Element (1,2)

Element (3,2) = Element (2,3)

and Element (3,3) is m.

Your regular design matrix resolves to (all sums substituted) $$\left( {\begin{array}{*{20}{c}} K&L&M\\ L&R&S\\ M&S&m \end{array}} \right) $$ This actually makes solving mach easier, especially if you think a few more steps ahead. In case you have done that, bravo! In case you havent, you did not stare at the design matrix long enough.

score 0 · Answer 4 · answered Jan 29 '22 at 12:50

I was happy to find the @Mick's answer here but I was not happy that it is without derivation. It is difficult to believe a formula without derivation, so I decided to re-derive it in order to check if it is correct (e.g. if all signs are OK).

In case, other people have the same feeling, this is my derivation.

We need to find $A$, $B$, $C$ that minimize the sum of squared distances between the observed $z_i$ and $\hat z_i = Ax+By+C$.

$$\min \sum_{i=1}^n \left(z_i-Ax_i-By_i-C\right)^2$$ $$\frac{\partial}{\partial C} \sum_{i=1}^n \left(z_i-Ax_i-By_i-C\right)^2 = 0 \Rightarrow C=\bar z - A \bar x - B \bar y$$ $$\frac{\partial}{\partial A} \sum_{i=1}^n \left(z_i-Ax_i-By_i-C\right)^2 = 0 \Rightarrow \overline{xz}-A\overline{x^2}-B\overline{xy}-C\bar x=0$$ $$\frac{\partial}{\partial B} \sum_{i=1}^n \left(z_i-Ax_i-By_i-C\right)^2 = 0 \Rightarrow \overline{zy}-A\overline{xy}-B\overline{y^2}-C\bar y=0$$

So, we have a system of linear equations for $A$, $B$, $C$: \begin{aligned} A \bar x + B \bar y + C &= \bar z\\ A\overline{x^2} + B\overline{xy} + C\bar x &= \overline{xz} \\ A\overline{xy} + B\overline{y^2} + C\bar y &= \overline{yz} \end{aligned} Its solution (Cramer's rule) is $A = \Delta_A/\Delta$, $B = \Delta_B/\Delta$, $C=\bar z - A \bar x - B \bar y$, where $$\Delta=\begin{vmatrix} \bar x & \bar y & 1 \\ \overline{x^2} & \overline{xy} & \bar x \\ \overline{xy} & \overline{y^2} & \bar y \notag \end{vmatrix}=(\overline{x^2}-\bar x^2)(\overline{y^2}-\bar y^2)-(\overline{xy}-\bar x \bar y)^2$$

$$\Delta_A=\begin{vmatrix} \bar z & \bar y & 1 \\ \overline{xz} & \overline{xy} & \bar x \\ \overline{yz} & \overline{y^2} & \bar y \notag \end{vmatrix}=(\overline{xz}-\bar x \bar z)(\overline{y^2}-\bar y^2)-(\overline{yz}-\bar y \bar z)(\overline{xy}-\bar x \bar y)$$

$$\Delta_B=\begin{vmatrix} \bar x & \bar z & 1 \\ \overline{x^2} & \overline{xz} & \bar x \\ \overline{xy} & \overline{yz} & \bar y \notag \end{vmatrix}=(\overline{yz}-\bar y \bar z)(\overline{x^2}-\bar x^2)-(\overline{xz}-\bar x \bar z)(\overline{xy}-\bar x \bar y)$$

Explicit solution for linear regression with two predictors

4 Answers4

Linked

Related