Regression on population values?

Question

In the theory of test equating (see, e.g., Test Equating, Scaling, and Linking by Kolen and Brennan (2014)), suppose we have a group of examinees whom we randomly assign one of two exam forms, say form X and form Y.

Let $X$ be a random variable representing scores of form X, and similarly for $Y$. Letting $\sigma(X)$ denote the standard deviation of a random variable $X$, linear equating assumes that $$\dfrac{X - \mathbb{E}[X]}{\sigma(X)} = \dfrac{Y - \mathbb{E}[Y]}{\sigma(Y)}$$ with probability one. To interpret the above, suppose we have a particular score $X = x$ from form X; we would use the equation above to yield an equivalent score for form Y.

Solving for $Y$ in the above equation yields $$Y = \dfrac{\sigma(Y)}{\sigma(X)}\cdot X + \mathbb{E}[Y] - \dfrac{\sigma(Y)}{\sigma(X)}\cdot \mathbb{E}[X]\text{.}\tag{1}$$ This can be interpreted as a linear equation of $X$, with slope $\dfrac{\sigma(Y)}{\sigma(X)}$ and intercept $\mathbb{E}[Y] - \dfrac{\sigma(Y)}{\sigma(X)}\cdot \mathbb{E}[X]$.

The question I have is on $(1)$ above - in Kolen and Brennan, toward the end of section 3.3, they state:

The equation for linear equating... is deceptively like a linear regression equation. The difference is that, for linear regression, the $\sigma(Y)/\sigma(X)$ terms are multiplied by the correlation between $X$ and $Y$.

I am very familiar with linear regression and that $\hat{\beta}_1 = r \cdot \dfrac{S_y}{S_x}$ when working with sample data points, where $S_y$ and $S_x$ denote the sample standard deviations.

But how does this make sense in a population context - to the point where we're working with population quantities such as $\sigma(X)$ and $\sigma(Y)$? In particular, what function are we minimizing so as to obtain that $\rho_{X, Y} \cdot \sigma(Y)/\sigma(X)$ (note: these are population quantities, not the sample ones) would be the least-squares regression slope between $X$ and $Y$?

Apart from using the population variances rather than sample variances, there's no difference whatsoever in the population context. In particular, you are still minimizing the sum of squared residuals. — whuber, Apr 12 '21 at 22:15
@whuber Yes, and I agree with that. However, unless I'm doing the math wrong in my head, you'd instead end up with, say, $\hat{\beta}_0 = \sum Y_i / n - \hat{\beta}_1 \cdot \sum X_i / n$ (simply take the partial with respect to $\beta_0$), rather than $\mathbb{E}[Y] - \hat{\beta}_1 \cdot \mathbb{E}[X]$ as the text in my question suggests. — Clarinetist, Apr 13 '21 at 01:11
@whuber Does this mean, then, that we should replace sample means with population means as well? — Clarinetist, Apr 13 '21 at 01:15
@whuber I made an attempt at an answer. It is likely wrong, but it was the best I could come up with. — Clarinetist, Apr 13 '21 at 19:40

score 1 · Accepted Answer · answered Apr 14 '21 at 14:01

I offer this post as one possible way to unify the concepts of (ordinary) linear regression for a random sample and for population. I will take you through a sequence of concepts, from broadest to narrowest, finally arriving at what you're looking for.

Corresponding with two conceptions of regression, which differ on whether the explanatory variable $X$ is considered a random variable, are two conceptions of "population."

The population is a function $\mathcal P$ from a set $X\subset \mathbb R$ into a set of random variables. When $x\in X,$ let $Y_x$ designate the random variable associated with $x.$
The population is a bivariate random variable $(X,Y).$

"Regression" most broadly means the process of associating with each $x$ the expectation of the random variable attached to $x.$

The "regression of $Y$ on $X$" in the first case is the function $x\to E[\mathcal{P}(x)].$ It exists when all these expectations exist.
The "regression of $Y$ on $X$" in the second case is the conditional expectation $E[Y\mid X].$ The "function" $x\to E[Y\mid X=x]$ is not actually well-defined, but any two such functions can disagree only on a set of zero probability."

"Linear regression" means finding a linear function to approximate the regression in the least squares sense. That is, in some sense the average squared difference between the $Y$ values and what the linear function "predicts" is minimized. In both cases we may express this function in the form

$$f(x;\theta) = \beta_0 + \beta_1 x$$

where $\theta = (\beta_0, \beta_1)$ are the parameters. The two situations differ subtly in what "mean square difference" might mean.

In the first case, the expected squared difference between $f(x;\theta)$ and the random variable $Y_x$ is (of course) $E\left[\left(Y_x - f(x;\theta)\right)^2\right].$ This isn't enough, though: evidently we need some way to average the expected squared differences over the set $X.$ That implies there is some measure $\lambda$ defined on $X$ so that we may define the mean squared difference as $$\operatorname{MSE}(\theta)=\int_X E\left[\left(Y_x - f(x;\theta)\right)^2\right]\,\mathrm{d}\lambda(x).$$ For minimization purposes, any positive multiple of $\lambda$ will yield the same linear regression.
In the second case, we already have a probability measure (given by the marginal distribution of $X$). Thus, $$\operatorname{MSE}(\theta)=E\left[E\left[\left(Y - f(X;\theta)\right)^2\mid X\right]\right] = E\left[\left(Y - f(X;\theta)\right)^2\right].$$

Both cases can be framed identically in the language of Euclidean vector spaces. The vector space in question has at most three dimensions: it is generated by $Y,$ $X,$ and the constant function $1,$ which I will write as $\mathbf{1}.$ The Euclidean norm in the second case is given by

$$||V||^2 = E\left[V^2\right]$$

where $V$ is any linear combination of $Y,$ $X,$ and $\mathbf 1.$

It determines the inner product

$$\langle U,V\rangle = \frac{1}{4}\left(||U+V||^2 - ||U-V||^2\right) = E[UV].$$

For convenience in the following calculations, let's rescale this norm to make $||\mathbf{1}||^2 = 1.$

The least squares objective is the squared distance between $Y$ and the linear combination $\beta_0\mathbf{1} + \beta_1 X.$ The shortest distances occur when the residual $Y - (\beta_0 \mathbf{1} + \beta_1 X)$ is orthogonal to the subspace generated by $\mathbf{1}$ and $X.$ Orthogonality can be checked by establishing that the residual is orthogonal (separately) to each of those vectors. This gives a pair of simultaneous linear equations, the Normal Equations

$$\left\{\begin{aligned} 0 &= \langle Y - (\beta_0\mathbf{1} + \beta_1 X), \mathbf{1}\rangle &= \langle Y, \mathbf{1}\rangle - \beta_0 - \beta_1 \langle X, \mathbf{1}\rangle \\ 0 &= \langle Y - (\beta_0\mathbf{1} + \beta_1 X), X\rangle &= \langle Y, X\rangle - \beta_0\langle \mathbf{1}, X\rangle - \beta_1 \langle X, X\rangle \end{aligned} \right.$$

Writing $\bar X = \langle X, \mathbf{1}\rangle = \langle \mathbf{1}, X \rangle$ and similarly for $\bar Y,$ let

$$V(X) = ||X||^2 - \left(\bar X\right)^2;\quad V(Y) = ||Y||^2 - \left(\bar Y\right)^2;\quad \operatorname{Cov}(X,Y) = \langle Y, X\rangle - \left(\bar Y\right)\left(\bar X\right).$$

Assuming $V(X)\ne 0,$ the unique solution is

$$\hat \beta_1 = \frac{\operatorname{Cov}(Y,X)}{V(X)}$$

from which $\hat \beta_0$ is readily computed.

Notice that in the second formulation, the correlation coefficient is given by

$$r = \frac{\operatorname{Cov}(Y,X)}{\sigma(Y)\,\sigma(X)}$$

where $\sigma^2(X) = V(X) = S_X^2$ and $\sigma^2(Y)=V(Y)=S_Y^2.$ In this notation the solution given in the question is

$$\hat\beta_1 = r\frac{S_Y}{S_X} = r\frac{\sigma(Y)}{\sigma(X)} = \frac{\operatorname{Cov}(Y,X)}{\sigma(Y)\,\sigma(X)}\frac{\sigma(Y)}{\sigma(X)} = \frac{\operatorname{Cov}(Y,X)}{V(X)},$$

algebraically equivalent to the solution obtained above.

Any (nonempty) finite dataset $((X_1,Y_1), (X_2,Y_2),\ldots, (X_n,Y_n))$ determines its empirical (bivariate) distribution. It is a discrete distribution in which the probability of any ordered pair $(x,y)$ is $1/n$ times the number of observations equal to $(x,y).$ The foregoing integrals reduce to finite sums, where for any $V = (v_1,v_2,\ldots, v_n)$ that is a linear combination of $Y,$, $X,$ and $\mathbf{1}=(1,1,\ldots),$

$$||V||^2 = E\left[V^2\right] = \sum_{i=1}^n \frac{1}{n} V_i^2 = \frac{1}{n}\sum_{i=1}^n V_i^2.$$

You will recognize this as the mean square. It follows that $\bar X = (X_1+X_2+\cdots+X_n)/n$ is the usual mean and $V(X)$ is the population variance (the normalization factor is $1/n$ rather than $1/(n-1)$). However, since the normalization factors cancel in the fraction $\hat\beta_1 = \operatorname{Cov}(Y,X)/V(X),$ the estimate is identical to the Ordinary Least Squares estimate.

Clarinetist · Answer 2 · 2021-04-13T19:41:47.863

This answer is likely wrong, and I would appreciate any corrections.

This is the only way I can reason it: we assume $(X, Y)$ is uniformly distributed over $\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^2$. Then, $X$ is a random variable whose values are uniformly distributed in $\{x_1, \dots, x_n\}$ and $Y$ is a random variable taking on values from a uniform distribution taking on values in $\{y_1, \dots, y_n\}$.

Minimize the residual sum of squares $$R(\beta_0, \beta_1) = \sum_{i=1}^{n}[y_i-(\beta_0+\beta_1x_i)]^2\text{.}$$ This would result in the standard linear regression coefficient equations \begin{align*} \hat{\beta}_0&=\sum_{i=1}^{n}y_i \cdot \dfrac{1}{n} - \hat{\beta}_1 \cdot \sum_{i=1}^{n}x_i \cdot \dfrac{1}{n} \\ &= \mathbb{E}[Y] - \hat\beta_1 \cdot \mathbb{E}[X] \\ \hat{\beta}_1 &= \dfrac{\sum_{i=1}^{n}(x_i - \mathbb{E}[X])(y_i - \mathbb{E}[Y])}{\sum_{i=1}^{n}(x_i - \mathbb{E}[X])^2}\tag{1} \\ &= \dfrac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mathbb{E}[X])(y_i - \mathbb{E}[Y])}{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mathbb{E}[X])^2} \\ &= \dfrac{\mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]}{\mathbb{E}[(X - \mathbb{E}[X])^2]} \\ &= \dfrac{\text{Cov}(X, Y)}{\text{Var}(X)} \\ &= \dfrac{\text{Cov}(X, Y)}{\sigma(X)} \cdot \dfrac{\sigma(Y)}{\sigma(X)\sigma(Y)} \\ &= \dfrac{\text{Cov}(X, Y)}{\sigma(X)\sigma(Y)} \cdot \dfrac{\sigma(Y)}{\sigma(X)} \\ &= \rho_{X, Y} \cdot \dfrac{\sigma(Y)}{\sigma(X)} \end{align*} as desired.

The line $(1)$ is true because $\sum_{i=1}^{n}x_i \cdot \dfrac{1}{n} = \mathbb{E}[X]$ by the uniform assumption.

Regression on population values?

2 Answers2

Linked