12

The question refers to @whuber's algorithm to draw a random variable with a given covariance structure to a given set of random variables: https://stats.stackexchange.com/a/313138/3277

The algorithm does exactly what it is supposed to do.

However, I don't grasp yet how and why it does so. Can anyone please clarify WHY the proposal by @whuber actually works.

In particular,

1) one line of his code reads

y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > 0, 1/d, 0)) %*% t(v))

where the inner matrix is a diagonal matrix containing the inverses of all singular values of y.

u contains the left-singular vectors of y. v contains the right-singular values of y.

So what is y.dual ?

2) the final step in calculating z is

z <- y.dual %*% rho + sigma*e

Why do we have to add sigma*e here? And, frankly, why does y.dual %*% rho + sigma*e produce what we are looking for?

I would be grateful for a clarification / explanation of this procedure.

pf11
  • 149
  • 3
  • Both questions are (sketchily) answered in my post: `y.dual` is (proportional to) a pseudo-inverse of `y` (which you can check by inspecting `zapsmall(t(y.dual) %*% y)` provided the columns of `y` are linearly independent) and (2) implements the generalization of the formula for $X_{Y;\rho}$ from ordinary to multiple regression. – whuber Jan 09 '20 at 17:06

1 Answers1

15

As a point for further elaboration, here is the explanation in the thread you reference:

If it's mathematically possible, [this method] will find an $X_{Y_1,Y_2,\ldots,Y_k;\rho_1,\rho_2,\ldots,\rho_k}$ having specified correlations [between $X$ and] an entire set of $Y_i$. Just use ordinary least squares to take out the effects of all the $Y_i$ from $X$ and form a suitable linear combination of the $Y_i$ and the residuals. (It helps to do this in terms of a dual basis for $Y$, which is obtained by computing a pseudo-inverse. The following code uses the SVD of $Y$ to accomplish that.)

(I apologize for the reversal from the usual roles of $X$ (which here is treated as a response variable in a multiple regression) and $Y$ (which here is a collection of explanatory or regressor variables). This notation was adopted out of respect for the original question.)

The R code given to specify this procedure is sufficiently compact to be reproduced here:

y <- scale(y)             # Makes computations simpler
e <- residuals(lm(x ~ y)) # Take out the columns of matrix `y`
y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > 0, 1/d, 0)) %*% t(v))
sigma2 <- c((1 - rho %*% cov(y.dual) %*% rho) / var(e))
return(y.dual %*% rho + sqrt(sigma2)*e)

Let's go through this recipe step by step.

  • Taking out the effects of $Y$. Ordinary Least Squares finds a vector $\beta=(\beta_i)$ of $k$ coefficients for which the residuals $$e = X - Y\beta$$ are orthogonal to (have zero covariance with) the vector space spanned by the columns of $Y = \pmatrix{Y_1,&Y_2,&\cdots, & Y_k}.$ (These orthogonality relations are known as the Normal Equations.) The line of code

    e <- residuals(lm(x ~ y))
    

    carries out this calculation.

  • A linear combination. By definition, a linear combination of $Y$ and the residuals $e$ can be written in terms of a number $\sigma $ and $k$-vector $\alpha$ as $$Z=\sigma e + Y\alpha =\sigma e + \alpha_1 Y_1 + \cdots + \alpha_k Y_k.$$ By virtue of the Normal Equations, the covariance between any $Y_i$ and such a combination is $$\operatorname{Cov}(Z, Y_i) = \sigma \operatorname{Cov}(e,Y_i) + \operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right) = \operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right).\tag{1}$$ Ultimately, we want to find coefficients $\alpha_j$ that give the right hand side the right values to achieve the desired correlations $\rho = (\rho_i).$

  • A dual basis. This problem of finding the $\alpha_j$ is easily solved with a dual basis. Without loss of generality, re-index the $Y_i$ if necessary so that the first $d$ of them are linearly independent and span the same space spanned by all the $Y_i:$ that is, they form a basis. We seek a collection of $d$ vectors $W_j,$ each of which is a linear combination of the $Y_j, j\le d,$ for which $$\operatorname{Cov}(W_i,Y_j) = 0\tag{2}$$ whenever $i\ne j$ and $j \le d.$ We may scale the $W_j$ so that $$\operatorname{Var}(W_j)=1.\tag{3}$$ $(W_j)$ is a dual basis associated with the $Y_i.$

    As described in the text, such a matrix $W$ can be found using the Singular Value Decomposition (SVD) $$Y = U\, D\, V^\prime$$ by inverting the diagonal matrix $D$ (leaving any diagonal zeros alone) to create a pseudo-inverse $D^{-}$ whose first $d$ diagonal values are nonzero and forming $$Y^{-} = U\, D^{-}\, V^\prime.$$ You may check that the $d\times d$ block of $$(Y^{-})^\prime Y = (V\, D^{-}\, U^\prime)\, (U\, D\, V^\prime) = V\, I^{-}\, V^\prime = \pmatrix{\mathbb{I}_d & * \\ * & *}$$ is the identity matrix, which is just a compact way of expressing formulas $(2)$ and $(3)$ simultaneously. This is what the code

    y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > threshold, 1/d, 0)) %*% t(v))
    

    does. It turns out that this formula for $W$ works even when the first $d$ columns of $Y$ are not linearly independent. (Thinking of SVD in terms of geometric operations makes this apparent.) This obviates any need to execute the preliminary re-indexing step of the $Y_i.$

  • A suitable linear combination. Returning to $(1),$ express the linear combination $Y\alpha$ in terms of the dual basis, $$Y\alpha = \sum_{i=1}^d \gamma_jW_j.$$ This re-expresses $(1)$ as $$\operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right) = \operatorname{Cov}\left(\sum_{j=1}^d \gamma_jW_j, Y_i\right) = \sum_{j=1}^d \gamma_j \operatorname{Cov}\left(W_j, Y_i\right) = \gamma_i.$$ Consequently, using a standard formula for correlations, the $d$ equations we need to solve are $$\rho_i = \operatorname{Cor}(Z, Y_i) = \frac{\operatorname{Cov}(Z, Y_i)}{\operatorname{sd}(Z)\operatorname{sd}(Y_i)} = \frac{\gamma_i}{\operatorname{sd}(Z)\operatorname{sd}(Y_i)}.\tag{4}$$

  • The final step. As a preliminary step, we may as well have scaled the $Y_i$ to give them unit standard deviations. (The line of code y <- scale(y, center=FALSE) does that.) We're going to make sure $Z $ also has a unit standard deviation, so that the previous formula $(4)$ actually gives the correlation. That is, the solution by all rights ought to be $\gamma_i = \rho_i.$ If that's possible, then $$1 = \operatorname{Var}(Z) = \sigma ^2 \operatorname{Var}(e) + \operatorname{Var}(W\rho).$$ Consequently, assuming the residual vector $e$ is nonzero, $$\sigma ^2 = \frac{1 - \operatorname{Var}(W\rho)}{\operatorname{Var}(e)} = \frac{1 - \rho^\prime \operatorname{Cov}(W)\rho}{\operatorname{Var}(e)}.$$ This formula (omitted in the present question) is implemented in the code as

    sigma2 <- c((1 - rho %*% cov(y.dual) %*% rho) / var(e))
    

    Finally, the solution $Z = \sigma e + Y\alpha = \sigma e + W\gamma = W\rho + \sigma e$ is implemented in the code as

    y.dual %*% rho + sqrt(sigma2)*e
    
whuber
  • 281,159
  • 54
  • 637
  • 1,101