Simulate regression with specified standardized coefficients

Question

TL;DR:

How do I simulate data $y=X\beta$ such that OLS will yield prespecified standardized coefficients?

Problem:

I would like to generate data (and outcome y and one or more variables V1, V2 ...) which linear regression would return specific coefficients.

This can be achieved using the following function based no this post:

sim_regression<-function(n.obs=100, coefficients=c(0, 0.5, 0.8)){

  n_var=length(coefficients)
  M=matrix(0,ncol=n_var, nrow=n.obs)


  for (i in 1:n_var){
    M[,i]=rnorm(n.obs,0,1)
  }

  beta = as.matrix(coefficients)
  y = M %*% beta

  data = data.frame(X=M)
  names(data) <- paste0("V", 1:n_var)
  data$y <- as.vector(y)

  return(data)
}

The returned coefs are indeed those specified (i.e., 0, 0.5 and 0.8):

> round(coef(lm(y ~ ., data=data)), 2)
(Intercept)          V1          V2          V3 
        0.0         0.0         0.5         0.8

However, these are not standardized coefs, in the sense that they are not expressed in units of outcome's SD, because the generated outcome y is not standardized (mean = 0 and sd = 1).

> sd(data$y)
[1] 0.925431

If I standardize the outcome afterwards, the coefs are totally changed:

> data$y <- as.vector(scale(y))
> round(coef(lm(y ~ ., data=data)), 2)
(Intercept)          V1          V2          V3 
      -0.01        0.09       -0.02       -0.08

I would like the same results as above, i.e., resulting coefficients equal to what was specified (in this case, 0, 0.5 and 0.8) but with a standardized outcome.

Is there any way to achieve this? Thanks

Because it also scales the coefs, which no longer correspond to what I want — Dominique Makowski, Aug 24 '18 at 06:11

whuber · Answer 1 · 2021-02-05T15:19:48.313

Given a standardized $n\times p$ design matrix $X$ and a specified coefficient $p$-vector $b$ you want to generate normalized random responses $y$ for which the ordinary least squares fit of the model $y = X\beta + \varepsilon$ is $b;$ that is,

$$b = (X^\prime X)^{-}X^\prime y.$$

How you normalize doesn't matter (to unit variance or unit Euclidean length), so I will suppose the constraints are

$$\left\{\begin{aligned}&\sum_{i=1}^n y_i = 0,\\&\sum_{i=1}^n y_i^2 =1.\end{aligned} \right.$$

The first constraint is automatically achieved when $y$ and all the columns of $X$ are centered, so let's assume this has already been done.

What this solution really means is that $b$ solves the system

$$(X^\prime X)b = X^\prime y.\tag{*}$$

This system always has the obvious unnormalized solution

$$y_0 = Xb$$

(because plugging it into the system gives a tautological equality). It is attractive to attempt to generate solutions of the form

$$Y = y_0 + \lambda\delta$$

where $\delta$ is a standardized $n$-variate random vector of some sort and $\lambda$ is chosen to normalize $y_0 + \lambda\delta.$ Thus,

$$\lambda = \pm\sqrt{1 - ||y_0||^2}.$$

There typically are two solutions $\lambda$ (provided any solutions to this problem exist at all: when the coefficients of $b$ are too large in magnitude, then $||y_0||\gt 1$ and no solutions exist). If the distribution of $\delta$ is symmetric around the origin, it won't matter which solution is used. Otherwise, you might want to select one uniformly at random.

Because this solution $y_0 + \lambda \delta$ must satisfy $(*),$

$$(X^\prime X)b = X^\prime(y_0 + \lambda\delta) = (X^\prime X)b + \lambda X^\prime \delta,$$

demonstrating that (unless fortuitously $\lambda=0$)

$$X^\prime \delta = 0.$$

That is, $\delta$ must be a distribution on the kernel (nullspace) of the linear transformation represented by $X^\prime.$ Let $(e_1, e_2, \ldots, e_q)$ be an orthogonal basis of the kernel (which in the R code below are the columns of a matrix $\Phi$). Usually $q = n-p-1$--the $1$ accounts for the preliminary centering--but it can be larger when $X$ is not of full rank. Any possible value of $\delta$ is determined by the $q$ coefficients of this basis. These coefficients are what we want to vary randomly.

So, let $\delta$ have any zero-mean $q$-variate distribution and use its components to generate the error terms. (In what follows I will use a standard Normal distribution.) It remains only to show how this program can be carried out efficiently.

One way (which works) is to find the null space of $X$ using a QR decomposition of $X$ adjoined to (a) a column of $1$'s (to assure mean centering) and (b) any orthogonal matrix you like (such as a nonzero diagonal matrix). I start with $UD$ matrix from the Singular Value Decomposition $X = UDV^\prime$ because that handles rank-deficient matrices. (In such cases, though, it's arguably not even meaningful to reproduce $b.$)

When $X$ is of full rank, $\lambda \delta$ must reside on a sphere of dimension $n-p-2$ (the normalization reduces the dimension of the solution space from $n-p-1$ to $n-p-1-1$). For instance, when $n=p+3$ all solutions will lie on a circle. Here is an example showing a scatterplot matrix (and histograms) of the components of $Y-y_0$ in a case where $n=4$ and $p=1.$ I generated 2,000 independent values of $Y$ for this figure. (In order to graph the Beta density, I also standardized the coordinates: see the R code for details.) As expected, each cell is a projection of a circle onto a coordinate plane.

When $n-p-2\gt 2,$ these projections will be filled ellipses. (This is a property of the geometry of least squares, not of the distribution of $\delta.$)

When $\delta$ has a standard Normal distribution, (1) the errors are uniformly distributed on the manifold of all possible solutions (the sphere) and (2) the coordinate distributions must shifted, scaled versions of a Beta$((n-p-2)/2, (n-p-2)/2)$ distribution.. These Beta densities are plotted in red in the figures over histograms of (appropriately shifted and scaled) error terms (the components of $\delta$). The agreement between simulation and theory is excellent.

As another example, when $n-p-2=3,$ the projection of $Y$ into any coordinate plane must have a uniform distribution. We may confirm this visually with another simulation. I changed only the first two lines of the R code (below) to simulate $n=105, p=100,$ partly to demonstrate that the distributions really do depend only on $n-p:$

Sure enough, the points appear to fill the ellipses uniformly.

How efficient is this procedure? A good SVD and QR implementation will take little time at all for $n$ in the hundreds. For $n=500$ and $p=25,$ a medium-sized multiple regression problem, my system generates $400$ $Y$ vectors per second (which includes checking the validity of every solution, involving regressing each $Y$ against $X.$). When individual checks are not performed, it goes ten times faster. For larger $n,$ the kernel of $X$ gets large and requires $O(n^2)$ coefficients and finding them is a $O(n^3)$ task. Consider selecting a smaller subspace of the kernel in order to make the calculations more efficient. Do this by starting a Gram-Schmidt process on $X_1$ and stopping partway through. Otherwise, the SVD and QR decomposition become the bottlenecks.

For details, see the R code that generated the example. The code is structured in three section. The first section creates data to work with. In practice, you would supply your own data. The second section performs the preliminary computations. The third section draws random values of $Y$ and plots them. Notice that only two lines of code are needed for each random value: a matrix multiplication to form the random linear combination of the null basis, followed by standardizing that and adding it to $y_0.$ This is a $O(n(n-p))$ calculation.

#
# Create data.
#
n <- 105
p <- 100
set.seed(17)
X <- matrix(rnorm(n*p), n, p)
X <- apply(X, 2, scale) / sqrt(n-1)                # (Usually X is standardized)
colnames(X) <- paste0("x.", seq_len(dim(X)[2]))
rownames(X) <- as.character(seq_len(n))
#
# Specify a feasible value of b.
# This will normalize y.0 to |y.0| = mu.
#
mu <- 0.9                   # Choose a value between 0 and 1
b <- runif(dim(X)[2], -1, 1)
b <- b * mu / sqrt(sum((X %*% b)^2))
#------------------------------------------------------------------------------#
#
# Preliminary computations:
# Compute y.0 and the (orthogonal) dual of X.
#
dual <- function(X, tol=1e-16) {
  n <- dim(X)[1]
  p <- dim(X)[2]
  qr.Q(qr(cbind(X, rep(1, n), diag(1, n, n)), tol=tol))[, -seq_len(p+1)]
}
y.0 <- X %*% b
y2 <- sum(y.0^2)
if (y2 > 1) stop("No solutions are possible: b is too large.")
eps.norm <- sqrt(1 - y2) # This will be the length of any error term
Phi <- with(svd(X), dual(u[, which(d > 1e-8), drop=FALSE]))
#------------------------------------------------------------------------------#
#
# Simulate data `y` (for fixed `X`).
#
tol <- 1e-8
run.checks <- TRUE
sim <- replicate(2e3, {
  ##############################################
  # This is the entire algorithm.              #
  # It uses precomputed y.0, Phi, and eps.norm.#
  ##############################################
  eps <- Phi %*% rnorm(dim(Phi)[2])            # Unnormalized errors
  y <- y.0 + eps * eps.norm / sqrt(sum(eps^2)) # Normalize and add
  #
  # Confirm the generated `y` meets all constraints.
  #
  if (isTRUE(run.checks)) {
    fit <- lm(y ~ . + 0, as.data.frame(X))
    if (sum((coefficients(fit) - b)^2) > tol) warning("Beta not reproduced.")
    if(abs(mean(y)) > tol) warning("Not centered: ", abs(mean(y)))
    if(1 - sqrt(sum(y^2)) > tol*1e2) warning("Not normalized: ", sqrt(sum(y^2)))
  }
  c(y)
})
rownames(sim) <- rownames(X)
#
# Plot the results.
#
f <- function(x) dbeta((x/eps.norm+1)/2, (n-p-2)/2, (n-p-2)/2) / 2 / eps.norm
panel.hist <- function(x, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE, breaks=seq(min(x), max(x), length.out=20))
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$density; y.max <- max(y)
  rect(breaks[-nB], 0, breaks[-1], y/y.max, ...)
  #
  # Overplot the theoretical density.
  #
  curve(f(x)/y.max, add=TRUE, n=201, lwd=2, col="#e01010")
}
m <- min(4, nrow(sim))               # Limit the scatterplot matrix dimensions
i <- sort(sample.int(nrow(sim), m))
sim.0 <- (sim - c(y.0))/ sqrt(rowSums(Phi^2))               # The error terms only
pairs(t(sim.0[i, ]), asp=1, col="#00000010", cex.labels=1, diag.panel=panel.hist)
mtext(paste0("Standardized Error Terms in ",
             ifelse(m < nrow(sim), "Selected ", ""),
             "Components with n=", n, " and p=", p), line=3)

Simulate regression with specified standardized coefficients

TL;DR:

Problem:

1 Answers1