Minimizing $L_2$ norm with constrained residual sum of squares (RSS)

Question

I have some complex-valued time-series data, $y \in \mathbb{C}^n$ - a signal with additive Gaussian white noise. The goal is to find the Fourier coefficients of this signal.

Ideally, you would just do a discrete Fourier transform, but in this case the data is non-uniformly sampled and every data point has some arbitrary complex amplitude (a "feature" of the measurement method) multiplied with it (prior to injection of noise). So, I want to solve for the underlying signal model via a linear model.

We have $y = A x + \epsilon$ where $A \in \mathbb{C}^{n \times p}$ is essentially the analogue of the DFT Matrix but also includes the complex amplitudes, $x \in \mathbb{C}^p$ are the Fourier coefficients we want to solve for, and $\epsilon \sim \mathcal{CN}(0, \sigma^2)$ is a complex normal noise term.

Due to non-uniform sampling, $A$ does not necessarily have full rank and is most definitely ill-conditioned (large ratio between highest and lowest singular values). So, we need to add regularization.

So, the problem I am trying to solve is:

$$ \min_x \|x\|^2_2 \,\, \text{ subject to } \|y - Ax\|^2_2 \le \delta \tag{1}\label{eq1} $$

This is a nice way to frame the problem for two reasons:

Our knowledge of the noise term allows us to set $\delta \approx n \sigma^2$ - this is the expected RSS. This constraint prevents underfitting.
$\|x\|^2_2$ has the interpretation of being the "total power" in the model. Finding the model with minimum power prevents overfitting.

These conditions should hit the right spot in the bias-variance tradeoff.

How do you go about solving this sort of problem?

Here's some of my work on the problem:

I don't know how to rigorously prove this statement, but I am fairly certain my problem is essentially equivalent to Ridge Regression.

$$ \hat{x}_\lambda = \underset{x}{\operatorname{argmin}} \, (\| y - A x \|^2_2 + \lambda \|x\|^2_2)$$

Is this true?

You can claim that as $\lambda$ decreases, $\delta$ increases (if you switch the inequality to an equality in (1)). Ideally, there's a nice way to numerically solve (1) preserving the interpretability of $\delta$.

An idea is to write the RSS as a function of $\lambda$,

$$ f(\lambda) = \| y - A \hat{x}_\lambda \|^2_2 $$

and solve for $\lambda$ in the equation: $f(\lambda) = \delta$.

This gives us:

$$ \| y - A \hat{x}_\lambda \|^2_2 = \delta \\ (y - A \hat{x}_\lambda)^* (y - A \hat{x}_\lambda) = \delta $$

where $X^*$ is the conjugate transpose of $X$. We know that,

$$ \hat{x}_\lambda = (A^* A + \lambda I)^{-1} A^* y $$

So, we have:

$$ (y - A (A^* A + \lambda I)^{-1} A^* y)^* \, (y - A (A^* A + \lambda I)^{-1} A^* y) = \delta $$

I am not sure how to approach this.

score 1 · Answer 1 · answered Sep 28 '21 at 21:12

Your intuition about the connection between your problem and Ridge regression is correct.

Let's consider the least squares regression problem:

$$\begin{bmatrix} y &=& Ax &+& \epsilon \\ 0 &=& \lambda I_p x &+& e\end{bmatrix}$$

where we have "stacked" the two regression problems $y = Ax + \epsilon$ and $0 = \lambda I_p x + e$. This is equivalent to ridge regression: see @whuber's answer to How to derive the ridge regression solution?.

Clearly, if we set $\lambda = 0$, we have an unconstrained regression problem $y = Ax + \epsilon$, and, as $\lambda \to \infty$, the problem approaches the unconstrained regression problem $0 = I_p x + e$. The latter problem is evidently equivalent to $\min_x ||x||_2^2$.

Let $x_{\lambda}$ be the least squares solution to this problem for a given value of $\lambda$. Then we define:

$$ ||y - Ax_{\lambda}||_2^2 = \delta_{\lambda}$$

$x_{\lambda}$ is the solution to $\min_x ||x||_2^2 \; \text{s.t.} \; ||y - Ax||_2^2 \leq\delta_{\lambda}$.

One proof of this is by contradiction. Assume there is an $x'$ that achieves a smaller value of $||x||_2^2$ given the constraint. Then it would have achieved a smaller value of $||e||_2^2 \; (= \lambda^2||x||_2^2)$ in the original regression problem while achieving the same or smaller value of $||\epsilon||_2^2$. Since the least squares solution minimizes $||e||_2^2 + ||\epsilon||_2^2$, and $x'$ achieves a smaller value of one or both terms while not increasing the other, $x_{\lambda}$ cannot be the least squares solution. But it is by definition, therefore, no such $x'$ can exist.

How do we find $\lambda$ such that $\delta_{\lambda} = $ the desired $\delta$? A little thought will convince us that $\delta_{\lambda}$ is strictly monotonic (increasing) in $\lambda$, implying that this is a straightforward one-dimensional root finding problem, and can be solved using any number of off-the-shelf routines.

Here's some sample code in R, solving a simpler (non-complex) version of your problem:

library(MASS)

# Original problem
A <- matrix(rnorm(100), 10, 10)
x_true <- rep(1, 10)
y <- (A %*% x_true + rnorm(10))[,1]

# Extension to Ridge formulation
yr <- c(y, rep(0, 10))

# Objective function for root-finding routine
foo <- function(lambda, delta) {
  Ar <- rbind(A, diag(lambda, 10))
  e <- residuals(lm(yr~Ar))[1:10]
  sum(e*e) - delta
}

# We handwave over bracketing lambda, but, due to monotonicity, it's easy to do
lambda <- uniroot(foo, lower=0.0001, upper=100, delta=5)$root

Ar <- rbind(A, diag(lambda, 10))
m1 <- lm(yr~Ar)

# Compare this to our target delta of 5
e <- residuals(m1)[1:10]
sum(e*e)
[1] 4.999904   
 
> coef(m1)  # constrained coefficients
(Intercept)         Ar1         Ar2         Ar3         Ar4         Ar5         Ar6 
 -0.9502549   0.5227828   1.1742053   0.7032605   0.5329104   0.7776386   0.5640878 
        Ar7         Ar8         Ar9        Ar10 
  0.6192965   1.2160958   0.8605507   0.3305862

The unconstrained coefficients would, naturally, all equal $0$.

Why do say $\delta$ is monotonically increasing in $\lambda$? I thought it would be decreasing. — XYZT, Sep 28 '21 at 22:03
The way I have it set up is that large $\lambda$ will force $x$ closer to $0$, which will naturally increase the errors associated with $y - Ax$. This happens because the second block of the regression will cause least squares to try to make $\lambda I_p x$ as close to zero as possible, so large $\lambda \to $ small $x$. — jbowman, Sep 29 '21 at 01:01

Minimizing $L_2$ norm with constrained residual sum of squares (RSS)

1 Answers1