An interesting non-smooth regression

Question

Consider the following setup:

Let $x_1^k$ and $x_2^k$ be length $N$ vectors of observed reals, where $N$ is about, say, 100,000, $k\in 1:K$, and $K$ is about, say, 200. (So $x_1,\;x_2$ can be thought of as $N\times K$ matrices if helpful).

The purpose of setting N and K is to 1) show the system is over-identified and 2) rule out naive brute force (grid-search like) solutions.

Let $y$ also be a length $N$ vector of observations.
Let $a$ and $b$ be a length K vector of reals representing the focal values to be estimated. You may assume $a^k$ and $b^k$ are strictly positive as well as some reasonable upper bound $\bar{a^k}$ and $\bar{b^k}$ of around, say, $100.0$.
The data generating process is given by: $$ y_n=\sum_{k\in 1:K}|x_{n2}^k b^k - x_{n1}^k a^k| + \epsilon_n \\ s.t.\\ \mathbb{E}(x_{n1}^k\epsilon_n)=\mathbb{E}(x_{n2}^k\epsilon_n)=0\text{ (for all k and n)} $$

What is the "best" way to run the above regression? I am flexible with respect to the objective function, but if it helps narrow down the problem, feel free to use: $$ \min_{a^k,b^k \forall k}\sum_{n\in 1:N}(y_n-\hat{y_n})^2\\ \hat{y_n}\equiv\sum_{k\in 1:K}|x_{n2}^k b^k - x_{n1}^k a^k| $$

By "best", I am looking for an algorithm, methodology and/or procedure that produces an optimal solution given a reasonable objective function. The procedure should be computationally feasible on a modern computer in a reasonable amount of time. Faster is better, but I also would like some confidence that the solution is truly optimal (or nearly so) given the objective.

You may make other reasonable assumptions, including assumptions about the distributions of $a^k$, $b^k$, and $\epsilon$ (such as normal errors) and/or other inputs if necessary or particularly helpful. Priors are fine too if you want to go a Bayesian route.

Notes and prior work:

The data generating process can be reframed as: $$ y_n=\sum_{k\in 1:K}|x_{n2}^k \lambda^k - x_{n1}^k|a^k + \epsilon_n\\ \lambda^k\equiv a^k/b^k $$ This cuts the number of unkowns down by half if $a$ is assumed to have an OLS solution for a chosen $\lambda$. Obviously this creates numerical issues if $a^k$ is small, but a log transform can help. A bigger issue is simply $\lambda^k$ still represents a lot of unknowns. All of the below approaches use this framing of the problem, but its not fundamental and I'd be happy to consider others or the original.
Grid search, at least naively applied, will not work here due to the dimensionality.
LBFGS using an (automatically differentiated, not finite difference) gradient gives an answer, but I'm not very confident in it due to the lack of smoothness. It also takes forever to solve.
I've tried global optimization techniques, including simulated annealing and evolutionary strategies but the computational time has been infeasible (at least days). However, feel free to suggest something in this category if you think it will work.
I'm hesitent to mention this since I don't want to exclusively imply a Bayesian strategy (machine learning is fine too!), but I'm currently working on a component-wise Metropolis-Hastings approach like the following (letting $N_{++}$ be a normal distribution truncated at 0) $$ \lambda^k\sim N_{++}(\mu_{\lambda},\sigma^{2}_{\lambda})\text{ (prior)} \\ a^k\sim N_{++}(\mu_{a},\sigma^{2}_{a})\text{ (prior)} \\ \sigma^2\sim IG(\alpha /2,\beta/2)\text{ (prior)} \\ L(D|\cdot)=(2\pi\sigma^2)^{-n/2}\exp[\frac{1}{2\sigma^2}(y-\hat{y})^{\prime} (y-\hat{y})] $$ I am somewhat uncomfortable with what I think is an implicit assumption of independence between components of $\lambda$. In addition, I'm worried about the convergence time of the MCMC and getting comfortable with the posterior draws. I could try to just draw values of $\lambda$ as a vector, but I think finding a reasonable proposal distribution will prove to be challenging in the non-component-wise setup (though I haven't tried).
~~Per the comments, I believe the problem is convex. Logic: abs function is weakly convex, sum of convex functions is convex.~~ While the absolute value function is convex, the quadratic objective function is not necessarily convex- in fact, each term $n$ is convex if $y_n\le0$

If I'm understanding this correctly, it's a convex optimisation problem. If so, gradient descent methods will find the global optimum and you can prove it a solution is optimal with the KKT conditions. Quasi-Newton methods like L-BFGS-B might not work because smoothness. — Thomas Lumley, Sep 20 '20 at 04:59
That's a good thought. I believe since the absolute value function is weakly convex, the problem is weakly convex as well (sum of many convex functions). I didn't realize SG worked better than LBFGS on non smooth problems. I had tried ADAM with fairly poor results in the original formulation, but I'll give it a shot with the formulation in the first note. — Matterhorn, Sep 20 '20 at 18:03
Or not even stochastic gradient descent, but the full thing. Or conjugate gradients. — Thomas Lumley, Sep 20 '20 at 21:33
So this improved the quality of the output, which I am grateful for, as well as my confidence in said output (doubly grateful!). But it's still very slow, with the gradient taking most of the computational time. — Matterhorn, Sep 22 '20 at 23:25

An interesting non-smooth regression

0 Answers0