5

I am currently reading a paper stating the following regression problem $$\text{min} \sum_{i=1}^N ||\beta\cdot x_i-y_i||\\ \text{subject to} \sum_{j=2}^M ||\beta_{j}-\beta_{j-1}|| \leq S $$ for vectors $x_1,\dots,x_N, \beta \in \mathbb{R}^n, y_i \in \mathbb{R}$. This is a slightly changed version of the fused lasso loss for a regression problem. The authors state, this poblem can be solved via maximum likelihood estimation. I would have used quadratic program solvers instead but they are slower of course. Are there closed form solutions (from MLE) for this kind of problem?

AlexConfused
  • 151
  • 7
  • Do you have a reference to the paper? – Sextus Empiricus Aug 28 '18 at 15:59
  • http://people.cs.vt.edu/gangwang/ccs18.pdf The mentioned paragraph on the computation is directly below equation (7) – AlexConfused Aug 28 '18 at 16:15
  • This is actually related to the penalty used by P-splines, see https://projecteuclid.org/download/pdf_1/euclid.ss/1038425655 (Eilers and Marx, Flexible Smoothing with B-splines and Penalties). – jbowman Aug 28 '18 at 16:30
  • The regression problem you present defines an estimator based on a loss function that is being minimized. Meanwhile, maximum likelihood (ML) defines an estimator based on maximizing the likelihood. I do not think the two can ever coincide since ML would never yield a penalized solution such as this one. ML is not even an method for calculating parameter estimates, unlike quadratic solvers, gradient descent, Newton-Raphson etc. (I have a problem finding the right term for all of these; optimization methods, maybe?). ML tells you what you are looking for but not how to calculate it mechanically. – Richard Hardy Sep 16 '18 at 08:38
  • Therefore, I am not sure whether the current formulation of the question quite makes sense. – Richard Hardy Sep 16 '18 at 08:40

2 Answers2

1

How about transforming to the vectors as following:

$$z_j = \sum_{i=j}^{N} x_i $$

where the index refers to the index of the regressors.

then

$$\beta^\star \cdot z = \beta \cdot x$$

for

$$\beta_j = \sum_{i=1}^{j} \beta_i^\star \quad \text{ or (when $i \geq 2$)} \quad \beta^\star_i = \beta_i-\beta_{i-1}$$

Now you just have to solve the problem for $z$ (instead of $x$) with the easier constraint $ \sum_{i=2}^M ||\beta^\star_{i}|| \leq S$.

This turns into ridge regression (where $\Gamma_{11} = 0$ in the Tikhonov regularization matrix) when we assume $\Vert \cdot \Vert$ is the quadratic norm, in which case there is a closed solution. Otherwise for Lasso there is a good algorithm to solve it.


Schematic example for $\beta^\star \cdot z = \beta \cdot x$ using three variables:

$\begin{array}{ccccccc} \beta_1^\star z_1 &=& \beta_1^\star x_1 & + & \beta_1^\star x_2 &+& \beta_1^\star x_3 \\ \beta_2^\star z_2 &=& & & \beta_2^\star x_2 &+& \beta_2^\star x_3 \\ \beta_3^\star z_3 &=& & & && \beta_3^\star x_3 \\ & & \vert\vert & & \vert\vert & & \vert\vert \\ & & \beta_1 x_1 & & \beta_2 x_2 & & \beta_3 x_3 \\ \end{array} $


To avoid confusion about indices

The article speaks of the cost function

$$L(f(\mathbf{x}),y) = \sum_{i=1}^N \Vert \boldsymbol\beta \mathbf{x_i} -y_i \Vert$$

with $x_i$ a sample $i$ represented by a row vector $(x_1, x_2, \cdots, x_M)^T$ of different features. This can be simplified to a matrix notation using $X = ((x_1, x_2, \cdots, x_M)^T)_i$ where each column of X is a vector of the same feature for different samples $i$.

$$L(f(\mathbf{x}),\mathbf{y}) = \Vert \mathbf{X} \boldsymbol\beta^T -\mathbf{y} \Vert$$

when I speak about the index of the regressors I mean the column index of the matrix $X$ (this adding the columns of $X$ is equivalent to adding the entries in the row vectors $\mathbf{x_i}$)

see also https://en.wikipedia.org/wiki/Design_matrix

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • I don't think this is correct. Firstly, I am not sure if your transformation works and even if it works you will solve a regression problem with ridge loss in the end, i.e. your parameters $\beta$ will not have a form such that neighbor parameters are as close as $S$ to each other. – AlexConfused Aug 28 '18 at 13:58
  • I changed the the transformation (which was indeed in error). The $\beta^\star$ will indeed not be necessarily close to each other. Instead the $\beta^\star$ will be as small as possible such that the neighbors in $\beta_j=\sum_{i=1}^j \beta_i^\star$ will be as close as possible. – Sextus Empiricus Aug 28 '18 at 14:01
  • In your transformation $z_j = \sum_{i=j}^N x_i$ , is $x_i$ the $i$-th vector from the training set (as introduced in question) or the $i$-th entry of some vector $x$? – AlexConfused Aug 28 '18 at 14:44
  • It refers to the i-th vector – Sextus Empiricus Aug 28 '18 at 14:56
  • I think it should refer to the i-th entry. Only then is the transformation correct. – AlexConfused Aug 28 '18 at 16:25
  • @AlexConfused I was adding the regressors not the entries. See the final part with the example how it works out (the $x_i$ are the vectors there with the index $_i$ referring to the $i$-th regressor). So in that example there are three regressors but there can be many more measurements. Could you explain *why* you think this is wrong. – Sextus Empiricus Aug 28 '18 at 17:18
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/82401/discussion-between-alexconfused-and-martijn-weterings). – AlexConfused Aug 29 '18 at 06:20
0

After some research I have found out that there is no closed form solution for this problem (and therefore no MLE in particular). A (fast) solution for this problem has been proposed via solution paths, the algorithm can be found here.

AlexConfused
  • 151
  • 7