Definition and Convergence of Iteratively Reweighted Least Squares

Question

I've been using iteratively reweighted least squares (IRLS) to minimize functions of the following form,

$J(m) = \sum_{i=1}^{N} \rho \left(\left| x_i - m \right|\right)$

where $N$ is the number of instances of $x_i \in \mathbb{R}$, $m \in \mathbb{R}$ is the robust estimate that I want, and $\rho$ is a suitable robust penalty function. Let's say it's convex (though not necessarily strictly) and differentiable for now. A good example of such a $\rho$ is the Huber loss function.

What I've been doing is differentiating $J(m)$ with respect to $m$ (and manipulating) to obtain,

$\frac{dJ}{dm}= \sum_{i=1}^{N} \frac{\rho'\left( \left|x_i-m\right|\right) }{\left|x_i-m\right|} \left( x_i-m \right) $

and iteratively solving this by setting it equal to 0 and fixing weights at iteration $k$ to $w_i(k) = \frac{\rho'\left( \left|x_i-m{(k)}\right|\right) }{\left|x_i-m{(k)}\right|}$ (note that the perceived singularity at $x_i=m{(k)}$ is really a removable singularity in all $\rho$'s I might care about). Then I obtain,

$\sum_{i=1}^{N} w_i(k) \left( x_i-m{(k+1)} \right)=0$

and I solve to obtain, $m(k+1) = \frac{\sum_{i=1}^{N} w_i(k) x_i}{ \sum_{i=1}^{N} w_i(k)}$.

I repeat this fixed point algorithm until "convergence". I will note that if you get to a fixed point, you are optimal, since your derivative is 0 and it's a convex function.

I have two questions about this procedure:

Is this the standard IRLS algorithm? After reading several papers on the topic (and they were very scattered and vague about what IRLS is) this is the most consistent definition of the algorithm I can find. I can post the papers if people want, but I actually didn't want to bias anyone here. Of course, you can generalize this basic technique to many other types of problems involving vector $x_i$'s and arguments other than $\left|x_i-m{(k)}\right|$, providing the argument is a norm of an affine function of your parameters. Any help or insight would be great on this.
Convergence seems to work in practice, but I have a few concerns about it. I've yet to see a proof of it. After some simple Matlab simulations I see that one iteration of this is not a contraction mapping (I generated two random instances of $m$ and computing $\frac{\left|m_1(k+1) - m_2(k+1)\right|}{\left|m_1(k)-m_2(k)\right|}$ and saw that this is occasionally greater than 1). Also the mapping defined by several consecutive iterations is not strictly a contraction mapping, but the probability of the Lipschitz constant being above 1 gets very low. So is there a notion of a contraction mapping in probability? What is the machinery I'd use to prove that this converges? Does it even converge?

Any guidance at all is helpful.

Edit: I like the paper on IRLS for sparse recovery/compressive sensing by Daubechies et al. 2008 "Iteratively Re-weighted Least Squares Minimization for Sparse Recovery" on the arXiv. But it seems to focus mostly on weights for nonconvex problems. My case is considerably simpler.

Looking at the wiki page on [IRWLS](http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) I struggle to the difference between the procedure you describe and IRWLS (they just use $|y_i-\pmb x_i'\pmb\beta|^2$ as their particular $\rho$ function). Can you explain in what ways you think the algorithm you propose is *different* from IRWLS? — user603, May 24 '14 at 07:38
I never stated that it was different, and if I implied it, I didn't mean to. — Chris A., May 27 '14 at 06:48

Alecos Papadopoulos · Accepted Answer · 2014-06-06T23:31:59.780

As for your first question, one should define "standard", or acknowledge that a "canonical model" has been gradually established. As a comment indicated, it appears at least that the way you use IRWLS is rather standard.

As for your second question, "contraction mapping in probability" could be linked (however informally) to convergence of "recursive stochastic algorithms". From what I read, there is a huge literature on the subject mainly in Engineering. In Economics, we use a tiny bit of it, especially the seminal works of Lennart Ljung -the first paper was Ljung (1977)- which showed that the convergence (or not) of a recursive stochastic algorithm can be determined by the stability (or not) of a related ordinary differential equation.

(what follows has been re-worked after a fruitful discussion with the OP in the comments)

Convergence

I will use as reference Saber Elaydi "An Introduction to Difference Equations", 2005, 3d ed. The analysis is conditional on some given data sample, so the $x's$ are treated as fixed.

The first-order condition for the minimization of the objective function, viewed as a recursive function in $m$, $$m(k+1) = \sum_{i=1}^{N} v_i[m(k)] x_i, \;\; v_i[m(k)] \equiv \frac{w_i[m(k)]}{ \sum_{i=1}^{N} w_i[m(k)]} \qquad [1]$$

has a fixed point (the argmin of the objective function). By Theorem 1.13 pp 27-28 of Elaydi, if the first derivative with respect to $m$ of the RHS of $[1]$, evaluated at the fixed point $m^*$, denote it $A'(m^*)$, is smaller than unity in absolute value, then $m^*$ is asymptotically stable (AS). More over by Theorem 4.3 p.179 we have that this also implies that the fixed point is uniformly AS (UAS).
"Asymptotically stable" means that for some range of values around the fixed point, a neighborhood $(m^* \pm \gamma)$, not necessarily small in size, the fixed point is attractive , and so if the algorithm gives values in this neighborhood, it will converge. The property being "uniform", means that the boundary of this neighborhood, and hence its size, is independent of the initial value of the algorithm. The fixed point becomes globally UAS, if $\gamma = \infty$.
So in our case, if we prove that

$$|A'(m^*)|\equiv \left|\sum_{i=1}^{N} \frac{\partial v_i(m^*)}{\partial m}x_i\right| <1 \qquad [2]$$

we have proven the UAS property, but without global convergence. Then we can either try to establish that the neighborhood of attraction is in fact the whole extended real numbers, or, that the specific starting value the OP uses as mentioned in the comments (and it is standard in IRLS methodology), i.e. the sample mean of the $x$'s, $\bar x$, always belongs to the neighborhood of attraction of the fixed point.

We calculate the derivative $$\frac{\partial v_i(m^*)}{\partial m} = \frac {\frac{\partial w_i(m^*)}{\partial m}\sum_{i=1}^{N} w_i(m^*)-w_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}}{\left(\sum_{i=1}^{N} w_i(m^*)\right)^2}$$

$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\frac{\partial w_i(m^*)}{\partial m}-v_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right]$$ Then

$$A'(m^*) = \frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)\sum_{i=1}^{N}v_i(m^*)x_i\right]$$

$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)m^*\right]$$

and

$$|A'(m^*)| <1 \Rightarrow \left|\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}(x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [3]$$

we have

$$\begin{align}\frac{\partial w_i(m^*)}{\partial m} = &\frac{-\rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|}|x_i-m^*|+\frac {x_i-m^*}{|x_i-m^*|}\rho'(|x_i-m^*|)}{|x_i-m^*|^2} \\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^3}\rho'(|x_i-m^*|) - \rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|^2} \\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[\frac {\rho'(|x_i-m^*|)}{|x_i-m^*|}-\rho''(|x_i-m^*|)\right]\\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right] \end{align}$$

Inserting this into $[3]$ we have

$$\left|\sum_{i=1}^{N}\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right](x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right|$$

$$\Rightarrow \left|\sum_{i=1}^{N}w_i(m^*)-\sum_{i=1}^{N}\rho''(|x_i-m^*|)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [4]$$

This is the condition that must be satisfied for the fixed point to be UAS. Since in our case the penalty function is convex, the sums involved are positive. So condition $[4]$ is equivalent to

$$\sum_{i=1}^{N}\rho''(|x_i-m^*|) < 2\sum_{i=1}^{N}w_i(m^*) \qquad [5]$$

If $\rho(|x_i-m|)$ is Hubert's loss function, then we have a quadratic ($q$) and a linear ($l$) branch,

$$\rho(|x_i-m|)=\cases{ (1/2)|x_i- m|^2 \qquad\;\;\;\; |x_i-m|\leq \delta \\ \\ \delta\big(|x_i-m|-\delta/2\big) \qquad |x_i-m|> \delta}$$

and

$$\rho'(|x_i-m|)=\cases{ |x_i- m| \qquad |x_i-m|\leq \delta \\ \\ \delta \qquad \qquad \;\;\;\; |x_i-m|> \delta}$$

$$\rho''(|x_i-m|)=\cases{ 1\qquad |x_i-m|\leq \delta \\ \\ 0 \qquad |x_i-m|> \delta} $$

$$\cases{ w_{i,q}(m) =1\qquad \qquad \qquad |x_i-m|\leq \delta \\ \\ w_{i,l}(m) =\frac {\delta}{|x_i-m|} <1 \qquad |x_i-m|> \delta} $$

Since we do not know how many of the $|x_i-m^*|$'s place us in the quadratic branch and how many in the linear, we decompose condition $[5]$ as ($N_q + N_l = N$)

$$\sum_{i=1}^{N_q}\rho_q''+\sum_{i=1}^{N_l}\rho_l'' < 2\left[\sum_{i=1}^{N_q}w_{i,q} +\sum_{i=1}^{N_l}w_{i,l}\right]$$

$$\Rightarrow N_q + 0 < 2\left[N_q +\sum_{i=1}^{N_l}w_{i,l}\right] \Rightarrow 0 < N_q+2\sum_{i=1}^{N_l}w_{i,l}$$

which holds. So for the Huber loss function the fixed point of the algorithm is uniformly asymptotically stable, irrespective of the $x$'s. We note that the first derivative is smaller than unity in absolute value for any $m$, not just the fixed point.

What we should do now is either prove that the UAS property is also global, or that, if $m(0) = \bar x$ then $m(0)$ belongs to the neighborhood of attraction of $m^*$.

Thanks for the response. Give me some time to analyze this answer. — Chris A., May 27 '14 at 06:49
Yeah, I was reminded of the problem and decided to put up a bounty. :) — Chris A., May 27 '14 at 17:48
Lucky me. I wasn't there 20 months ago - I would have taken up this question, bounty or not. — Alecos Papadopoulos, May 27 '14 at 21:49
Thanks so much for this response. It's looking like, so far, that you've earned the bounty. BTW, your indexing on the derivative of $v_i$ w.r.t $m$ is notationally weird. Couldn't the summations on the second line of this use another variable, such as $j$? — Chris A., May 28 '14 at 01:16
So to summarize your proof, basically, you're saying that a theorem from Ljung states that a stable fixed point implies stability of the algorithm and then much of your work goes into proving the stable fixed point. What are the other conditions necessary in Ljung's theorem for this? For instance, is continuity of the function required, etc. I can think of functions with a stable fixed point that certainly do not converge, but they are all pathological. — Chris A., May 28 '14 at 01:21
I am not sure I understand the problem with the indexing of $v_i$. $v$ is just a compact notation for the weights on each $x$, so naturally they follow the indexing of the $x'$s. — Alecos Papadopoulos, May 28 '14 at 01:49
I mentioned $Ljung$ for your information. You don't need his results to say that a recursive function has a stable fixed point if the derivative at the fixed point is smaller than unity in absolute value -that's basic stability theory. Ljung results are concerned with increasing incoming information (increasing available $x's$), so he has to take into account the variability of the $x's$ which are random variables. In your case the sample (the $x'$s) is fixed, which makes things much more easier since you can prove convergence conditional on the $x$'s (as though they were deterministic). — Alecos Papadopoulos, May 28 '14 at 01:54
I understand. Thanks for the clarification. I guess my concern is more whether this procedure always converges. Correct me if I'm wrong, but it seems that you proved only that there's a stable fixed point and this is not sufficient for the procedure to always converge. This isn't my area of expertise so I'd appreciate this clarification. — Chris A., May 28 '14 at 02:39
Stability and convergence are related. I will add something about the matter in my answer, in about 12 hours from now. — Alecos Papadopoulos, May 28 '14 at 09:47
Thanks Alecos. You've earned the bounty for sure. But I'm really curious about the answer to my last question. — Chris A., May 28 '14 at 17:21
By the way, the problem with the indexing was because you were definition partials of $v_i$ w.r.t. $m$ and then for some internal summations in this definition also reused $i$, but in reality it should have been a different index, e.g., $j$. I know what you meant, but it seemed ambiguous. — Chris A., May 28 '14 at 23:27
Ok Alecos, not to beat this dead, but you can come up with some very simple functions that have less than one derivative at the fixed point but that do not converge. An sample is $f(x)=−(x−5)^3+5$ which oscillates between $x=4$ and $x=6$ even though the fixed point at 5 has a 0 derivative at $x=5$. What am I missing? — Chris A., May 29 '14 at 23:23
No, that's interesting. This is a good example of a function that is "asymptotically stable" but not _globally_ asymptotically stable (start with $m_1=4,0000000001$ and it will converge). An oversight from my part -as it stands, my answer proves only that the algorithm is the former, and not necessarily the latter. This means that there is certainly a (maybe very large) region around the fixed point that guarantees convergence if the first $m_1$ is inside it, but we have yet to prove that it will converge for an arbitrary $m_1$. Hmm, it gets better and better. — Alecos Papadopoulos, May 29 '14 at 23:39
Ok, thanks for your honesty. I understand. If I get some time, I'll try to use the technique you showed to prove this because I think this IRWLS case is globally asymptotic stability. — Chris A., May 30 '14 at 00:07
I simply use the arithmetic mean of the $x_i$. This can be far off from the minimizer of my cost function, but it's still a reasonable starting point that's guaranteed to be in $[min_i(x_i),max_i(x_i)]$ — Chris A., May 30 '14 at 00:53

Definition and Convergence of Iteratively Reweighted Least Squares

1 Answers1

Linked