1

I have a problem where LAD regressions is not giving me a solution as the R package (L1pack) errors whenever there is an infinite number of possible solutions*. This occurs in the following example: Suppose you want to find the constant $c$ that minimises the absolute distance between it and two $y$ values, that is, minimise $f(c) = |c - y_1| + |c - y_2|$. The median value of ${y_1, y_2}$ is a solution for $c$, as is any value between $y_1$ and $y_2$. For example:

set.seed(0)
tmp_df <- data.frame(y = sort(runif(2))) # generate data with 2 y values

> tmp_df
          y
1 0.2655087
2 0.8966972

But running the regression L1pack leads to the following error:

> L1pack::lad(y ~ 1, tmp_df)
Error in L1pack::lad(y ~ 1, tmp_df) : 
  L1FIT optimal solution is probably non-unique.

Now what I would like is for this regression to return the 'median' value in all cases where the solution is not unique. In this very simple example it is clear that if I took the mean value of $c, c \in [y_1, y_2]$, I would obtain the median. How do I extend this to a more general regression?


*: This is not an R specific question but I am using R to demonstrate the problem, also quantreg package has no problems with supplying one solution even if the supplied solution is not the median.

Alex
  • 3,728
  • 3
  • 25
  • 46
  • In the case of a constant model (`y~1`) it's clear what you mean by the 'median', but in the case of an ordinary simple regression (`y~x`) the solution can be a region (bounded by straight lines) in $(\beta_0,\beta_1)$ space (e.g. see the example data set I made up [here](https://stats.stackexchange.com/questions/148438/minimization-of-the-sum-of-absolute-deviations/148441#148441)); degenerate cases aside, I believe it's always a convex region. What is the intended meaning of `median` in that case? / How do you want to generalize the notion? – Glen_b Apr 26 '17 at 11:23
  • 1
    One simple possibility may be to take the midrange of possible values of each coefficient separately, If I'm right that the region should be convex, that will always be in the solution space, though it might not correspond well to every notion of "median". There are [several](http://onlinelibrary.wiley.com/doi/10.1002/0471667196.ess1107.pub2/abstract) higher-dimensional "median" definitions that could be used, that generalize the notion of univariate median in different ways. In addition various L1 fitting algorithms have been proposed over the years and some of those choose an interior point. – Glen_b Apr 26 '17 at 22:09
  • 1
    With the linked example if we look at the y-vs-x plot, there's a clear enough line which many people would tend to regard as the natural choice (the red line); the more usual situation would be that the region was a simplex (where again, perhaps there's a natural enough choice even in higher dimensions), but given the ambiguity in the question it would be necessary to be clear about what the intent is or to rephrase, perhaps, to ask about the possibilities in some way). – Glen_b Apr 26 '17 at 22:15
  • 1
    If you don't have a clear sense on how you'd like to generalize to more than one dimension, I could perhaps post the above as an answer explaining why it's not clear what solution to suggest. ... $\quad$[NB While it's not in any notional sense a multivariate median, as a purely practical matter I've sometimes tended to choose the L1 solution which minimizes SSE (generally a corner of the region of possible values, sometimes interior to a section of boundary or even interior to the region). It has the merit of being easy to explain, though it's not especially satisfying as a choice] – Glen_b Apr 26 '17 at 22:19
  • 1
    One of those multivariate medians that could in some situations be a reasonable choice is to choose the *[geometric median](https://en.wikipedia.org/wiki/Geometric_median)* of the points in the region -- to minimize the sum of distances to all of the points in the solution-region. In $(\beta_0,\beta_1)$ space that's not generally going to make sense though (because the two are generally in completely different units). As a result I don't think we could rely on it being equivariant to a change of scale, which could be disturbing, though it will also be a problem with several other possibilities – Glen_b Apr 26 '17 at 23:10
  • Very helpful comments, thanks. Some notes from my end: 1) I have no clear sense on how to generalise beyond one dimension 2) Your suggestion of taking the mid points of all possible coefficient choices seems sensible. To achieve this, would one be able to modify the objective function so that the mid points are returned? 3) This method would produce a 'natural' looking line with respect to your linked plot ... ctd ... – Alex Apr 26 '17 at 23:16
  • ... ctd ... 4) if we consider that there is a 'natural' looking line in the 2d case $y$ vs $x$, then even in the multiple regression case all possible solutions lead to lines of the form $y$ vs $\hat{y}$. Would it not be possible to find the 'median' at this stage? (Possibly a very silly idea as I have not thought about the geometry of this projection onto one dimension. It will probably be instructive to look at a case with three regression coefficients.) – Alex Apr 26 '17 at 23:21
  • On 1) that's a common issue and probably why some different methods (and hence packages) have different solutions. On (2) I don't know the answer to that right now; there may be a simple way to do it or perhaps one might have to formulate a second optimization problem or something. It might be an interesting new question either here or wherever else looks at optimization (math.SE perhaps, or maybe some other site?). (3) It's the horizontal line in your example but for my linked example, I think it's parallel to and close to the magenta line; reasonable but not quite as natural . – Glen_b Apr 26 '17 at 23:24
  • 4) is a very interesting idea. I presently have no intuition there. Worth looking into I think. Would you prefer I turn my comments into an answer or do you want to modify the question? – Glen_b Apr 26 '17 at 23:25

0 Answers0