8

I'm writing a program for evaluating real estates and I don't really understand the differences between some robust regression models, that's why I don't know which one to choose.

I tried lmrob, ltsReg and rlm. for the same data set, all three methods gave me different values for the coefficients.

I thought that it is best to use ltsReg because, summary(ltsReg()) provides information about R-squared and p-values and this will help me to decide if upon accepting or dismissing the model.

Do you think that ltsReg is a good choice?

EDIT: I've just read on Goodness-of-Fit Statistics that the adjusted R-squared is generally the best indicator of quality fit

user603
  • 21,225
  • 3
  • 71
  • 135
Paul
  • 471
  • 1
  • 5
  • 19
  • 4
    Both p-values and $R^2$ can be misleading so choosing a package based on the fact that it outputs them is not really a good criteria for such choice... – Tim Mar 09 '15 at 10:58
  • then how can I decide if the model is a valid one without plotting it? – Paul Mar 09 '15 at 11:01
  • 2
    Also consider ordinal regression because it may be more robust and interpretable, and more powerful. – Frank Harrell Mar 09 '15 at 11:19
  • @user603: I confirm it. for the evaluation of a real estate, I create several models, which contain different number of characteristics (eg: 1. price ~ livingArea + floorNumber + age + ...). – Paul Mar 09 '15 at 11:59
  • I don't think this question is opinion based. The answer I propose is unbiased towards any of the two fits and so it is not opinion based. Since it admits an answer that is not opinion based, I think the question isn't either. – user603 Mar 09 '15 at 14:26
  • 2
    Someone wants to close this question! I dont think that is right, even if on surface it is about choosing R functions, it is really about how and why's of choosing robust regression methods, that is, ontopic. – kjetil b halvorsen Mar 09 '15 at 14:45
  • @kjetilbhalvorsen: I asked this question because I'm not sure which robust regression I should choose, because I don't understand the differences between them. – Paul Mar 09 '15 at 14:59
  • And the methods you mention, are really made for solving **very** different problems ... – kjetil b halvorsen Mar 09 '15 at 15:00
  • "in R" is incidental to the question and would be better excised from the title. – Nick Cox Mar 10 '15 at 12:51

1 Answers1

10

In the notation I will use, $p$ will be the number of design variables (including the constant term), $n$ the number of observations with $n\geq2p+1$ (if this last condition was not met, the package would not have returned a fit but an error, so I assume it is met). I will denote by $\hat{\boldsymbol\beta}_{FLTS}$ the vector of coefficients estimated by FLTS (ltsReg) and $\hat{\boldsymbol\beta}_{MM}$ the coefficients estimated by MM (lmrob). I will also write:

$$r^2_i(\hat{\boldsymbol\beta})=(y_i-\boldsymbol x_i^\top\hat{\boldsymbol\beta})^2$$

(these are the squared residuals, not the standardized ones!)

The rlm function fits an 'M' estimate of regression and, like @Frank Harrell's proposal made in the comments to your question, it is not robust to outliers on the design space. Ordinal regression has a breakdown point (the proportion of your data that needs to be replaced by outliers to pull the fitted coefficients to arbitrary values) of essentially $1/n$ meaning that a single outlier (regardless of $n$!) suffice to render the fit meaningless. For regression M estimates (e.g. Huber M regression) the breakdown point is essentially $1/(p+1)$. This is somewhat higher but in practice still uncomfortably close to 0 (because often $p$ will be large). The only conclusion that can be drawn from rlm finding a different fit than the other two methods is that it has been swayed by design outliers and that there must be more than $p+1$ of these in your data set.

In contrast, the other two algorithms are much more robust: their breakdown point is just below $1/2$ and more importantly, doesn't shrink as $p$ gets large. When fitting a linear model using a robust method, you assume that at least $h=\lfloor(n+p+1)/2\rfloor+1$ observations in your data are uncontaminated. The task of these two algorithms is to find those observations and fit them as well as possible. More precisely, if we denote:

\begin{align} H_{FLTS} &= \{i:r^2_i(\hat{\boldsymbol\beta}_{FLTS})\leq q_{h/n}(r^2_i(\hat{\boldsymbol\beta}_{FLTS}))\} \\ H_{MM} &= \{i:r^2_i(\hat{\boldsymbol\beta}_{MM})\leq q_{h/n}(r^2_i(\hat{\boldsymbol\beta}_{MM}))\} \end{align}

(where $q_{h/n}(r^2_i(\hat{\boldsymbol\beta}_{MM}))$ is the $h/n$ quantile of the vector $r^2_i(\hat{\boldsymbol\beta}_{MM})$)

then $\hat{\boldsymbol\beta}_{MM}$ ($\hat{\boldsymbol\beta}_{FLTS}$) tries to fit the observations with indices in $H_{MM}$ ($H_{FLTS}$).

The fact that there are large differences between $\hat{\boldsymbol\beta}_{FLTS}$ and $\hat{\boldsymbol\beta}_{MM}$ indicates that the two algorithms do not identify the same set of observations as outliers. This means that at least one of them is swayed by the outliers. In this case, using the (adjusted) $R^2$ or any one statistics from either of the two fits to decide which to use, though intuitive, is a terrible idea: contaminated fits typically have smaller residuals than clean ones (but since knowledge of this is the reason one uses robust statistics in the first place, I assume that the OP is well aware of this fact and that I don't need to expand on this).

The two robust fits give conflicting results and the question is which is correct? One way to solve this is to consider the set:

$$H^+=H_{MM}\cap H_{FLTS}$$

because $h\geq[n/2]$, $\#\{H^+\}\geq p$. Furthermore, if either of $H_{MM}$ or $H_{FLTS}$ is free of outliers, so is $H^+$. The solution I propose exploits this fact. Compute:

$$D(H^+,\hat{\boldsymbol\beta}_{FLTS},\hat{\boldsymbol\beta}_{MM})=\sum_{i\in H^+}\left(r^2_i(\hat{\boldsymbol\beta}_{FLTS})-r^2_i(\hat{\boldsymbol\beta}_{MM})\right)$$

For example, if $D(H^+,\hat{\boldsymbol\beta}_{FLTS},\hat{\boldsymbol\beta}_{MM})<0$, then, $\hat{\boldsymbol\beta}_{FLTS}$ fits the good observations better than $\hat{\boldsymbol\beta}_{MM}$ and so I would trust $\hat{\boldsymbol\beta}_{FLTS}$ more. And vice versa.

user603
  • 21,225
  • 3
  • 71
  • 135
  • 1
    +1. I guess you are using $[\ \ ]$ to mean round down to integer or floor function $\lfloor\ \ \rfloor$. I find the latter notation more explicit. It's easy to assume for readers new to that notation for integer rounding to assume that square brackets are just brackets. – Nick Cox Mar 10 '15 at 12:48