7

I am running a multinomial logistic regression using the mlogit package and mlogit function in R. Now I need to check for outliers for the model.

Is there any approach or function in R for testing outliers in an mlogit model?

user603
  • 21,225
  • 3
  • 71
  • 135
kalyani
  • 589
  • 1
  • 5
  • 4
  • You can start by removing the outliers from the continuous variables in your design. How this is done depends on the number of variables you have. – user603 Nov 14 '12 at 16:07
  • Are we talking about model outliers or data outliers? – gregmacfarlane Nov 14 '12 at 16:19
  • 1
    @gmacfarlane what is the difference? A data point can only be outlying with respect to a model! – user603 Nov 14 '12 at 17:00
  • @user603, outliers are not well defined but at the very least, they're high leverage / high influence points. In multivariate models, identifying high leverage points is not trivial. Some advanced methods would include looking at leave-one-out cross validation and comparing jackknife to model based standard errors. – AdamO Mar 14 '13 at 18:02
  • @AdamO: once you assume a model, outliers are very well defined. The maximum # of outliers that can be detected by the methods you cite is 1. Do you think the 's' after 'outlier' in the O.P.'s question was a typo? Btw I'm not sure were you get this from: even the inventor of the jackknife did not recommend using it for outlier detection. – user603 Mar 14 '13 at 18:45
  • @user603, you would be correct if you had said "you can at most identify single points, but mutual point-cluster outliers cannot be detected." LOO gives a sampling distribution of the parameter of interest from which you can select any number of points in the sample that give inconsistent parameter estimates. Standard techniques don't identify pairwise outliers for which points are not individually outliers. Furthermore, I challenge you to provide a rigorous definition of what an "outlier" is. – AdamO Mar 14 '13 at 19:31
  • @kalyani what is the purpose of this model? Prediction, classification, or inference? Be as specific as possible based on the context of the problem. – AdamO Mar 14 '13 at 19:35
  • @AdamO: I posted a link to a real data set in [here](http://stats.stackexchange.com/a/50780/603) I don't think any of the method you cited would have found the outliers. For the [challenge have a look here](http://www.jstor.org/stable/2289995) – user603 Mar 15 '13 at 01:09
  • "observations that do not follow the pattern of the majority of the data" is not a rigorous definition. Furthermore, such observations may be representative of the sample population and modifying any primary analysis based on outlier identification invalidates results. – AdamO Mar 15 '13 at 16:02
  • Lastly, I stress that research is all about the *methods*. If your *methods* are such that they obtain outliers (shoddy PCA tools, liars on surveys), you need to describe a process that an independent researcher can reliably replicate to obtain data that roughly resemble the same patterns you've obtained. This means that outlier identification involves inference, but paradoxically you've said your original inference was invalid because there were outliers. The whole thing is a farce, I never exclude outliers from any primary analysis. – AdamO Mar 15 '13 at 16:11

1 Answers1

1

I assume that what you want is a diagnostic plot of some sort that examines residuals against fitted values. Typically model outliers are observations whose fitted values $\hat{y}$ are very different from their observed values $y$. In other words, they have an abnormally large residual $\epsilon = y - \hat{y}$.

The trick is that multinomial logit models rely on a latent, unobserved $y^*$ instead of $y$. So the entire model is based on the assumption that the error terms have an independent and identical extreme value distribution, an assumption that doesn't leave room for the concept of an "outlier." If you think your data are not IID-EV, you should use a different model.

Also, remember that the ultimate output of an MNL model is a probability; just because you observe someone "choosing" a category without the highest probability doesn't mean you have an "outlier."

Having said all of this, you can still do a leverage points analysis to determine if some observations are unique to the extent that they can affect your likelihood estimates.

gregmacfarlane
  • 3,242
  • 21
  • 34
  • The first paragraph is wrong. Think of [leverage points](http://www.jstor.org/stable/2289995). – user603 Nov 14 '12 at 15:58
  • I don't see how the linked article has any relevance to discrete choice analysis or latent variable regression. – gregmacfarlane Nov 14 '12 at 16:18
  • @gmacfalane: section 2 discusses outliers in regression. – user603 Nov 14 '12 at 17:01
  • section two relies on the definition of outliers that I give in paragraph one. Also from the paper: "To distinguish between good and bad leverage points we have to consider $y_i$ as well as $X_i$, and we also need to know the linear pattern set by the majority of the data." Again, this definition doesn't work if you can't see $y$. – gregmacfarlane Nov 14 '12 at 17:53
  • @gmacfalane: i'm not sure what is your claim. Are you are stating that there can't be outliers in the regression settings when the responses are multinomial? If this is your claim, it very simple to construct a counter example using the fact --stated in that second section-- that observations outlying only in the design space will have an unbounded influence on the estimated parameters of any M-estimators (a class to which multivariate logit belongs). – user603 Nov 14 '12 at 18:06
  • The model in question is multi*nomial* logit, which is different from multi*variate* logit. I suppose you could use the method on the latent regression for each alternative, and this could provide some useful information. But I've never seen such an analysis done for a multinomial logit model. – gregmacfarlane Nov 14 '12 at 18:32
  • indeed, there was a typo in my last response. Read multinomial logit in the last line of my response. I do not understand the last two sentences of your response nor how they are relevant to my question ("Is your claim that...") – user603 Nov 14 '12 at 19:19