Treatment of outliers produced by kurtosis

Question

I was wondering if anyone could help me with information about kurtosis (i.e. is there any way to transform your data to reduce it?)

I have a questionnaire dataset with a large number of cases and variables. For a few of my variables, the data shows pretty high kurtosis values (i.e. a leptokurtic distribution) which is derived from the fact that many of the participants gave the exact same score for the variable. I do have a particularly large sample size, so according to the central limit theorem, violations of normality should still be fine.

The problem, however, is that fact that the particularly high levels of kurtosis are producing a number of univariate outliers in my dataset. As such, even if I transform the data, or remove/adjust the outliers, the high levels of kurtosis mean that the next most extreme scores automatically become outliers. I aim to use (discriminant function analysis). DFA is said to be robust to departures from normality provided that the violation is caused by skewness and not outliers. Furthermore, DFA is also said to be particularly influenced by outliers in the data (Tabachnick & Fidel).

Any ideas of how to get around this? (My initial thought was some way of controlling the kurtosis, but isn't it kind of a good thing if most of my sample are giving similar ratings?)

The OP is long gone, but the thread remains. Two general comments: First, some of the wording makes "kurtosis" sound almost like a medical condition afflicting the data: it makes more sense to say that outliers cause (relatively high) kurtosis as conversely. Second, the question is vague on what the variables are but if they are e.g. scores on 5-point or 7-point scales then a normal distribution is hardly a reference distribution and indeed each distribution is discrete. Answering with extreme scores isn't pathological, outliers are genuine, and transformation isn't needed or helpful. . — Nick Cox, Sep 27 '20 at 09:11

score 10 · Answer 1 · answered Mar 08 '11 at 14:11

The obvious "common sense" way to resolving your problem is to

Get the conclusion using the full data set. i.e. what results will you declare ignoring intermediate calculations?
Get the conclusion using the data set with said "outliers" removed. i.e. what results will you declare ignoring intermediate calculations?
Compare step 2 with step 1
If there is no difference, forget you even had a problem. Outliers are irrelevant to your conclusion. The outliers may influence some other conclusion that may have been drawn using these data, but this is irrelevant to your work. It is somebody else's problem.
If there is a difference, then you have basically a question of "trust". Are these "outliers" real in the sense that they genuinely represent something about your analysis? Or are the "outliers" bad in that they come from some "contaminated source"?

In situation 5 you basically have a case of what-ever "model" you have used to describe the "population" is incomplete - there are details which have been left unspecified, but which matter to the conclusions. There are two ways to resolve this, corresponding to the two "trust" scenarios:

Add some additional structure to your model so that is describes the "outliers". So instead of $P(D|\theta)$, consider $P(D|\theta)=\int P(\lambda|\theta)P(D|\theta,\lambda) d\lambda$.
Create a "model-model", one for the "good" observations, and one for the "bad" observations. So instead of $P(D|\theta)$ you would use $P(D|\theta)=G(D|\theta)u+B(D|\theta)(1-u)$, were u is the probability of obtaining a "good" observation in your sample, and G and B represent the models for the "good" and "bad" data.

Most of the "standard" procedures can be shown to be approximations to these kind of models. The most obvious one is by considering case 1, where the variance has been assumed constant across observations. By relaxing this assumption into a distribution you get a mixture distribution. This is the connection between "normal" and "t" distributions. The normal has fixed variance, whereas the "t" mixes over different variances, the amount of "mixing" depends on the degrees of freedom. High DF means low mixing (outliers are unlikely), low DF means high mixing (outliers are likely). In fact you could take case 2 as a special case of case 1, where the "good" observations are normal, and the "bad" observations are Cauchy (t with 1 DF).

Just a clarifying note: Optimal classification requires knowledge of the true multivariate distributions. If you can estimate these distributions well, then the resulting classification function is nearly optimal. Outliers (as indicated by kurtosis) are indeed problematic because there is no little data in the region with which to estimate the density. With multivariate data, the curse of dimensionality also contributes to this problem. — BigBendRegion, Nov 19 '17 at 20:40

Treatment of outliers produced by kurtosis

1 Answers1