2

Suppose I have a relatively large number of samples (~1k) drawn from a series (~40) of increasingly long-tailed distributions (going from approximately normal to approximately log-normal). I want to estimate the mean and its uncertainty for these distributions, which I do using jackknife resampling because the samples are highly correlated between adjacent distributions in the series. However, as the distributions become increasingly long-tailed, the estimated uncertainty of the mean becomes so large that the data are useless.

I might think to apply robust estimators, but my review of the literature seems to indicate that these methods assume that the "outliers" are erroneous data, drawn from some other distribution than the one of interest. For this case, they are samples from the true distribution that appear to be outlying merely because the region in which they lie is sparsely sampled.

  • Are the usual robust estimators valid for this case? If so, how should they be applied if an outlier in one distribution does not correspond to outliers in other distributions?
  • If not, are there other suitable methods? I experimented with power transformations, but could not work out how they could be used while retaining the correlations, since the transformation parameters would differ between distributions.
Xerxes
  • 71
  • 4
  • You are trying to lift water with a fork;) My points is that the mean (as a statistics) doesn t have its usual interpretation when the data you applied it exihibits fat tails. In these case, characterizing your distribution in terms of its mean is just a bad approach. You want to characterize the distribution of your data in terms of a statistics that retains its interpretability at fait tail distributions. Statistics built on quantiles are a direct way to do that. – user603 Jun 23 '14 at 20:12
  • Unfortunately, in this case the mean has a physical meaning that makes other forms of characterization irrelevant. Presumably, although it is difficult to demonstrate, the link to a physical interpretation implies that the distributions cannot become so fat tailed that the mean and moments no longer exist. – Xerxes Jun 23 '14 at 20:20
  • Then, a solution could be to re-express the *estimates of uncertainty around the mean* in terms of empirical quantiles of the data. Would that be acceptable in your application? – user603 Jun 23 '14 at 20:23
  • I'm not exactly sure, but I believe any estimator that goes to the true mean as the number of samples goes to infinity would be okay. An estimator that goes to the true median or any nonzero shift from the mean would be wrong. So knowing the 50th quantile would not be helpful unless you can work out the mean from it. – Xerxes Jun 23 '14 at 21:55
  • Can you please verify the structure of your data? You have 40 sets of realizations, each set coming from a different distribution, and each of these 40 sets containing ~1k realizations? – Alecos Papadopoulos Jun 29 '14 at 23:25
  • @AlecosPapadopoulos: Yes, that is correct. Each realization produces a member drawn from each distribution, and the correlations between the sets for a given realization are very important. – Xerxes Jun 30 '14 at 19:35
  • So do i understand correctly that your data is a realization from a 40 - dimensional random variable, X= (X1, ... , XJ), J=40, that has marginally in each dimension heavier right tail / right skew as you increase j? And yu observe 1000 samples of it that show that its dependent variables. Are those non-negative random variables? Is Xj related to Xj-1 in a meaningful, real- world way? – Georg M. Goerg Jan 05 '17 at 11:57
  • @GeorgM.Goerg Yes, it is a 40-dim random variable with increasing skew. No, there is no positivity constraint on the variables. Yes, each variable is related to the other variables in a meaningful physical way; in addition, there are strong correlations between them from sample to sample. – Xerxes Jan 05 '17 at 19:00
  • Re physical meaning: is it by construction that Xj has higher skew than X(j-1),or you just observed that in your data? Could you post some plots of summary stats or boxplots on marginal variables. – Georg M. Goerg Jan 08 '17 at 06:39
  • FWIW have a look at Lambert W x Normal distributions. Estimates of mu parameter converge to true location/center even for distributions with non-existing mean (Cauchy and more heavy tailed). See https://stats.stackexchange.com/questions/152361/location-parameter-estimation-in-alpha-stable-distributions/443848#443848 for illustrations. – Georg M. Goerg Dec 19 '20 at 13:55
  • @GeorgM.Goerg Looks interesting, but does mu go to the mean in the large-N limit? A parameter that estimates the median of a skewed distribution seems like it wouldn't. – Xerxes Dec 20 '20 at 15:26
  • @Xerxes The MLE is consistent. The particular example was for heavy tails (symmetric); at least empirically it also works for asymmetric distributions. It's converging to the true location parameter, which is median if heavy tails delta > 1 or the mean if < 1. – Georg M. Goerg Dec 24 '20 at 01:01

0 Answers0