The location and scale of a normally distributed data can be estimated by sampling the data then taking the mean of the sample means and standard deviations, respectively. For non-normal (heavy-tailed) data, is it correct to take the median of the sample medians and IQR/MAD, instead? That is, is it correct to use the median of sample medians as a robust estimator of location similar to the mean of sample means for normal data?
1 Answers
If all the samples come from the same distribution, then yes the median of the sample medians is a fairly robust estimate of the median of the underlying distribution (though this need not be the same as the mean), since the median of a sample from a continuous distribution has probability 0.5 of being below (or above) the population median.
Added
Here is some illustrative R code. It takes a sample from a normal distribution and a case with outliers where 1% of data is 10,000 times bigger than it should be. It looks at the various statistics for the overall sample data (50,000 points) and then by the centre (mean or median) of the statistics of the 10,000 samples with 5 points in each sample.
library(matrixStats)
wholestats <- function(x,n) {
mea <- sum(x)/n
var <- sum((x-mea)^2)/(n-1)
sdv <- sqrt(var)
qun <- quantile(x, probs=c(0.25,0.5,0.75))
mad <- median(abs(x-qun[2]))
c(mean=mea, variance=var, st.dev=sdv,
median=qun[2], IQR=qun[3]-qun[1],
MAD=mad)
}
rowstats <- function(x,b) {
rmea <- rowSums(x)/b
rvar <- rowSums((x-rmea)^2)/(b-1)
rsdv <- sqrt(rvar)
rqun <- rowQuantiles(x, probs=c(0.25,0.5,0.75))
rmad <- rowMedians(abs(x-rqun[,2]))
c(mean=mean(rmea), variance=mean(rvar), st.dev=mean(rsdv),
median=median(rqun[,2]), IQR=median(rqun[,3]-rqun[,1]),
MAD=median(rmad))
}
a <- 10000 # number of samples
b <- 5 # samplesize
set.seed(1)
d <- array(rnorm(a*b), dim=c(a,b))
doutlier <- array(d * ifelse(runif(a*b)>0.99, 10000, 1) , dim=c(a,b))
The median based statistics as expected are more robust, though they fail to show that the heavy tailed outlier variant is heavy tailed.
> wholestats(d,a*b)
mean variance st.dev median.50% IQR.75% MAD
-0.002440456 1.011306552 1.005637386 -0.001610677 1.357029247 0.678706371
> wholestats(doutlier,a*b)
mean variance st.dev median.50% IQR.75% MAD
-3.425664e+00 9.591583e+05 9.793663e+02 -1.610677e-03 1.373658e+00 6.871415e-01
> rowstats(d,b)
mean variance st.dev median IQR MAD
-0.002440456 1.014611308 0.947630870 0.003460172 0.917642167 0.510115277
> rowstats(doutlier,b)
mean variance st.dev median IQR MAD
-3.425664e+00 9.607212e+05 1.685929e+02 3.460172e-03 9.301795e-01 5.175084e-01

- 30,848
- 1
- 63
- 107
-
Thanks! It seems that this method is not quite common or I'm just reading the wrong papers/books. Is it really not common? Is there a more standard way to solve the problem? Can you suggest keywords/references where I can learn more about this? – Christian Alis Apr 24 '11 at 07:48
-
@ianalis: The median of sample medians works well, and is still close to the "true" median of 0; it is not the same as the overall median but there is no reason to expect that it would be. The mean of sample means does give the overall mean (if the sample sizes are equal). Combining measures of dispersion from different samples is more complicated, especially with median or quantile based measures: with variances the answer is to use the denominator $n-1$ to get unbiased estimates, as shown here. – Henry Apr 24 '11 at 14:24