Median of medians as robust mean of means?

Question

The location and scale of a normally distributed data can be estimated by sampling the data then taking the mean of the sample means and standard deviations, respectively. For non-normal (heavy-tailed) data, is it correct to take the median of the sample medians and IQR/MAD, instead? That is, is it correct to use the median of sample medians as a robust estimator of location similar to the mean of sample means for normal data?

Henry · Accepted Answer · 2019-07-08T19:16:50.617

If all the samples come from the same distribution, then yes the median of the sample medians is a fairly robust estimate of the median of the underlying distribution (though this need not be the same as the mean), since the median of a sample from a continuous distribution has probability 0.5 of being below (or above) the population median.

Added

Here is some illustrative R code. It takes a sample from a normal distribution and a case with outliers where 1% of data is 10,000 times bigger than it should be. It looks at the various statistics for the overall sample data (50,000 points) and then by the centre (mean or median) of the statistics of the 10,000 samples with 5 points in each sample.

library(matrixStats)

wholestats <- function(x,n) {
     mea <- sum(x)/n  
     var <- sum((x-mea)^2)/(n-1)
     sdv <- sqrt(var) 
     qun <- quantile(x, probs=c(0.25,0.5,0.75))
     mad <- median(abs(x-qun[2]))
     c(mean=mea, variance=var, st.dev=sdv, 
       median=qun[2], IQR=qun[3]-qun[1], 
       MAD=mad)
    }

rowstats <- function(x,b) {
     rmea <- rowSums(x)/b     
     rvar <- rowSums((x-rmea)^2)/(b-1)
     rsdv <- sqrt(rvar)
     rqun <- rowQuantiles(x, probs=c(0.25,0.5,0.75))  
     rmad <- rowMedians(abs(x-rqun[,2]))
     c(mean=mean(rmea), variance=mean(rvar), st.dev=mean(rsdv), 
       median=median(rqun[,2]), IQR=median(rqun[,3]-rqun[,1]), 
       MAD=median(rmad))
    }

a <- 10000 # number of samples
b <- 5     # samplesize

set.seed(1)
d <- array(rnorm(a*b), dim=c(a,b))
doutlier <- array(d * ifelse(runif(a*b)>0.99, 10000, 1) , dim=c(a,b))

The median based statistics as expected are more robust, though they fail to show that the heavy tailed outlier variant is heavy tailed.

> wholestats(d,a*b)
        mean     variance       st.dev   median.50%      IQR.75%          MAD 
-0.002440456  1.011306552  1.005637386 -0.001610677  1.357029247  0.678706371 
> wholestats(doutlier,a*b) 
         mean      variance        st.dev    median.50%       IQR.75%           MAD 
-3.425664e+00  9.591583e+05  9.793663e+02 -1.610677e-03  1.373658e+00  6.871415e-01 
> rowstats(d,b)
        mean     variance       st.dev       median          IQR          MAD 
-0.002440456  1.014611308  0.947630870  0.003460172  0.917642167  0.510115277 
> rowstats(doutlier,b) 
         mean      variance        st.dev        median           IQR           MAD 
-3.425664e+00  9.607212e+05  1.685929e+02  3.460172e-03  9.301795e-01  5.175084e-01

Thanks! It seems that this method is not quite common or I'm just reading the wrong papers/books. Is it really not common? Is there a more standard way to solve the problem? Can you suggest keywords/references where I can learn more about this? — Christian Alis, Apr 24 '11 at 07:48
@ianalis: The median of sample medians works well, and is still close to the "true" median of 0; it is not the same as the overall median but there is no reason to expect that it would be. The mean of sample means does give the overall mean (if the sample sizes are equal). Combining measures of dispersion from different samples is more complicated, especially with median or quantile based measures: with variances the answer is to use the denominator $n-1$ to get unbiased estimates, as shown here. — Henry, Apr 24 '11 at 14:24

Median of medians as robust mean of means?

1 Answers1