0

I have data in the "dat" vector and I am looking to report the weighted mean and also some information on the variation of that mean.

As a toy example you can see the data in the "value" vector and the weight of the data in the "weight" vector:

dat = data.frame(value = c(1,2,3,4,5,6,7,8,9),weight = c(200,2,3,4,5,6,7,8,9))
dat

The weighted mean and sd are 1.98 and 2.28

library(Hmisc)
mu = wtd.mean(dat$value , dat$weight )
sd = sqrt(wtd.var(dat$value ,dat$weight))
mu
sd
> mu
[1] 1.983607
> sd
[1] 2.280653

And Weighted confidence intervals are 3.47 to .49

upperConfidenceInterval = mu + 1.96*(sd/sqrt(9))
lowerConfidenceInterval = mu - 1.96*(sd/sqrt(9))
upperConfidenceInterval
    lowerConfidenceInterval
[1] 3.473633
> lowerConfidenceInterval
[1] 0.4935802

BUT the data in this toy example is not normal and in my real data set it is not normal either.

**SO when it comes to providing info on the variation of the data does the weighted sd and confidence interval make sense? OR can I use Chebyshev's inequality with k = 2 to say **

upperConfidenceInterval = mu + 2*(sd/sqrt(9))
lowerConfidenceInterval = mu - 2*(sd/sqrt(9))

at least 75% of the distribution is between 3.5 and .46?

Since the data is not normal and if I don't use Chebyshev's inequality....can you use 1st and 3rd quartiles to give a measure of spread?

Some say to report the 1st and 3rd quartiles so the 1st and 3rd quartiles of the UNWEIGHTED data are 3 and 7. Remember the mean of the WEIGHTED data was 1.98 which is not in the range of UNWEIGHTED 1st and 3rd Quartiles so using UNWEIGHTED 1st and 3rd quartiles doesn't seem to make sense:

quantile(dat$value)[2] # 1st quartile
quantile(dat$value)[4] # 3rd quartile

The WEIGHTED 1st and 3rd Quartiles are .06 and .26 and again the weighted mean is not between the WEIGHTED quartiles:

quantile(  dat$value *(dat$weight)/sum(dat$weight) )[2] # 1st quartile
quantile(  dat$value *(dat$weight)/sum(dat$weight)  )[4] # 3rd quartile

> quantile(  dat$value *(dat$weight)/sum(dat$weight)          )[2] # 1st quartile
       25% 
0.06557377 
> quantile(  dat$value *(dat$weight)/sum(dat$weight)                )[4] # 3rd quartile
      75% 
0.2622951 

Since quantiles don't make since I am thinking using the weighted standard deviation and to use Chebyshev's inequality to say at least 75% of the distribution is between 3.5 and .46. Do you agree?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
user3022875
  • 726
  • 1
  • 6
  • 17
  • This would be a misapplication of Chebyshev's inequality. That inequality uses the *actual parameters* of a distribution, not their estimates. Whether your weighted CI makes any sense depends on what the weights mean and on the underlying distribution of the data. – whuber Jul 07 '16 at 17:05
  • It's machete work to see the question here. You are asking what would be good summaries of ... data you don't show us. That's hard to say. I'll peel off one facet and say that for all the interest of Chebyshev's inequality in probability theory I've never seen it used to report on real data. For one thing, would your audience know anything about it? – Nick Cox Jul 07 '16 at 17:06
  • @ Nick the question is: when you have a non-normal distribution and you report a weighted mean and you want to report a measure of variation of the mean what statistic is best? Weighted standard deviation some sort of weighted quantile (which doesn't seem to make sense), or something else? – user3022875 Jul 07 '16 at 17:53
  • @Whuber in real life the weights are dollar values. – user3022875 Jul 07 '16 at 17:57
  • @Whuber would you suggest the extended version when the mean and variance is not known here: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Finite_samples – user3022875 Jul 07 '16 at 19:12
  • There is a nice analysis of this at http://stats.stackexchange.com/a/82694/919 . I cannot find any circumstances under which these "Chebyshev" limits produce CIs, though. Yes, the calculation produces intervals, but they don't behave as CIs do. – whuber Jul 07 '16 at 20:49
  • There is no inherent problem in weighted quantiles. It's a matter of inverting the cumulative probability distribution, where the cumulation includes weighting. – Nick Cox Jul 10 '16 at 12:32
  • @Nick - can you give an example. TY – user3022875 Jul 11 '16 at 15:18
  • Weights 0.1, 0.2, 0.4, 0.2, 0.1; values 1,2,3,4,5; median corresponds to cumulative weight 0.5 and so is 3. You can be as complicated as you wish, e.g. interpolate in the sum of weights to get values corresponding to 0.25 and 0.75 cumulative probability. – Nick Cox Jul 11 '16 at 16:15

0 Answers0