35

So, I have a data set of percentages like so:

100   /   10000   = 1% (0.01)
2     /     5     = 40% (0.4)
4     /     3     = 133% (1.3) 
1000  /   2000    = 50% (0.5)

I want to find the standard deviation of the percentages, but weighted for their data volume. ie, the first and last data points should dominate the calculation.

How do I do that? And is there a simple way to do it in Excel?

Yahel
  • 555
  • 3
  • 9
  • 11
  • The formula with (M-1)/M is correct. If you have a doubt, check it by setting all the weights equal to 1, and you will obtain classical formula for unbiased estimate for the standard deviation with (N-1) in the denominator. To whuber: unusual does not mean incorrect. –  Sep 24 '14 at 18:10
  • 1
    The formula with (M-1)/M is NOT CORRECT. Imagine you add a million points with weights of one trillionth. You don't change your answer at all regardless of what those weights are, but your $(M-1)/M$ term becomes 1? Absolutely NOT! If you care that $(M-1)/M \neq 1$, then you also care that this is just wrong. – Rex Kerr Sep 08 '15 at 16:43
  • I wonder *why* you want the standard deviation here? you only have $4$ numbers! How is that too many numbers? Especially when percentages are more easily explained and understood. – probabilityislogic May 25 '18 at 05:32
  • The highest vote is correct. Please check https://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf – Bo Wang May 25 '18 at 05:13
  • 3
    @probabilityislogic it was a simplified example for keeping the question short. – Yahel Jun 11 '18 at 15:50

6 Answers6

47

The formula for weighted standard deviation is:

$$ \sqrt{ \frac{ \sum_{i=1}^N w_i (x_i - \bar{x}^*)^2 }{ \frac{(M-1)}{M} \sum_{i=1}^N w_i } },$$

where

$N$ is the number of observations.

$M$ is the number of nonzero weights.

$w_i$ are the weights

$x_i$ are the observations.

$\bar{x}^*$ is the weighted mean.

Remember that the formula for weighted mean is:

$$\bar{x}^* = \frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i}.$$

Use the appropriate weights to get the desired result. In your case I would suggest to use $\frac{\mbox{Number of cases in segment}}{\mbox{Total number of cases}}$.

To do this in Excel, you need to calculate the weighted mean first. Then calculate the $(x_i - \bar{x}^*)^2$ in a separate column. The rest must be very easy.

anjoschu
  • 3
  • 3
deps_stats
  • 1,615
  • 1
  • 17
  • 17
  • 3
    @Gilles, you're right. deps_stats, the fraction $(M-1)/M$ in the SD is unusual. Do you have a citation for this formula or can you at least explain the reason for including that term? – whuber Jun 25 '12 at 14:14
  • Thanks for your post. But I think there may be a mistake in the weighted mean expression. Shouldn't it be divided with the sum of the weights ? –  Jun 25 '12 at 13:40
  • the sum of all weights is BY DEFINITION = 1, so what is the point of including that in the divisor term? –  Jan 22 '13 at 21:28
  • 4
    @Aaron Weights are not always defined to sum to unity, as exemplified by the weights given in this question! – whuber Jan 22 '13 at 21:36
  • 2
    (-1) I am downvoting this answer because no justification or reference for the $(M-1)/M$ term has been provided (and I'm pretty sure it does *not* make the estimate of the variance unbiased, which would be its apparent motivation). – whuber Jan 22 '13 at 21:39
  • 2
    In light of the added reference (which is not authoritative, but it is a reference) I am removing the downvote. I am not upvoting this answer, though, because calculations show the proposed weighting does *not* produce an unbiased estimate of anything at all (except when all weights equal $1$). The real difficulty here--which is the fault of the question, not the answer--is that it's not clear what this "weighted standard deviation" is attempting to estimate. Without a definite estimand, there is no justification to introduce an $(M-1)/M$ factor to "reduce bias" (or for any other reason). – whuber Jul 31 '13 at 14:47
  • The formula with (M-1)/M is correct. If you have a doubt, check it by setting all the weights equal to 1, and you will obtain classical formula for unbiased estimate for the standard deviation with (N-1) in the denominator. To whuber: unusual does not mean incorrect. –  Sep 24 '14 at 18:10
  • 1
    @Mikhail You are correct that "unusual" and "right" have little to do with one another. However, unusual results do implicitly demand a little more justification because being unusual is one indicator that an error may have been made. Your argument is invalid: although the formula indeed reduces to one for an unbiased estimator when all weights are equal, that does not imply the estimator remains unbiased when unequal weights are used. I am not asserting your conclusion is wrong, but only that so far no valid justification has been offered. – whuber Sep 24 '14 at 20:17
  • I'm sorry to ask, but this isn't clear to me. The comments seem to indicate this answer introduces a bias, but I'm not seeing a complete solution with the unbiased version. I do see the (N - 1), is that used instead of (M - 1)/M? That remark says standard deviation, not weighted standard deviation. I'm not trying to split hairs, just want to make sure I'm understanding things properly. Would someone that feels this answer is introducing bias please submit a new answer containing their proposed modification, justification, and references? – Bennett Dill Jan 04 '15 at 02:48
  • 1
    @whuber - This is wrong, as you suspected. It's correct if the weights themselves are frequencies. But although frequencies go into computing the percentages _in this case the weights, though unspecified, are not frequencies of occurrence but something else to do with "data volume"_. So this is the wrong answer. – Rex Kerr Sep 08 '15 at 17:50
  • How do you go about computing the weights for the measurements? If i'm reading this right, they just exist with no source? – Diesel Apr 16 '17 at 19:13
23

The formulae are available various places, including Wikipedia.

The key is to notice that it depends on what the weights mean. In particular, you will get different answers if the weights are frequencies (i.e. you are just trying to avoid adding up your whole sum), if the weights are in fact the variance of each measurement, or if they're just some external values you impose on your data.

In your case, it superficially looks like the weights are frequencies but they're not. You generate your data from frequencies, but it's not a simple matter of having 45 records of 3 and 15 records of 4 in your data set. Instead, you need to use the last method. (Actually, all of this is rubbish--you really need to use a more sophisticated model of the process that is generating these numbers! You apparently do not have something that spits out Normally-distributed numbers, so characterizing the system with the standard deviation is not the right thing to do.)

In any case, the formula for variance (from which you calculate standard deviation in the normal way) with "reliability" weights is

$${ \sum {w_i (x_i - x^*)^2} \over {\sum w_i - {\sum w_i^2 \over \sum w_i }} }$$

where $x^* = \sum w_i x_i / \sum w_i$ is the weighted mean.

You don't have an estimate for the weights, which I'm assuming you want to take to be proportional to reliability. Taking percentages the way you are is going to make analysis tricky even if they're generated by a Bernoulli process, because if you get a score of 20 and 0, you have infinite percentage. Weighting by the inverse of the SEM is a common and sometimes optimal thing to do. You should perhaps use a Bayesian estimate or Wilson score interval.

Rex Kerr
  • 338
  • 2
  • 7
  • 2
    +1. The discussion of the different meanings of weights was what I was looking for in this thread all along. It is an important contribution to all of this site's questions about weighted statistics. (I am a little concerned about the parenthetical remarks concerning normal distributions and standard deviations, though, because they incorrectly suggest that SDs have no use outside a model based on normality.) – whuber Sep 08 '15 at 18:23
  • @whuber - Well, central limit theorem to the rescue, of course! But for what the OP was doing, trying to characterize that set of numbers with a mean and standard deviation seems exceedingly inadvisable. And in general, for many uses the standard deviation ends up luring one into a false feeling of understanding. For instance, if the distribution is anything but normal (or a good approximation thereof), relying on the standard deviation will give you a bad idea of the shape of the tails, when it is exactly those tails that you probably most care about in statistical testing. – Rex Kerr Sep 08 '15 at 19:44
  • @RexKerr We can hardly blame standard deviation if people place interpretations on it that are undeserved. But let's move away from normality and consider the much broader class of continuous, symmetric unimodal distributions with finite variance (for example). Then between 89 and 100 percent of the distribution lies within two standard deviations. That's often pretty useful to know (and 95% lies pretty much in the middle, so it's never more than about 7% off); with many common distributions, the dropping symmetry aspect doesn't change much (e.g. look at the exponential, for example).... ctd – Glen_b Oct 01 '15 at 23:57
  • ctd... -- or if we don't make any of those assumptions, there's always the ordinary Chebyshev bounds which do at least say something about the tails and standard deviation.. – Glen_b Oct 01 '15 at 23:58
  • @RexKerr you mention the "formula for standard deviation", isn't that the variance? – Gabriel Oct 02 '15 at 00:56
  • 1
    @Gabriel - Yes, sorry, I was being sloppy. (I figure people can tell which is which by glancing.) I've corrected my description. – Rex Kerr Oct 02 '15 at 01:32
  • @Rex Kerr You said that "the formula for variance (from which you calculate standard deviation in the normal way) with "reliability" weights is ∑wi(xi−x∗)2∑wi−∑w2i∑wi ∑wi(xi−x∗)2∑wi−∑wi2∑wi where x∗=∑wixi/∑wix∗=∑wixi/∑wi is the weighted mean. Do you have a reference for this formula apart from wikipedia? – javivr Jun 28 '16 at 11:42
  • @javivr Wikipedia cites the [GSL documentation](https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html), although I'm not sure if you should consider that any more authoritative than Wikipedia itself. There is also a derivation given in the Wikipedia page itself, which looks sound at a glance. – shadowtalker Nov 12 '18 at 16:11
6
=SQRT(SUM(G7:G16*(H7:H16-(SUMPRODUCT(G7:G16,H7:H16)/SUM(G7:G16)))^2)/
     ((COUNTIFS(G7:G16,"<>0")-1)/COUNTIFS(G7:G16,"<>0")*SUM(G7:G16)))

Column G are weights, Column H are values

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user35936
  • 61
  • 1
  • 1
2

If we treat weights like probabilities, then we build them as follows: $$p_i=\frac{v_i}{\sum_iv_i},$$ where $v_i$ - data volume.

Next, obviously the weighted mean is $$\hat\mu=\sum_ip_ix_i,$$ and the variance:$$\hat\sigma^2=\sum_ip_i(x_i-\hat\mu)^2$$

Aksakal
  • 55,939
  • 5
  • 90
  • 176
1

Late in the day I know, but in reference to Whuber's insistance on an authoritative justification for the (M-1)/M term for an unbiased estimate, perhaps Prof. James Kirchner's justification, download currently available at http://seismo.berkeley.edu/~kirchner/Toolkits/Toolkit_12.pdf, which references

Bevington, P. R., Data Reduction and Error Analysis for the Physical Sciences, 336 pp.,
McGraw-Hill, 1969

will do?

Prof. Kirchner distinguishes between

  1. "Case I" in which some points are more important than others (hence the weighting) but the uncertainties associated with each point are assumed to be the same
  2. "Case II" in which the points are equally important but the uncertainties associated with each point are not the same.

For FabioSpaghetti's comment from yesterday, the above linked paper also shows how to calculate the standard error.

babelproofreader
  • 4,544
  • 4
  • 22
  • 35
0
Option Explicit

Function wsdv(vals As Range, wates As Range)
Dim i, xV, xW, y As Integer
Dim wi, xi, WgtAvg, N
Dim sumProd, SUMwi

    sumProd = 0
    SUMwi = 0
    N = vals.Count  ' number of values to determine W Standard Deviation
    xV = vals.Column  ' Column number of first value element
    xW = wates.Column  ' Column number of first weight element
    y = vals.Row - 1  ' Row number of the values and weights

    WgtAvg = WorksheetFunction.SumProduct(vals, wates) / WorksheetFunction.Sum(wates)

    For i = 1 To N  ' step through the elements, calculating the sum of values and the sumproduct
        wi = ActiveSheet.Cells(i + y, xW).Value  ' (i+y, xW) is the cell containing the weight element
        SUMwi = SUMwi + wi
        xi = ActiveSheet.Cells(i + y, xV).Value  ' (i+y, xV) is the cell containing the value element
        sumProd = sumProd + wi * (xi - WgtAvg) ^ 2
    Next i

    wsdv = (sumProd / SUMwi * N / (N - 1)) ^ (1 / 2)  ' output of weighted standard deviation

End Function
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650