Average of values, their standard deviations and ranges

Question

After reading for hours, I am still swimming.

I have a set of plasma concentration results from different labs:

value (mg/L), standard deviation, range, sample count

For example (in reality, I have a lot more rows):

1. 25.8 mg/L, +/- 8.0, 13.0-41.0, 24
2. 55.0 mg/L, +/- 7.9, ?-100.0, 10

For the second value, I have only the highest measured concentration available, which I put as high value of the range. For the low range, I put a "?".

I would like to calculate an average of these two rows to summarize the results in a single row. Further data is not available.

Update: I just realize that the values could be mean or median, I don't even know that.

Distribution:

The lab values are from different labs, and they might use different techniques. I would think that a unit like mg/L represents an "absolute" measurement, meaning that established different techniques should result in similar results within a certain range, and if not, the technique should be questioned (which is out of scope for me), but I am ready to learn better. Anyway, I have no way of knowing which technique was used.

The results are from a somewhat uniform population: They have a certain untreated medical condition which will appear on average at a certain older age. But it could happen to a much younger or much older person, too, and I don't know the patients age for a particular value. Also, it's medical, so there could be differences in location, genetics, or other preconditions.

So considering tristans answer, I would say they are from different distributions, although I'd rather would not :)

Question 1: For the value, I sum the values and divide by their total count: (55.0+25.8)/2 = 40,40 mg/L. Is that a good practice, or should I do it different?

Question 2: What would be the statistical correct way to get an average of the standard deviation? Does that even make sense?

Question 3: How to get an "average" of the range? I would think that I just take the lowest range and the highest range of all results (13.0-100.0), can someone confirm/disconfirm?

Question 4: How should I handle missing low ranges as in the example, and is putting the max value as high range even a good idea?

Question 5: I am not at all sure if the samples should play a role here. If they do, how?

Thanks in advance for reading and any help.

There are a number of posts on site for getting variance (or standard deviation, obviously if you have one you can get the other) for combined samples (they don't match the answer on this page). — Glen_b, Apr 13 '15 at 16:08
As I said, I've been reading here for hours. One I studied in particular and were able to understand is [here](http://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation), but I am so out of my league here that I was unsure if my samples play a role (aren't the values already the mean or median of them?), or if and how to factor in ranges. So I thought I'd rather explain my available data set and learn from there. But please go ahead and give me links I should read! — user1840267, Apr 13 '15 at 16:24
A key question is why you want to combine the data? What will combining the data achieve for you? Are you interested in analysing differences between labs, or between groups of patients? — tristan, Apr 13 '15 at 16:29
user1840267 See [here](https://stats.stackexchange.com/questions/43159/how-to-calculate-pooled-variance-of-two-groups-given-known-group-variances-mean) for one example — Glen_b, Apr 13 '15 at 16:37
I think the best description is I would like to give an overview of results between healthy patients and diseased ones. Like, having an overview page with the average results and then show for each type the detailed rows of different measurements. Should I update the question to explain? — user1840267, Apr 13 '15 at 16:41
More explanation why I would like to combine the data: Supposed I would like to explain to someone else in which range (not in the statistical sense, but to give an idea) a pathological value is - presenting rows of single measurements is not very helpful, so I am trying to give one average number as a "signal" and the option to view all single values that lead to that average on request. — user1840267, Apr 13 '15 at 17:00

tristan · Accepted Answer · 2015-04-13T17:31:31.903

Welcome to Cross Validated!

I assume that you have far more than just results for two labs. I will attempt to answer your questions:

Question 1: The formula you would use here depends on whether you believe the labs are all sampling from the same distribution or not. If you think they are, then you should weight according to the number of samples each lab took:

$$ \bar{x}_p = \frac{\sum_{i}{n_i x_i}}{\sum_i{n_i}} $$

where $n_i$ is the number of samples from each lab. For your two lines of data this gives $\bar{x}_p = (24\times 25.8 + 10\times 55.0)/(24+10) = 1169.2/34 = 34.4$.

Pseudo-code:

numerator <- 0
denominator <- 0
For i From 1 To k Do
    numerator <- numerator + x[i]*n[i]
    denominator <- denominator + n[i]
Next i
Return (numerator/denominator)

If you think these labs might actually be sampling from different distributions (perhaps some labs are using different techniques?) then there are other common approaches which might be used. If this is the case we can go through them.

Question 2: Again, this is fairly simple if you believe the labs are sampling from the same distribution:

$$ s_p = \sqrt{\frac{\sum_i{(n_i-1)s_i^2}}{\sum_i{n_i-1}}} $$

Where $s_i$ is the (unbiased) sample standard deviation. For your data it gives $s_p = \sqrt{(23\times 8.0^2 + 9\times 7.9^2)/(23+9)} = \sqrt{2033.7/32} = \sqrt{63.55} = 7.97$.

Psuedo-code

numerator <- 0
denominator <- 0
For i From 1 To k Do
    numerator <- numerator + (n[i]-1)*sd[i]^2
    denominator <- denominator + (n[i]-1)
Next i
Return (numerator/denominator)

The picture is more complicated if you do not believe the labs are sampling from the same distribution.

Question 3: If you have missing values, it is best to assume that they can take any value which are physically possible. For your example I would combine the ranges for those two labs as ?–100.0 . If another lab had the range, for example, 24.0–? I would combine to make either ?–? or leave it as ?–100.0 if 100.0 is the maximum value that could possibly be taken.

Question 4: See answer to Question 3.

Question 5: The sample counts are valuable, because they tell you (along with the standard deviation) how precise each lab's estimate of the mean is. You will see that the formulae in the answers to Questions 1 and 2 both include $n_i$. It would be common to report the sum of $n_i$ along with any combined results.

So, how do you know whether the labs are sampling from the same distribution? Ideally this should be something you know. If the labs are receiving randomly allocated samples from a uniform source then they probably are sampling from the same distribution. If they are measuring samples from different populations (e.g., some labs from a healthy population, some labs from a diseased population, or different prevalences, or different ages, etc.) or are measuring them using different techniques then you probably can't assume they are from the same distribution.

Dear Tristan, thank you very much, that is already a tremendous help. Yes, I have a lot more lab results. I updated the question about the distribution, bottomline, I think they are different. I have one tiny request: Would it be possible for you to create an example for the formulas with the numbers in the example rows? Examples with real numbers are much better to understand for me. And thanks for the welcome! — user1840267, Apr 13 '15 at 16:11
@user1840267 maybe *you* could provide a sample data as an example illustrating your question (it could be made-up)? This is something people do here quite often. It also helps you to get an answer illustrated with example that is clear to *you*. — Tim, Apr 13 '15 at 16:43
You mean I should give an example which form I would best understand? If yes, I think it would be some pseudocode or php, but would that make sense on this site? — user1840267, Apr 13 '15 at 16:46
@user1840267 I have used your two example data rows to show calculation details and provided some pseudocode. — tristan, Apr 13 '15 at 17:32

Average of values, their standard deviations and ranges

1 Answers1