How to calculate a mean of a sample where some measurements do not match a normal distribution?

Question

I have a set of temperature measurements from amateur meteo stations around the place I live in. Being amateur meteo stations, the measurements vary because of the placement of the probe, measurement conditions, etc.

I would like to derive a realistic mean from these measurements.

The frequency distribution for this morning is

Since the temperature was around 4 to 7°C (according to the official bulletin), the bulk of the measurements makes sense, except for a few which must be next to a heater or something.

How to calculate the mean of such a sample? (specifically to convey less weight to these bizarre temperatures)

My first thought is to make a distribution in a 1° bins (I will just show an rounded temperature) and calculate a mean from the frequency table.

But then I though that since the temperature distribution should be normal (source: gut feeling) then maybe there is a mechanism to get rid of "the measures which do not match to a normal distribution made on the bulk of measurements" (I am sure there is a theorem named after someone which states exactly this)

I am looking for an automatic mechanism / algorithm as the data pooling and handling will be scripted.

You might want to look at methods from the field of robust statistics and/or outlier detection. For example, using the arithmetic mean as an estimator for the mean of the distribution is considered of not being robust whereas the median is. — Dr_Be, Jan 08 '16 at 10:56
Similarly to @BerndH I'd recommend a trimmed mean here. Indeed nothing stops you looking at trimmed means with different proportions of trimming and thinking about what level of trimming works well for your situation. See e.g. http://stats.stackexchange.com/questions/117950/how-can-i-interpret-a-plot-of-trimming-percentage-vs-trimmed-mean — Nick Cox, Jan 08 '16 at 13:22
Note that rounding the data and then using the frequency table to calculate means would do nothing to solve the main problem. It just adds imprecision to the estimate of the mean. — Nick Cox, Jan 08 '16 at 14:37
@NickCox: sorry for not having been clear. What I meant is that I want to see temperatures **ultimately** rounded (the final mean on my display), thus the choice of the size of the bin for the sampling when doing the frequency table. — WoJ, Jan 08 '16 at 15:04
In turn I now don't follow the point you are making. Neither rounding nor binning is needed in any calculation of the mean except in the presentation of the final result. You have, it seems, 50 measurements and the question is how to summarize them. From your example, the mean is a bad idea and some trimmed mean would work better. — Nick Cox, Jan 08 '16 at 15:22
@NickCox: yes, that's right and this is what I will end up doing. The binning (and mean of a frequency) was my first idea (and when I was mentioning rounding it was to explain why I would be choosing 1° bins, and not 2° or 0.3° - because the mean temperature I will use at the end will be rounded to 1°). Thanks for the help. — WoJ, Jan 08 '16 at 15:26

score 4 · Accepted Answer · answered Jan 08 '16 at 12:31

Your title (and question body) ask about computing the mean of a sample.

The sample mean is perfectly well-defined and doesn't depend in any way on the distribution from which the data were drawn (as long as it makes sense to actually calculate a mean; you can hardly do that if the values are nominal categories like $\{\text{red},\text{blue},\text{yellow}\}\, $).

If your problem were instead how best to estimate a population mean, and you have values that were in some sense obviously contaminated, that might lead you to try some kind of estimate that is somewhat robust to such contamination (perhaps using an $M$-estimator for example), but as it stands, the question is about the sample mean, and the sample mean is what it is ... the sum of the values divided by the count of values.

How to calculate a mean of a sample where some measurements do not match a normal distribution?

1 Answers1