Transformation of a skewed sample for estimating better the mean

Question

Given a skewed sample whose distribution is not normal and was caused by various reasons. As a result the mean calculation is affected by the skewed distribution. Can the following steps assess the mean better:

Using transformation, such as Box-Cox (or other. which?) to obtain a distribution closer to normal.
Calculate the mean of the transformed distribution.
Using inversed function on the transformed mean to get a better mean result.

Link to the sample:

https://www.dropbox.com/scl/fi/2rd64a0h0dolmsqwyjdpp/test2.xlsx?dl=0&rlkey=556dp5sm9w4x44csudtiw3x7l

Welcome to Cross Validated! I’m thinking of an answer to post, but it might help to say what in your mind makes one assessment better than another. — Dave, Feb 26 '22 at 00:02
This approach does not estimate the usual arithmetic mean: it estimates a *power mean.* [Duan's smearing estimator](https://stats.stackexchange.com/a/58077/919) is worth a look. cc @Dave There are other methods, ranging from Maximum Likelihood to Winsorizing, that might be worth considering. — whuber, Feb 26 '22 at 00:08
Are you interested only in point estimation, or might you also want a confidence interval for the population mean $\mu?$ — BruceET, Feb 26 '22 at 00:56
To Dave: skewed data yields a mean calculation that is also skewed. — Ami, Feb 26 '22 at 01:04
BruceET: confidence interval for the population mean μ will be a plus. — Ami, Feb 26 '22 at 01:06
To BruceET: I would like to know about all the methods that you know like Maximum Likelihood and Winsorizing that you have already mentioned. — Ami, Feb 26 '22 at 02:02
Have a look at Lambert W x F distributions . See here for an example https://stats.stackexchange.com/questions/33115/whats-the-distribution-of-these-data . You can get confidence intervals (MLE) and inverse transformations as well. Do you have data to share / post? — Georg M. Goerg, Feb 26 '22 at 03:23
@Ami the sampling distribution of the sample mean may be skewed, but it's unbiased. In some cases with skewed distributions (e.g. exponential, Poisson) the sample mean is as good as you can do for estimating the population mean (in at least one typical sense). — Glen_b, Feb 26 '22 at 06:37
To Georg M. Goerg: I added link to the data in the body of the question. — Ami, Feb 26 '22 at 11:26

Nick Cox · Answer 1 · 2022-02-26T19:37:19.540

This question, like many others, can be answered generally or specifically with reference to your dataset.

The first creates opportunities for experienced members to jump on hobby-horses and ride off in all directions.

Here I focus on the second. The dataset is not too big to be be presented here, as may be more convenient for people unwilling to grapple with Dropbox and a spreadsheet file.

effciencies
149
 99
 99
 97
 95
 97
 98
 62
 48
135
 44
 89
 87
 74
 30
 75
 75
 78
 63
 75
 75
 94
 96
 90
 89
 30
 98
 98
 38
 99
 94
 82
102
 82
200
102
101
 95
 93
 66
166
 44
 65
 55
 26
 20
  3
180
102
 95
100
 77
 55
 47
 37
100
 83
 35
 66
  8
 67
 68
 58
 70
 94
 89
 93
 60
 60
 87
 75
 72
 86
 84
 96
 72
 70
 75
 75
 80
 67
 67
 86
 75
 74
 73
 84
 70
 81
 81
 85
 81
 76
 72
 91
 68
 93
 99
 13
100
 88
 93
 78
 88
 90
 90
 85
 35
 17
 33
 77
 49
 81
 43
100
 99
 82
 99
 37
 71
 99
 42
 83
100
 80
 57
 65
 65
 25
100
 99
 74
 72
 74
 56
 65
 76
 80
 84
 31
 51

The data must be plotted for any decisions to be made carefully in context. I use normal quantile plots on original and logarithmic scales, mostly because these show all the detail (some of which needs a story) and because that is a standard form which statistically minded researchers should find familiar. There is no implication that the data are, or should be, normal -- or that it is a problem if they are not.

The idiosyncratic detail here concerns five moderate outliers. Is there a story about those? Do they belong with the others? I am not in favour of discarding or even ignoring outliers unless they are wrong or irrelevant strangers, but that is a question for the OP.

What is a good choice for mean? The mean is what it is; the geometric mean damps the effect of outliers by being exp(mean(log(data)); I make no case for the harmonic mean, but present it alongside.

    Variable |       Type          Obs        Mean       [95% conf. interval]
-------------+---------------------------------------------------------------
 effciencies | Arithmetic          141    76.85816        72.07696   81.63935
             |  Geometric          141    69.68447        63.85033    76.0517
             |   Harmonic          141    54.39093         42.8449   74.45552
-----------------------------------------------------------------------------

Trimmed means don't vary much, but tend to drift slightly, as the plots would imply. The 25% trimmed mean has a longish history as the midmean, and can be explained to boxplot users as the mean of those values that would fall inside the box.

  +------------------------------+
  | percent     #   trimmed mean |
  |------------------------------|
  |       5   127        76.3071 |
  |      10   113        77.4956 |
  |      25    71        79.0563 |
  +------------------------------+

In short, summaries based on different recipes vary at least from about 70 to about 79, depending on the focus and on what is downplayed.

What are good data visualization techniques to compare distributions? is a recent puff for quantile plots, a 19th century idea that is still too often neglected.

Nick, First, thanks for your answer and I will study it in depth. Second, I tried very hard to phrase my question without the word "outliers" because it takes you, the statisticians, to districts I did not mean. My story is in a question I asked before. I would be happy if you could add something on the matter. See: https://stats.stackexchange.com/questions/565299/analysis-of-observations-with-and-without-outliers-for-study-the-statistic-diffe — Ami, Feb 26 '22 at 13:32
I am not a statistician… I intend the description “moderate outliers” to be neutral and do not pre-judge what to do about them. — Nick Cox, Feb 26 '22 at 14:13

score 2 · Answer 2 · edited Feb 26 '22 at 19:39

A common way to define which estimator is better is to compare them on their mean squared error (MSE). Loosely speaking, MSE measures the average amount by which an estimate differs from the true value. More technically, MSE is defined as follows, for a $\hat\theta$ that estimates a true value $\theta$. (That $\theta$ has nothing to do with angles and is just the preferred Greek letter in statistics.)

$$ MSE(\hat\theta) = \mathbb{E}\bigg[ \big(\theta - \hat\theta\big)^2 \bigg] $$

MSE has a convenient decomposition into the bias of the estimator and the variance of the estimator (the so-called "bias-variance decomposition" or "bias-variance" tradeoff that you might hear in machine learning circles).

$$ MSE(\hat\theta) = bias(\hat\theta)^2 + var(\hat\theta) = \big(\mathbb{E}\big[\theta - \hat\theta\big]\big)^2 + var(\hat\theta) $$

When we know that an estimator has desirable MSE in a situation like the one we face, we might choose that estimator, so let's compare your proposed estimator with the usual $\bar X$ sample mean in a situation like you've described. I've worked this out for $X\sim\chi^2_5$.

We know the bias and variance of $\bar X$. With a small caveat that these quantities have to be defined (which is a reasonable assumption in many situations):

$$ bias(\bar X) = 0\\ var(\bar X) = \dfrac{var(X)}{n} $$

In this setting, $var(X) = 10$, so $MSE(\bar X) = bias(\bar X)^2 + var(\bar X) = 0^2 + \dfrac{10}{n} = \dfrac{10}{n}$.

To estimate the MSE for your proposed estimator that applies a Box-Cox transformation, calculates the mean of the transformed data, and does the inverse of the Box-Cox transformation, I turned to a simulation in R software. I started by simulating many observations from $\chi^2_5$ and determining that the right Box-Cox transformation to achieve normality was about a cube-root.

library(MASS)
set.seed(2022)
N <- 30000
x <- rchisq(N, 5)
bc <- MASS::boxcox(x ~ 1)
bc$x[which(bc$y == max(bc$y))] # This tells me that the right Box-Cox 
                               # transformation is about a cube root.
                               # 0.3-ish is close enough to 1/3

Next, I simulated $1000$ small samples from $\chi^2_5$, calculated your estimate of variance (cube root the data, calculate the mean of the transformed data, cube the mean of the transformed data) each time, and calculated the mean squared error of those $1000$ estimates, knowing that the true mean is $5$.

# Now let's simulate the MSE of your proposed estimator
#
B <- 1000 # Iterations
N <- 3 # Sample size...fiddle with this and check out what happens :)
means_bc <- rep(NA, B)
for (i in 1:B){
  
  # Simulate some data
  #
  x <- rchisq(N, 5)
  
  # Apply the cube-root transform
  #
  x_prime <- x^(1/3)
  
  # Calculate the mean of x_prime
  #
  m <- mean(x_prime)
  
  # Now cube the mean and save that value as an estimate of the mean of X
  #
  means_bc[i] <- m^3
  
  print(i)
}

# Calculate the MSE of "means" (using the Box-Cox transform)
#
mse_bc <- (mean(means_bc - 5))^2 + var(means_bc)
  
# Calculate the MSE of the usual x-bar estimator
# X-bar is unbiased
# Analytic solution for var(X-bar) is sigma/N
#
mse_usual <- 0^2 + 2*5/N

hist(means_bc)
abline(v = mse_usual, col = 'blue')
abline(v = mse_bc, col = 'black')

The histogram gives the distribution of values calculated your way, the black line is the MSE of your approach, and the blue line is the MSE of $\bar X$.

Your estimator gives a lower MSE than the usual $\bar X$ estimator. I was surprised.

This doesn't mean that we should go Box-Cox transform our data, calculate the mean, and then invert the Box-Cox transformation. I used a contrived example with $\chi^2_5$ data, and $\chi^2$ distributions usually arise in hypothesis testing, not real data generation. Further, the Box-Cox transformation is pretty effective in this case, yet it need not be in other cases.

However, this result so surprised me that I had to share.

The cube root being about right for chi-square distributions is a finding that goes back to Wilson and Hilferty. (+1) — Nick Cox, Feb 26 '22 at 19:41
Dave, thanks. The idea to make use of transformation is in what whuber call power mean. Architecture performance analysers are usually use the logarithmic-transformed for better estimation of the mean. And as you showed, sometimes it works. https://stats.stackexchange.com/questions/159061/transformation-among-power-means — Ami, Feb 26 '22 at 20:24
Transformation does not lead to a better estimate of the mean; quite possibly what it may do if you're lucky is suggest a scale on which calculating a mean is not too misleading. At the simplest, if logarithmic transformation is a good idea, then the geometric mean is a candidate summary. but that is a different mean, not a better estimate of the mean. — Nick Cox, Feb 27 '22 at 10:53
Nick, your clarification has sharpened my understanding. Thanks. — Ami, Feb 27 '22 at 17:57

Transformation of a skewed sample for estimating better the mean

2 Answers2