1

Question:

Can we calculate kurtosis and skewness of 2 or more combined samples, given each samples' mean, std, sample size and kurtosis/skewness?

Let's say we have sample size, sample mean, sample standard deviation and sample kurtosis of 2 samples($\{x_1,x_2,...,x_n\}$ and $\{y_1,y_2,...,y_m\}$) , can we calculate the kurtosis of the combination of these 2 samples $\{x_1, x_2,...x_n,y_1, ...,y_m\}$?

The problematic thing here is we focusing on calculating skewness/kurtosis by subsamples' statistics instead of original sample/subsample. In other word, we need to get the skewness/kurtosis without any touch of original data! This is reasonable in real industrial practice due to limits on RAM and CPU times.

Further Question:

I have proved that:

  1. For 2 independent samples, if we have sample size, sample mean, sample deviation of each sample, we can calculate the mean and std of combined sample.
  2. For 2 independent samples($\mathrm{pvctr_1}= \frac{\# clicks}{\# expos}=\frac{\sum_{i=1}^{n} x_{1i}}{\sum_{i=1}^{n} y_{1i}}, \mathrm{pvctr_2}= \frac{\# clicks}{\# expos}=\frac{\sum_{j=1}^{m} x_{2j}}{\sum_{j=1}^{m} y_{2j}}$), if we have sample size($n$ for sample 1,$m$ for sample 2), numerator/denominator's sample mean/sample deviation for each sample, we can not calculate the mean and std of combined sample's $\mathrm{pvctr}$. If we also know the covariance between numerator and denominator in each sample, then we can.

I was wondering if there exists any theorem gives the sufficient sample statistics of each sample, to help us calculating sample statistics for a combined sample?

With all due respect, I think I should emphasize the last question of the post:

I was wondering if there exists any theorem gives the sufficient sample statistics of each sample, to help us calculating sample statistics for a combined sample?

I am asking a most powerful solution to this kind of question, instead a general way of thinking. For me, I did know how to calculate mean/std/skewness/kurtosis based on subsamples' up to 4th raw/central moments. But, there exists duplicate/useless information in up to 4th raw/central moments, which means we don't need all of them to calculate combined samples' statistics. Thus, I want to know the "Sufficient Statistics of subsamples" for calculating combined sample particular statistics, and then I can keep my solution extremely tiny and powerful.

Travis
  • 217
  • 1
  • 6
  • Yes — one way is to convert all your data into raw (non-central) moments, combine those moments, and then get the skewness and kurtosis for the combination. See https://stats.stackexchange.com/questions/237558/kurtosis-expressed-in-raw-moments – Matt F. Jul 05 '21 at 18:22
  • @MattF. Thanks for you comments! We know how to calculate skewness and kurtosis. However, I want to calculate them based on each samples' row/central moments, that's the point. – Travis Jul 06 '21 at 02:02
  • Did you look at the link? – Matt F. Jul 06 '21 at 06:31
  • 1
    I dont think this question is the same with [combining two covariance matrices](https://stats.stackexchange.com/questions/51622/combining-two-covariance-matrices) and it should not be closed. I am asking for a particular method on skewness and kurtosis and I have already finished some proofs on sample mean and variance. That link only provide some naive solutions for some special cases, which can not help with my specific/harder question. – Travis Jul 07 '21 at 04:58
  • We focus on calculate skewness by subsamples' __statistics__ instead of __original data__. See the answer below. @MattF. – Travis Jul 07 '21 at 05:05
  • 1
    For the moderators, I think they should give some reasons for closing posts, showing their concerns and understanding of the post. @Ben – Travis Jul 07 '21 at 07:24
  • 3
    @Travis: I assume that the moderator is of the view that the answer to the linked question covers the case of higher-order moments (including kurtosis). While you are correct that the linked question only asks about covariance, but the answer to the question gives a broader solution that covers higher-order sample moments. In any case, I hope the answer below suffices for the present question. – Ben Jul 07 '21 at 08:59
  • 2
    @Ben I did not feel it necessary to offer any explanation because, being familiar with my own answer to the duplicate, know that it fully covers the present case (which is a special example of combining moments). The reason I originally posted that answer was to offer a canonical reply to future questions that were just variations on the same problem, like this one. This is how we want our site to work: we prefer curating existing answers rather than allowing myriad answers to the same question to accumulate. – whuber Jul 07 '21 at 13:56
  • @whuber: Perhaps it would be worth constructing a canonical question/answer from scratch (so that the question clearly covers all relevant cases)? – Ben Jul 07 '21 at 23:30
  • @whuber If I emphasized the last question of the post, then guys will know I am asking for a theoretical proof of "sufficient statistics of subsample" to calculate combined sample's particular statistics. I think it's a harder question that other question does not cover. – Travis Jul 08 '21 at 00:45
  • That's a question of algebra which might attract some interest on [math.se]. – whuber Jul 08 '21 at 14:05

1 Answers1

3

As a general rule, you can compute the $k$th order sample moments of a pooled sample so long as you have all the sample moments of the underlying subgroups up to order $k$. So, in order to compute the sample kurtosis of the pooled sample you need to have the sample mean, sample variance, sample skewness and sample kurtosis of the two subgroups that are to be pooled. (In your question you missed the skewness, so technically, the answer to your question is no; you can't get the pooled sample kurtosis with just the statistics you mention.)

Fortunately, you needn't re-invent the wheel here, since a brilliant and devoted statistician has already done this work for you in the utilities package in R. Statistical problems of this kind are automated in the sample.decomp function in that package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for any decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.


How to use the function: Here we give an example where we use the function to compute the sample moments of the pooled data from two subgroups. First we randomly generate some data from a fixed distribution and find the sample moments of the two subgroups.

#Generate some example data
set.seed(1)
DATA.A <- rgamma(n = 12, shape = 2, scale = 6)
DATA.B <- rgamma(n = 20, shape = 2, scale = 6)

#Compute and store the sample moments
library(utilities)
MOMENTS.A <- moments(DATA.A, include.sd = TRUE)
MOMENTS.B <- moments(DATA.B, include.sd = TRUE)

#Show the moments of dataset A
MOMENTS.A
        n sample.mean sample.sd sample.var sample.skew sample.kurt NAs
DATA.A 12    11.94847  6.946869   48.25898   0.4535123    1.738779   0

#Show the moments of dataset B
MOMENTS.B
        n sample.mean sample.sd sample.var sample.skew sample.kurt NAs
DATA.B 20    11.12291  6.301247   39.70571   0.2532274    1.731373   0

Now that we have two subgroups with known sample moments (up to fourth order), let's find the moments of the pooled sample using the sample.decomp function.

#Compute sample decomposition
N    <- c(MOMENTS.A$n, MOMENTS.B$n)
MEAN <- c(MOMENTS.A$sample.mean, MOMENTS.B$sample.mean)
VAR  <- c(MOMENTS.A$sample.var,  MOMENTS.B$sample.var)
SKEW <- c(MOMENTS.A$sample.skew, MOMENTS.B$sample.skew)
KURT <- c(MOMENTS.A$sample.kurt, MOMENTS.B$sample.kurt)
sample.decomp(n = N, sample.mean = MEAN, sample.var = VAR,
              sample.skew = SKEW, sample.kurt = KURT, 
              names = c('DATA.A', 'DATA.B'), include.sd = TRUE)

            n sample.mean sample.sd sample.var sample.skew sample.kurt
DATA.A     12    11.94847  6.946869   48.25898   0.4535123    1.738779
DATA.B     20    11.12291  6.301247   39.70571   0.2532274    1.731373
--pooled-- 32    11.43250  6.451729   41.62481   0.3535067    1.791869

We can confirm that these moments match the results from first combining the subgroups and then computing the moments of the pooled data directly.

#Combine the two subgroups
POOLED <- c(DATA.A, DATA.B)

#Compute the moments of the pooled sample
MOMENTS.POOLED <- moments(POOLED, include.sd = TRUE)

#Show the pooled moments
MOMENTS.POOLED
        n sample.mean sample.sd sample.var sample.skew sample.kurt NAs
POOLED 32     11.4325  6.451729   41.62481   0.3535067    1.791869   0
Ben
  • 91,027
  • 3
  • 150
  • 376
  • Very appreciate your help, Thats what I want! I re-invent the wheel with the help of [Higher-order_statistics document](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics) this morning T-T. I will use you package under several real data scenario. – Travis Jul 06 '21 at 08:16