Question:
Can we calculate kurtosis and skewness of 2 or more combined samples, given each samples' mean, std, sample size and kurtosis/skewness?
Let's say we have sample size, sample mean, sample standard deviation and sample kurtosis of 2 samples($\{x_1,x_2,...,x_n\}$ and $\{y_1,y_2,...,y_m\}$) , can we calculate the kurtosis of the combination of these 2 samples $\{x_1, x_2,...x_n,y_1, ...,y_m\}$?
The problematic thing here is we focusing on calculating skewness/kurtosis by subsamples' statistics instead of original sample/subsample. In other word, we need to get the skewness/kurtosis without any touch of original data! This is reasonable in real industrial practice due to limits on RAM and CPU times.
Further Question:
I have proved that:
- For 2 independent samples, if we have sample size, sample mean, sample deviation of each sample, we can calculate the mean and std of combined sample.
- For 2 independent samples($\mathrm{pvctr_1}= \frac{\# clicks}{\# expos}=\frac{\sum_{i=1}^{n} x_{1i}}{\sum_{i=1}^{n} y_{1i}}, \mathrm{pvctr_2}= \frac{\# clicks}{\# expos}=\frac{\sum_{j=1}^{m} x_{2j}}{\sum_{j=1}^{m} y_{2j}}$), if we have sample size($n$ for sample 1,$m$ for sample 2), numerator/denominator's sample mean/sample deviation for each sample, we can not calculate the mean and std of combined sample's $\mathrm{pvctr}$. If we also know the covariance between numerator and denominator in each sample, then we can.
I was wondering if there exists any theorem gives the sufficient sample statistics of each sample, to help us calculating sample statistics for a combined sample?
With all due respect, I think I should emphasize the last question of the post:
I was wondering if there exists any theorem gives the sufficient sample statistics of each sample, to help us calculating sample statistics for a combined sample?
I am asking a most powerful solution to this kind of question, instead a general way of thinking. For me, I did know how to calculate mean/std/skewness/kurtosis based on subsamples' up to 4th raw/central moments. But, there exists duplicate/useless information in up to 4th raw/central moments, which means we don't need all of them to calculate combined samples' statistics. Thus, I want to know the "Sufficient Statistics of subsamples" for calculating combined sample particular statistics, and then I can keep my solution extremely tiny and powerful.