0

I have around 8 billion data points, and I need to calculate the distribution and the cumulants of this distribution.

However, due to technical restrictions, and time constraints, I can only calculate those cumulants just for a half of the data, but I still need the cumulants of the whole data points.

Question:

if I have a two distributions and I know their cumulants separately, what is the cumulant of the whole combined distributions in terms of the cumulants of each separate distribution ?

Apart from analytical results, I would also accept approximate results.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Our
  • 207
  • 2
  • 8
  • Which cumulant are you calculating? $\kappa_1, \kappa_2, \kappa_3, ...\kappa_{n-1}, \kappa_n $? – user158565 Jul 27 '19 at 16:17
  • Hopefully only skewness and kurtosis, but it would be great if someone could give a general method of how to derive the higher order cumulants. – Our Jul 27 '19 at 19:40
  • If you want skewness and kurtosis, why you do not calculate them directly, instead of calculating the cumulants and deriving them from cumulants? – user158565 Jul 27 '19 at 19:46
  • @user158565 Do you know how much space does 8 bilion double-precision data points occupy in memory ? – Our Jul 27 '19 at 19:49
  • 1
    Alternatively you could ask for "online" updating formulas for cumulants, analogous to [Online estimation of variance with limited memory ](https://stats.stackexchange.com/questions/235129/online-estimation-of-variance-with-limited-memory?noredirect=1&lq=1) and look at the tag [tag:online]. Also this [arXiv paper](https://arxiv.org/pdf/1701.06446.pdf) which tackles a more general problem---cumulant tensors – kjetil b halvorsen Jul 27 '19 at 20:41
  • I think this is more of an issue of calculating a statistic for half of the data, calculating the same statistic for the other half of the data, and then "averaging" them together, than it is about anything else. Is this right? – Dave Jul 27 '19 at 20:46
  • Is moments okay or do you need cumulants? – Akababa Jul 27 '19 at 20:54
  • @Dave Do you have any argument why should the cumulants of combined be the mean of the cumulants of of its parts.? or just intuitively thought that should be the case. – Our Jul 28 '19 at 04:55
  • @kjetilbhalvorsen Thanks a lot, I will check them out. – Our Jul 28 '19 at 04:56
  • @onurcanbektas I don't literally mean adding the two together and dividng by two (though maybe that's an idea), just more of the idea of "averaging" values in general, even if we don't take the arithmetic mean. (We don't take the arithmetic mean for pooled variance in t-testing, for instance, yet we're still kind of averaging the variances of the two distributions.) – Dave Jul 28 '19 at 16:45

1 Answers1

0

Let's say $X_1,X_2$ are independent random variables drawn from the two half-distributions, and $k\sim Bernoulli(0.5)$ is another independent r.v. Then you want to find the distribution of $X=kX_1+(1-k)X_2$.

If you can use cumulants $\alpha_i$ to approximate $\log E(e^{tX_1})\approx\sum_{i=1}^n\alpha_i\frac{t^i}{i!}$ and similar for $X_2$ with your cumulants $\beta_i$, then you can plug those into $$\log E(e^{tX})=\log(.5E(e^{tX_1})+.5E(e^{tX_2}))$$ and differentiate the RHS to get your cumulants for $X$.

Akababa
  • 161
  • 5