4

I'm looking to do a PCA analysis on count based data itself rather than averages. I'm hoping this will help for variable observation depths; for example, 3/4 reads is not really equivalent to 15/20. There is more confidence in the 15/20 being near 75% than for the 3/4.

Any ideas how how I could do this?

Here is some example data, each site is the number of positive reads and the number of total reads.

Individual, Site1, Site 2, ...
Indiv1, 7/9, 4/5, ...
Indiv2, 5/11, 7/22, ...
Indiv3, 14/29, 3/5, ...
amoeba
  • 93,463
  • 28
  • 275
  • 317
Jautis
  • 588
  • 1
  • 4
  • 13
  • I would treat it as weighted PCA with cell-specific weights, see http://stats.stackexchange.com/questions/113485 – amoeba Jul 27 '16 at 11:39
  • @amoeba: could you be more specific about which would be the weights $w_i$ in the particular case described in the question ? –  Jul 27 '16 at 14:13
  • @fcop For example, $1/\hat{\sigma}_{\hat{p}}$ from your answer. Each cell in the data table (individual/site) will have its own weight corresponding to the certainty in that number. $7/9$ should have little influence on the PCA result (little weight), but $700/900$ should have large influence (large weight). – amoeba Jul 28 '16 at 14:22
  • @amoeba: so I use the $w_i$ that you mention? –  Jul 28 '16 at 14:34

1 Answers1

1

My pick of your problem

You ask how to incorporate the statistical uncertainties on the data in your table to do PCA. Since 4 out of 5 has a larger uncertainty than 40 out of 50.

The solution

Put the uncertainty into your data. I'll try to explain below.

First an assumption

You have to make an assumption, though, which is that your measurements (the fractions) follow a certain distribution, which should reflect the statistical uncertainties we wish to incorporate into the data.

I'd recommend the beta distribution.

The procedure

Try the following procedure:

  1. Consider each data point $p_{i}$ in your data table and determine the numbers $i_{i}$ and $n_{i}$ such that $p_{i}\equiv\frac{i_{i}}{n_{i}}$. For example, if $p_{i}=\frac{3}{4}$, then $i_{i}=3$ and $n_{i}=4$.
  2. Generate extra tables $j$, for which you generate each data point $p_{i,j}$ according to a beta distribution $\texttt{B}\left(\alpha, \beta\right)$. Where $\alpha=i_{i}$ and $\beta=n_{i}-i_{i}+1$. (See here for why this is so.) Leave all the other data points as they were, only manipulate the $p_{i}$.
  3. Do your PCA analysis treating all your tables as "real" data.

Keep adding new "fake" data until the outcomes don't change significantly any more.

Let us know if that worked for you.

Edit (2016 Sep 13)

I improved my answer to accommodate for the situation amoeba sketches in the comments below.

Ytsen de Boer
  • 567
  • 4
  • 14
  • Point 3 looks very strange to me. It's like saying if you have 10 numbers and the standard error of the mean is too large, just replicate them 100 times to get 1000 numbers and your standard error of the mean will decrease. This would clearly be nonsense. – amoeba Jul 27 '16 at 11:35
  • On the contrary amoeba, this way you force the standard error wrt. the mean to stay _larger_ to incorporate the statistical uncertainty from your counts. – Ytsen de Boer Jul 27 '16 at 11:45
  • Wait, do you mean that if you take 10 numbers and replicate them 100 times, then the SEM of the resulting 1000 numbers will be larger than SEM of the original 10 numbers? – amoeba Jul 27 '16 at 11:46
  • Do not replicate the same numbers. The trick is to draw your new numbers from the appropriate statistical distribution. It is the uncertainty from that distribution, which you are then forcing into the analysis. If you would not do that, then 4 out of 5 will be treated the same as 400 out of 500. – Ytsen de Boer Jul 27 '16 at 12:00
  • I see now. I still don't like your approach though, because 4/5 will get "replicated" into 3/5 and 5/5 a lot of times, i.e. a lot of variance will be created around the 80% mean, whereas 400/500 will stay 80% with almost no variance "added". PCA is analyzing variance, hence I am not sure this will yield meaningful results. – amoeba Jul 27 '16 at 12:11
  • 1
    There _should_ be a lot more variance in the 4/5 case than in the 400/500 case, since you _could_ have obtained those numbers (3/5 or 5/5 just as well (given the Binomial distribution) contrary to 300/500 or 500/500. This Monte Carlo approach lets you augment your data artificially to force that variance onto your data. If you take enough simulated data points, the PCA should be _prevented_ from picking up correlations which are introduced exactly because you picked 4 out of 5 in stead of 3 out of 5. – Ytsen de Boer Jul 27 '16 at 12:26
  • 1
    Thanks, I see what you mean. This makes sense. My only comment then is that I would use a beta-binomial distribution instead of binomial one. E.g. if you get 5/5, then you should not just generate 5/5 all the time, because there is quite some uncertainty about this really being 100%... – amoeba Jul 27 '16 at 12:34
  • I think that could speed up your Monte Carlo simulation, but note that it would also lead to data points, which you could _never_ have obtained in real life. It probably won't matter a lot for the conclusions you will reach from your PCA analysis, though. Good luck. – Ytsen de Boer Jul 27 '16 at 12:42
  • It's not *my* analysis; I am not the one who asked the question... – amoeba Jul 27 '16 at 13:09
  • I missed your earlier point about the 5/5. Good point. – Ytsen de Boer Sep 08 '16 at 19:37