1

I have a large metabolomics dataset, 6000 samples and 3300 features. For the samples the only thing that differentiates each sample from the rest is that one gene was knocked out, which will not affect most of the metabolites. The features are metabolite concentrations.

There are some 'known'/measured batch and technical variables such as different Mass Spec runs, differential growth of the bacteria in the samples. However, I also want to adjust for the unknown variables.

It has been suggested that I perform PCA then throw out the first few principal components. However, I'm not sure how I can use the PCA information to predict values for the original features using the remaining principal components.

df.pca <- prcomp(as.matrix(df.rw),
             center = TRUE,
             scale. = TRUE) 

df.pca.x.minus12<-df.pca$x[,-c(1,2)]

How to predict values of the original 3300 features after removing the first 2 components?

amoeba
  • 93,463
  • 28
  • 275
  • 317
user2814482
  • 141
  • 1
  • 4
  • The first (accepted) answer here http://stats.stackexchange.com/questions/57467/ explains how to reconstruct original features in R using a small number of leading principal components. You need just a tiny modification: instead of using leading principal components, you use all *except* the leadings ones. – amoeba Sep 21 '15 at 14:32
  • Thanks, I had actually read that answer earlier but still couldn't quite follow. Reading it a second time, its now more clear. – user2814482 Sep 21 '15 at 19:33
  • `df.denoised – amoeba Sep 23 '15 at 10:42
  • @amoeba I appreciate that the modification is in some sense "small" but in another sense it is asking quite the reverse. I think overall that this question benefits from having an answer in its own right, but certainly it is good for the new thread to be strongly linked to it. – Silverfish Aug 12 '16 at 12:11
  • @Silverfish I tried to explicitly cover this situation in my new thread but perhaps I failed. Keeping the leading PCs and discarding the rest or discarding the leading PCs but keeping the rest is conceptually just *exactly the same thing*! Do you maybe think that I can make it more explicit in my new answer so that it could work as a duplicate? Any advises welcome. – amoeba Aug 12 '16 at 12:14
  • @Silverfish Note that I put "how can one remove or discard several principal components from the data?" explicitly in my question in the linked thread... Perhaps the answer is not doing it justice though. And this Q did not get a good answer since 2015. – amoeba Aug 12 '16 at 12:15
  • @Silverfish I edited that answer to be a little bit more explicit about this aspect. I am wondering if I can make a title general enough to refer explicitly to both aspects, but it gets too long... – amoeba Aug 12 '16 at 12:19
  • @Silverfish This Q ended up closed with gung's dupehammer, but I would still very much appreciate any suggestions on how that thread can be improved. – amoeba Aug 12 '16 at 12:34
  • @amoeba I think closing as a duplicate is not unreasonable. Naturally I agree about the concepts at play being essentially unified, but I often think that when considering dupe-closing we should bear in mind that to the uninitiated, or those working on a very "practical" level (who may tend to disregard a lot of the underlying theory), such underlying, fundamental unity may be less obvious than it is to the thread-closers. – Silverfish Aug 12 '16 at 14:37
  • For what it's worth, perhaps it would be worth to include a small example in the other thread about *why* one might remove leading PCs, and a pictorial example (similar to the image reconstruction you have based on discarding the PCs that explain little variance). I guess my preferred solution might have been to have your "big" thread to serve as an overview thread (which may mention this issue in passing), and a thread elsewhere that links to the big thread but isn't closed as a dupe of it, that considers the discarding of leading PCs in more detail – Silverfish Aug 12 '16 at 14:39
  • (Incidentally, I'm not sure this thread would be a great example anyway, if there were to be such a canonical thread it might be something with more practical/illustrative data for instance.) – Silverfish Aug 12 '16 at 14:40

1 Answers1

1

Actually you might like to take a look into Surrogate Variable Analysis to determine unknown batch effects (link).

And also the author (J.Leek) is currently teaching a MOOC "Statistics for Genomic Data Science"