3

Performing factor analysis/PCA to remove potential hidden latent variables from high dimensional data is extremely useful to remove confounding/noise/measurement error and batch effects.

However, when the variable of interest is also highly correlated across many of the factors/PCs. Is there anything you can do?

What are the statistical limitations to removing hidden factors?

David shaw
  • 82
  • 9

1 Answers1

4

The question (and, largely, the topic itself) seemed interesting enough to warrant spending some time on a brief research (please note that, despite the fact that most of the information below is tied to bioinformatics (but I'm sure is generally applicable), I'm not fluent in this domain's subject matter). It appears that there are various approaches and methods for detecting and eliminating batch effects and other unwanted effects / noise. Including some bioinformatics-specific methods, the approaches/methods include (Chen, Grennan, Badner, Zhang, Gershon et al., 2011):

  • Distance-weighted discrimination (DWD), based on the support vector machines (SVM);
  • Mean-centering (PAMR), based on one-way ANOVA;
  • Surrogate variable analysis (SVA), based on a combination of singular value decomposition (SVD) and a linear model analysis;
  • Geometric ratio-based method (Ratio_G);
  • Combating Batch Effects When Combining Batches of Gene Expression Microarray Data (ComBat), based on empirical Bayes method;
  • Singular value decomposition (SVD);
  • Standardization (location/scale adjustment model);
  • Ratio-based method with arithmetic mean (Ratio_A).

Note that the last three methods in the list above are excluded from the study by Chen et al. (2011), however I include them here for the sake of completeness. Approaches to detecting and removing systemic variation as well as several other bioinformatics-focused methods are also discussed by Li, Łabaj, Zumbo, Sykacek, Shi, Shi et al. (2014).

In regard to the software that could be used for the task, various packages are available, many within the Bioconductor R project ecosystem. One of the most popular R packages seems to be Surrogate Variable Analysis (SVA). Its vignette contains detailed description of functionality with examples. It also briefly covers other complimentary functions, such as above-mentioned ComBat and svaseq. The latter is described in more details in this paper (too specific, hence no citation).

References

Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., et al. (2011). Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS ONE, 6(2): e17238. doi:10.1371/journal.pone.0017238 Retrieved from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017238

Li, S., Łabaj, P. P., Zumbo, P., Sykacek, P., Shi, W., Shi, L., ..., & Mason, C. E. (2014). Detecting and correcting systematic variation in large-scale RNA sequencing data. Nature Biotechnology, 32, 888–895. doi:10.1038/nbt.3000 Retrieved from http://www.nature.com/nbt/journal/v32/n9/full/nbt.3000.html

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
  • 2
    Depending on whether you have positive/negative control probes, you can use RUV (http://www.stat.berkeley.edu/~johann/ruv/). Otherwise, SVA is the next best bet, but you cannot guarantee that you won't remove true variation together with your unwanted variation. – purple51 Mar 26 '15 at 01:38
  • @purple51: Thank you for the comment (and vote). RUV (RUV2), along with other methods, is used in the second study that I've referenced above. Point on lack of guarantees taken. However, I'm curious about whether it's a method-specific aspect or a general statistical one. – Aleksandr Blekh Mar 26 '15 at 01:52
  • 2
    RUV assumes that you have some kind of control data that allows you to estimate the batch effects, for example, some non-human probes that are guaranteed not to match any human sequence, such that any difference in measured levels for these probes can definitely be assigned to batch effects and not true biological variation (these are negative controls). SVA, at least the original version, doesn't use that information but tries to infer that from the data itself, which is the best you can do when there are no negative controls. – purple51 Mar 26 '15 at 02:07
  • @purple51: Thanks for clarification. But, my question implied whether it is, in general, theoretically (in a statistical sense) possible to separate true effects and systematic ones. – Aleksandr Blekh Mar 26 '15 at 02:22
  • 1
    In the sense of consistency, as sample size go to infinity etc? not sure, there probably are papers out there giving theoretical guarantees under assumptions of sample size, number of variables, limits on the correlations between the variables etc. – purple51 Mar 26 '15 at 02:23
  • @purple51: Any other dependencies beyond sample size? – Aleksandr Blekh Mar 26 '15 at 02:24
  • @purple51: Thank you. I will try to find some sources. – Aleksandr Blekh Mar 26 '15 at 02:31
  • 1
    Thanks for the excellent answer. However, I would emphasise I'm aware of a lot of these analysis methods but the focus is on methods that do not remove variation that is also attributed to the variable of interest, and whether it's even possible. – David shaw Mar 26 '15 at 13:18
  • 2
    @Davidshaw: You're welcome. Thank you for kind words - feel free to upvote my answer, if you feel it's still helpful. Note that I said "upvote", but not "accept", which should be reserved for **direct and comprehensive** answers, addressing the main problem (I haven't _positioned_ my answer as such). I will post an update, if I'll come across any information, reflecting your question's focus. – Aleksandr Blekh Mar 26 '15 at 14:25