Transformation versus projection to Normality

Question

Can anyone explain the theoretical consequences of a traditional variance stabilizing transformation such as sqrt(lambda) for the Poisson versus projection to a normal distribution and the pros and cons of each? I am familiar with the concepts of the traditional square root transformation, but I came across this "projection to normality" in a paper (updated below). After digging through their code i understand how they actually perform the transform:

Calculate mu and sigma for the vector. Convert the data vector to its percentiles (I am guessing from the ecdf somehow).
Use the inverse CDF of the normal with the specified mu and sigma to transform the percentiles to normal variates.

The paper is from PLOS Comp Bio. They were using a glasso type approach to model gene expression networks from RNAseq data which are usually counts. Specifically in the methods section they say "Normalization of Data For each read count ni in each sample, we computed the normalized read count ri = log2(2 + C ⋅ ni/n) ........ Because GMRFs are designed for Gaussian data, we projected all samples for each transcript for each tissue onto a Gaussian with variance 1."

MATLAB code from the paper for the transformation:

function v=gaussianProject(x)
%%projects x onto a Gaussian v. 
p=percentile(x);
p(p==1)=.99;%otherwise these values get sent to infinity. 
mu=mean(x);
sigma=std(x);
if sigma==0
    sigma=1;
end
v=norminv(p, mu, sigma);
end

I am planning to run some simulations in R to see how the comparisons work, but would be really grateful if anyone could give any theoretical explanations.

Which paper? If your description is accurate it would be quite misleading to refer to that as "projection". It would seem *especially* weird to try to calculate mean and s.d. after converting to percentile ranks rather than before. It sounds like a slightly muddled description of replacing data by their normal scores (which has sometimes been proposed, e.g. see [here](https://en.wikipedia.org/wiki/Van_der_Waerden_test)), which are then scaled and shifted to match the original sample mean and sd, but it's hard to tell for certain. What was the purpose of doing this? — Glen_b, Mar 26 '16 at 01:07
Note that if the distribution is discrete to begin with the result is still discrete (unless you break tied ranks artificially). Try it with a large sample from a Poisson(0.5) for example. — Glen_b, Mar 26 '16 at 01:10
The paper is in [PLos Comp. Bio] (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004220). They were using an glasso type approach to model gene expression networks from RNAseq data which are usually counts. Specifically in the methods section they say "Normalization of Data For each read count ni in each sample, we computed the normalized read count ri = log2(2 + C ⋅ ni/n) ........ Because GMRFs are designed for Gaussian data, we projected all samples for each transcript for each tissue onto a Gaussian with variance 1." — ashokragavendran, Mar 26 '16 at 02:43
Also i have updated the post to reflect that they calculated the mean and sd before converting to percentile ranks. sorry for the confusion. And i will also add the matlab code from the paper in case I am somehow reading it wrong — ashokragavendran, Mar 26 '16 at 02:47
Please include the information in your comments above in your actual post — Glen_b, Mar 26 '16 at 06:26

score 0 · Answer 1 · edited Jul 26 '19 at 06:18

So I have found a partial answer to my question and I will need to dig deeper into this, but the main point being that these are a class of transformations called the Inverse Normal Transforms see a review I found to applications in genetics.

If anyone has any other comments/references/experience that they would like to share I would be grateful. This is a little frustrating to me since obviously the paper above did not have any references to the methodology and I had to backtrack from the code.

Transformation versus projection to Normality

1 Answers1

Linked