What statistical comparison would I use with data that is sparse, but normally distributed at a low frequency?

Question

Please forgive me if this is a stupid question, I am a novice with larger data sets and statistics in general. I have a matrix of data that is 50 x 429 that reflects "signal" obtained from a peptide array experiment - 429 peptides probed with plasma from 50 people. The signal is reflective of how much antibody each person has against that peptide. Unfortunately, out of the 50, only 4 are controls, and 46 are patients - a very unbalanced experiment.

As I understand, we should normalize the data in some manner before trying any comparisons, especially because each array was done separately. The company we worked with provided a dataset that is adjusted using variance stabilizing normalization (VSN) with some modification to the standard method. Below are histograms to show the distribution of the overall data.

In both cases I cut the y-axis off at 500 - the first bin has a frequency of over 19000 in either case. Ignoring the first bin, VSN signal seems to be pretty normal.

Essentially what I'd like to answer is if there are any peptides that have different signal between the 46 patients and 4 healthy controls. I am not sure what test(s) to use. If I can assume normality of my VSN data, despite the huge frequency of the "0.5" bin, a Welch's t test gives me the best result for one peptide that I am confident shows a signal consistently.

If I cannot make the assumption of normality, I presume I would use a Mann-Whitney test to compare these groups. This does not give me any p-values that make sense.

I then adjust the list of 429 p-values I get for false-discovery rate (FDR). The t test still gives me results that make sense, the Mann-Whitney test gets worse.

After this point I'd like to also compare sub-groups, for example comparing patients treated with one drug or the other, or not treated, or healthy. I am comfortable using GraphPad for comparisons like this with simple data - I'd likely end up using a one-way ANOVA with a multiple comparison test. I have tried to replicate this on R, but am unsure if I can use the standard linear model (lm function) followed by an ANOVA (Anova function from the car package) followed by multiple comparisons using the glht function from the multcomp package. Most of my hesitation comes from being unable to determine what assumptions about my data I can carry over to these functions.

Any advice or help would be very much appreciated!

Of I were in your position, I would not use the vsn signal unless I fully understand the transformation and it's implications. — Rodolphe, Jun 06 '20 at 01:31
If you wish to study each peptide, you will need to make one model for each peptide. You are screening your 529 peptides, over 50 people, in search for a signal. So do not adjust for false discovery rate else you will just find about something like... nothing. Because your tests will be too much stringent on p-value. While having quite low power because of sample size... However if you detect something, keep in mind that the risk of false discovery is quite high so that would warrant further exploration / experiment to confirm the findings. — Rodolphe, Jun 06 '20 at 01:42
About normality. Normality should be looked for in the residuals of a model, not on the raw dependent variables. So, here, for peptide 1, fit a model signal as a function of groups, with the control being the base group for easier reading of the results. And plot the residuals of this model vs the predicted values. There should be no discernible structure in the residuals. — Rodolphe, Jun 06 '20 at 01:46
Thank you very much Rodolphe, please excuse the late reply. You make a good point with respect to not using the vsn signal without knowing how it was transformed - I will use another transformation. I will ensure I try to balance benefit-to-risk of FDR any p-value adjustment. Regarding normality, is there a more user-friendly way of assessing the residuals (I use R) so I wouldn't have to make a subjective decision based on 429 plots? Finally, depending on the structure of my residuals, what tests would I use? — Maltesers, Jun 16 '20 at 22:57
To automatize the test of normality of your residuals, look for Shapiro-Wilk test. It is present in r. — Rodolphe, Jun 16 '20 at 23:20
Another approach, with reflexion, could be using mixed effect linear regression, possibly generalized (if you think your residuals are not normally distributed), with peptides signal as 429 fixed effect predictors, array id as random effect, and group as predicted variable. The objective would then be to identify, through any procedure of model selection, which fixed effects (i.e which peptide) are useful for prediction of the group of the individual. — Rodolphe, Jun 17 '20 at 07:36
This is somewhat related, in its final objectives, to principal component analysis and is therefore more a kind of descriptive analysis of your data. — Rodolphe, Jun 17 '20 at 07:38
Does this thread: https://stats.stackexchange.com/questions/97098/practically-speaking-how-do-people-handle-anova-when-the-data-doesnt-quite-mee/103446 address what you suggest? Specifically, the part about generalized least squares (gls function from package nlme)? This worked very nicely with respect to peptides that, by eye, should be sensibly significant. — Maltesers, Jun 23 '20 at 18:16
Not really. First, I just suggested the "generalized" part as a possibility of and only if you have reasons to believe your data (more precisely the residuals) agree not normally distributed. So this is absolutely not mandatory. Secondly, it does not address the "random effects" part which i strongly believe however you should take into consideration. The arrays, or equivalently in your case the people random effects. Look for linear mixed effects models. — Rodolphe, Jun 23 '20 at 21:53

What statistical comparison would I use with data that is sparse, but normally distributed at a low frequency?

0 Answers0