2

I have some biological data points collected from individual cells. In my field I often see that people average the data points per cell, and then run a t-test using these averages as input. So this means your n in each group equals the number of cells from which you collected and averaged data points. For example you end up with e.g. 5 averages in the control group (from 5 cells) and 5 averages from the treated group (another 5 cells), and then these 10 averages are compared using the t-test.

But in my case, the data from each cell (hundreds of data points per cell) is not normally distributed, and as I understand the median will then represent the data more accurately. I used the normality tests in GraphPad Prism (Anderson-Darling, D'Ágostino & Pearson, Shaprio-Wilk, Kolmogorov-Smirnov), all said that the data was not normally distributed.

Is it then allowed to use these medians (one from each cell) as input for a t-test?

Many thanks for your help!

[edit for some additional details]

Holioneok
  • 23
  • 3
  • You indicate that you have data points *within each cell*, but that your *n* is the total number of cells. Do you average the *within-cell* data points, and then average over the total number of cells? Maybe show us a subset of your data, if possible. Also, what criteria did you use to determine the data was *not* normally distributed? I often hear this, in practice, but it is often rare, again in practice, to observed perfectly symmetrical distributions. – Thomas Bilach Jun 04 '20 at 19:20
  • Thanks for your answer! Yes, or in this case I took the median of the within-cell data points. But usually people indeed average, and then compare the means of the average values (using the t-test). I used the normality tests in GraphPad Prism (Anderson-Darling, D'Ágostino & Pearson, Shaprio-Wilk, Kolmogorov-Smirnov), all said that the data was not normally distributed. One difference I forgot to mention is that in my case there are hundreds of data points per cell, usually people have much less than that and then average those. – Holioneok Jun 04 '20 at 19:39

3 Answers3

3

You say your have large numbers of data points per cell. In that case the cell medians should be normally distributed. (See simulation below.) So you could run a two-sample t test on cell medians to see if Control and Treatment groups differ. I haven't seen your data, but it would probably be OK to use cell means because you have so many data points per cell.

In principle, there is nothing to stop you from running a nonparametric Mann-Whitney-Wilcoxon test as @Parnian suggests, but with only five cells each in Treatment and Control groups, that may well be a futile exercise. You are near the absolute lower limit of sample sizes for which the MWW test is useful.

For example, for the rank-based MWW test, if you had only four cells in each group, then all of the Treatment cells would need to have greater 'averages' than any of the Control cells (or vice versa) to get a significant result. There are only ${8 \choose 4} = 70$ possible arrangements of ranks the most extreme two of which correspond to complete separation of the values in the two groups; $2/70 = 0.029$ so it is possible to to get a significant P-value. But as soon as there is any overlap at all, the smallest possible P-value becomes greater than $0.05.$

Also, here is an example of MWW test with five Treatment values and five Control values that is not significant at the 5% level. By contrast, a t test does find a significant difference at that level.

 wilcox.test(c(10, 20, 30, 40), c(38, 48, 58, 68))$p.val
 [1] 0.05714286

 t.test(c(10, 20, 30, 40), c(38, 48, 58, 68))$p.val
 [1] 0.02201958

Simulation: CLT for mean and median. Finally, suppose you have samples of size 500 from the skewed distribution $\mathsf{Gamma}(2, 1).$ By the Central Limit Theorem, means of such samples will be nearly normal. But there is also a CLT for medians. Here is a simulation using means a and medians h of $100\,000$ samples of size $n=500$ from this distribution. The medians are a little more variable, but normal nevertheless.

set.seed(604);  m = 10^5;  n = 500
x = rgamma(m*n, 2, 1)
DTA = matrix(x, nrow=m)  # each row of matrix is sample
a = rowMeans(DTA);  h = apply(DTA,1,median)
par(mfrow=c(1,3))
 curve(dgamma(x,2,1), 0,10, col="blue", lwd=2, ylab="PDF", 
       main="Density of GAMMA(2,1)")
   abline(v=0,col="green2"); abline(h=0,col="green2")
 hist(a, prob=T, br=30, col="skyblue2", 
      main="n=500: Sample Means")
 hist(h, prob=T, br=30, col="skyblue2", 
      main="n=500: Sample Medians")
par(mfrow=c(1,1))

enter image description here

BruceET
  • 47,896
  • 2
  • 28
  • 76
  • Many thanks for your extensive reply. Indeed, if I test my population of medians they are normally distributed just like you show in your example! And I suppose the t-test is blind to the data that is used as input, be it means, medians, percentages or what have you, so then it should be okay to compare the cell medians between the groups. – Holioneok Jun 05 '20 at 12:17
1

If your data are not normally distributed and that is why you are using median instead of mean to compare the groups, you can use Mann-Whitney test to compare the difference between the two groups.

Parnian
  • 127
  • 5
  • Thanks for your answer! Do you mean that one can take the medians of the individual cells, and then use a M-W test to compare the medians? So you end up with e.g. 5 medians in the control group (from 5 cells) and 5 medians from the treated group (another 5 cells), and then you compare these 10 medians with each other using the M-W test? – Holioneok Jun 04 '20 at 19:41
  • yes you can use it that way too – Parnian Jun 05 '20 at 08:58
0

Here is a paper talking about using t-test with extremely small sample size "Using the Student's t-test with extremely small sample sizes". For the normality part here is something I found on Wikipedia

For exactness, the t-test and Z-test require normality of the sample means, and the t-test additionally requires that the sample variance follows a scaled χ2 distribution, and that the sample mean and sample variance be statistically independent. The normality of the individual data values is not required if these conditions are met.

Notice the sample mean in the above text refers to the average of cell means in your case. Let's denote the mean value of each cell is $C_i$. Here 'i' only ranges from 1 to 5. The number of cells can't guarantee $\bar{C}$ to have a normal distribution. It would be easier if $C_i$s are $i.i.d$ normal. When you calculate the mean value of each cell (assuming samples in each cell across the entire group are from the same population), $C_i$s are $i.i.d$ normal by CLT. Although median can follow CLT in some case (Please see Central limit theorem for sample medians for detail), using mean works in the most general setting.

Tbone
  • 79
  • 2