8

I have two groups A and B, each of which consists of 5 samples. Each sample is described in a vector of length (>1000) of continuous numeric values (characteristics). I want to test if the sample in the first group varies between each other more than the sample in the second group.

I tried one-way ANOVA for each group independently and calculated the F-test statistic for each group. Now, can I compare the two obtained statistics directly (so if statistic 1 > statistic 2 I say the first group has more variance between its samples than the second group)? or do I need to perform statistical significance to compare the two statistics?

Pitouille
  • 1,506
  • 3
  • 5
  • 16
Abbas
  • 485
  • 1
  • 4
  • 12
  • 2
    Have you tried two way `ANOVA`? – SAAN Aug 03 '13 at 03:23
  • +1 to two-way as the way forward. Note: Various problems of terminology here. I have edited fairly strongly and assumed that by "scores" you mean "statistics". – Nick Cox Aug 03 '13 at 07:57
  • Thanks @NickCox for the answer and the accurate edit. As far as I know, the purpose of two-way ANOVA is to find out whether data from two groups have a common mean. However, I want to prove that samples in one group are more different between each other than the difference between samples in the other group. So, I think two-way ANOVA may not work. I performed one-way ANOVA on each group independently to prove that samples in each group are different between each other. Then used the F-statstic scores to compare the group1 difference to group2 difference. What do you think? – Abbas Aug 03 '13 at 15:50
  • @Abbas If you think two way `ANOVA` is not appropriate (I believe it is) than generate a dummy variable for two groups and run regression the resulting `ANOVA` fulfil your requirement. Your approach not seems good for me. – SAAN Aug 03 '13 at 16:05
  • 1
    What you describe sounds like an interaction. A two-way ANOVA would still be the way to go but you would simply not be primarily interested in the main effects, only in the interaction term. – Gala Aug 03 '13 at 17:05
  • 1
    The question is unclear. Your data are in two 5 x 1000+ matrices: one for group A, one for group B. In each matrix, rows correspond to samples and columns correspond to characteristics. Are all the characteristics in the same (or comparable) units? What variances are you talking about: down each column, giving you 1000+ variances for each group; or across each row, giving you 5 variances for each group? – Ray Koopman Aug 04 '13 at 06:35
  • Thanks @RayKoopman. Yes. all characteristics are in comparable units. I am talking about variances down each column of the 5x1000(have 1000+ variances for each group). I have two matrices, each o size (1000+ x 5), one matrix per group. Columns represent individuals (samples) and rows represents (genes). so, the values represent the "expression rate" of genes in each sample. I normalize (mean=0,std=1) the values row-wise across both matrices. I want to prove that the "expression profiles" (i.e samples) of the first group are more variant between each other comparing to the other group samples. – Abbas Aug 04 '13 at 16:03
  • @Abbas So you have a 1000+ x 10 matrix, with cols 1--5 for samples from group A and cols 6--10 for samples from group B. Standardizing within each row won't change the ratio of the variance of the A-values to the variance of the B-values within each row. You might consider a simple t-test on the mean of the 1000+ logs of the ratios of the two variances. Why did you standardize? It might help if you could link to some info about "expression rates": are they proportions, or rates in some other sense? (I assume they can't be negative.) – Ray Koopman Aug 05 '13 at 09:19
  • +1 for the t-test.Thanks @RayKoopman. I think I need to standardize because some genes are highly expressed in group A (e.g. 70,90,80,60,75) but less expressed in group B(e.g. 1,9,3,10,2). While this gene in group B is more variant, variance value in A is higher. So, I standardize across all the 10 samples to remove such effect. When I applied t-test with standardization, I got p-value=0 (rejection of null hypothesis), however, when I use without standardization I got p-value=0.23 (fail to reject null hypothesis). What do you think? one more thing: is this test more useful than ranksum test? – Abbas Aug 06 '13 at 03:14
  • @Abbas Since standardizing across all 10 samples divides both within-sample SDs by the overall SD, the ratio of the within-sample SDs should not change. Did you do the t-test on the logs of those ratios? – Ray Koopman Aug 06 '13 at 16:25
  • @RayKoopman. Oh, sorry I forgot to do log of the ratios. When I did paired-sample t-test of the two vectors of variances, I got p-value=0, however, t-test of the vector of log of ratios of variances give Nan (not a number) in Matlab. I think it means that there is no alternative hypothesis to reject against the null hypothesis. – Abbas Aug 06 '13 at 17:28
  • @Abbas Are some of the variances zero? How did you get the expression rates? How precise are they? How much did you round them before getting their variance? – Ray Koopman Aug 07 '13 at 01:41
  • @RayKoopman The zero variances in the first group are only less than 1% of the vector length; the zero-variances are 2.3% of the length vector of the second group. The expression rates are obtained from next generation sequencing tools and they are precise and validated by multiple quality assurance tools. I did not round them. I used the raw counts (e.g. 12, 15, 30, 90...etc) which tells how much that gene is expressed in a sample. I only standardized the values gene-wise. – Abbas Aug 07 '13 at 15:01
  • @Abbas So before standardizing, the numbers in each row are raw counts, not proportions? Is there any upper limit to the counts, and if so are they close enough to it to matter, or can we interpret them as being in [0,$\infnty$)? – Ray Koopman Aug 07 '13 at 15:26
  • @RayKoopman. Yes. numbers are raw counts, not proportions. (some people in the domain convert the raw counts to proportions, while it is valid, I did not do). In theory, there is no upper limit to the expression rate of each gene. In my matrix, the maximum number is 29236. the median of the values is around 13 and the mean is around 63. – Abbas Aug 08 '13 at 18:28
  • @Abbas If the counts have Poisson distributions then the variances should be proportional to the means, so the variance differences you are seeing would not be surprising. Have other people treated the counts as having Poisson distributions (or any particular other distributions)? – Ray Koopman Aug 09 '13 at 00:21
  • @Abbas A question I should have asked earlier: You have on the order of 10 genes with zero-variance counts in one group and 23 in the other. What are the corresponding counts? – Ray Koopman Aug 10 '13 at 07:11
  • Possibly relevant: "The difference between significant and non-significant is not significant." http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf – jona Jun 27 '14 at 10:04
  • "if the sample in the first group varies between each other" what does it mean if the samples vary between each other? How is this 'varies' quantified? And what is the underlying assumed distribution? – Sextus Empiricus Oct 01 '21 at 14:49

1 Answers1

0

Your Model may be

response variable=dummy variable(Category A=1 or B=0) + Samples(1,2,3,4,5)

run regression, fixed factors as independent variables and see the effect of dummy, samples and combine effect too as your own requirements.

SAAN
  • 531
  • 5
  • 16
  • Thanks @Azeem. I am thinking how this regression model can help in answering the question above. shall I generate a regression model for each group and compare the (response variables) obtained from the model for each variable (e.g. plot them), and how "Samples(1,2,3,4,5)" can be formulated in the regression model. More clarification would be of great help. – Abbas Aug 03 '13 at 18:01
  • If you have several response variables (Category $A$ have five samples it does not mean five variable) than it is multivariate regression model. And do not think "shall I generate a regression model for each group" because it highly influenced precision of the model. You must have one data file under the columns response variable and classified variables. In response variable distinct columns can come and in classified variables( I think you have two, category and Samples) e.g first response variable belong to category $A=1$ and first sample=1, second value category $A=1$ second sample=2. – SAAN Aug 04 '13 at 03:07