Data analysis: how to compare 2 variables from each subject across 3 groups of 200 subjects?

Question

New here and not a statistician, am MD doing research. Need advice on best method for data analysis.

I have 3 groups of 200 subjects each. We collected 2 data points (A,B) from each subject. Baseline group is (A,B), Group 1 is (Ax,B), and Group 2 is (A,Bx), where x is the study condition. Note A is typically larger than B.

Question: How to best compare A and B for each subject? We considered A/B, B/A, (A-B)/A, (A-B)/B, (A-B)/(A*B). The results for A/B perfectly fit our prediction, so of course we like that one! But would like a more solid reason for choosing an analysis method to make sure our conclusion is real.

It would help if you could say more about what the A and B values represent. Presumably they are numeric, but are they integers or on a continuous scale, what are their ranges, are they necessarily non-negative, are their distributions skewed? — EdM, Jul 30 '15 at 18:24
@EdM Thanks for your response. The numbers are non-negative integers. A ranges from 0.104 - 71.183 and B ranges from 0.098 - 75.272. For each of the 3 conditions, skewness of A ranges from 2.2 - 4.0 and skewness of B ranges from 2.9 - 6.2. — skapl, Jul 31 '15 at 03:47

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

From your description of the data, with a wide range and positive skew, it seems that analysis of variance or linear modeling of your raw data values will pose problems. Many standard statistical tests work best if the residual errors (after accounting for the treatment effects and so forth) are normally distributed; that seems highly unlikely in your case.

For these types of data, you will probably be better off if you take logarithms of your readings before you begin your statistical analysis. This is often done to get a more reliable handle on data like these; see this page as an introduction to discussion about this topic. That log transformation will often lead to a better statistical model.

Note that $log(A/B) = log(A) - log(B)$. So if you do analysis of variance or other linear modeling of your log-transformed data, the differences among data values (here, log-transformed) that are used in those analyses translate pretty directly into analyses of $A/B$ ratios.

You do have to be careful in the way you present your results. The mean values of log-transformed data represent geometric means in the original scale. The differences between 2 log-values for an individual will not be the same as the log of their difference on the original data scale, it will be the log of their ratio. You will have to decide how best to present the results in tabular or graphical form, whether to present the means in the log scale or to back-transform into the original scale. If you back-transform, confidence intervals will no longer be symmetric about the (back-transformed) mean value.

If interpretation of your data in terms of $A/B$ ratios makes sense from your knowledge of the subject matter, however, standard statistical analyses based on the logarithms of your data could work well.

edited Apr 13 '17 at 12:44

Community

1

answered Jul 31 '15 at 14:16

EdM

57,766
7
66
187

thanks so much. We may use logs, but even in that case, I'm not sure if log(A/B) is more appropriate than log(B/A). Even with simple number, such as (1,2,3,4,5,6,7,8) vs (1,1/2,1/3,1/4,1/5,1/6,1/7,1/8), the means and SDs are different. In our case, it seems that the decision of which value to put in the denominator is random. I'm hoping there is a rational (pun intended) reason to put a smaller number in the denominator. – skapl Jul 31 '15 at 14:56
Just using raw data, when I use a ratio A/B, my values range from 0.065 to 9.237, skewness 0.9 - 3.7. Using B/A, my values range from to 0.108 - 15.492, skewness 1.5 - 11.1. If I take just 1 - 99th percentiles to remove a few outliers, skewness for A/B drops to 0.7 - 1.6 and skewness for B/A drops to 0.4 - 2.9. We hoped that by taking a ratio of the raw data before attempting any stats, we could reduce some of the range of distribution. I see how logs could be helpful in any case, since we are working with a mixture of fractions and integers. – skapl Jul 31 '15 at 15:08
Work on the logs and the problems with skewness of ratios on the original scale will probably go away. $log(B/A) = -log(A/B)$, so it doesn't really matter which way you do the analysis on log values from a statistical standpoint; _p_-values and all will be the same. Consider instead which makes most sense from the standpoint of the subject matter and explaining to your audience. Don't throw away "outliers" unless you know the data were collected in error; on the log scale, outliers often end up being much less of a problem. – EdM Jul 31 '15 at 15:15
In the example of your first comment, if you do the log transform _first_ you will get the same SD for both series, and the mean of the second series (on the log scale) is simply the negative of the mean of the first series (on the log scale), as expected because $log(A) = -log(1/A)$. – EdM Jul 31 '15 at 15:24
Thanks, this is tremendously helpful. I see from other discussions on the board that this is an extremely basic question, and I appreciate you taking time to answer! – skapl Jul 31 '15 at 15:32

Data analysis: how to compare 2 variables from each subject across 3 groups of 200 subjects?

1 Answers1