How to determine the correct analysis from two variables?

Question

so I'm trying to determine what the correct form of analysis I need to use to find out if people who drink one can of cola per day have different amounts of acne from those who don't drink a can of cola per day. I am struggling figuring out whether I need to use a bivariate correlational analysis or not, and if so which type of correlational analysis - Pearson or Spearman?

Any input would be greatly appreciated! Thanks!

BruceET · Answer 1 · 2019-10-25T23:41:12.110

According to your description, I can find no basis for use of correlation in this study. Presumably, you have $n_1$ cola drinkers and $n_2$ non-drinkers of cola in the study. Also, you might have numerical or ordinal categorical scores for degree of acne.

Welch t test of numerical data. Suppose $n_1 = 100, n_2 = 150$ and that acne scores are numerical and reasonably close to normal (e.g., not heavily right-skewed with many far outliers to the right). Then you might use a Welch 2-sample t-test.

For example, here are some continuous data, sampled using R, that are close enough to normal that a two-sample t test can give useful results. [Results from R.]

x1 = rgamma(100, 50, 5);  x2 = rgamma(150, 50, 4.5)
t.test(x1, x2)

        Welch Two Sample t-test

data:  x1 and x2
t = -7.2884, df = 216.31, p-value = 5.807e-12
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
 -1.7352022 -0.9964768
sample estimates:
mean of x mean of y 
 9.957286 11.323125

The P-value very near 0 shows a highly significant difference between the groups. Whether a difference in scores between 9.96 and 11.32 is of actual practical importance might be a question for those who treat acne to determine. [Notice that my fake data show higher acne scores among non-drinkers of cola.]

Here are box plots summarizing scores in the two groups. The 'notches' in the sides of the boxes are nonparametric confidence intervals calibrated for comparing two groups; non-overlapping CIs is an indication of a significant difference in group medians.

boxplot(x1, x2, col="skyblue2", pch=10, notch=T, names=T)

Wilcoxon test for ordinal data. If acne scores are ordinal (0 = none, 1 = minmal, 2 = moderate, 3 = severe), then use an implementation of the 2-sample Wilcoxon test that is programmed to handle ties.

Using artificial data, here is a demonstration how the Wilcoxon test would work.

y1 = sample(0:3, 100, repl=T, p = c(1,1,2,3))
y2 = sample(0:3, 150, repl=T, p = c(1,2,2,2))
table(y1)
y1
  0  1  2  3 
 11 13 31 45            # counts out of 100
table(y1)/100
y1
   0    1    2    3 
 0.11 0.13 0.31 0.45    # proportions
table(y2)
y2
  0  1  2  3 
 26 39 38 47                       # counts out of 150
table(y2)/150
y2
        0         1         2         3 
0.1733333 0.2600000 0.2533333 0.3133333  # proportions

For these data, the non-cola group has greater proportion of 0s, smaller proportion of 3's.

wilcox.test(y1, y2)

        Wilcoxon rank sum test with continuity correction

data:  y1 and y2
W = 9031, p-value = 0.004324
alternative hypothesis: 
   true location shift is not equal to 0

With these data, the Wilcoxon test shows a significant difference at the 0.5% level.

Chi-squared test, treating data as nominal. A chi-squared test based on a 'contingency table' of counts, looks at proportions of each categorical level. The null hypothesis is that proportions are the same for the two groups. This test ignores the ordinal character of the data, recognizing the categories only as nominal. Even so the test is significant at the 5% level for my fake data.

You can look in a statistics text or search online to see how the expected scores and the chi-squared statistic are computed.

chisq.out= chisq.test(cbind(c(11,13,31,45), c(26,39,38,47)))
chisq.out

        Pearson's Chi-squared test

data:  cbind(c(11, 13, 31, 45), c(26, 39, 38, 47))
X-squared = 10.244, df = 3, p-value = 0.0166

     chisq.out$obs
     [,1] [,2]
[1,]   11   26
[2,]   13   39
[3,]   31   38
[4,]   45   47
     chisq.out$exp
     [,1] [,2]
[1,] 14.8 22.2
[2,] 20.8 31.2
[3,] 27.6 41.4
[4,] 36.8 55.2

If you have further questions, please provide sample sizes and summary statistics, along with any misgivings about the possibilities above.

How to determine the correct analysis from two variables?

1 Answers1