Statistical line comparison

Question

I have a dataset like the one in this question, i.e,

interval    mean  Drug    lower   upper
14  0.004   a   0.002   0.205
30  0.022   a   0.001   0.101
60  0.13    a   0.061   0.23
90  0.22    a   0.14    0.34
180 0.25    a   0.17    0.35
365 0.31    a   0.23    0.41
14  0.84    b   0.59    1.19
30  0.85    b   0.66    1.084
60  0.94    b   0.75    1.17
90  0.83    b   0.68    1.01
180 1.28    b   1.09    1.51
365 1.58    b   1.38    1.82
14  1.90    c   0.9     4.27
30  2.91    c   1.47    6.29
60  2.57    c   1.52    4.55
90  2.05    c   1.31    3.27
180 2.422   c   1.596   3.769
365 2.83    c   1.93    4.26
14  0.29    d   0.04    1.18
30  0.09    d   0.01    0.29
60  0.39    d   0.17    0.82
90  0.39    d   0.20    0.7
180 0.37    d   0.22    0.59
365 0.34    d   0.21    0.53

You can see a good graphical representation in the top answer on the linked thread. Let's assume the upper = means + 1 standard-deviation and lower = means - 1 standard-deviation. Means and standard-deviations were computed over a set number of trials (say, $n=20$) at each interval for each Drug.

My question is, how do I get p-values for the overall superiority of say drug C to drug A or drug B to drug D? What is the correct statistical procedure here and how can it be implemented?

More information is needed here. Is `interval` the number of days since administration of the drug, or something? Are these repeated measurements from the same subjects, or independent samples? The simplest answer will be to use an ANOVA, and maybe repeated measures ANOVA, depending on the design. — Eoin, Sep 07 '20 at 15:44
@Eoin We should be able to say that the samples along ```interval``` are independent. (That is, a different group of people are inspected after 14 days to that inspected after 30 days.) Within these 'inspections', each consisted of 20 trials at each 'interval' for each model based on random train/test splits each trial (my actual use case isn't medicinal drugs...) and that's how the means and standard-devs were calculated — Mobeus Zoom, Sep 07 '20 at 16:30
OK. So you have the raw data from which the means were calculated? And finally, the raw values for each trial are values on some linear scale (e.g. some medical test score)? Or is `OR` an odds ratio, or similar? I'm going to assume the former, and start writing an answer. Please correct me if it isn't! — Eoin, Sep 07 '20 at 16:40
@Eoin yes, the raw values are on a linear scale from 0-100 (e.g., "patients who survive" if that's not too morbid). I don't have the raw data---only the means, standard deviations, and the sample sizes (20) — Mobeus Zoom, Sep 07 '20 at 16:47

Eoin · Answer 1 · 2020-09-07T17:03:02.917

2

Assuming that you have access to the values from each individual trial, the simplest model here is a two-way drug (a, b, c, or d) × interval (14, 30, 60, 90, 180, or 365) ANOVA.

m = lm(score ~ interval * factor(drug), data=your_data)
anova(m)

This will tell you if a) there's a main effect of drug (indicating that some drugs are better than others), b) there's a main effect of interval (some intervals have higher scores than others), and c) there's a drug × interval interaction (the difference between the drugs varies depending on the interval).

If you do find a main effect of drug, you may want to explore various post hoc test, for instance, testing if there's a significant difference between drugs a and b. The simplest way to do this is just to repeat the analysis on a subset of the data.

data_aVb = dplyr::filter(data, drug %in% c('a', 'b'))
m_aVb = lm(score ~ factor(interval) * drug, data=data_aVb)
anova(m_aVb)

You'll also want to read about correcting for multiple comparisons, but I won't go into that here.

Update!

Since your data is actually a proportion, you'll have to nuance this slightly. Standard ANOVA is a version of linear regression, and assumes the data is on a linear scale. What you actually have is a proportion, indicating that $y$ out of 20 patients survived (or similar). You can deal with this by using logistic regression instead of linear regression, as follows (assuming that survived is the total number who survived (out of 20)):

m = glm(cbind(survived, 20) ~ interval * factor(drug), 
        data=your_data, family=binomial)
anova(m)

Again, there plenty of resources on this online.

edited Sep 07 '20 at 17:03

answered Sep 07 '20 at 16:56

Eoin

4,543
15
32

thanks. What if the original data is no longer available? – Mobeus Zoom Sep 07 '20 at 17:09
Is `mean` just a proportion out of 20? If so, you just create a new column, `survived = mean * 20`. That is, if you're fitting logistic regression, you don't need the original data if you have the totals (`n_survived` and `n_total`, in this case). If you're using regular ANOVA, you can also [calculate it using the means and SDs](https://www.tandfonline.com/doi/abs/10.1207/S15328031US0103_04?journalCode=hzzk20), but I don't know if there's a built-in function in R for this. – Eoin Sep 07 '20 at 17:14
No, ```mean``` is a proportion out of 100, averaged over 20 trials (at each ```interval```, for each ```Drug```). So then ```mean*100``` instead? Would it be statistically valid (given the assumptions ANOVA makes anyway) to use the means and SDs to simulate data from the normal distribution and let R run ANOVA on that? – Mobeus Zoom Sep 07 '20 at 17:34
Ah, ok. Now I understand. It's not necessary or advisable to simulate data as you suggest, since this just adds noise for no reason. Instead, you can calculate an ANOVA using the means and standard deviations, using the method in the link provided. There's probably an R package for this somewhere as well, although it's not complicated to implement yourself. – Eoin Sep 07 '20 at 17:38
Thanks. I don't seem to be able to get access to that paper. Could you edit your post with some details how it might be done (mathematically)? – Mobeus Zoom Sep 07 '20 at 17:43
You can get the paper on [sci-hub](https://sci-hub.se/https://www.tandfonline.com/doi/abs/10.1207/S15328031US0103_04?journalCode=hzzk20). This is also discussed on this site [here](https://stats.stackexchange.com/a/57759/42952), [here](https://stats.stackexchange.com/a/126521/42952), [here](https://stats.stackexchange.com/a/329293/42952), and in a few other questions. – Eoin Sep 08 '20 at 07:50
Please don't forget to accept the answer and/or upvote if you found it helpful. – Eoin Sep 08 '20 at 07:51
the first two of those posts recommend simulation (which you explicitly advised against) and the last is a one-way ANOVA. I wouldn't say the question has been answered yet – Mobeus Zoom Sep 08 '20 at 12:42

Statistical line comparison

1 Answers1