8

I have following simulated data of 2500 persons regarding the incidence of a rare disease over 20 years

year number_affected
1   0
2   0
3   1
4   0
5   0
6   0
7   1
8   0
9   1
10  0
11  1
12  0
13  0
14  1
15  1
16  0
17  1
18  0
19  2
20  1

What test can I apply to show that the disease is becoming more common?

Edit: as suggested by @Wrzlprmft I tried simple correlation using Spearman and also Kendall methods:

        Spearman's rank correlation rho

data:  year and number_affected
S = 799.44, p-value = 0.08145
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.3989206 

Warning message:
In cor.test.default(year, number_affected, method = "spearman") :
  Cannot compute exact p-value with ties
> 



        Kendall's rank correlation tau

data:  year and number_affected
z = 1.752, p-value = 0.07978
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.3296319 

Warning message:
In cor.test.default(year, number_affected, method = "kendall") :
  Cannot compute exact p-value with ties

Are these sufficiently good for this type of data? Mann Kendall test using method shown by @AWebb gives P value of [1] 0.04319868. Poisson regression suggested by @dsaxton gives following result:

Call:
glm(formula = number_affected ~ year, family = poisson, data = mydf)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3187  -0.8524  -0.6173   0.5248   1.2158  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.79664    0.85725  -2.096   0.0361 *
year         0.09204    0.05946   1.548   0.1217  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 16.636  on 19  degrees of freedom
Residual deviance: 14.038  on 18  degrees of freedom
AIC: 36.652

Number of Fisher Scoring iterations: 5

Year component here is not significant. What can I finally conclude? Also, in all these analyses, the number 2500 (denominator population number) has not been used. Does that number not make a difference? Can we use simple linear regression (Gaussian) using incidence (number_affected/2500) versus year?

rnso
  • 8,893
  • 14
  • 50
  • 94
  • Some resources you might find useful: the US Geological Survey has published a textbook online, [*Statistical Methods in Water Resources*](http://pubs.usgs.gov/twri/twri4a3/html/toc.html). The chapter on trend analysis, [here](http://pubs.usgs.gov/twri/twri4a3/pdf/chapter12.pdf), covers things like the Mann-Kendal test and when you might prefer to undertake regression analysis instead. It also shows how to deal with seasonality, which might be relevant to you if your data was quarterly rather than annual. – Silverfish Jul 03 '15 at 00:04
  • Interestingly, Scipy’s implementation of Kendall’s τ yields the same coefficient but a drastically different *p*-value, namely 0.042. – Wrzlprmft Jul 03 '15 at 04:40
  • 1
    Regarding the Poisson model, I would instead use `drop1(fit, test="LRT")` to do a likelihood ratio test, instead of doing an asymptotic *z*-test on the Poisson statistic. (Doing so gives you a *p*-value of 0.107, so still not statistically significant.) You don’t need to include the population number in the regression if it’s the same for each year. Then it just plays the role of a scaling factor. But you *should* include it (with per-year population values), as the population at risk probably *does* vary over the twenty years. Just add `offset=log(pop_at_risk)` to the `glm`call. – Karl Ove Hufthammer Jul 12 '15 at 15:51

3 Answers3

3

You can use the non-parametric Mann-Kendall test. For this sample data, cases and the one-sided null hypothesis that there is no increasing trend, you can implement as follows in .

> n<-length(cases)
> d<-outer(cases,cases,"-")
> s<-sum(sign(d[lower.tri(d)]))
> ties<-table(cases)
> v<-1/18*(n*(n-1)*(2*n+5)-sum(ties*(ties-1)*(2*ties+5)))
> t<-sign(s)*(abs(s)-1)/sqrt(v)
> 1-pnorm(t)
[1] 0.04319868

And reject at the 5% level in favor of an increasing trend.

A. Webb
  • 690
  • 5
  • 9
  • Do you happen to know whether there is any difference between the Mann–Kendall test and the normal significance test for Kendall’s τ? Or is the Mann–Kendall test even the normal way of obtaining significance values for Kendall’s τ? At least the test statistics only differ by a normalisation factor that only depends on the length of the time series: $S = \tfrac{1}{2} n (n-1) τ.$ – Wrzlprmft Jul 03 '15 at 19:45
  • @Wrzlprmft This is the typical normal approximation significance test in the presence of ties. The [Wikipedia article](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) has good information/references for the various adjustments needed to account for ties. – A. Webb Jul 06 '15 at 13:39
2

You could fit a very simple regression model consisting only of an intercept and time component and test the "significance" of the time component. For instance, you might model $Y_t \sim$ Poisson$(\lambda_t)$ where $Y_t$ is the number of occurences in year $t$ and $\log(\lambda_t) = \alpha + \beta t$ and check if $\beta > 0$.

dsaxton
  • 11,397
  • 1
  • 23
  • 45
  • I agree that Poisson regression is appropriate. And if one had more data, one could even fit the (log) incidence rate as a *non-linear* function of time. An added advantage of Poisson regression is that it’s easy to take the number of persons *at risk* into account. And when one’s dealing with time, this is especially important, as the (possible) trend in incidence that we’re seeing may just be the effect of an increasing *population at risk*, not an increasing incidence *rate*. (For example, the world population has increased by a quarter in the last twenty years.) – Karl Ove Hufthammer Jul 12 '15 at 15:38
1

Just check whether your number of new cases (i.e., number_affected) is significantly correlated with time (i.e., year). As any possible linear dependence of the event rate is at least distorted to the observational discretisation, you want to use a rank-based correlation coefficient, e.g., Kendall’s τ or Spearman’s ρ.

Wrzlprmft
  • 2,225
  • 1
  • 18
  • 35
  • I actually meant incidence i.e. number_affected indicates new cases in that year. But your method of simple correlation should work for that also. – rnso Jul 03 '15 at 00:56
  • @rnso: *I actually meant incidence i.e. number_affected indicates new cases in that year.* – that’s how I understood it and I see no contradiction. – Wrzlprmft Jul 03 '15 at 04:38
  • 1
    I made that comment since you used the word 'prevalence' in your answer. Prevalence will include previous years' cases also (unless they have died). https://en.wikipedia.org/wiki/Incidence_%28epidemiology%29#Incidence_vs._prevalence – rnso Jul 03 '15 at 13:20
  • @rnso: Ah, point taken. – Wrzlprmft Jul 03 '15 at 18:39
  • 1
    Correlations measures, like Kendalls’s τ or Spearman’s ρ, are not appropriate, as they are created for *random* variables, and here one of the variables (time) is obviously not random at all. See, e.g., [Don’t Summarize Regression Sampling Schemes with Correlation](http://vanbelle.org/rom/rom_2003_12.pdf). Besides that, the Kendalls’s τ or Spearman’s ρ **tests** won’t work very well, as there is a large amount of ties in the data. A regression approach would be better, e.g., a Poisson regression (with a suitable trend function) and a likelihood ratio test. – Karl Ove Hufthammer Jul 12 '15 at 09:40
  • @KarlOveHufthammer: That link makes too much references to unknown stuff to be understandable to me, though it does not really seem to be about the same thing. Also, in my understanding (ignoring ties for simplicity’s sake), you can break down ranked correlation coefficients down to the question whether there is a significant trend in one a sequence of ranks, so one variable being random does not really pose a problem. Finally, the ties should be a problem for any analysis. – Wrzlprmft Jul 12 '15 at 15:12
  • @Wrzlprmft The link is about the problem of using a correlation measure (which is useful only in the *bivariate* case) in a regression setting. (I understand that it’s difficult to understand, as it references a book.) Correlation measures make no sense in a regression setting (which this is). That is, you can’t properly interpret the magnitude of the correlation coefficients when one of the variables is non-random. The corresponding tests *may* work (I guess that depends on which correlation measure one uses). And no, the ties would **not** be a problem for, e.g., Poisson regression. – Karl Ove Hufthammer Jul 12 '15 at 15:21