Age-adjustment (and its realization with mgcv): add age as covariate to the regression or do an internal standardization?

Question

This question consists of two - but possibly related - parts.

Question #1

We have data on the occurrence of a disease in several age groups over several years. We try to investigate the secular trend of the disease (i.e. how it changes over the years), we will use negative binomial regression with splines for that end, but as the age composition of the (background) population is substantially changing due to aging, we have to apply some adjustment. I can think of two ways:

Classical method: perform a usual indirect standardization, and add the expected counts as offset to the regression.
Modern method: don't collapse age groups, feed the whole database to the regression, but add age as a predictor (and the offset is simply to population count of the stratum).

Apart from being ''classical'' and ''modern'', what are the pros and cons of these approaches?

Question #2

Let's try out these two approaches on a simulated dataset.

We first simulate changing age composition (by transiting from Kazakhstan's to Sweden's population pyramid), and add some predictors to create a more or less realistic scenario:

library( data.table )
library( mgcv )
library( epitools )
set.seed( 12 )
RawData <- data.table( read.table( "http://data.princeton.edu/eco572/datasets/preston21long.dat",
                  header = FALSE, col.names = c( "country", "ageg", "pop", "deaths" ) ) )
RawData$ageg <- rep( c( 0, 1, seq( 5, 85, 5 ) ), 2 )
RawData <- RawData[ , approx( x = c( 0, 20*360-1 ), y = pop,
                      xout = 0:(20*360-1) ) , .( ageg ) ]
names( RawData )[ c( 2, 3 ) ] <- c( "date", "pop" )
RawData$year <- floor( RawData$date/360 )+1
RawData$month <- floor( ( RawData$date-(RawData$year-1)*360)/30 )+1
RawData$count <- rpois( nrow( RawData ),
                     exp( log( 0.0001*RawData$pop ) + 0.00004*RawData$ageg^2 -
  0.005*(RawData$date/300)^2 + sin( RawData$month/12*2*pi ) ) )

I.e. we have daily data for 20 years.

First let's try the modern approach:

fit <- gam( count ~ s( date ) + s( month, bs = "cc" ) + s( ageg ), offset = log( pop ),
       data = RawData, family = poisson( link = "log" ) )
par( mfrow = c( 2, 2 ) )
plot( fit, trans = exp, scale = 0 )

The result is:

Let's now try the classical approach:

StdPop <- RawData[ , .( count = sum( count ), pop = sum( pop ) ), by = .( ageg ) ]

fit2 <- gam( count ~  s( date ) + s( month, bs = "cc" ), offset = log( exp ),
        data = merge( RawData, StdPop, by = "ageg" )[
          , .( exp = ageadjust.indirect( count.x, pop.x, count.y, pop.y )$sir[ c( "exp" ) ],
          count = sum( count.x ) ),
          .( date, year, month ) ], family = poisson( link = log ) )
par( mfrow = c( 1, 2 ) )
plot( fit2, trans = exp, scale = 0 )

Result:

As expected, we couldn't obtain information on the age's effect, but otherwise the results are pretty similar. But here comes the surprising part.

Let's say we are worried whether the spline for the time trend was flexible enough (i.e. k is correctly chosen). The standard approach to this problem in mgcv is simply to set k high enough - if k is too low, we have a problem, but if it is high enough, it's exact value usually doesn't really matter.

So let's increase k to 30. With the modern approach we see that the effective df increases:

Just as with the traditional approach:

Was 30 still too low? Let's increase it to 50! The modern one produces an even stranger output:

Just as the traditional one:

Something clearly went very wrong with k=50, but even if we put that aside, it is still strange: the effective df - instead of settling when the k is high enough - simply keeps increasing...

What's going on here...?

EDIT (01-Feb-2018)

See earlier attempts here (chronologically).

In the light of an earlier answer, I tried to experiment with the adaptive spline (with bs="ad"), but it didn't really help. For k=30:

With k=50:

EDIT (15-Dec-2017)

Following the suggestions of @Paul, I tried to experiment with including population as a true offset:

RawData$count2 <- round( exp( log( 0.0001*RawData$pop ) + 0.00004*RawData$ageg^2 -
0.005*RawData$year^2 + sin( RawData$month/12*2*pi ) + rnorm( nrow( RawData ), 0, 2 ) ) )

However, it doesn't seem to change the story:

fit <- gam( count2 ~ s( date, k = 30 ) + s( month, bs = "cc" ) + s( ageg ),
offset = log( pop ), data = RawData, family = nb( link = "log" ) )
par( mfrow = c( 2, 2 ) )
plot( fit, trans = exp, scale = 0 )

results in

With k=50:

So it seems that everything is unchanged...

EDIT (08-Jan-2018)

Following another suggestion of @Paul, I tried to change the method, but to no avail...

REML (with the usual k=30):

ML (with the usual k=30):

EDIT (10-Jan-2018)

Following the final suggestion from @Paul I tried to generate the simulated data so that it follows Poisson distribution and set the model also to Poisson:

RawData$count3 <- rpois( nrow( RawData ), exp( log( 0.0001*RawData$pop ) +
0.00004*RawData$ageg^2 - 0.005*RawData$year^2 + sin( RawData$month/12*2*pi ) ) )
fit <- gam( count3 ~ s( date, k = 50 ) + s( month, bs = "cc" ) + s( ageg ),
offset = log( pop ), data = RawData, family = poisson( link = "log" ), method = "ML" )

...but it didn't help either.

k=10:

k=30:

k=50:

So, as before, edf seems to follow k, instead of settling when k is sufficiently high.

1. For mgcv, your graphs are not consistent with the equation you wrote in. There is a `s(month)` term in the plots but not in the equation. 2. Is there a reason you are using negative binomial instead of Poisson? — Paul, Dec 01 '17 at 22:10
@Paul 1) Sorry, you're totally corrected, I just mixed a few things... (I had many versions when I experimented with this phenomenon.) I cleared the code, hope everything is correct now. 2) Nothing extraordinary, I just tend to use NB when there is any chance of overdispersion... just to be on the safe side. — Tamas Ferenci, Dec 03 '17 at 01:22
Other issues: (1) You have a log offset for pop but your simulated count doesn't include pop. (2) For diagnostic purposes it would be less ambiguous for me if we used Poisson simulated count data and a poisson family in the fit, so we can at least rule out a poor statistical model as the cause. — Paul, Dec 03 '17 at 04:01
If you get your simulation fixed and still have issues, try to set method="REML" in gam. The current default of GCV.Cp is not the most recommended, REML or ML is recommended due to occasional undersmoothing issues with GCV.Cp. — Paul, Dec 03 '17 at 04:05
@Paul (1) You are right, but I don't think it should matter: right now it is generated independently, so it is similar to the inclusion of an irrelevant covariate, isn't it? (Actually, as the offset _is_ simply a covariate - but with fixed coefficient - it is perhaps not only similar, but completely identical to including a non-necessary covariate.) And the inclusion of an irrelevant covariate shouldn't lead to such strange undersmoothing...?? Nevertheless, I tried out what you're suggesting, see the edit of my question! I'll try out your further remarks too, thank you! — Tamas Ferenci, Dec 15 '17 at 21:20
It's not undersmoothing if the simulated data is in fact very wiggly. Your simulation needs to be correct for GAM to have a chance of working. Yes things can be messed up and still somehow work, but it's easier to debug if everything is correct and matches up precisely with simulation. Good luck with the other variants. — Paul, Dec 15 '17 at 21:25
Yes, sorry, perhaps "undersmoothing" wasn't the best terminology here, I was just referring to the fact the effective df simply follows `k` (instead of settling if `k` is sufficiently high). — Tamas Ferenci, Dec 15 '17 at 21:46
If you still can't get it to work after trying everything I suggested, consider sending in your code to the developer (Simon Wood). He has fixed bugs and other issues for me in the past. — Paul, Dec 15 '17 at 21:54
Yes, I also thought of this, but I thought it worth first asking it here, before directly disturbing him. (And indeed it was useful, thanks again!) But that will be the next step then, if these don't help. — Tamas Ferenci, Dec 16 '17 at 08:55
I am a bit concerned that GAM cannot readily distinguish between your desired solution and the one that is wavy with respect to the date term. Both solutions fit the data well, so they have to be distinguished based on the degree of penalization. The penalization scheme is complex and it's not clear to me if your favored solution would come out on top. That would definitely be a topic for Prof Wood. In the mean time, just keeping $k$ low on the s(date) term is a reasonable way to make this work. — Paul, Dec 16 '17 at 13:38
@Paul I finally had time to try out other methods, unfortunately they didn't help (see the update of the post). I'll try your last remaining suggestion (Poisson data generation with Poisson model), and if it doesn't help either, I'll ask further assistance, as you suggested. — Tamas Ferenci, Jan 08 '18 at 18:45
The behaviour of the spline of time seems to be the result of your model currently not accounting for the seasonal aspect of the data --- aren't those seasonal oscillations turning up in the `s(date)`? It's clear from that plot, if those are seasonal oscillations that the seasonal signal is changing over time whilst your model has a fixed seasonal effect. As Paul mentions, I think you need to dial the model in correctly first, and *then* see if there are still problems. Increasing `k` will only make a wigglier spline if there is unmodelled structure to fit to. — Gavin Simpson, Jan 10 '18 at 17:41
If you have daily data, then you could use the "day of the year" as the seasonal predictor. You probably want to switch from `date` to using `year` (`DoY`) as the trend covariate in the smooth. You will also want to interact these two splines via `te(year, DoY)`. And as you have daily data it is highly unlikely that you don't have autocorrelation and then you may have an identifiability issue: wiggly spline + low or zero AR is one potential description of the data but also a smooth spline with high AR is another valid description. — Gavin Simpson, Jan 10 '18 at 17:45
I think what you are seeing is the spline picking up residual autocorrelation not modelled by the constant seasonal effect. If you switch to `s(year)` plus add in autocorrelation structure to residuals, this may help, but you can't do that with `gamm()` as it can't handle the NBin conditional distribution. You could fit the model with `bam()` and then use `acf()` on residuals to get a guess at $\rho$ and then refit the `bam()` with the estimate for $\rho$ plugged in via the `rho` argument. — Gavin Simpson, Jan 10 '18 at 17:47
(Also, can you clean the question up - there's too much going on now for users to reasonably grapple with here.) — Gavin Simpson, Jan 10 '18 at 17:48
@GavinSimpson First of all, thank you very much for your comments! "of your model currently not accounting for the seasonal aspect of the data" But I included `s( month, bs = "cc" )`, doesn't it account for seasonality? "the seasonal signal is changing over time whilst your model has a fixed seasonal effect" That's true, but the seasonality is fixed in the population model too (`sin( RawData$month/12*2*pi )`). "If you have daily data, then you could use the "day of the year" as the seasonal predictor." OK, I'll try it! — Tamas Ferenci, Jan 10 '18 at 20:05
@GavinSimpson "And as you have daily data it is highly unlikely that you don't have autocorrelation" Hm, indeed! You're totally right, I simply haven't thought of this aspect.. "plus add in autocorrelation structure to residuals" That's definitely an interesting idea, but my problem is that in this case the whole autocorrelation story is a bit more convoluted as I have 19 age strata. Am I right to assume that now I don't have "an" ACF (and 1 $\varrho$), but rather 19 ACFs and $\varrho$s - one for each stratum..? And thus the corr. structure should also somehow take this into account? — Tamas Ferenci, Jan 10 '18 at 20:53
Hi Tamas, in your Jan 10 simulation the fitted model and ground truth are much closer than before, but there is still some discrepancy. In your simulation the long-term drift is based on `year` which is discrete (changes every year) but the model is based on `date`. So the correct latent factor associated with `date` is a discretized downward parabola with discontinuity at each year. GAM is designed to extract smooth curves, not stairstepping curves. So GAM might perform better if the drift term was based on `date` instead of `year`. — Paul, Jan 29 '18 at 16:37
It is also worth noting that improving the agreement between simulation and model did help, although not fully solving the problem. The results at $k=30$ are clearly much better in the Jan 10 experiment than in the earlier experiments. — Paul, Jan 29 '18 at 16:58
I agree with Gavin that the question should be cleaned up. Only the Jan 10 simulations are really relevant and helpful for our users. I don't see the value in showing that GAM "fails" to fit data when the generating process is so different from the GAM model formula. — Paul, Jan 29 '18 at 17:09
@Paul Thank you Paul again! As you suggested, I cleaned up the question and changed the model so that `date` is used both in the simulation and in the model. Unfortunately it didn't solve the problem! The "good" news is that now ever the traditional method produces the same problem... Or is it the unidentifiability issue you raised in your last comment?? — Tamas Ferenci, Feb 01 '18 at 14:49
It's not identifiability, your simulation has problems. Take floor out of the month definition and everything works — Paul, Feb 01 '18 at 17:21

score 4 · Accepted Answer · answered Jan 31 '18 at 09:29

Yes, you are right Paul, the true function of date has a step at the year boundaries. That accounts for the estimated cyclic smooth of month having the strange blocky shape instead of the simulated sinusoid --- it is trying to match what has been simulated. The cyclic smooth can't capture the change in step height over time, so, if you give the model the freedom to do so, it uses the smooth of date to deal with this instead.

If the two `floor' functions are removed in the original simulations, to avoid the saw tooth in the dependence on date, then everything behaves as expected: the smooth of date remains smooth and stable as k is increased (despite the data not being remotely negative binomial). Also the cyclic smooth then looks like a sinusoid, as it should do.

So in fact there was variability in seasonality from year to year, generated by the rounding to year, and that seems to have been the main cause of the apparent problem.

Unfortunately, changing `year` to `date` did not solve the problem! — Tamas Ferenci, Feb 01 '18 at 14:51
@TamasFerenci Simon's right. Go up to your code and remove the floor function from the line where you define `month`. Everything works perfectly after that. The unphysical discretization of the month effect in your simulation is throwing off both GAM and the traditional method. — Paul, Feb 01 '18 at 17:15
@Paul ...and that is correct Thank you very much! `RawData$DOY — Tamas Ferenci, Feb 14 '18 at 18:29

score 2 · Answer 2 · answered Jan 29 '18 at 16:00

Sorry I hadn't read your simulation code properly, so my answer was a bit misleading. I think you need to try simulating your data from a model somewhat more similar to the one you are fitting (or fitting a model more similar to the one simulated from). It you look at the residual plots you can see that the negative binomial is a very poor model of data generated from round(exp(model.mean + Gaussian.deviate)). If the model is this far wrong then you are likely to find that it will try to overfit in some places, and underfit in others, leading to things like an apparent change in amplitude of the seasonal cycle (peakier where the model likelihood says it should be fitting closer - attenuated where the likelihood says it should not be such a close fit).

Simon Wood

If you look farther down to the Jan 10, he has done a variant experiment where the model is more faithful to the simulation process. The only problem I can see with that experiment is that the long-term drift is based on `year` but the model is based on `date`. So the ground truth for `s(date)` is a discretized parabola with stair steps at each year, rather than a completely smooth parabola. — Paul, Jan 29 '18 at 16:52

score 1 · Answer 3 · answered Jan 11 '18 at 09:11

It looks to me as if there is variability in the seasonality effect from year to year --- this gets picked up by the smooth of date only if you give it enough flexibility to allow it to do so. For confirmation that this is the issue, notice the way that the cyclic smooth of month is attenuated when the smooth of time starts to pick up the annual cycle. Probably, if you look at the residuals when the smooth of time has a low k you will see pattern with respect to time.

My guess is that when you collapse the data there is just not enough information to detect the variability in annual cycle (but I may be missing something).

One detail: if month is discrete, then you want to set the cyclic spline to run from 0 to 12 (or 1 to 13), otherwise you are forcing s(december)==s(january) which is not what is intended (use 'knots' to do this).

Simon Wood, Bristol

Thank you very much for your answer! I don't really see how can we have "variability in the seasonality effect from year to year" in an artificially generated database, where I have manually added the seasonality (that is of course the same in every year),but the more important question is: what to do now, what is the solution? In particular, shall I start with adding some autocorrelation model, as @Gavin Simpson suggested...? — Tamas Ferenci, Jan 16 '18 at 22:42

Age-adjustment (and its realization with mgcv): add age as covariate to the regression or do an internal standardization?

3 Answers3