5

We work on cervical cancer detection and we started accumulating data in 2006 by performing HPV testing on all our samples. We have accumulated a data set with almost 400,000 women. Data enters the database on a daily basis. Most women come for a yearly smear so most women are more than one time in the database (2-13x). Most women are HPV negative (appr 85%) but some become HPV positive during this observation period.

I would be very interested in looking at the rate of infection (incidence) of HPV16. I suppose I will have to look at the women with at least 2 samples and restricting to women going from a negative sample to a positive sample. I have a few questions:

  1. How do I start? Do I use a survival analysis and how do I do this specifically in Stata or Prism5?
  2. Do I censor some women and how (these women who are HPV16+ on the first sample, we don't know when they acquired this)?
  3. What do I do with all the HPV negative women, do they stay in the analysis?
  4. How would I compare the incidence rate according to age? I would specifically like to know if young women (<30 years) have a different pattern when compared to women of >30 years.
chl
  • 50,972
  • 18
  • 205
  • 364
JP Bogers
  • 51
  • 2
  • May I suggest that you talk to your friendly neighborhood statistician? This project seems to be too large and important to mess up, and it would be easy to do it incorrectly. – Aniko Dec 30 '10 at 15:18
  • 2
    May I just emphasize what Aniko writes. The biggest concern in my mind, after reading this question, is not about statistical procedure or software, but about what all this might *mean*. Unless your data are an exhaustive survey (of a well-defined population) or obtained through a formal randomized procedure, considerable thought--of both a statistical and epidemiological nature--will be necessary to assure that you can learn *anything* from these data that is validly generalizable. – whuber Dec 30 '10 at 15:55

3 Answers3

3

Survival analysis is a good idea, but:

  • Do not discard anybody - even those with only one data point. The fact that they did or did not have HPV at that age is informative.
  • Your data is interval censored: left-censored for those who are positive at first measurement, right-censored for those who stay negative, and censored between the two measurement times for those who change status.

I would set up age 0 (or some other fixed value before the earliest measurement) as the starting point for time, and then estimate the hazard of the infection non-parametrically - Stata probably has a routine for that, but if not R certainly does. The result would be the age-dependent instantenous probability of infection - pretty much what you are looking for.

If you want to estimate how covariates affect this hazard, then interval-censored regression is needed - standard Cox regression does not handle this situation.

Aniko
  • 10,209
  • 29
  • 32
  • 1
    Thank you very much for the advise, I will think about this.. I agree that I need a statistician but: 1. I underestimated the problem, the previous smaller questions were easier and could be handled by ourselves; b. I'm in a progrma for a master in public health with special in epidemiology, the advanced statistics modules are only in 2011 and I wanted to have an idea now; c. I like to understand my statistical friend when we are discussing this. I'm a professor ath the University of Antwerp, we even have a epidemiology and biostatistical department ;-) (politics....) – JP Bogers Dec 31 '10 at 05:18
1

This may be too late, looking at the time since you asked your question, but I'm going at it anyway, just in case.

First, I'd like to echo the suggestion that you go talk to your friendly local biostatistician, or at least a methods-oriented epidemiologist. Your data sounds too nice to waste on bad analysis.

Going on your questions in order.

We work on cervical cancer detection and we started accumulating data in 2006 by performing HPV testing on all our samples. We have accumulated a data set with almost 400,000 women. Data enters the database on a daily basis. Most women come for a yearly smear so most women are more than one time in the database (2-13x). Most women are HPV negative (appr 85%) but some become HPV positive during this observation period.

Sounds like you've got a delightful cohort study on your hands. Congrats!

I would be very interested in looking at the rate of infection (incidence) of HPV16. I suppose I will have to look at the women with at least 2 samples and restricting to women going from a negative sample to a positive sample. I have a few questions:

At least two samples is probably justified. Going from a negative to a positive sample is not.

How do I start? Do I use a survival analysis and how do I do this specifically in Stata or Prism5?

Yes. What you're looking for is survival analysis, and if you're particularly interested in the incidence rate (cases/time), what you're looking for is Poisson regression. That should be easy enough to Google for the specifics of implementing such a model in STATA.

Do I censor some women and how (these women who are HPV16+ on the first sample, we don't know when they acquired this)?

Your program should be able to handle this type of censoring, but I'm not an expert in STATA, so I can't give you specifics as to how.

What do I do with all the HPV negative women, do they stay in the analysis?

These women absolutely stay in the analysis. You follow them until they either develop HPV (and thus become cases) or until your study ends (at which point they become censored). You want them in there - partially because survival analysis techniques actually assume they do get HPV at some point - just some point in the unknown, infinite future.

How would I compare the incidence rate according to age? I would specifically like to know if young women (<30 years) have a different pattern when compared to women of >30 years.

You can include either a binary (<30 vs 30+) age variable in the model, or more preferably, model age continuously. When you run you're analysis, you'll have a number of model coefficients - exp(coefficient for your age variable) is the % increase going between young and old (for the binary variable) or a one-year step in age (for the continuous variable). So for example, if you used the binary variable, and got a result of 1.50, it means older women are 1.5 times as likely to get HPV per unit time. This is an incidence density ratio, which is similar to a relative risk.

But again, I would echo the advice of those suggesting you bring a statistician or someone familiar with survival analysis on board.

Fomite
  • 21,264
  • 10
  • 78
  • 137
0

I don't use Stata or Prism 5, so I can't help with that part, but a survival analysis does sound appropriate, at least from what you've said. The most commonly used method is Cox proportional hazard, which is certainly available in Stata.

Instructions on how to code the data are program specific, so I will leave it to a Stata expert to help with that.

Survival analysis can deal with various kinds of censoring. In particular, you have two types to deal with: 1) Some women were positive when you first saw them, and 2) Some women were negative when you last saw them. But, again, how to do this will be dependent on the program you use.

You can include age as a time varying covariate.

onestop
  • 16,816
  • 2
  • 53
  • 83
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • I'd agree with most of this except that instead of including age as a time-varying covariate you'd almost certainly be better using age as the time scale. See Kom, Graudbard & Midthune 1997 http://aje.oxfordjournals.org/content/145/1/72. – onestop Dec 30 '10 at 16:57
  • Thanks! Can you recommend any books on a. survival analysis of this kind; b. programming advise in any program (I'm not married to Stata, my university has a licence for all major programs; I tried R but the learning curve of both statistics and R was a bit to steep...) – JP Bogers Dec 31 '10 at 05:24
  • Applied Survival Analysis by Hosmer and Lemeshow is good; if you use SAS, then Survival Analysis Using SAS by Allison is very good and clear. – Peter Flom Dec 31 '10 at 22:08