This may be too late, looking at the time since you asked your question, but I'm going at it anyway, just in case.
First, I'd like to echo the suggestion that you go talk to your friendly local biostatistician, or at least a methods-oriented epidemiologist. Your data sounds too nice to waste on bad analysis.
Going on your questions in order.
We work on cervical cancer detection and we started accumulating data in 2006 by performing HPV testing on all our samples. We have accumulated a data set with almost 400,000 women. Data enters the database on a daily basis. Most women come for a yearly smear so most women are more than one time in the database (2-13x). Most women are HPV negative (appr 85%) but some become HPV positive during this observation period.
Sounds like you've got a delightful cohort study on your hands. Congrats!
I would be very interested in looking at the rate of infection (incidence) of HPV16. I suppose I will have to look at the women with at least 2 samples and restricting to women going from a negative sample to a positive sample. I have a few questions:
At least two samples is probably justified. Going from a negative to a positive sample is not.
How do I start? Do I use a survival analysis and how do I do this specifically in Stata or Prism5?
Yes. What you're looking for is survival analysis, and if you're particularly interested in the incidence rate (cases/time), what you're looking for is Poisson regression. That should be easy enough to Google for the specifics of implementing such a model in STATA.
Do I censor some women and how (these women who are HPV16+ on the first sample, we don't know when they acquired this)?
Your program should be able to handle this type of censoring, but I'm not an expert in STATA, so I can't give you specifics as to how.
What do I do with all the HPV negative women, do they stay in the analysis?
These women absolutely stay in the analysis. You follow them until they either develop HPV (and thus become cases) or until your study ends (at which point they become censored). You want them in there - partially because survival analysis techniques actually assume they do get HPV at some point - just some point in the unknown, infinite future.
How would I compare the incidence rate according to age? I would specifically like to know if young women (<30 years) have a different pattern when compared to women of >30 years.
You can include either a binary (<30 vs 30+) age variable in the model, or more preferably, model age continuously. When you run you're analysis, you'll have a number of model coefficients - exp(coefficient for your age variable) is the % increase going between young and old (for the binary variable) or a one-year step in age (for the continuous variable). So for example, if you used the binary variable, and got a result of 1.50, it means older women are 1.5 times as likely to get HPV per unit time. This is an incidence density ratio, which is similar to a relative risk.
But again, I would echo the advice of those suggesting you bring a statistician or someone familiar with survival analysis on board.