We are currently collecting data for a study whose purpose is to show whether scientists are focusing more or less on a specific subject with time. To keep some privacy let's say the subject is jelly beans: we reviewed a thousand random studies and we checked whether they were about jelly beans or not. The dataset has only two columns and it looks like:
| JellyBeans | Year |
|------------|------|
| YES | 2010 |
| NO | 2001 |
| NO | 2010 |
| NO | 2015 |
| YES | 2009 |
| NO | 2016 |
| ... | .... |
| YES | 1999 |
We thought of using logistic regression for the purpose as the DV is categorical. In R, this would look something like:
logreg_jelly_year = glm(JellyBeans ~ Year, family = "binomial", data = dataset)
We have, however, some doubts about the validity of the procedure, in particular:
- Is there any specific assumption we have to check that could jeopardise the scientific value of the procedure?
- Is the fact that
Year
is not truly continuous a problem? - Is there any other test or procedure that we should run on top or instead of logistic regression?