Can logistic regression be used with "years" as a continuous variable?

Question

We are currently collecting data for a study whose purpose is to show whether scientists are focusing more or less on a specific subject with time. To keep some privacy let's say the subject is jelly beans: we reviewed a thousand random studies and we checked whether they were about jelly beans or not. The dataset has only two columns and it looks like:

| JellyBeans | Year |
|------------|------|
|    YES     | 2010 |
|    NO      | 2001 |
|    NO      | 2010 |
|    NO      | 2015 |
|    YES     | 2009 |
|    NO      | 2016 |
|    ...     | .... |
|    YES     | 1999 |

We thought of using logistic regression for the purpose as the DV is categorical. In R, this would look something like:

logreg_jelly_year = glm(JellyBeans ~ Year, family = "binomial", data = dataset)

We have, however, some doubts about the validity of the procedure, in particular:

Is there any specific assumption we have to check that could jeopardise the scientific value of the procedure?
Is the fact that Year is not truly continuous a problem?
Is there any other test or procedure that we should run on top or instead of logistic regression?

See also http://stats.stackexchange.com/questions/65900/does-it-make-sense-to-use-a-date-variable-in-a-regression Any kind of model seems overkill here to me: why not just plot the observed proportions against time? — Nick Cox, May 23 '16 at 09:33

score 3 · Answer 1 · answered May 23 '16 at 08:54

3

Yes, you can use years as a continuous variable in your model. But, I would not be estimating a logit model for this problem. Some specific issues:

The way to show your data here is as a plot, where the x-axis shows the years, and the y-axis shows the proportion of jelly beans. Estimating a logit model to do this brings with it the risk that you make an error, but no benefits of any kind in terms of interpretation.
If you are desperate to compute a p-value, you would be better off using Kendall's tau-b, as then you have no assumptions to worry about.
If the plot reveals a non-linear relationship I suppose you could use a logit model with a polynomial effect, using, say, JellyBeans ~ poly(Year, 3) or something similar and a likelihood ratio test for significance of the model.

answered May 23 '16 at 08:54

Tim

3,255
14
24

2

I'd try a logistic GAM instead of a polynomial. – Roland May 23 '16 at 08:56
4

Better to use [tag:splines] instead of polynomials. Polynomials are unstable at the extremes, and the later years may be exactly what we are most interested in. And once you use splines, a logistic model *does* give you more information than a simple Kendall's correlation. – Stephan Kolassa May 23 '16 at 09:00
I saw I do not remember where that it could be useful to add/subtract a constant value to the _year_ variable in order to treat it as continuous. GAM as well might be an option. – rsl May 23 '16 at 10:34
@Moazzem Hossen: subtracting the mean of the Year can be useful in terms of avoiding computational error. But, the simplicity of this problem means it will not be necessary. – Tim May 23 '16 at 23:25
@Roland I agree a spline is a better solution. But, it is also a lot harder to implement if the asked is not so familiar with the area. – Tim May 23 '16 at 23:27
@Tim, regarding your first point, would you run a simple linear regression between years and proportion to jelly beans? That would simplify my life AND give me a so-desired p-value. – Edgar Derby May 25 '16 at 13:04
@Secret Parrot, a regression is going to be just as useful as a logit model. But, a plot and a nonparametric test is still the better approach in my opinion. – Tim May 27 '16 at 08:02

Can logistic regression be used with "years" as a continuous variable?

1 Answers1