6

We are currently collecting data for a study whose purpose is to show whether scientists are focusing more or less on a specific subject with time. To keep some privacy let's say the subject is jelly beans: we reviewed a thousand random studies and we checked whether they were about jelly beans or not. The dataset has only two columns and it looks like:

| JellyBeans | Year |
|------------|------|
|    YES     | 2010 |
|    NO      | 2001 |
|    NO      | 2010 |
|    NO      | 2015 |
|    YES     | 2009 |
|    NO      | 2016 |
|    ...     | .... |
|    YES     | 1999 |

We thought of using logistic regression for the purpose as the DV is categorical. In R, this would look something like:

logreg_jelly_year = glm(JellyBeans ~ Year, family = "binomial", data = dataset)

We have, however, some doubts about the validity of the procedure, in particular:

  1. Is there any specific assumption we have to check that could jeopardise the scientific value of the procedure?
  2. Is the fact that Year is not truly continuous a problem?
  3. Is there any other test or procedure that we should run on top or instead of logistic regression?
amoeba
  • 93,463
  • 28
  • 275
  • 317
Edgar Derby
  • 614
  • 1
  • 6
  • 16
  • 2
    See also http://stats.stackexchange.com/questions/65900/does-it-make-sense-to-use-a-date-variable-in-a-regression Any kind of model seems overkill here to me: why not just plot the observed proportions against time? – Nick Cox May 23 '16 at 09:33

1 Answers1

3

Yes, you can use years as a continuous variable in your model. But, I would not be estimating a logit model for this problem. Some specific issues:

  • The way to show your data here is as a plot, where the x-axis shows the years, and the y-axis shows the proportion of jelly beans. Estimating a logit model to do this brings with it the risk that you make an error, but no benefits of any kind in terms of interpretation.
  • If you are desperate to compute a p-value, you would be better off using Kendall's tau-b, as then you have no assumptions to worry about.
  • If the plot reveals a non-linear relationship I suppose you could use a logit model with a polynomial effect, using, say, JellyBeans ~ poly(Year, 3) or something similar and a likelihood ratio test for significance of the model.
Tim
  • 3,255
  • 14
  • 24
  • 2
    I'd try a logistic GAM instead of a polynomial. – Roland May 23 '16 at 08:56
  • 4
    Better to use [tag:splines] instead of polynomials. Polynomials are unstable at the extremes, and the later years may be exactly what we are most interested in. And once you use splines, a logistic model *does* give you more information than a simple Kendall's correlation. – Stephan Kolassa May 23 '16 at 09:00
  • I saw I do not remember where that it could be useful to add/subtract a constant value to the _year_ variable in order to treat it as continuous. GAM as well might be an option. – rsl May 23 '16 at 10:34
  • @Moazzem Hossen: subtracting the mean of the Year can be useful in terms of avoiding computational error. But, the simplicity of this problem means it will not be necessary. – Tim May 23 '16 at 23:25
  • @Roland I agree a spline is a better solution. But, it is also a lot harder to implement if the asked is not so familiar with the area. – Tim May 23 '16 at 23:27
  • @Tim, regarding your first point, would you run a simple linear regression between years and proportion to jelly beans? That would simplify my life AND give me a so-desired p-value. – Edgar Derby May 25 '16 at 13:04
  • @Secret Parrot, a regression is going to be just as useful as a logit model. But, a plot and a nonparametric test is still the better approach in my opinion. – Tim May 27 '16 at 08:02