1

My dataframe looks like this:

Julian day     Year
 153           1951
 161           1952
 167           1953
  .             .
  .             .
 161            2007

I have Julian day for each year when an event occurs and I am trying to determine whether the Julian day of that event happening is changing with time or not. Do I treat Julian day as continuous or count data. I guess if I treat it as continuous data, I will do a linear regression of Julian day against time and if it's a count data, I will do a Poisson regression against time.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
user3013423
  • 143
  • 1
  • 3
  • 11
  • It the Julian day your response variable? Can you say more about your situation, your data, & your goals? – gung - Reinstate Monica Jun 15 '14 at 20:53
  • Hi. Yes Julian day is my response(dependent) variable and year is my independent variable. My goal is to study whether the timing of the event (measured in Julian Day) has changed over past 60 years. – user3013423 Jun 15 '14 at 20:56
  • 2
    Is there a first date when this event could have occurred? It sounds like your response variable is a *duration*, in which case you want to use a survival analysis, not a count model. – gung - Reinstate Monica Jun 15 '14 at 21:02
  • not really. basically the Julian day for each year is the day when the monsoon begins. Therefore I am examining whether the day of monsoon onset has changed over time. – user3013423 Jun 15 '14 at 21:24
  • 4
    Strictly speaking, it's discrete but not actually a count (well, it's arguably a count of days since an origin, but not in the sense that would lead to a distribution like a Poisson or a binomial or a negative binomial or a hypergeometric, etc - it's not a count variable in the usual sense). I'd suggest you treat it as continuous-time, because it's a measure of 'how long' since the start of the year something happens and the discreteness from only measuring it to the day should not matter for that. – Glen_b Jun 15 '14 at 21:33
  • 1
    The day of monsoon onset is a duration since the beginning of the year. It isn't a count in any meaningful sense. In addition, the beginning of the year is arbitrary. In truth, any number of days could be used to start the clock so long as it is consistent across the years, but I would try to ground it in something outside of the calendar. Perhaps the number of days since the solstice might work (I'll admit I know little of this topic). – gung - Reinstate Monica Jun 15 '14 at 22:30
  • 1
    Another issue is that there will be autocorrelations in this series. You will need to account for those somehow. A straight regression (suvival, Poison, OLS, etc) will be inappropriate. – gung - Reinstate Monica Jun 15 '14 at 22:32
  • 1
    A bit late to the party but this isn't Julian day! At least it is isn't compared with https://simple.wikipedia.org/wiki/Julian_day I think you're talking about day of the year in which 1 January = 1 and 31 December = 365 or 366. Either way, as @Glen_b-ReinstateMonica explains, it's discrete in principle but with many possible values it is approximately continuous. Note that if the event moved back and forth between say December or January you need to start counting at a different time of year. – Nick Cox Dec 11 '19 at 17:56

1 Answers1

1

It is count data

The day of the year is a discrete variable with a categorical distribution.

In your case you could also consider it as count data because it describes the number of days that you count untill something (the monsoon) happens. That is different than the case where the day number has no ordered meaning. E.g. 'the day when somebody is born' does not relate to something like waiting time.

Count data, what does it help you?

I imagine that waiting time for monsoon does not allow you to use typical distributions for count data like Poisson or negative binomial. That is more the case when the counts are a sum of simple individual events where each individual has a simple distribution.

E.g. when each day you have an independent probability for monsoon onset, as if the onset is determined by rolling a dice. But that doesn't seem realistic to me.

Possibly there are models that do treat it in some way like that but more advanced (I guess the problem is uncertainty about the model and it is questionable whether this is gonna help you). For instance each day adds a random amount of change until some random trigger point is being reached (e.g. the distribution of waiting time for a sum of exponential distributed variables to surpass some level/value is related to a Poisson distribution).

Treat the sampled statistic as continuous

Anyway, you can treat it as a continuous variable. The underlying categorical distribution can be likely parameterized by a few variables (rather than many variables giving the frequency/probability for each 365 days), and something like the mean of the distribution might be a meaningfull variable, which could be estimated by the sample mean (and the sample mean is a discrete variable with such small steps that you could analyze it as a continuous variable, ie. neglect errors due to discontinuity)

Plotting a histogram of your data might help you to consider what type of analysis would be best. (possibly the mean is not so interesting but instead it is an increase in the variance or some other aspect)


Count data is a bit ambiguous and you can look at it in different ways.

  • Counts can relate to aggregate data. E.g. when $X_i$ is a categorical distributed variable, where $i$ refers to various instances of that variable, then you can tabulate the several $X_i$ by counting how often a particular value occurs.
  • Counts can also relate to an individual sampled value, where the value itself is a count, e.g. counting the number of days until something (the monsoon) happens.
Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161