A regression model whose response variable is the day of year that an annual event (usually) occurs

Question

In this particular case I'm referring to the day on which a lake freezes. This "ice-on" date only occurs once a year, but sometimes it doesn't occur at all (if the winter is warm). So on one year the lake may freeze on day 20 (january 20th), and another year it might not freeze at all.

The goal is to figure out drivers of ice-on date.

Predictors would be things like fall/ winter air temperature each year. Year could be a predictor for the long-term linear trend.

1) Is the integer "day of year" a reasonable response variable (if not, what is?)?

2) How should one handle the years when the lake never froze?

Edit:

I don't know what the etiquette is here, but I figured I'd post the outcome of the suggestions I received. Here's the paper, open access. I got good feedback on the approach used, thanks @pedrofigueira and @cboettig . Of course, errors are my own.

what kind of dataset do you have? Measures during all the days of the year? — Donbeo, May 25 '14 at 23:01
@Donbeo, ice-on occurs once a year, so the response variable is at an annual resolution. The other data come in at an annual frequency as well, but in some cases could be converted to higher frequency data. — rbatt, May 26 '14 at 01:16
For which purpose do you want to consider the ice-on date? I ask this because statistical modelling is never true or false but useful or useless. So the use for the statistical results matters, also the insight if the target variable is of use at all. E.g. what if the lake freezes with a thin ice shield already in October but melts the same week and never freezes again this winter? Maybe you do your analysis to predict when to start using something like snow tires? This could give a hint to a useful answer to your 2nd question. — Horst Grünbusch, Jun 01 '14 at 06:15
Thanks for your thoughts, @HorstGrünbusch. I want to know how variations in climate have affected ice, because putting a lid on an aquatic system affects a lot of things (gas exchange, light, etc). The only ice data available are these ice-on dates (not thickness, etc). — rbatt, Jun 01 '14 at 18:28

pedrofigueira · Accepted Answer · 2014-06-05T13:16:38.397

4

I think one can consider "day of the year" as a response variable to a multivariate regression. In order to handle years when the lake never froze I would simply consider that the day of freezing is larger than an observable lower limit which corresponds, for instance, to the day when ice content starts to melt (or melts completely, if you want to be very conservative). Theoretically it should freeze after that, or can freeze after that, but we do not know. This way you could use the data you collected on the different parameters to understand how the freezing day depends on them, if it was allowed to be later than the latest observable date. You can then use a Tobit model to handle simultaneously freezing days (corresponding to "normal" datapoints) and lower limits (corresponding to limits and thus a censored regression).

In order to correctly include the measured lower limits in the analysis, you can use a censored regression model in which the dependent variable has a cut-off at the value of the lower limit. The above-mentioned Tobit model is appropriate for this case; it assumes the existence of an unobservable (latent) dependent variable $y_i^*$ which in our case corresponds to the freezing date if the winter extended indefinitely. The observable dependent variable $y_i$ (i.e. the measured lower limit on freezing date) is then taken to be equal to the latent variable in the absence of a lower limit $L_i$, and equal to the lower limit otherwise

\begin{eqnarray} y_i = \left\{ \begin{array}{ll} y_i^* & \quad \mathrm{if} \quad \bar{\exists}\,L_i \:\: (\textrm{i.e.} \, y_i^* < L_i) \\ L_i & \quad \mathrm{if} \quad y_i^*\geq L_i \end{array} \right. \end{eqnarray}

The application of the Tobit model to handle observation-by-observation censoring, results in a log-likelihood function of the form

\begin{eqnarray} \mathcal{L} = \sum_{i \,\in\, y_i^* < L_i} ln \left[ \phi\left(\frac{y_i-X_{ij}\beta_j}{\sigma}\right)/\sigma \right] \,+\, \sum_{i \,\in\, y_i^*\geq L_i}ln \left[ \Phi\left(\frac{L_i-X_{ij}\beta_j}{\sigma}\right) \right] \, \, \end{eqnarray}

where $\phi(.)$ and $\Phi(.)$ denote the probability and cumulative density functions, respectively, of the the standard normal distribution. The index $i$ runs on the observations and $j$ on the independent variables. The solution to the linear regression is the set of parameters $\beta_j$(including intercept) that maximizes the log-likelihood function.

edited Jun 05 '14 at 13:16

answered May 26 '14 at 14:06

pedrofigueira

741
3
17

3

The big issue with "day of year" concerns how to encode it. Ordinarily it would be represented as a Julian day between $1$ and $365$ or as a decimal year from $0$ to $1$, but neither of these is appropriate because this is a *circular* variable: the Julian day of $1$ immediately follows day $365$, for instance. Thus, in particular, "upper" and "lower" limits are meaningless. (There is also a minor issue of how to handle leap years; this could be resolved in various simple ways.) The other big issue concerns handling years where freezing does not occur: this is *not* missing or censored data. – whuber May 26 '14 at 14:26
1

I would argue that the concept of lower limit retains its meaning if each year can be considered as an independent experiment, i.e., if the experiment does not have memory and the freezing date in one year can be assumed to be completely independent from the date in the previous; then it should depend only on the parameters of the year in question. If that is the case, then, to the best of my understanding, the variable is not circular. – pedrofigueira May 26 '14 at 14:31
I think you must have a different understanding of ["circular."](http://en.wikipedia.org/wiki/Directional_statistics) For example, how would you deal with a situation where lakes freeze on January 20 *and* on December 30 of the same year? If you were fitting a model to the data and your model predicted a freeze-up on January 1 but the freeze-up occurred on December 31 of the previous year, would you penalize the model heavily or very lightly? Obviously the latter is correct--but that corresponds to a circular variable whereas the former does not. – whuber May 26 '14 at 14:35
@whuber I see your point. This means that one cannot use day of the year as a variable, even if the measurements are independent as I said. But one could use the day as measured from the equinox, for instance. In that case the day would always be in the interval [0, melting date/lower limit]. You would only have one measurement for "pseudo-year". – pedrofigueira May 26 '14 at 14:40
1

Yes, in some circumstances such *ad hoc* techniques can work. When (a) the event *always* occurs each year and (b) the events are tightly dispersed around a predictable date, you will be fine by choosing the year's origin appropriately. But with larger amounts of dispersion (which is likely the case here)--or in the most drastic cases when the event may be absent altogether--you really need to apply the methods of circular ("directional") statistics. BTW, serial correlation or independence is a separate concern altogether. – whuber May 26 '14 at 14:45
You both raise several valid points. @whuber, what if a freeze date of December 31 was simply encoded as -1? Freeze dates only span a range of a few months, so nothing outside -60 or +60 is ever observed, e.g. (I think this was one of your suggestions). Also, what were your concerns about serial correlation? Finally, I'm still unsure of how to handle no-freeze years – perhaps there is a binomial component to this model, but I'm very unsure. – rbatt May 26 '14 at 15:22
1

Encoding December 31 as -1 does not solve the problem, because then it would be 364 days away from December 30. You could number the days with an origin of, say, July 31, but it's still a good idea to model the circular nature of the data explicitly. I have little concern about serial correlation now because it's too early in the modeling process to address it. I would favor a model in which the preconditions necessary for a freezeup were estimated, rather than the date of the freeze. That would make absence of a freeze simple and natural to describe: the preconditions simply did not occur. – whuber May 26 '14 at 20:52
@whuber Each year gets an ice-on date. An ice-on date is always reference by the "new year" of that winter. So ice-on for 2014 could occur in December 2013, or February 2014. The date is recorded as a DoY, where positive DoY indicate julian day for current "ice year" (i.e., 2014, Jan 5 is DoY = 5), and negative indicates a julian day for the calendar year previous to the current "ice year" (i.e., an ice-on date of Dec 22 2013 is DoY = -10). The range of ice-on dates is fairly constrained Nov - Feb. – rbatt May 29 '14 at 17:11
I'm warming up to the idea of the Tobit model. How would one choose the upper censoring limit? Just use the latest ice-on date ever observed? Are there any situations where it would make sense for the upper limit to change between samples? – rbatt Jun 01 '14 at 21:14
2

I think the upper limit should be year defined as precisely as possible; if one can do so the Tobit analysis becomes more insightful. I would suggest as lower limit (freezing could happen before, but was not observed/observable) the DoY beyond which you consider that you cannot detect melting any more. Maybe this can be done by taking a look at the (P, T) required for water to freeze, and assuming a constant pressure, choose the last local minima of the year, or similar. I believe the question at this points becomes more a physical than statistical question (but very interesting, anyway). – pedrofigueira Jun 02 '14 at 14:59
2

@rbatt I think this answer is sensible. The start-date is arbitrary, you can start from some other date or use negative numbers; I don't see an issue. The circularity takes care of itself by numbering by day of the year. – cboettig Jun 02 '14 at 19:17
Cencoring should have a conservative effect on the result. It should not artificially make the regression coefficients larger. The last day where melting is possible is reasonable. However, this day may already belong to the next winter, i.e. may be past the first ice-on date. So you would end up with >365. – Horst Grünbusch Jun 03 '14 at 09:00
@pedrofigueira Could you include your explanation of how to implement the tobit (your comment about upper/ lower limits was helpful) as part of your answer? – rbatt Jun 04 '14 at 17:03
@rbatt thank you. Apologies for not being able to do it earlier. I'll do it now. – pedrofigueira Jun 05 '14 at 12:52
@pedrofigueira, in your revised answer (thanks!), you refer to a "lower limit". Are the limits for the censored regression refer to the limits of possible values of the latent variable? Do limits refer to what can be observed (e.g., upper or lower limits of detection), or to the possible values of what is not observed (if it wasn't detected, it must be outside the limits of detection)? Is my question/ source of confusion clear? – rbatt Jun 05 '14 at 16:34
@rbatt yes, these are the limits of possible values for the measured variable (note that the latent one is the one that is mener measured). The limits are basically the limits of what you observe, i.e., the latent variable can be larger but is never observed because the different factors did not render the observable date earlier and thus observable. – pedrofigueira Jun 06 '14 at 21:26

score 1 · Answer 2 · answered Jun 02 '14 at 19:32

Day of year is one sensible predictor variable, and for that I think it is sensible to treat it as @pedrofigueira suggests.

For other predictor variables you may need to be careful about how you represent time. For instance, imagine you have air temperatures by day -- how would you model air temperature as a predictor of ice-on day? I don't think comparing same day-of-year samples is sufficient.

In any such analysis, I think it helps to write down what you think a plausible generating model (or models) of the data might be, (where some physics might be available as a guide). For instance, a reasonable model might be to integrate the number of days below freezing, and when that integral passes a threshold (e.g. related to the thermal mass of the lake), ice-on occurs. From such a model you can then ask what is a reasonable approximation and what isn't.

For instance, day-of-year as predictor matters to that model only in so much as day of year is a good predictor of temperature. Thus knowing only the day of the year, one would just have an average day-of-year corresponding to the ice-on threshold, with perhaps some normal distribution about it resulting from interannual temperature variations, and therefore looking for a trend in day-of-year is completely justified.

But if you know other variables like air-temp by day, you probably face dealing with somewhat more complicated model more directly. If you are just using the annual values (minimums? means?) than variable as a predictor of ice-on day also seems reasonable (by the same argument as above).

+1 for pointing to physics. If you cannot explain the statistical result by reason, it may be spurious, even if it showed up significant. — Horst Grünbusch, Jun 03 '14 at 08:44
Just to be clear, day-of-year for ice-on is the response variable ... it is what I am trying to "predict" (in your answer you refer to it as the 'predictor' in a few places). Do you have a suggestion for handling the years with no freezing (other the Tobit suggestion below)? — rbatt, Jun 04 '14 at 16:49
@rbatt, sorry for the confusion. The simplest model is 1D, using the day-of-year that ice-on occured in the past as the predictor. But if you want to detect trends in ice-on date, you have the full DATE, not Day Of Year, as the thing you want to predict, because the prediction for, say, 2020 could then differ from that for 2050. — cboettig, Jun 05 '14 at 21:42

score 0 · Answer 3 · answered May 29 '14 at 17:17

0

For this problem you need two response variables. One Boolean response that indicates whether the lake froze or not, and one integer response giving the day of the year, conditional on the indicator being true. In years when the lake froze, both the Boolean and integer are observed. In years when the lake didn't freeze, the Boolean is observed and the integer is not. You can use a logistic regression for the Boolean. The regression for day of the year could be an ordinary linear regression.

The circular nature of the day of the year shouldn't be a problem as long as you number the possible freeze-over days consecutively within a given time period. If you are wondering where to start the numbering, I would suggest the day when the predictors were measured. If you want the model to represent causal effects, it must be the case that all predictors were measured before any possible freeze-over.

To handle the integer and bounded nature of the day of the year, could use a discretization model. That is, there is a real latent value which generates an observation in the following way: if the value is within the bounds then the observation equals the latent value rounded to the nearest integer, otherwise the value is truncated to the bounds. The latent value itself can then be modelled as a linear function of the predictors plus noise.

answered May 29 '14 at 17:17

Tom Minka

6,610
1
22
33

I understand the premise of the approach, but I'm not sure how to implement it. How would I arrange data and estimate the influence of candidate drivers of the boolean/ date? I work in R. – rbatt May 29 '14 at 17:25
Put the data into a data frame where one column is the Boolean and another is the date. Then use: fit1 = glm(froze ~ x, frame, family = "binomial") fit2 = lm(date ~ x, frame) – Tom Minka May 29 '14 at 17:45
Sorry, may I understand "fit2 = lm(date ~ x, frame, subset = Boolean==TRUE)"? – Sergio May 29 '14 at 17:53
Those would be two separate models. In the model where "date" is the response, what do I do with the years when the water never froze? If I simply remove those years, then I am biasing the results (or severely reducing my observed range of responses) because I am selectively removing the most extreme observations of the response (i.e., never freezing is the most extreme ice-on date). So the years when the water never freezes should tell us something about the influence of those drivers on ice-on date. It seems that the information in both models should be combined. – rbatt May 29 '14 at 19:26
I'm uncomfortable with treating freezing as a boolean variable because the underlying process is no doubt more continuous than that. – cboettig Jun 02 '14 at 19:13

kjetil b halvorsen · Answer 4 · 2017-04-17T13:32:54.343

What you have is time-to-event data, which is also termed survival analysis. That is not really my area, so I am not giving a detailed answer here. Googling for "time-to event data" or "survival analysis" will give you a lot of hits!

One good starting point could be the chapter (13) about survival analysis in Venables/Ripley: MASS, or the classic "The Statistical Analysis of Failure Time Data, Second Edition" by John D. Kalbfleisch, Ross L. Prentice(auth.)

EDIT, EXTENDED ANSWER

As an alternative to survival analysis, you could approximate that by ordinal logistic regression. By example, in your example case of first freezing date, define some dates for which you give the "have been freezing at or before" state, 0 (no freezing), 1 (freezing). That nicely accomodates the years without freezing, you simply have an all-zero response vector. If your choosen dates are, say,

1:08   15:08 1:09 15:09 1:10 15:10 1:11 15:11 1:12  15:12  1:01  15:01
and the actual date of first freezing was  17:11, then your observed vector will be
0       0    0    0     0    0     0    0      1     1     1      1

and, in general, all response vectors will have an initial block of zeros, followed by a block of ones. Then, you can use this with ordinal logistic regression, obtaining an estimated probability of freezing for each date. Plotting that curve will give an approximation for a survival curve (survival, in this context, becomes "not having frozen yet").

EDIT

One could also see your data as recurrent events, since river freezes (almost) every year. Se my answer here: Finding significant predictors of psychiatric readmissions

A regression model whose response variable is the day of year that an annual event (usually) occurs

Edit:

4 Answers4