5

According to Yale:

Categorical variables represent types of data which may be divided into groups (Lacey M, 1997)

To me, dates do not fit this definition. They are ordinal, as one date is bigger than the date before it. It is also quantitative as it can added, subtracted...etc.

I am interested in correlating these observations to other variables in a sample, so I wanted to perform pre-modelling analysis.

Is my understanding correct?

EDIT:

Thank you for your replies. The general consensus is that dates can either be considered binomial or count data according to these data-type characterisations: https://en.wikipedia.org/wiki/Statistical_data_type#Simple_data_types I tried to fit the explanations in the comments to the data-types in wikipedia, but, it doesn't seem to fit what people actually mean, is I'll reread.

EDIT 2: To give context for the question: I am trying to measure the effect of various processes over time, and these effects may not be linear, but cyclical (e.g. the seasons). The observations have dates (dd/mm/yyyy), but the dates are only significant in relation to the other dates.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Sinker
  • 151
  • 1
  • 1
  • 3
  • 5
    Dates are interval. There's no true 0 beside arbitrary definitions yet the difference between adjacent values is constant. – HEITZ Mar 10 '18 at 09:34
  • Differences between dates are ratio. – Nick Cox Mar 10 '18 at 09:40
  • 2
    Dates can not be summed! but you can take difference. Geometrically, dates are affine points: differences, means, and other contrasts with coefficients summing to one are defined, but not other sums. So the timeline is an affine line, a one-dimensional affine geometry. See https://en.wikipedia.org/wiki/Affine_geometry – kjetil b halvorsen Mar 10 '18 at 11:15
  • @kjetilbhalvorsen On whether dates can be summed: If I have friends with various years of birth then I am not interested in the sum of those dates, but the mean birth year makes sense and is related to their mean age at any time. This isn't exceptional: for example, total temperatures aren't of interest or even meaning; nevertheless the mean temperature is interesting and useful. Taking a mean certainly depends on taking a sum. – Nick Cox Mar 10 '18 at 11:49
  • What I got from the discussion above is that it seems more correct to refer to dates as part of a ratio scale, or an affine line, but not as much an interval scale...hmmm – Sinker Mar 10 '18 at 12:18
  • @Nick Cox: Algorithmically, yes, taking a mean depends on taking a sum. The *numbers* representing the points (relative a given origin, or an "affine frame", can be summed, but not the points! Conceptually, means can be defined without using any sum. This is clear physically: the mass center of a steel plate, also called barycenter, can be given as a (weighted) average, but can be found physically by experiment: just shift the plate around until it balances. No summing there. – kjetil b halvorsen Mar 10 '18 at 14:03
  • @Sinker I don't see anything in the discussion that defines a general consensus, or even supports the view, that you're identifying in your EDIT. I don't even see that dates are essentially counted or discrete: dates are always (equivalent to) integers we assign to finite intervals, but those intervals can always be subdivided, just as where I am we're about 0.62 of the way through 10 March 2018. Reporting dates as (equivalent to) integers is a useful convention: we just choose the resolution we want, but it is not intrinsic. – Nick Cox Mar 10 '18 at 14:53
  • Similarly, dates are **not** ratio scale unless there is a natural zero. When there is, we usually call dates something else, such as the time since the start of of a match or a TV programme. – Nick Cox Mar 10 '18 at 14:54
  • 3
    Your edit is surprising: I cannot see any possible way in which a date could be considered a count. – whuber Mar 10 '18 at 17:07
  • 4
    Evil, dates and time are pure unmitigated evil. ;) – russellpierce Mar 10 '18 at 17:30
  • "The general consensus is that dates can either be considered binomial or count data" - uh, what??? Sorry, but *whose* "consensus" is that? Certainly not the one of the commenters here. – Stephan Kolassa Mar 10 '18 at 22:42
  • I must have misunderstood. I am trying to apply the answers to a set of classifications to see which one fits. I'll go through the replies again and try and understand them better – Sinker Mar 11 '18 at 02:50
  • Re edit 2: Now you seem to trying to ask about what are often called "circular" variables. Focusing on a measurement type usually is less than helpful. Consider reformulating your post to ask about the problem you actually face. – whuber Mar 11 '18 at 16:13

3 Answers3

7

This is a tricky question, and personally I feel this question is more about semantics and conventions.

Let's go to basics. What is Date? It's just a name we give to 86,400 seconds period. Date by definition, is counted from a reference point (year 1 AD). You could simply treat dates as natural numbers, if your problem is about number of days. Or you could convert days to seconds. And count seconds from 1st day of 1 AD. In other words, it's a 'name' we give to that specific range of numbers.

You can argue that date is a category variable, as you can put them in "Sunday", "Monday", etc into 7 categories.. But will it serve the purpose?

Or you could treat date as range of numbers(seconds/minutes/hours), using seconds/minutes/hours with reference to a particular date/point in time.

I feel this question doesn't have a universally agreeable answer as dates can be used in so many ways in variety of applications.

Ultimately you'll have to think about the specific application you're looking at and then take a call.

tired and bored dev
  • 855
  • 2
  • 9
  • 17
  • 1
    It is not basic that dates are here taken to mean daily dates. That may be a good guess about language usage, but there are dates on many scales, as you say. – Nick Cox Mar 10 '18 at 14:57
6

It is correct that dates do not fit nicely into the Stevens typology https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale of different levels of measurement. Dates are certainly ordered, so we could say that dates are ordinal type, but they are certainly more than that. When talking specifically about days in this sense, astronomers use Julian days.

I take your question to be what mathematical structure can we give to the set of dates (or more generally dates/times). That is about a mathematical representation of time, and we talk generally of time in at least two ways: events ("when did something happen") and durations "how long did the last winter Olympic games in PyeongChang last"? If $P$ is the date of the opening ceremony and $Q$ the date of the closing ceremony, then the duration is $Q-P$. So we can take a difference of two events (dates); that difference is a duration. But we cannot sum two events (dates), what should we mean by $P+Q$? But the halfway point of the winter Olympics has meaning; that is the average $0.5 P+0.5 Q$. So averages make sense!

This looks like a strange mathematical structure, with two kinds of objects "events" and "durations" and operations only defined in some cases, not all. But it is a very well-known object, an affine space; see https://en.wikipedia.org/wiki/Affine_space.

The usual way of introducing an affine space is saying it is a vector space "where we have forgotten the origin". Since we have forgotten the origin, any operation whose result depends on the origin is invalid or undefined. We can now define "events" (dates) as vectors in the underlying (1-dim) vector space, which we can identify with the real line. But note that this representation depends on choice of an origin! We must just remember that anything we actually do must not depend on this choice.

We can represent "durations" as differences between the vectors representing dates. It should be quite obvious that the duration of the winter Olympic Games do not depend on if we choose as time origin the birth of Christ or 1 january 1970 (time origin used in linux). The average of events also has meaning: if we write the events as $P_i$, then the average of the $P_i$ is an event $Q$ such that $$ \sum_i (P_i - Q)=0 $$ (In affine geometry $Q$ is called often the barycenter.) Note that here we are only summing durations, which is allowed.

If we want to implement some data type representing dates in a computing environment, it must have these properties. Let us see in R:

 P <- as.Date("2018-2-9") # Starting date of Olympics
 Q <- as.Date("2018-2-25") # end date
 Q-P   # duration 
Time difference of 16 days
 Q+P
Error in `+.Date`(Q, P) : binary + is not defined for "Date" objects
 mean(c(P, Q))  # time midpoint of the games 
[1] "2018-02-17"
 weighted.mean(c(P, Q), c(1/4, 3/4))  # games 3/4-finnished.
[1] "2018-02-21"
 P+16  # 16 days after the opening ceremony 
[1] "2018-02-25"

That all seems to be well-behaved.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    +1 for `R` example! I appreciate that I didn't give a context before, but if I care more about the "events", would I then consider dates points on an affine line? I think this paragraph "The usual way of introducing an affine space is saying it is a vector space" put it best for my context – Sinker Mar 11 '18 at 04:47
  • 2
    Dates are *clearly* of interval type according to Stevens' original definition. Most physicists would reflexively agree with that on the basis of Special Relativity. The definition of a type is not in terms of the "mathematical structure" of its objects: it is in terms of the *group of operations* on the class of those objects. – whuber Dec 11 '20 at 18:25
  • 1
    Note that dates in the sense of time of year or day of week may be considered a circular scale too. Working backwards, if sinusoids or other periodic functions make sense in modelling the outcome then the time variable has circular flavour, which need not rule out other flavours. Naturally, there can be problems in which time has a complicated role, so that in climatology or Earth or environmental science generally we might be looking at a long-term trend and also seasonal variations, just as a starting point. – Nick Cox Jun 05 '21 at 10:23
0

Dates can be ordinal, categorical or both. It really depends on what these dates represent and what you are trying to answer with them.

If the data your dates represent can be described as elapsed time then I would use ordinal.

Examples:

  1. If you are looking at how your process affects the growth of a population over decades and the date field represents the day the population was counted, I would treat this field as ordinal

  2. How much does a company's historical stock price influence the current value of a stock?

  3. The effect a process has on a person's memory over time, where the date field is the date a person took a memory test and their score.

If the data your dates represent can be described as part of a cycle then I would use categorical.

Examples:

  1. If you want to determine if your process has an effect on the number of births per calendar week, I would use categorical

  2. Does the day of the week influence the value of a stock price.

  3. Does the month the process was started on influence its results.

Looking at the two example pairs, it can be easily seen that a model looking at the effect a process has on the reproduction of a species or a model looking at influences on stock prices would most likely convert dates into both categorical and ordinal.

I believe that depending on what question the model is created to answer and what the data represents would greatly influence which (categorical and/or ordinal) should be used.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156