33

Is there a good way to measure smoothness of a time series in R? For example,

-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0

is much smoother than

-1, 0.8, -0.6, 0.4, -0.2, 0, 0.2, -0.4, 0.6, -0.8, 1.0

although they have same mean and standard deviation. It would be cool if there is a function to give me a smooth score over a time series.

chl
  • 50,972
  • 18
  • 205
  • 364
agmao
  • 431
  • 1
  • 4
  • 3
  • 6
    Smoothness has a well-defined meaning in the theory of stochastic processes. ("A variogram is a statistically-based, quantitative, description of a surface's roughness": http://www.goldensoftware.com/variogramTutorial.pdf, p. 16.) Smoothness is related to the *extrapolation* of the variogram to zero distance. (The SD of successive differences and the lag-one autocorrelation are quick and dirty versions of this). The essential information is contained in the coefficients of the Taylor series at 0. E.g., a non-zero constant is rough indeed; a high-order zero at 0 indicates a very smooth series. – whuber Mar 14 '12 at 16:59
  • How funny, I've been wondering this exact same thing myself. Thanks for posting! – Chris Beeley Mar 14 '12 at 22:18
  • 1
    @whuber: that's an answer, not a comment. – naught101 Aug 24 '12 at 03:21
  • 1
    @naught101 I humbly beg to differ: my comment is apropos a related situation and it refers only to the theoretical process used to model spatial data, not to how one would actually *estimate* that smoothness. There is an art to that estimation with which I am familiar in multiple dimensions, but not in one, which is special (due to the direction of time's arrow), so I hesitate to claim that applying the multidimensional procedures to time series is at all conventional or even a good approach. – whuber Aug 24 '12 at 03:24
  • 1
    I've heard of [hurst](http://en.wikipedia.org/wiki/Hurst_exponent) exponents too. – Taylor Mar 14 '12 at 18:22
  • @whiner: fair call. My understanding is limited, but a web search tells me that the a variogram in one dimension is a correlogram (or equivalent to one), which support something like cyan's answer. I don't really see how directionality impacts smoothness - surely a sawtooth wave is just as (un)smooth as a reverse sawtooth... – naught101 Aug 24 '12 at 09:44
  • http://www.johndcook.com/blog/2009/02/06/the-smoothest-curve-through-a-set-of-points/ – sav May 12 '14 at 23:15

4 Answers4

27

The standard deviation of the differences will give you a rough smoothness estimate:

x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0)
y <- c(-1, 0.8, -0.6, 0.4, -0.2, 0, 0.2, -0.4, 0.6, -0.8, 1.0)
sd(diff(x))
sd(diff(y))

Update: As Cyan points out, that gives you a scale-dependent measure. A similar scale-independent measure would use the coefficient of variation rather than standard deviation:

sd(diff(x))/abs(mean(diff(x)))
sd(diff(y))/abs(mean(diff(y)))

In both cases, small values correspond to smoother series.

Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
  • 1
    That score isn't scale-invariant, which may or may not make sense depending on the application. (And my own suggestion _is_ scale-invariant, so the same concern applies to it.) Also, it's worth pointing out that for the above score, smaller values indicate smoother time series. – Cyan Mar 14 '12 at 04:02
  • 1
    Thanks @Cyan. I've now added a scale-independent version as well. – Rob Hyndman Mar 14 '12 at 05:14
  • 2
    Do you really intend to include `diff` in the denominators? The values would algebraically reduce to `(x[n]-x[1])/(n-1)` which is a (crude) measure of trend and ought, in many cases, to be extremely close to zero, resulting in an unstable and not terribly meaningful statistic. I'm puzzled by that, but maybe I'm overlooking something obvious... – whuber Jul 25 '12 at 22:10
  • 1
    I used `diff` to avoid an assumption of stationarity. If it was defined with the denominator `abs(mean(x))` then the scaling would only work when `x` was stationary. Taking diffs means it will work for difference stationary processes as well. Of course, diffs may not make `x` stationary and then there are still problems. Scaling time series is tricky for this reason. But I take your point about stability. I think to do anything better would require something substantially more sophisticated --- using a nonparametric smoother for example. – Rob Hyndman Jul 25 '12 at 22:53
  • Usually, if you want to characterise the smoothness of a dataset, you probably also want to try to improove it. If as an exemple your "ideal" dataset is a sine curve where you change the sampling frequency, you will very quickly find that `mean(diff(y)) = 0` in the ideal case. I think this clearly shows that you should add to your denominator an arbitrary quantity to ensure that `mean(diff(y)) <> 0` – Samuel Albert Jun 22 '15 at 12:05
  • @RobHyndman thanks a lot for the great answer. if I am having a constant trend (e.g., [5,5,5,5,5,5,5] it returns me 0/0 (division by zero). Do you have any suggestions for the smooth value for such situations? Looking forward to hearing from you. Thank you once again :) – EmJ Mar 28 '19 at 01:14
  • 1
    I would have thought a constant trend should be perfectly smooth, so the answer should be 0. – Rob Hyndman Mar 28 '19 at 10:15
  • 1
    Your answer was very helpful to me, but I think it's not a great solution. It's easy for abs(mean(diff(x)) to be zero or very small if the array is stationary-ish. Perhaps you meant mean(abs(diff(y)). Regardless, see what you think of my solution using second differences. – Jonathan Mar 06 '20 at 12:43
  • Dear all, i too think that the answer is a good starting point, but currently the best answer is Jonathans (and the even better one he links!). Imho, Jonathans answer is underrated, since it is the only one that correctly points out that second differences are necessary! – ckrk Jan 21 '21 at 09:50
16

The lag-one autocorrelation will serve as a score and has a reasonably straightforward statistical interpretation too.

cor(x[-length(x)],x[-1])

Score interpretation:

  • scores near 1 imply a smoothly varying series
  • scores near 0 imply that there's no overall linear relationship between a data point and the following one (that is, plot(x[-length(x)],x[-1]) won't give a scatterplot with any apparent linearity)
  • scores near -1 suggest that the series is jagged in a particular way: if one point is above the mean, the next is likely to be below the mean by about the same amount, and vice versa.
Cyan
  • 2,748
  • 17
  • 21
1

To estimate the roughness of an array, take the squared difference of the normalized differences, and divide by 4. This gives you scale-independence (because of the normalization), and ignores trends (because of using the second difference).

firstD = diff(x)
normFirstD = (firstD - mean(firstD)) / sd(firstD)
roughness = (diff(normFirstD) ** 2) / 4

Zero will be perfect smoothness, 1 is maximal roughness.

You then either use the sum of this measure, or its mean, depending on whether you want your roughness measure to be length-independent.

I think this may be the same as a previous answer elsewhere:

And similar things are discussed in academic sources like this and this, saying we should integrate the squared second derivative.

I don't read algebra, so I'm not sure if what I'm suggesting is quite the same as any of these.

Jonathan
  • 111
  • 4
0

You could just check the correlation against the timestep number. That would be equivalent to taking the R² of a simple linear regression on the timeseries. Note, though, that those are two very different timeseries, so I don't know how well that works as a comparison.

naught101
  • 4,973
  • 1
  • 51
  • 85