11

I asked this question over on StackOverflow, and was recommended to ask it here.


I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).

The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.

The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.

So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)

The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).

My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).

The nature of the time series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.

Anyone have any clues for what I can do, or approaches I should research?

Update 28 Feb 2011: Added some plots here showing examples of the data.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
BobC
  • 211
  • 1
  • 2
  • 5
  • 1
    @BobC, maybe one of the moderators can have your post migrated to this site. That would be the most reasonable. As for your technical questions, first of all, are you using the FFT to do the correlation? That should be feasible for 2 million data points on a half-decent computer. Your signal-to-noise ratio looks reasonably high, so you should be in business. A quick-and-dirty cut would be to fill in the missing data with either the last available sample or with zeros. The creep from sampling-interval differences may be the most challenging "feature" of your data to deal with. – cardinal Mar 01 '11 at 02:16
  • @cardinal: I did indeed try an FFT, only to get garbage as a result. The 'interesting' features readily visible in the data are indistinguishable from noise in the FFT. However, I have done FFTs only on the entire data set: Perhaps a moving window FFT would provide better results, but I haven't yet been able to find a computationally efficient way to implement it. I suspect a Wavelet transform could help, but I'm unfamiliar with it (but am slowly learning about it). – BobC Mar 01 '11 at 02:24
  • 1
    @BobC, what I meant was, did you consider an FFT-based implementation for calculating the correlation? Direct convolution is $O(n^2)$, but an FFT-based implementation would reduce this $O(n \log n)$, making it feasible. As for looking at the FFT itself, with 2 million data points, your frequency resolution will be very high. Any sampling creep and other stuff is bound to wash the signal out on a per-frequency basis. But, you should be able to aggregate over many bins to bring the signal out of the noise. Something like a Welch approach or maybe a custom windowing technique. – cardinal Mar 01 '11 at 02:29
  • @BobC, off the top of my head, it seems like some variant of an *overlap-and-add* or *overlap-and-save* algorithm could be used to do a sliding-window FFT. Sliding the samples within a window just amounts to a phase shift, so all you have to do is compensate for those samples that "fall off" the left end and those that "come in" on the right end. – cardinal Mar 01 '11 at 02:31
  • @cardinal, did you take a look at the plots I posted as an update to the original question? The sensors are not measuring identical environments, and the environment does lots of filtering, the nature of which will be quantified by future analysis. But the changes appear sufficient to preclude use of phase shift. – BobC Mar 01 '11 at 06:55
  • Hi, I have a similar question. I have 2 time series, each represented by a matrix with its first column corresponding to the values and second column corresponding to the time difference (since the previous value) How do I find the correlation between these 2 matrices? I tried to do xcorr2() but it doesn't seem right and doing xcorr would probably calculate correlation with only the values into considering, but I also want to account for the time. I'm really confused here, will an FFT help? How would you suggest I go about it? –  Mar 01 '11 at 03:09
  • @lmelza Converted; yet if you have a question, just ask it as a new one (ASK QUESTION button at the top-right side). –  Mar 01 '11 at 07:12
  • @BobC, I did take a quick look at the three plot snippets you posted. I'll try to take a closer look later today. It's not entirely germane to your question, but I'm a little puzzled by the data dropouts. I used to work with multichannel accelerometers sampling at 1kHz per channel or more for hours at a time and never experienced any such issues. Assuming you're running each channel into, say, a 12 bit ADC, you can easily pipe that over even a standard serial cable and still have more than enough left over for some additional error correction coding. – cardinal Mar 01 '11 at 12:54
  • It's a mobile acquisition where minimal size and weight are critical, and no external power is available. I link to the devices used in the OP. They lack internal memory sufficient to buffer data during slow writes to a microSD card. The devices have a variable sample rate, and I'm using them at 512 Hz, their fastest rate, thus obtaining the worst dropouts. Only sample rates below 60 Hz are gap-free, and such rates completely undersample the events I'm interested in. The device does have a burst-mode acquisition, but the implementation has limited use. – BobC Mar 02 '11 at 22:28
  • Do you have time stamp of the data? If yes then why don't you simply exclude missing time from both series? – Aksakal Jan 14 '15 at 12:39
  • Bob I'm facing the same need, have you made some progress ? Kind regards Nicolas – Nicolas Le Ny Jan 14 '15 at 12:26

1 Answers1

13

The question concerns calculating the correlation between two irregularly sampled time series (one-dimensional stochastic processes) and using that to find the time offset where they are maximally correlated (their "phase difference").

This problem is not usually addressed in time series analysis, because time series data are presumed to be collected systematically (at regular intervals of time). It is rather the province of geostatistics, which concerns the multidimensional generalizations of time series. The archetypal geostatistical dataset consists of measurements of geological samples at irregularly spaced locations.

With irregular spacing, the distances among pairs of locations vary: no two distances may be the same. Geostatistics overcomes this with the empirical variogram. This computes a "typical" (often the mean or median) value of $(z(p) - z(q))^2 / 2$--the "semivariance"--where $z(p)$ denotes a measured value at point $p$ and the distance between $p$ and $q$ is constrained to lie within an interval called a "lag". If we assume the process $Z$ is stationary and has a covariance, then the expectation of the semivariance equals the maximum covariance (equal to $Var(Z(p))$ for any $p$) minus the covariance between $Z(p)$ and $Z(q)$. This binning into lags copes with the irregular spacing problem.

When an ordered pair of measurements $(z(p), w(p))$ is made at each point, one can similarly compute the empirical cross-variogram between the $z$'s and the $w$'s and thereby estimate the covariance at any lag. You want the one-dimensional version of the cross-variogram. The R packages gstat and sgeostat, among others, will estimate cross-variograms. Don't worry that your data are one-dimensional; if the software won't work with them directly, just introduce a constant second coordinate: that will make them appear two-dimensional.

With two million points you should be able to detect small deviations from stationarity. It's possible the phase difference between the two time series could vary over time, too. Cope with this by computing the cross-variogram separately for different windows spaced throughout the time period.

@cardinal has already brought up most of these points in comments. The main contribution of this reply is to point towards the use of spatial statistics packages to do your work for you and to use techniques of geostatistics to analyze these data. As far as computational efficiency goes, note that the full convolution (cross-variogram) is not needed: you only need its values near the phase difference. This makes the effort $O(nk)$, not $O(n^2)$, where $k$ is the number of lags to compute, so it might be feasible even with out-of-the-box software. If not, the direct convolution algorithm is easy to implement.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • @whuber, good comments and suggestions. If I'm reading the question correctly, I believe one primary concern is uncertainty of the time-point of sampling. This may be a little different from the typical geostatistics framework where, I believe, the spacing is irregular, but still assumed known (at least to high precision). I think a rough model is, if the $n$th point of series one is at time $t_n = n t$, for fixed $t$ then the $n$th point of series 2 is at $\tau_n = t_n + \alpha + \beta n$ where $\alpha$ is probably on the order of a few milliseconds and $\beta$ is likely tiny. – cardinal Mar 08 '11 at 03:40
  • @Cardinal I didn't get that from the question. I can't think of a way of estimating $\beta$ that wouldn't be quite computationally intensive. Perhaps by breaking the time series into groups where the net effect of $\beta$ is negligible? – whuber Mar 08 '11 at 05:08
  • @whuber, @BobC, I'm making a semi-educated guess based on past experience dealing with similar problems and issues. Most approaches I've seen are computationally intensive and fail to impress. One attempt might be via something like [dynamic time warping](http://en.wikipedia.org/wiki/Dynamic_time_warping) or what Ramsay and Silverman call [curve registration](http://books.google.ca/books?id=mU3dop5wY_4C&pg=PA127#v=onepage&q&f=false). Whether either of those would be feasible on this size dataset is unclear to me. – cardinal Mar 08 '11 at 05:17
  • It'll take me a bit to get my brain wrapped around this. I'll start with the examples in the R packages you mentioned. – BobC Mar 08 '11 at 09:21
  • @BobC, is the rough model I gave for the timing asynchronicity close to what you have? I'm thinking it's "random initial offset" + "linear error" where the latter is due to a small constant difference in the sampling interval between your two devices. Then there is some additional small random error, say due to interrupt processing of two different uC's on top of that. – cardinal Mar 08 '11 at 17:53
  • @cardinal, the initial offset is from two sources: Different actual start times (manually start Device A, then start Device B a second or two later), and differences in the RTC setting in each device (a few seconds at most). I see no clear effects of sample timing jitter (nothing measurable), so that should be safe to ignore. During the 1-2 hour durations of data acquisition the RTC drift is also not noticeable, though longer trial acquisitions did reveal some, but it seems more of a temperature effect that will be difficult to quantify, so it can probably be ignored too. – BobC Mar 08 '11 at 20:46
  • @whuber, I'm presently working my way through this **[variogram tutorial](http://www.goldensoftware.com/variogramTutorial.pdf)**. – BobC Mar 08 '11 at 20:52
  • @BobC, that makes it sound like a simple constant, albeit unknown, time delay, then. No? If that's the case, the variogram approach maybe a bigger hammer than you want! – cardinal Mar 08 '11 at 21:20
  • @cardinal Variograms are fairly simple, intuitive graphical indicators. They are straightforward to calculate. So I wouldn't characterize them as being big, complex, unwieldy, or unsuitable (which are some of the connotations of a "big hammer"). – whuber Jun 08 '11 at 20:15
  • @whuber, I'm trying to resurrect my thoughts from several months ago, which is a bit difficult. But, I think what I was trying to get at was essentially that the variogram approach might appear more complicated then what is needed since it is geared toward handling irregularly spaced data. Since the OP's question seems to involve regularly spaced data, somewhat simpler time-series techniques can be used. Indeed, I believe the variogram would reduce to a linear transformation of the autocorrelation function in this setting. – cardinal Jun 09 '11 at 12:55
  • @cardinal Those are good points. You can do a lot more with the autocorrelation function of a regularly spaced series than you can with a variogram of an irregularly spaced one. I recall my motivation to suggest a variogram (or cross-variogram) was that one of the clocks exhibited "creep" with respect to the other, so that at a minimum the two series do not use the same equal spacings. It also seemed possible that the "creeping" could be irregular, resulting in changes of spacing throughout a series. – whuber Jun 09 '11 at 13:31
  • @whuber: Yes, I think your comments hit the mark. The comment you referenced was written in response to @BobC's immediately prior comments as my understanding of his problem was evolving. When I originally read the post and in my initial discussion with him, I also thought there was more irregularity to the data spacing. We actually never got confirmation one way or the other from @BobC. – cardinal Jun 09 '11 at 17:36