7

I have biological time series (9 years long) of the biomass of species which logically exhibit a seasonal pattern. I would like to cluster them into a few groups based on their typical seasonal evolution (e.g. spring vs. summer species). To do so, I was advised to use Fourier transform in order to decompose their signal into N harmonics (e.g. 3: annual, bi-annual and tri-annual seasonal cycles) and use the amplitudes and phases of these in a Principal Components Analysis (PCA; which would work as the harmonics are orthogonal/uncorrelated).

I know there are already some similar subjects in this Forum, yet some aspects remain unclear to me. My questions are:

(1) When I reconstruct the time evolution from the N first harmonics computed from the Discrete Fourier Transform (DFT), the explained variability of the original signal (the R² of the linear model between recomposed signal and the original data) is sometimes only 0.40 (N=3) or 0.60 (N=5). In your experience, does it mean the data are not suited for this approach, does that invalidate the approach? Is there more pre-processing I could do to fix that (e.g., smoothing the signals, …)? Some species exhibit sudden increases spaced by total absence, and I wonder if this doesn’t call for the need of higher frequency harmonics; should I expect difficulties there and how to tackle them?

(2) Beside DFT which appears limited here, I considered using continuous Fourier Transform through a Fast Fourier Transform (FFT) algorithm and working on the power spectrum of each time series. I wonder if this could allow me to select N' so-called “harmonics” by selecting the N' highest peaks in the periodogram and then calculating the corresponding amplitude and phase to be used in a following PCA... Does that make sense? How to concretely use the info given by a FFT algorithm in R (such as fft() or spec.pgram()) in order to run a subsequent PCA (or any other clustering method)? [any R code snippet would be very welcome]

(3) How to reconstruct the signal from selected harmonics in the continuous case (FFT)? I can easily do this in the DFT case, but I am stupidly blocked in the continuous case… Any R code snippet is of course very welcome.

Any help regarding these questions would be very appreciated. Links toward concrete examples, especially with associated R code, would be very helpful too (as well as method name or keywords). Thank you.

PS: in case it is useful: The time series are of equal length and pre-processed to have uniform sampling intervals; stationarity may be assumed; no long-term trend is in the way. I divided the time series in 52 equally-spaced observations per year (i.e., 468 observations over the 9 years).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
ztl
  • 331
  • 3
  • 8
  • What results do you get when you perform PCA on the *untransformed* data? – whuber Nov 30 '13 at 17:51
  • I am sorry if I miss your question, but I don’t see how I can perform PCA on the raw data, i.e. on the 468 observations of the time series? I thought the Fourier transform was practical here to reduce the dimensionality of the data to a few variables (i.e., the amplitudes and phases of the first few harmonics)… – ztl Dec 01 '13 at 20:38
  • 4
    PCA is a good way to reduce dimensionality and does not make the stronger assumptions required of the Fourier analysis. PCA can be performed on any matrix of any size (up to certain computational limits), so if you have a set of parallel time series, you can arrange them in a matrix and proceed. [Search our site](http://stats.stackexchange.com/questions/tagged/pca?sort=votes) for more information. – whuber Dec 01 '13 at 21:44
  • As I see it, performing PCA on the amplitudes and phases of the first harmonics for each species and looking at species location in biplot would give me information on clustering between species (based on their scores) and loadings (harmonics’ variables). In contrast, am I right to say that to perform PCA on the untransformed data I would have to use species as ‘variables’ and the PCA would give me the location of the samplings dates (less useful) while the loadings (arrows) alone would give me the needed clustering information? I am completely new to this and may be wrong… – ztl Dec 02 '13 at 18:28
  • 2
    This sounds as if something simpler should be tried first, or at least as well. I'd fit a few sine-cosine pairs and look at times of fitted peaks and troughs, amplitude of cycle measured in some way and fraction of variation explained. Such measures might be more interesting biologically and easier to interpret. As you have just 9 series, fitting similar models to each and comparing results might be as instructive as full-blown multivariate. – Nick Cox Dec 02 '13 at 18:44
  • You can perform PCA on the transpose of the untransformed data matrix, interchanging the roles of species and dates. It's easy to do and worth an exploratory look. – whuber Dec 02 '13 at 19:42
  • 1
    So, @whuber: considering species as observations and the sampling dates as (many) variables? Performing PCA on the correlation matrix (scaled data, which makes more sense to me in this context) in this way yields a proportion of variance explained of (0.26+0.17=) 0.43 for the first 2 components. The clustering of the species (the rotated data) is not very clear but may make some sense. But doesn’t the nature of the time variables pose any problem (correlation)? [note: PCA on sampling days as observations and species as variables yields 0.21 variance explained by PC1&2 and unclear clustering…] – ztl Dec 03 '13 at 18:28
  • @Nick Cox : not sure to get how different it is from the DFT-based analysis I tried to make. Basically, I am also decomposing the signal into sine-cosine pairs (the harmonics) and looking at their phase and amplitude. It is true that I haven’t looked specifically at their fraction of variation explained yet (though it can be done following Parseval’s Theorem if I am not wrong), but am I missing your point here? Also, sorry I was not clear: time series are 9-yr long, but there are more than 9 of them (several dozens). Thanks! – ztl Dec 03 '13 at 18:46
  • 1
    You likely need more than two components. The first few often will be uninteresting, reflecting the overall magnitudes of the data, but the next few might have coefficients revealing any seasonality, clustering, and so on. Correlation among the time variables is not necessarily a problem--PCA is an exploratory method--but it does have a (somewhat predictable effect): see http://stats.stackexchange.com/questions/50537. – whuber Dec 03 '13 at 18:57
  • 1
    Sorry; silly misreading of mine about 9 time series. That strengthens the case for multivariate. With environmental data at least scientists care about when the peaks and troughs are, which PCA only reveals indirectly (correct me if I'm wrong). – Nick Cox Dec 03 '13 at 19:03

0 Answers0