2

I am very new to R and statistics so this may be a simple question. I have a matrix (1000,756) containing 1000 years of winter sea-level pressure data (SLP) at 756 locations in the North Atlantic. I need to identify an oscillation in SLP anomalies (i.e. the difference between high and low regions in the North and South), called the North Atlantic Oscillation.

I have done a principal component analysis of the data using princomp. According to the literature I need to use the leading PC and

... project the time series of SLP anomaly fields on to this pattern (i.e. compute the scalar or dot product between field and pattern).

Can anyone help me with how to do this?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Welcome to our site! Is your question about what this projection means (mathematically) or about how to compute the dot products in `R`? – whuber Feb 19 '14 at 15:15
  • 1
    Hi whuber, thank you for the response! I am really asking both of those questions. I am trying to get my head round it mathematically and at the same time calculate it in R but am struggling a little. – Edward Armstrong Feb 19 '14 at 15:18
  • The first principal component is a vector of length 756 (number of your "locations"), so 756 numbers $w_i$. To project your data onto the first principal component you need to take each column of the data matrix (1000 years at one location), multiply it by the corresponding number $w_i$ and add them all together. You will get one column of length 1000, this is your North Atlantic Oscillation. I have a strong feeling that this is a duplicate of many many questions here, try reading highest voted questions tagged "PCA". – amoeba Feb 19 '14 at 15:45
  • Hi amoeba, I really appreciate your help. Just to confirm and reword slightly to make sure I understand: The different rows (years) of each column (location) need to be multiplied by the same corresponding value of PC1. And then the 788 values of each row are added up? – Edward Armstrong Feb 19 '14 at 16:49
  • @EdwardArmstrong: By 788 you mean 756? I think you understood it correctly, but in general it's easy to get confused between rows and columns. Important thing is that PC1 is a vector in LOCATION space, i.e. has as many coordinates as there are locations. Projecting onto PC1 means reducing all your 756 locations to one "composite location". To do that you take your data at each location (1000 values), multiply it by a corresponding coordinate of PC1 (all 1000 values are multiplied by the same value, yes) and sum the resulting 756 1000-long data vectors up, obtaining one 1000-long data vector. – amoeba Feb 20 '14 at 10:22
  • Hi ameoba, yes I meant 756. You have been very helpful and really appreciate it! Has worked well. – Edward Armstrong Feb 21 '14 at 16:31
  • @EdwardArmstrong: I am happy to help. I have now joined my comments into a single answer, so that your question does not remain "unanswered". – amoeba Feb 24 '14 at 00:17

1 Answers1

2

The first principal axis (some people would refer to it as "principal component", but I advocate calling it "principal axis") is a vector of length $756$ (number of your "locations"), so $756$ numbers $w_i$. To project your data onto the first principal axis you need to take each column of the data matrix (i.e. $1000$ years at one location), multiply it by the corresponding number $w_i$ (the whole column is multiplied by the same number), and add the $756$ resulting $1000$-long columns together. You will get one column of length $1000$, and this is your North Atlantic Oscillation. This projection is also what I would call "principal component".

Important thing to realize is that principal axis is a vector in location space, i.e. has as many coordinates as there are locations. Projecting onto this axis means reducing all your $756$ locations to one "composite location", which is simply a linear combination (i.e. "weighted sum") of all individual locations.


See my answer here about this terminological distinction: What exactly is called "principal component" in PCA?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Hi amoeba. I wanted to ask you a follow on question regarding the calculation of the NAO. Following on from the calculation above I have a [1000] vector, with values ranging between 50.5 to -54.8. On a time-series plot of the data, the literature has the NAO index ranging from between -3 and 3 to -15 and 15 depending on the paper. Is there a specific scaling that I need to do to my data to get an NAO index similar to these values? Hope this makes sense, many thanks in advance. – Edward Armstrong Mar 03 '14 at 13:31
  • @EdwardArmstrong: I am not familiar with the literature on NAO, so have no idea how it is usually defined. However, I can imagine that there is some normalization involved, because otherwise the values of the principal component are going to depend on the number of locations you have the pressure measurements from. E.g. if you had twice as many locations, the PC1 you get would be scaled by $\sqrt{2}$. One way to normalize the index would be to divide it by its standard deviation (assuming that means are subtracted already at each location). Then you would expect NAO index be between $\pm 3$. – amoeba Mar 03 '14 at 14:06
  • Thanks amoeba for the rapid response! That all makes sense. I subtracted the mean of each spatial point over time before doing the PCA analysis (i.e. for each of the 756 locations, I calculated the mean of the 1000 time measurements for that point, and subtracted this from each value for that location). I trust this is what you are referring to? Thanks again. – Edward Armstrong Mar 03 '14 at 15:04
  • @EdwardArmstrong: Yep, that's what I meant. – amoeba Mar 03 '14 at 15:07
  • Hi @amoeba. I hope you don't mind me messaging you so long after this thread opened but its good to check with an expert! If I wanted to get a spatial map this time instead of a timeseries, would I do the same as you stated above but instead sum over the rows. Thus ending with one row of length 756? – Edward Armstrong Feb 01 '16 at 18:18
  • Hi @EdwardArmstrong, sorry for the delay in responding. I am not sure I understand your question though! Is it still in the context of your 1000x756 data matrix? What "spatial map" do you want to get? Of how each of the 756 locations contributes to the NAO? Or did I misunderstand? – amoeba Feb 05 '16 at 00:29