2

First of all, sorry for the strange title, I had no idea how to describe my problem better. My issue is the following, I think it is pretty much limited to geosciences.

I have several properties for every sample, which are divided by depth.

For instance:

$ \qquad \displaystyle \small \begin{array} {r|rrr} \hline ID & 1 & 2 &3 & ...\\ \hline \text{var1}_{0-20cm} & 2.3 &2.0 &1.0& ...\\ \text{var1}_{20-50cm} & 2.1 &1.1 &0.0& ...\\ \text{var1}_{50-100cm}& 2.6 &1.1 &0.0& ...\\ \hline \text{var2}_{0-20cm} & 10.5 &5.5 &3.5& ...\\ \text{var2}_{20-50cm} & 10.9 &5.9 &1.9& ...\\ \text{var2}_{50-100cm}& 15.0 &5.0 &1.0& ...\\ \hline \vdots & \vdots & \vdots\\ \hline \end{array} $

Basically these are geological layers going from surface down to 100 cm depth. I am trying to decrease the number of variables, either with PCA or factor analysis. The issue is, that I would like to handle properties together, no matter what the depth is.

(For instance I do not want to get rid of a layer in between the surface and the bottom layer.)

Is there any way to handle them together, or group them for PCA or whatever. I tried to find some relevant information, but I think the problem is limited to a small portion of the science (maybe I am wrong), so I could not find anything useful.

Gottfried Helms
  • 1,494
  • 15
  • 23
user1775772
  • 99
  • 1
  • 4
  • How, *exactly,* are the samples obtained and measured? This matters because if the samples represent averages within each layer at their locations, then the varying layer thickness will change the distributions of the values and thereby suggest one set of approaches. Otherwise, if the results are from subsampling each layer (which often happens in the lab), then another set of approaches might be favored. – whuber Nov 13 '12 at 22:43
  • Thanks @whuber for the fast reply: Layers are (re)calculated with weighted average, from sampling layers. So they do not represent the actual sampling and lab measured samples (every profile has a different layer divison for sampling). And after recalculating the layering is uniform for every sample for every property. – user1775772 Nov 13 '12 at 22:49
  • You may have a hard time, then, interpreting the results: they could tell you more about your interpolation (averaging) method than about what's really going on. Is there a reason not to do the PCA with the original data? – whuber Nov 13 '12 at 22:50
  • 1
    Sampling is really diverse for every single point. It means sometimes the first layer is 0-1cm sometimes 0-100cms. The goal would be clustering, on a uniform layering, but I'd like to get rid of correlating properties. – user1775772 Nov 13 '12 at 22:54
  • @whuber I was considering also splines, not weighted average but in some cases it would have result to misleading values if NAs are present in a profile. – user1775772 Nov 13 '12 at 23:05
  • Slightly off-topic, but there's a [geoscience proposal on area51](http://area51.stackexchange.com/proposals/36296/geoscience) that's currently in the commitment phase. While this is certainly a stats question, the fact that it's so field specific means you might get better help there or pointers to solutions that CV users might not be aware of. So go and sign up! :D – naught101 Sep 10 '13 at 08:23
  • Have you looked into functional data analysis? I am very far from an expert in that, but the little I know suggests it might be useful here. See e.g. [this book](http://www.textbooks.com/BooksDescription.php?BKN=773525&SBC=ME3&kpid=9780387400808U&network=GoogleShopping&tracking_id=9780387400808U&utm_medium=cpc&utm_term=9780387400808U&utm_source=googleshopping&kenshu=2275d976-eb91-6b88-5311-00004cb31ff0&gclid=CIGl0ceWw7kCFaYDOgodFxIA6A) – Peter Flom Sep 11 '13 at 10:52

3 Answers3

1

What you could do is use Multiple Factor Analysis. This method allows for factor analysis in which you consider multiple groups of variables. If you set up your analysis so that each group is a depth then it garantees that all your depth will be 'preserved'.

EDIT : Maybe explaining a bit more would be useful

In MFA, as in PCA, you have coordinates for your individuals and your variables. But what's new with MFA is the groups of variables for which you can compute coordinates too so you can extract coordinates for all of your groups (depth) on the first few dimensions, effectively reducing variable number and keeping all your depths. If you consider your individuals, you will have several sets of coordinates, one for each of the groups of variables (a description of the individuals by each group of variable if you will) and a set of coordinates which is the centroid of all groups coordinates (partial representations), that last set of coordinates could be interpreted by how the individuals are described overall

Riff
  • 423
  • 3
  • 15
0

If I understand you correctly, you want to use

Variable1 :=  var1_0-20cm + var1_20-50cm + var1_50-100cm
Variable2 :=  var2_0-20cm + var2_20-50cm + var2_50-100cm

(i.e. "independent of depth"; depending on how your data was generated you might want to use the mean or a weighted average instead of the sum. e.g. 0.20 times the first, 0.30 the second and 0.50 the third) instead of the full data space? What exactly is the problem with doing this? A key benefit is that this way you can control what happens quite well.

Then in the end you can e.g. use PCA on these non-divided variables. You can however try to use the PCA result to project the original data, too - do the same mapping for the "divided" attributes that you got by PCA for the non-divided variables.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Could you please explain how this answer is connected to PCA? Indeed, what do your two equations mean? – whuber Nov 14 '12 at 13:31
  • As I understand his question, he might want to run PCA on these new variables instead of the "divided" attributes. To cite the question: "any way to handle them together, or group them for PCA" – Has QUIT--Anony-Mousse Nov 14 '12 at 15:05
  • I see now: you are summing values, so that the PCA is tantamount to using vertically averaged variable values. (At first the "Var" looked like you were taking variances somehow.) There's an [ecological fallacy](http://www.stat.berkeley.edu/~census/549.pdf) lurking here: that's the problem with your recommendation. – whuber Nov 14 '12 at 15:53
  • 1
    Summing would not work, the variables would loose their meaning. The depth is an important point in the data, where one of the variables starts to increase or decrease is the key in the study. I was wondering about it a lot, reading literature etc, and i came up with the idea to fit splines into the variables and cluster the groups based on their splines. – user1775772 Nov 14 '12 at 18:14
  • Drop it for PCA, but work with the full data later on maybe? – Has QUIT--Anony-Mousse Nov 14 '12 at 18:15
  • That is an option, but I think I could make the splines, drop them to PCA and than cluster based on them. – user1775772 Nov 14 '12 at 19:07
0

A data driven (and thus probably not so very good) approach

Calculate four correlation matrices: One for each layer and one for the pooled data (three lines per sample). If they all look quite similar, run a PCA based on the correlation matrix of the pooled sample and go on with the first few PCs.

Instead of comparing the four correlation matrices, you could also consider the four loading matrices of the corresponding PCAs and compare the loadings of the first few PCs. This is much easier if you have lots of variables.

Michael M
  • 10,553
  • 5
  • 27
  • 43