6

Suppose I have a multivariate, compositional dataset that depicts the concentration of different elements. However, the data are not available on a single scale; i.e., some are of form 0.00x while others are integers. Should I apply any kind of normalization / standardization technique before applying, or do I need to do anything with the data at all before apply transformations (isometric log ratio [ilr], centered log ratio [clr], etc.) and start the data analysis, imputation on missing value using robust methods, robust PCA and data clustering?

Some pointers for understanding compositional data analysis are also welcomed.

[Update]

For example: Consider two vectors:

[ 0.016, 71.2, 0.123, 1.74, 14.0, 0.002, 2310, 0.064, 0.29, 0.32,5.63, 96.5, 0.044, 
  4360, 1110, 585, 0.052, 62.9, 4.45, 1110, 1.50, 15.10, 783, 0.015,78.9, 5.61, 0.007, 
  0.022, 0.007, 0.53, 29.3 ]  
[ 0.073, 245.0, 0.299, 2.77, 17.4, 0.039, 2460, 0.145, 0.85, 0.99, 20.40, 359.0 0.062, 
  4040, 1530, 148, 0.113, 217.0, 18.10, 1310, 4.61, 4.56, 880, 0.069, 230.0, 12.20, 
  0.028, 0.025, 0.013, 9.92, 34.1]

These two represent concentrations of different elements in soil samples collected from two different positions. If I were to analyse them using robust methods, should I preprocess them standardize / normalize in any form, or should I simply transform them into Aitchison geometry and start my analysis?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user41728
  • 61
  • 4
  • 2
    This sentence: "*But data is not available on a single scale i.e. some are of form 0.00x while others are integers*" is a bit cryptic, especially if your data (as you write below) are concentrations of different elements. Why integers? Why ratios? What are the values? – amoeba Mar 13 '14 at 14:22

1 Answers1

1

Make sure you understand the algorithms before using them.

E.g. k-means minimizes variance, and of course an attribute with a larger scale, with have a much larger variance, too. Therefore, standardizing data is often beneficial there.

But with e.g. hierarchical clustering, you need to give a distance function. Euclidean distance is just one of the many options; and maybe you can be much more specific as to which attribute should have which amount of effect on the result.

The key question is: what is a sensible measure of similarity for your domain. There is no universal measure. With hierarchical clustering, this is just more explicit - K-means is based on the sum-of-squared-deviations, so there you need to rescale / transform your data to give appropriate weight, which is much more limited than specifying a similarity measure for your data.

So: when are two soil samples alike - as you can see this is a domain and purpose question, not so much a statistical question.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96