Multivariate analysis for a data with different metric units

Question

I have some weird data that I don’t know how to treat them!

I have a data of metabolite measurements in different groups of samples with 5 replicate in each group:

Groups      treatment   diet
Group1:     yes         A
Group2:     NO          A
Group3:     yes         B
Group4:     No          B

Each sample has been measured for 500 different metabolites. but the measurement values are so weird since:

The measured values are the signals and not the concentration, which means the metric units are different (i.e. value 2 means totally different in metabolite 1 comparing to metabolite 2).

There are some missing values, which means that it wasn’t possible to detect those metabolites in that specific sample but it doesn’t mean it is zero! E.x. as below.

samples         metabolite1
Group1          12374
Group1          NA
Group1          NA
Group1          NA
Group1          46091
Group2          128025
Group2          90689
Group2          129950
Group2          76813
Group2          66439

What I want to do:

First, I would like to do a principle component analysis to see if there is any clear separation between the groups.
And then I would like to study if any of the factors: treatment or diet or the interaction has any effect on each metabolite.

What do you suggest me to do with this data?

P.S. I analyze my data in R!

Regarding your "missing data", these are *censored* because the values are below the limit of detection. You might find this thread helpful: [How small a quantity should be added to x to avoid taking the log of zero?](http://stats.stackexchange.com/q/30728/7290) More generally, what are you trying to find out from these data? What model do you want to fit? — gung - Reinstate Monica, Nov 23 '16 at 15:24

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

Missing values

As @gung says, a first step would be to go and find out why there are NAs:

concentrations are below LOD/LOQ/signal below critical value: ask for the uncensored values. At the very least make sure that in future data is not censored and instead you are provided with the critical values/LOD/LOQ in addition to the data.
Btw: known LOD and LOQ implies that a calibration is available.
Obviously, if NAs are not due to censoring but mean that for some reason measurement was not possible, you'll have to go on with missing values.
Whether the NAs come from (left) censoring or other reasons is important for dealing with them.
Ivana Stanimirova from Katowice has done a lot of work about data analysis with missing values both completely at random (CAR) and/or due to censoring.
I. Stanimirova: Practical approaches to principal component analysis for simultaneously dealing with missing and censored elements in chemical data, Analytica Chimica Acta, 796 (2013) 27–37.
DOI: 10.1016/j.aca.2013.08.026
may be a good starting point.

Signals instead of concentrations and scaling

are not very problematic:
different metabolites come at vastly varying concentrations as well, so you'd anyways need to think about scaling
if all signals are physically the same unit and don't vary in their order of magnitude, you may keep them as they are
see also: Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one?
Sometimes it is possible to set up scaling according to some control group.

Study factors

ASCA (ANOVA Simultaneous Component Analysis), PCA-ANOVA and rMANOVA (regularized MANOVA) could be starting points.

Multivariate analysis for a data with different metric units

1 Answers1

Missing values

Signals instead of concentrations and scaling

Study factors