MANOVA with variables from different datasets

Question

I conducted several experiments where I have one underlying independent variable (tree species, IV). Each of these experiments gave me one dependent variable (DV), like bark pH, rugosity or the water-holding capacity. Now I want to conduct a MANOVA to see if the tree species differ in the various dependent variables. My analysis is conducted in R.

My model therefore looks like:

pH + rugosity + water-holding capacity + [...] ~ tree species

where I have per tree species...

3 measurements of the bark pH.
9 measurements of the bark rugosity.
4 measurements of the bark thickness,
5 measurements of the water-holding capacity,
5 measurements of the water retention.

However, unlike most examples I've found on how to do a MANOVA (i.e. here, here, here), my data stems from different measurements and from different individuals. Now, I've found only this thread discussing unequal sample sizes, but this targets only sample sizes within the explaining factor.

My Question:

My dependent variables all have different sample sizes. Would a MANOVA be appropriate for such kind of data? Can I just ignore the different variable sizes? Is there an alternative way to do this or rather an alternative statistic test? Does my small sample size matter?

EDIT: What I really want to find out

I really just want to conduct a statistical test telling me, if I have an underlying pattern. So are the tree species different in regards to the dependent variables? In the end I want to be able to tell, if some species have a certain set of traits different from other species.

Example Data:

My data looks like this:

> manova_df
    # A tibble: 45 x 6
   tree_species rugosity bark_mm    pH   whc   ret
   <fct>           <dbl>   <int> <dbl> <dbl> <dbl>
 1 AS              2.36        8  6.49  295. 119. 
 2 AS              1.45        8  6.83  222. 105. 
 3 AS              3.13        9  5.8   291. 181. 
 4 AS              2.38        8 NA     314. 214. 
 5 AS              4.39        7 NA     613. 317. 
 6 AS              2.21       NA NA      NA   NA  
 7 AS              0.810      NA NA      NA   NA  
 8 AS              1.58       NA NA      NA   NA  
 9 AS              0.934      NA NA      NA   NA  
10 BU              3.34        6  7.22  189.  74.9
# ... with 35 more rows

The NAs stem from the fact that I have different sample sizes but had to get all the variables into one data.frame. So I just binded all the columns of the different observations together. This means, that the single observations of the various DVs are not tight together and the order within the tree species levels is totally arbitrary!

My analysis is pretty straightforward:

mano_mod = manova(cbind(pH, bark_mm, rugosity, whc, ret) ~ tree_species, data = manova_df)
> summary(mano_mod)
             Df Pillai approx F num Df den Df    Pr(>F)    
tree_species  4 2.4207    3.372     20     44 0.0003836 ***
Residuals    12                                            
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Full Data Set:

structure(list(tree_species = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("AS", "BU", "CL", 
"MB", "PR"), class = "factor"), rugosity = c(2.36, 1.45, 3.13, 
2.38, 4.39, 2.21, 0.81, 1.58, 0.93, 3.34, 5.06, 0, 0.77, 12.64, 
4.1, 0.8, 1.03, 0.84, 6.49, 9.09, 5.96, 5.32, 8.41, 15.29, 9.91, 
7.65, 2.13, 9.43, 10.14, 13.24, 10.26, 9.81, 12.34, 17.23, 16.63, 
8.82, 1.68, 0.7, 0.82, 2.43, 0, 0.76, 0.77, 0, 1), bark_mm = c(8L, 
8L, 9L, 8L, 7L, NA, NA, NA, NA, 6L, 8L, 8L, 7L, 9L, NA, NA, NA, 
NA, 9L, 9L, 8L, 10L, 9L, NA, NA, NA, NA, 5L, 9L, 9L, 8L, 4L, 
NA, NA, NA, NA, 5L, 5L, 5L, 6L, NA, NA, NA, NA, NA), pH = c(6.49, 
6.83, 5.8, NA, NA, NA, NA, NA, NA, 7.22, 7.11, 7.72, 7.29, NA, 
NA, NA, NA, NA, 7.39, 7.18, 7.3, 7.3, NA, NA, NA, NA, NA, 6.76, 
6.55, 6.24, NA, NA, NA, NA, NA, NA, 5.76, 6.59, 5.44, NA, NA, 
NA, NA, NA, NA), whc = c(295.2, 222.4, 290.6, 314.3, 613.4, NA, 
NA, NA, NA, 189.4, 248.2, 336.8, 330.1, 427.8, NA, NA, NA, NA, 
236, 492.6, 549.3, 330.1, 370.7, NA, NA, NA, NA, 430, 142.2, 
372.4, 260, 176.1, 680, 215, NA, NA, 333.8, 320.6, 282.4, 322.9, 
576.7, NA, NA, NA, NA), ret = c(118.9, 104.9, 180.6, 214.5, 317.3, 
NA, NA, NA, NA, 74.9, 95.7, 127.3, 150.1, 327.3, NA, NA, NA, 
NA, 80.8, 176.7, 255.7, 142.6, 236.6, NA, NA, NA, NA, 148.4, 
32.4, 244.2, 66.8, 76.4, 246.1, 73.6, NA, NA, 111.2, 151.3, 102.1, 
200.6, 258.1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -45L))

(If anything is unclear, please ask.)

@Jaap hm, what do you miss? I tried to focus only on my problem... — bamphe, Dec 09 '18 at 16:25
[Here](https://stats.stackexchange.com/questions/77812/manova-with-unequal-sample-sizes) is a discussion about MANOVA with different sample size. — pockeystar, Dec 09 '18 at 18:04
@pockeystar As stated above in my question, the mentioned question only targets different sample sizes in the independent variable, i.e. if I had sample 1000 trees of one species and only 10 of another. However, in my case the number of sampled individuals per tree species are the same but different for each dependent variable. — bamphe, Dec 09 '18 at 18:18
Among others, you had a five-item-list describing the type of data you have. I think that is quite informative and maybe even essential to the question. — Jaap, Dec 09 '18 at 20:43
If you can put the whole data set up there that would certainly help me walk through one option for dealing with these data. — Matt Barstead, Dec 10 '18 at 21:17
Why do you stick on the MANOVA? What MANOVA can do, the Linear mixed model also can do, and there are several advantages over MANOVA. One of advantages is incorporating the missing value. In addition, why you cannot analysis the dependent variables one by one? What do you want to get beyond the univariate (one by one) analysis? — user158565, Dec 11 '18 at 05:24
@user158565 Good point to think of linear mixed models, I didn't know they work with multiple DVs as well. Why I want to conduct a MANOVA: I basically just want to conduct one test to see an underlying pattern that I would probably miss if I just did individual ANOVAs (see my Edit). Does this make sense? — bamphe, Dec 11 '18 at 08:40

Matt Barstead · Answer 1 · 2018-12-14T14:46:56.320

One option here is to think of this as a missing data problem. You have varying amounts of information for each case (different tree species) that you have classified as one of five different species. If you can meet the assumption that the data are missing at random, you could impute.

Before going any further though, I you should let you know that I am going to argue against a MANOVA for these data. This is a small data set to be missing as many scores as I can already see in the first 10 rows. In fact if you look at your degrees of freedom you only had a total of 17 cases with complete data. MANOVA is going to listwise delete - so you lose all non-complete cases. Right now your inferential test is based on less than half your data and that seems suboptimal to me.

My guess is that you would prefer to make inferences using as much information as is possible, which is going to be limited if you are using a MANOVA without some sort of imputation. Even if you did impute, Finch (2016) found that missing rates over 40%, even under the condition that they were missing at random (i.e., MAR), may cause an undesirable inflation of Type I error. His simulation study was based on two groups with 50 observations per group (i.e., $N = 100$). Your model can likely tolerate even less missing data given both a smaller total sample ($N_{max} = 45)$ and a larger number of groups ($k=5$).

If it is not possible to include the observed scores for the cases that are missing values on your different measures, I would recommend considering an approach in which you impute your data using as much information as you have about your cases (i.e., if there are additional variables you have access to that are completely observed across all cases), and then perform a series of one-way ANOVAs. Check out the mice and Amelia packages in R for imputation options.

Another reason to drop the whole MANOVA model other than your missing data is that it may not be all that good of a conceptual fit. Unless you have a strong hypothesis about group differences on the location of the mutlivariate distributions (i.e., their centroids), you gain little by performing a MANOVA in the first place. There is a myth out there still floating around that running a MANOVA is somehow going to protect you against Type I inflation that results from multiple comparisons, which is something it doesn't really do all that well.

My guess is that you are most likely interested in which dimensions your different species differ on. You were going to have to conduct follow-up one-way ANOVAs to find out anyway, so why not cut out the MANOVA middle man, which does nothing to control type I error of these separate one-way ANOVAs?

UPDATE:

I think the gls option from Heteroskedastic Jim is a clever solution if you want to keep this in one model. But it does require a complex model for a simple problem, and without more complex random effects being specified in the model, I am not sure what the approach "buys" over and above the simpler ANOVA approach with multiple comparisons. For clarity (using the data you provided), here is what you would get with an ANOVA workflow for rugosity. Note for comparison purposes to the gls example above I am not imputing any values here.

fit.lm1<-lm(rugosity~tree_species, data=dat)
fit.lm1.aov<-aov(fit.lm1)

First, and importantly, the overall omnibus test is significant:

             Df Sum Sq Mean Sq F value   Pr(>F)    
tree_species  4  763.1  190.77   23.49 4.78e-10 ***
Residuals    40  324.9    8.12                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And now for the model coefficients:

Call:
lm(formula = rugosity ~ tree_species, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6756 -1.8456 -0.1467  0.9244  9.4644 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.138      0.950   2.250 0.029992 *  
tree_speciesBU    1.038      1.343   0.772 0.444374    
tree_speciesCL    5.668      1.343   4.219 0.000137 ***
tree_speciesMB    9.851      1.343   7.333 6.48e-09 ***
tree_speciesPR   -1.231      1.343  -0.916 0.364958    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.85 on 40 degrees of freedom
Multiple R-squared:  0.7014,    Adjusted R-squared:  0.6715 
F-statistic: 23.49 on 4 and 40 DF,  p-value: 4.775e-10

But to make a little more sense of the differences between your 5 species, a set of $p$ values based on multiple comparisons might help. I am using the more conservative Bonferonni correction here to reduce Type I error inflation, but alternatives exist.

pairwise.t.test(dat$rugosity, dat$tree_species, p.adj = "bonferroni")

Which returns:

Pairwise comparisons using t tests with pooled SD 

data:  dat$rugosity and dat$tree_species 

   AS      BU      CL      MB     
BU 1.0000  -       -       -      
CL 0.0014  0.0135  -       -      
MB 6.5e-08 7.7e-07 0.0341  -      
PR 1.0000  0.9903  7.7e-05 3.6e-09

P value adjustment method: bonferroni

That is your most complete variable. So what happens with a different DV? Well with bark_mm you'd get the following output:

fit.lm2<-lm(bark_mm~tree_species, data=dat)
fit.lm2.aov<-aov(fit.lm2)
summary(fit.lm2.aov)
             Df Sum Sq Mean Sq F value  Pr(>F)   
tree_species  4  34.01   8.502   5.056 0.00603 **
Residuals    19  31.95   1.682                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21 observations deleted due to missingness

summary(fit.lm2)

Call:
lm(formula = bark_mm ~ tree_species, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.3375  0.0000  0.8125  2.0000 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.0000     0.5799  13.795 2.38e-11 ***
tree_speciesBU  -0.4000     0.8201  -0.488  0.63133    
tree_speciesCL   1.0000     0.8201   1.219  0.23765    
tree_speciesMB  -1.0000     0.8201  -1.219  0.23765    
tree_speciesPR  -2.7500     0.8699  -3.161  0.00514 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.297 on 19 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.5156,    Adjusted R-squared:  0.4136 
F-statistic: 5.056 on 4 and 19 DF,  p-value: 0.006027

pairwise.t.test(dat$bark_mm, dat$tree_species, p.adj = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  dat$bark_mm and dat$tree_species 

   AS     BU     CL     MB    
BU 1.0000 -      -      -     
CL 1.0000 1.0000 -      -     
MB 1.0000 1.0000 0.2473 -     
PR 0.0514 0.1414 0.0038 0.5865

P value adjustment method: bonferroni

Now compare this with the relevant gls model output:

Coefficients:
                                  Value Std.Error   t-value p-value
(Intercept)                      8.0000   0.57993 13.794673  0.0000
tree_speciesBU                  -0.4000   0.82015 -0.487715  0.6267
tree_speciesCL                   1.0000   0.82015  1.219288  0.2252
tree_speciesMB                  -1.0000   0.82015 -1.219288  0.2252
tree_speciesPR                  -2.7500   0.86990 -3.161279  0.0020
measurepH                       -1.6267   0.61771 -2.633383  0.0096
measureret                     179.2400  37.30808  4.804321  0.0000
measurerugosity                 -5.8622   1.11299 -5.267085  0.0000
measurewhc                     339.1800  64.50607  5.258110  0.0000
tree_speciesBU:measurepH         1.3617   0.86708  1.570412  0.1191
tree_speciesCL:measurepH        -0.0808   0.86708 -0.093225  0.9259
tree_speciesMB:measurepH         1.1433   0.87357  1.308800  0.1932
tree_speciesPR:measurepH         2.3067   0.92044  2.506045  0.0136
tree_speciesBU:measureret      -31.7800  52.76159 -0.602332  0.5481
tree_speciesCL:measureret       -9.7600  52.76159 -0.184983  0.8536
tree_speciesMB:measureret      -59.3971  48.84873 -1.215940  0.2265
tree_speciesPR:measureret      -19.8300  52.76239 -0.375836  0.7077
tree_speciesBU:measurerugosity   1.4378   1.57401  0.913450  0.3629
tree_speciesCL:measurerugosity   4.6678   1.57401  2.965536  0.0037
tree_speciesMB:measurerugosity  10.8511   1.57401  6.893937  0.0000
tree_speciesPR:measurerugosity   1.5189   1.60049  0.949012  0.3446
tree_speciesBU:measurewhc      -40.3200  91.22536 -0.441982  0.6593
tree_speciesCL:measurewhc       47.5600  91.22536  0.521346  0.6031
tree_speciesMB:measurewhc      -21.0800  84.45884 -0.249589  0.8034
tree_speciesPR:measurewhc       22.8500  91.22582  0.250477  0.8027

Where is your bark_mm variable in the model? Well, it may not be obvious initially, but bark_mm is actually your reference measure. So you'll have to tweak the reference value in your gls model to get the comparison you want. This is not a condemnation of the approach, just a minor annoyance, and a reason I think the ANOVA workflow is more straightforward in getting you want you want in the end. And if Type I error is a concern, ANOVAs can still be used responsibly without inflating Type I error, with appropriate corrections.

Also, don't forget that you need to get your different species pairwise comparisons out of the gls model if you stick with that approach (which is certainly doable - and as an aside the pairwise.t.test() function is far from the only option for post hoc mean comparisons).

Reference(s)

Finch, W. H. (2016). Missing data and multiple imputation in the context of multivariate analysis of variance. The Journal of Experimental Education, 84, 356-372. doi: 10.1080/00220973.2015.1011594

Thanks a lot for your very informative answer! I uploaded my data set, so if you want to play around with it... However, from your answer it is clear to me that a MANOVA cannot be the test I want to conduct. If it really just takes half my values than the result is meaningless to me. Will see, what I can do with the data instead. Maybe I stick to the single ANOVAs and compare them individually... — bamphe, Dec 11 '18 at 08:26

Heteroskedastic Jim · Accepted Answer · 2018-12-11T16:51:23.653

2

I think you can treat the problem as a multilevel regression instead of MANOVA. The first step will be to make the data even longer:

dat$ID <- 1:nrow(dat)
dat.l <- tidyr::gather(dat, measure, value, rugosity:ret)
dat.l <- na.omit(dat.l)
dat.l <- dat.l[order(dat.l$ID), ]
head(dat.l)
# # A tibble: 6 x 4
#   tree_species    ID measure   value
#   <fct>        <int> <chr>     <dbl>
# 1 AS               1 rugosity   2.36
# 2 AS               1 bark_mm    8   
# 3 AS               1 pH         6.49
# 4 AS               1 whc      295.  
# 5 AS               1 ret      119.  
# 6 AS               2 rugosity   1.45

Now each row pertains to a specific measurement, we have an identifier for measurement type and tree species.

A reasonable baseline model would be where we allowed the value variable to differ by measure with regards to the mean and variance since the measures appear to be on very different scales:

library(nlme)

fit.0 <- gls(
  value ~ measure, dat.l, # ~ 1 | ID,
  weights = varIdent(form = ~ 1 | measure))

A second step would be to include the tree species as a predictor:

fit.1 <- gls(
  value ~ measure + tree_species, dat.l, # ~ 1 | ID,
  weights = varIdent(form = ~ 1 | measure))

We can then compare both models using standard model comparison techniques:

anova(fit.0, fit.1)
#       Model df      AIC      BIC    logLik   Test  L.Ratio p-value
# fit.0     1 10 1054.608 1083.660 -517.3037                        
# fit.1     2 14 1035.080 1075.333 -503.5402 1 vs 2 27.52702  <.0001

There is a warning that you can get rid of by using ML instead of REML, something like: anova(update(fit.0, method = "ML"), update(fit.1, method = "ML")). But focusing on the output we have, a likelihood ratio test suggests that the more complicated model has a better fit to the data. And AIC suggests that the more complicated model has better out of sample predictive performance. So including species type may help us better understand/predict the data beyond just understanding the specific measure we are looking at.

An even more complicated model would be permitting the interaction between species and measures:

fit.2 <- gls(
  value ~ tree_species * measure, dat.l, # ~ 1 | ID,
  weights = varIdent(form = ~ 1 | measure))

In this model, we fit 16 additional parameters since there are 5 measures and 5 species. We can again use the anova() to compare all three models. This model will have so many coefficients, it might be difficult to understand but if the comparison methods suggests this is a better model, it appears that the species differ in the ways they are different across the different measures. This is likely to happen since the measures are on very different scales, even if it's the same tree that is consistently different across measures. You will probably have to plot the model or break it up to better understand what is going on.

edited Dec 11 '18 at 16:51

answered Dec 11 '18 at 15:04

Heteroskedastic Jim

4,567
1
10
32

Great answer! Still, I've got some questions. (1) Do I need to use the `ID` variable for my baseline model? Because the measurements are not linked to each other (i.e. `pH` of `ID` 3 could also be linked with `rugosity` of `ID` 4,7 or 9...). Is this a problem? – bamphe Dec 11 '18 at 16:22
(2) Slowly I beginning to grasp what I really want. So what about the interaction between `tree_species` & `measure` (in the last model): if the interaction is significant, wouldn't that mean that my underlying idea that the tree species differ consistently with the parameters is wrong? In that case the tree species could differ but only for single parameters!? – bamphe Dec 11 '18 at 16:23
Note that ID is a variable I created to identify the rows in your data. Are you saying the measurements on the same rows of your original data are not linked? If the interaction is significant, yes you are right, there is no consistent difference. But that is bound to happen in your data. Even if species AS is always the most different, since the measurements are on different scales, the gap can be huge on measure whc and smaller on measure pH, simply because whc has a much greater range than pH. – Heteroskedastic Jim Dec 11 '18 at 16:32
Exactly, they are not linked! OK, I thought that the range of the data would have little impact on the test... – bamphe Dec 11 '18 at 16:37
1

The range of the data applies to the heteroskedasticity assumption of regression models, so it is important. Also, the difference between AS and BU will be different on pH and whc simply because whc has a huge range. This affects the interaction term. If the data are not linked, then I would switch from the `lme()` function to `gls()` function and remove the `~ 1 | ID,` part of the syntax. – Heteroskedastic Jim Dec 11 '18 at 16:43
1

Works like a charm! Now I just have to figure out the whole meaning behind this ;) – bamphe Dec 11 '18 at 16:54
Haha, see https://stats.stackexchange.com/questions/356080/how-to-test-for-difference-in-means-between-5-groups-the-variance-between-the-g/357142#357142 where I explained a `gls()` output in some detail. – Heteroskedastic Jim Dec 11 '18 at 17:12
Could I transform my data somehow to avoid the difference in variance? By, say, dividing the values by the respective standard deviation? Then my "Variance Function" output from the gls is quite nice (ranging between 1 - 1.6) for all parameters! ... Maybe I should read the linked answer first ;) – bamphe Dec 11 '18 at 17:21
If inference on the original scales is not of concern, then yes, I might center then standardize each of the original outcome measures. And maybe that helps with clarifying your interactions. – Heteroskedastic Jim Dec 11 '18 at 17:24
Clever solution with the `gls` model. Makes me want to do a mini-simulation comparison to see what this buys over and above the simpler ANOVA workflow I propose in more detail in my answer. – Matt Barstead Dec 14 '18 at 14:55
1

@MattBarstead I think the only rationale for GLS is OP's desire for a single statistical test. So we'd need a way to model heteroskedasticity. I agree with you on a simpler model for interpretation. And from a simulation I conducted, I think GLS handles heteroskedasticity well under normal errors. Under non-normal errors, I'd go for OLS with heteroskedasticity-consistent SEs for inference. – Heteroskedastic Jim Dec 14 '18 at 15:14

MANOVA with variables from different datasets

My Question:

EDIT: What I really want to find out

Example Data:

Full Data Set:

2 Answers2