1

When performing an RDA where the constraining matrix displays equal sums of rows, the last column in the matrix is always considered aliased and therefore dropped. I am asking why this happens and how one can avoid that.

Given a matrix of response variables and a constraining matrix of explanatory variables (for instance, in the ecological sciences, one could look at species occurrence data and environnemental data to explain these occurrences), one can perform some form of constrained ordination, like an RDA (redundancy analysis).

I am doing exactly this with a matrix of environmental variables whose sum of rows are all equal to the same value (in this case, 1). In this use case, the last column of this matrix is always dropped because considered "aliased" which I understand to mean collinear to the other variables in the matrix. I use the package vegan in R.

library(vegan)

## Environmental data: the constraining matrix
env <- matrix(c(0.2, 0.3, 0.5, 
                0.1, 0.8, 0.1, 
                0.7, 0.2, 0.1, 
                0.3, 0.3, 0.4), 4, 3, byrow = T)
colnames(env) <- c("var1", "var2", "var3")
## Rows sum to 1
rowSums(env)

## Species data: the response variables
spe <- matrix(c(4, 7,
                2, 3, 
                6, 8, 
                2, 1),4, 2 )

## RDA
my_rda <- rda(spe ~ ., data = as.data.frame(env))
my_rda

Results of the rda mention aliased variables, which we can find with alias()

## Aliases
alias(my_rda)

Shuffling the column rows confirms that it is always the last column of the matrix that is dropped and considered aliased

## Shuffling the columns
env_shuffled <- env[,c(3,1,2)]

## Running RDA again
my_rda_shuffled <- rda(spe ~ ., data = as.data.frame(env_shuffled))
my_rda_shuffled

## Aliases seem to be always in the last column
alias(my_rda_shuffled)

How can this be explained and how can I properly run the RDA in order to keep all my variables in the RDA analysis?

  • 1
    See https://stats.stackexchange.com/questions/259208 for an account of one flexible, principled method. – whuber Feb 23 '21 at 12:55
  • try log() your env data (**log(10000*env+1)** )before rda() – kai Min Feb 23 '21 at 12:03
  • Although mathematically this works, its arbitrariness should give one pause. See https://stats.stackexchange.com/a/30749/919 for a discussion. – whuber Feb 23 '21 at 12:56

1 Answers1

0

"I am doing exactly this with a matrix of environmental variables whose sum of rows are all equal to the same value (in this case, 1)."

I'm wondering if this is your problem right here. There is something tickling the back of my brain, although I am not very sure about it. Basically, if they all sum to 1, then the last column can always be calculated from the other columns. An alternate way to consider co-linearity is that a co-linear variable adds no new information, which would be the case here. The last column always explains nothing of the variation, therefore it is aliased/dropped.

An possible solution, since your data seems like it might be proportions, is to use the original values rather than proportional ones.

No idea if this is, for sure, the actual answer, but I hope someone else can chime in, or this sets you on a hunt for the right answer.

Cheers.

DaveH
  • 1
  • 1
  • Yes, It does seem to be related to the fact that "the last column can always be calculated from the other columns" but here I am using original species abundance, the issue is not in the response data, its in the environmental data. – Valentin Lucet Dec 09 '20 at 14:47