analysis of non-normal data

Question

I collected some data on a species of goose called Brent Goose over the winter. A csv file of the data can be downloaded from Dropbox or imported straight into R with this code:

library(repmis)
goose_behaviour <- repmis::source_DropboxData("goose_behaviour.csv", "hy6labsyh56050g", sep = ",", header = TRUE)

Each row of the data represents a flock of geese. Each flock of geese was one of two subspecies: Dark-bellied Brent Goose or Light-bellied Brent Goose. The Dark-bellied Brent Goose were present on the east side of an intertidal mudflat and the Light-bellied Brent Goose on the west side. The values in each row represent the proportions of the flock exhibiting each behaviour, so each row sums to 1.

I want to know if Dark-bellied Brent Goose and Light-bellied Brent Goose are exhibiting different proportions of each of the 7 behavioural types.

Plots of the seven behavioural types show they are each very non-normally distributed. Nonetheless, I calculated means and standard errors for each behaviour and for each subspecies as follows:

library(dplyr)
goose_behaviour %.%
group_by(subspecies) %.%
summarise(pecking = mean(pecking), alert = mean(alert), aggression = mean(aggression), asleep = mean(asleep), preening = mean(preening), flying = mean(flying), other = mean(other)) %.%
as.data.frame()


                 subspecies   pecking     alert aggression     asleep   preening     flying       other
1  Dark-bellied Brent Goose 0.4048882 0.3438450 0.02123310 0.05914777 0.10377128 0.06418008 0.002934522
2 Light-bellied Brent Goose 0.3620766 0.3467897 0.00534835 0.17768323 0.04889585 0.05657772 0.002628567


statStandardError <- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))

goose_behaviour %.%
group_by(subspecies) %.%
summarise(pecking = statStandardError(pecking), alert = statStandardError(alert), aggression = statStandardError(aggression), asleep = statStandardError(asleep), preening = statStandardError(preening), flying = statStandardError(flying), other = statStandardError(other)) %.%
as.data.frame()

                 subspecies    pecking      alert  aggression    asleep   preening     flying        other
1  Dark-bellied Brent Goose 0.03422627 0.02902893 0.003839248 0.0163771 0.01617119 0.02160593 0.0012157325
2 Light-bellied Brent Goose 0.03014162 0.02489201 0.001104804 0.0242016 0.01068396 0.01498604 0.0007520355

Because the data are non-normally distributed, I've also used Wilcoxon rank sum test to test if the two species differ in behavioural types:

library(plyr)
llply(goose_behaviour[,1:7],  function(x) wilcox.test(x ~ subspecies, goose_behaviour))

Here are my questions:

If data are non-normally distributed, is it appropriate to calculate means and standard errors? Would calculating medians be more appropriate?
Is a Wilcoxon rank sum test appropriate to test if the two subspecies differ in behavioural types?

As proportions, these would be count data that you've scaled by some total. I assume you have the numbers you had to divide by to turn counts into proportions, in which case you might want to consider models appropriate to count data, such as a GLM. — Glen_b, Apr 20 '14 at 09:11
@Glen_b My guess is different, that these are fractions of time spent doing different things. But I agree on a fundamental: what the raw data are and how they were generated is germane to recommending how the data were analysed. These data may be labelled compositional. I can't see that Wilcoxon ideas apply, as they are based on comparing groups two at a time, but with precisely no attention to the compositional constraint. — Nick Cox, Apr 20 '14 at 11:07
@Nick "*the values in each row represent the proportions of the flock exhibiting each behaviour*" seems to me to be count based rather than continuous, but it may depend on how the values were obtained. — Glen_b, Apr 20 '14 at 11:09
Yes these data are count based. Count the number of geese in the flock then count the number of geese doing each behaviour. — luciano, Apr 20 '14 at 11:14
OK, so what are the replicates for each subspecies, different days, different sites? Are the replicates totally independent of each other? — Nick Cox, Apr 20 '14 at 11:18
Data was collected on the same intertidal mudflat (about 5.5 square kilometres in size). See question for further details. Data was collected between October 2013 and March 2014. Some observations were collected temporally close to each other, for example on some days I collected 10+ observations within the same hour. — luciano, Apr 20 '14 at 11:26
I don't see that you have a hope of getting a honest and credible P-value for differences between subspecies without modelling the dependence structure of your data. Alternatively, it may be that by comparing successive observations you can make a case that they are in some sense independent, but that's an assumption to be tested, and not an article of faith. (I don't have a clear idea of appropriate model for your data, absent a dependence structure, perhaps a Dirichlet distribution.) — Nick Cox, Apr 20 '14 at 11:36
Would it be helpful if I included the date and time of each observation? I have this easily available — luciano, Apr 20 '14 at 11:43
That's indeed what you need to look at dependence structure. — Nick Cox, Apr 20 '14 at 11:47
Another point is that I might have to trade off accuracy with understanding. Trying to model something like a Dirichlet distribution with a mixed effects model might lead to more problems than it will solve. I'm very willing to learn techniques like this, but this might take years and I need to analyse these data in the next few months — luciano, Apr 20 '14 at 11:48
I am sure we appreciate that. However, it also seems that you planned data collection without planning how the data would be analysed: why should that compromise suggestions on what should be done to analyse the data? Correspondence analysis seems a descriptive possibility here that should be familiar to you from ecological or ethological literature, and it's not perturbed by inferential nuances, as it tends to avoid them. Moreover, plotting the data to look at time dependence should take just a few minutes and seems to me to be something of scientific interest any way. — Nick Cox, Apr 20 '14 at 11:56

analysis of non-normal data

0 Answers0

Linked