Analysis of count data with percentages

Question

for my master thesis I count and identify sediment grains. In total I have 82 samples from 3 different gravity cores. I divided the sediment components in 11 groups (Quarz, Mica, Opaque, Aggregate, Other terrigenous, etc.). In order to estimate how many grains I need to count I made a preliminar study with 5 samples of one core.

First, I randomly counted and identified 300 grains in every sample. After that, I did the same but with 100 grains. The null hypothesis is that there is no difference when counting 100 or 300 grains, whereas the alternative hypothesis implies that there is a statistically difference when counting 100 or 300 grains.

To compare those two methods (counting 100 or 300 grains), I converted the count data in percentages. Part of the data:

>   "Component" "Method"    "Percentage"
"Aggregate" "A" 4
"Aggregate" "A" 3
"Aggregate" "A" 2
"Aggregate" "A" 1
"Aggregate" "A" 1
"Aggregate" "B" 1.66666
"Aggregate" "B" 0.66666
"Aggregate" "B" 1.66666
"Aggregate" "B" 2
"Aggregate" "B" 1.33333
"BenthForam"    "A" 19
"BenthForam"    "A" 11
"BenthForam"    "A" 9
"BenthForam"    "A" 15
"BenthForam"    "A" 13
"BenthForam"    "B" 16
"BenthForam"    "B" 11.33333
"BenthForam"    "B" 11.66666
"BenthForam"    "B" 17.66666
"BenthForam"    "B" 15.33333
"Mica"  "A" 3
"Mica"  "A" 19
"Mica"  "A" 13
"Mica"  "A" 8
"Mica"  "A" 14
"Mica"  "B" 6.66666
"Mica"  "B" 7.33333
"Mica"  "B" 10
"Mica"  "B" 8.66666

My first attempt was to use ANOVA with a nested linear model (R-code):

aov (Percentage ~ factor(Component) + factor(Component):factor(Method))

Component are the (11) different groups, Percentage is the count data and Method is counting 100 (Method A) or 300 (Method B) grains

But residuals of the ANOVA are neither normally distributed nor equal variances can be assumed. Also the data shows overdispersion and I was thinking about negative binomial regression. The Problem here is that I have an upper boundary and the only way to use this test would be to exclude one component like quarz since it is the most abundant component in each sample.

What test would you recommend me or can I change my approach? I use R and prefer to have references if possible.

" The null hypothesis is that there is no difference when counting 100 or 300 grains, whereas the alternative hypothesis implies that there is a statistically difference when counting 100 or 300 grains." So you're performing an analysis that assumes that statistical methods are valid to test whether statistical methods are valid? I don't understand the point of what you're doing. — Acccumulation, May 02 '19 at 15:11
I want to test if there is a significant difference between those two methods because I could save a lot of time analysing the grains under the microscope. I know, that in this case more is always better. But counting and identifying 8200 grains is way more time efficient than 24600 grains. If counting 100 grains/sample does not give complety different results than counting 300/sample, I can save a lot of time. — Vincent, May 02 '19 at 15:23
Statistical theorems will tell you what the effect of a smaller sample size is. If you don't trust statistics to give the answer to that question, why do you trust them to evaluate an empirical test? — Acccumulation, May 02 '19 at 15:28
Yes, I agree statistically the best would be to count every single grain (thousands?) of the sample. But if I can proof that counting 100 grains gives similiar results than counting 300 I use this information to explain why I only counted 100. Or is your question why I chose to count either 100 or 300 grains? — Vincent, May 02 '19 at 15:39
The variance of the mean of a sample of 100 is, making various assumptions such as independence, three times that of 300. If that factor of three is insufficient to invalidate whatever you're trying to do, then you need only note this statistical fact. If you are not comfortable making the assumptions required to make this assertions, then why are you comfortable making these assumptions when performing your 100 vs. 300 comparison? — Acccumulation, May 02 '19 at 15:56

score 3 · Accepted Answer · answered May 02 '19 at 09:59

When dealing with count data, it is generally best not to convert to percentages. Instead, implement a model on the original count values that were used to obtain the percentages, if necessary, using a fixed offset for the denominator in the percentage (see e.g., here). A negative-binomial GLM with a log-link function is a good starting point for analysis, and it allows easy incorporation of a fixed offset on a log-scale. Given percentages formed from a Numerator_count divided by a Denominator_count you would use a model like this:

MODEL <- glm.nb(Numerator_count ~ offset(log(Denominator_count)) + ..., data = DATA)

Analysis of count data with percentages

1 Answers1