Means centralization, is it a thing?

Question

Assume a dataset of counts in distances (1, 2, 5, and 10) to a point from three independent samples (a, f, and m) like the following:

sample;distance;count;count_central
a;1;34253;14640
a;2;42085;17988
a;5;32365;13833
a;10;11587;4953
f;1;8198;19413
f;2;5580;13214
f;5;3510;8312
f;10;4424;10476
m;1;5296;22244
m;2;5296;22244
m;5;1116;4687
m;10;533;2239

I am trying to show a relationship (these are very few samples, both my tutor and I are aware of this) between distance and counts. My knowledge and understanding of statistics is very limited, so please be gentle. What I did to show the overall relationship is to centralize the means of all three samples around a common central mean. Then do a linear regression and report the values of the pooled linear regression analysis (between distance and count_central.

Does that make sense? Is means centralisation a thing? Is there a reference for this that I can cite in my research? I mean, intuitively, I would argue this is totally legitimate, but my tutor argues this is totally illegal and I can only do the analysis for each sample individually.

If what I am trying to do is indeed illegal, is there a way to describe the relationship between distance and counts in a sensible manner and how would you do it? (in fact, "how would you do it?" is probably not the right question. I should be able to understand the method. If somebody later asks me what I did and I go "uh, well, you know, it's statistics... heh", then that's not gonna cut it, I guess.)

If the tagging is out of whack, please feel free to add and remove tags so that it makes sense.

I counted microplastic particles in roadside soils in three different sites. Distance is physical distance (in m) from the road curb. — thymaro, Sep 03 '19 at 03:56
Thank you for the clarification. You can include this information in your question by using the edit button below it. I will write an answer later today when I have time. — Frans Rodenburg, Sep 03 '19 at 04:57
@FransRodenburg yes, thanks, saw it, haven't had time to check it out. I'll have to work through it and it will take some time, but I think I should be ok. This will be a very asynchronous evaluation ;) — thymaro, Sep 05 '19 at 05:34

score 2 · Answer 1 · answered Sep 03 '19 at 06:13

What you want is a model that relates the expected distance $E[D]$ to the count $C$. If you thought the three samples functioned differently, you could estimate three separate regression models of the form $E[D|C,S=s']=\beta_{s'0}+\beta_{s'1}C$, where $S$ is the sample variable and $s'$ is one of the samples (i.e., a, f, m). This would give you three different estimates of the relationship between count and distance.

You could also also run a single regression model for the whole sample using subsample indicators. Let $g_{s'}$ be a dummy variable that is equal to $1$ if the observation is in sample $s'$ and $0$ otherwise. Now you can specify the following regression model: $$E[D|C,S]=g_{a}(\beta_{a0}+\beta_{a1}C)+g_{f}(\beta_{f0}+\beta_{f1}C)+g_{m}(\beta_{m0}+\beta_{m1}C) \\ =\beta_{a0}g_{a}+\beta_{a1}g_{a}C+\beta_{f0}g_{f}+\beta_{f1}g_{f}C+\beta_{m0}g_{m}+\beta_{m1}g_{m}C$$ To get a single number that represents the marginal effect of count on distance, you could take the average of the group-specific $\beta_{.1}$s (i.e., $\frac{\beta_{a1}+\beta_{f1}+\beta_{m1}}{3}$).

If you though the group-specific slopes were all the same, you could estimate a simpler model ignoring group membership as follows: $$E[D|C,S]=\beta_{a0}g_{a}+\beta_{f0}g_{f}+\beta_{m0}g_{m}+\beta_{1}C$$ Your approach involves estimating the following model: $$E[D|C,S]=\beta_{0}+\beta_{1}(g_{a}(C-\bar{C}_a)+g_{f}(C-\bar{C}_f)+g_{m}(C-\bar{C}_m)) \\=\beta_{0}+\beta_{1}((g_{a}+g_{f}+g_{m})C-(g_{a}\bar{C}_a+g_{f}\bar{C}_f+g_{m}\bar{C}_m)) \\= \beta_{0}-\beta_{1}(g_{a}\bar{C}_a+g_{f}\bar{C}_f+g_{m}\bar{C}_m)+\beta_{1}(g_{a}+g_{f}+g_{m})C \\= \tilde{\beta}_{a0}g_{a}+\tilde{\beta}_{f0}g_{f}+\tilde{\beta}_{m0}g_{m}+\beta_{1}C$$ So, you're still estimating the same effect, $\beta_{1}$, which will have the same value regardless. The only difference will be in the interpretation of the group intecepts $\tilde{\beta}_{.0}$, which now is a bit more challenging to interpret after group mean centering.

So, centering buys you nothing, but it doesn't make things much worse. Just fit the model you want to fit without manipulating the data. I recommend the first approach because it's fairly simple and doesn't require any constraining assumptions. If you're willing to make the assumption that the slope is the same, then just fit the second model.

Unfortunately, you lost me after "form". I clearly need to up my understanding of statistics. It's on the list. Thanks for the answer. Now I have something to work towards to. — thymaro, Sep 04 '19 at 19:33
Sorry! I wasn't sure how technical you wanted to be since this is potentially a technical question. If you ask on stackoverflow I can show you R code to run the method I recommended. — Noah, Sep 05 '19 at 01:27
I might do that and link to it. Just to be clear, I actually really want to get into the matter and work up enough of the theory to be able to follow your answer. Even if it takes me two years, I'll probably come back to this and evaluate the model you propose. — thymaro, Sep 05 '19 at 05:29
Uhm, what's wrong about posting R code here? R is statistical software. — thymaro, Sep 05 '19 at 05:30
It would be too much to add to a comment, and the question "How do I implement the method described in R?" is a programming question. Just ask there and I can help out. — Noah, Sep 06 '19 at 05:53
This will have to wait until next weekend. Thanks for the explanation, though. — thymaro, Sep 07 '19 at 13:13

Frans Rodenburg · Answer 2 · 2019-09-04T02:47:08.237

Visual summary

There no evidence more compelling than a simple visual summary of your data:

Both figures show count vs distance, but the one on the right has a logarithmic y-axis. There is a clear trend towards lower particle counts further from the road curb, which I assume is also what you would expect. Which of these plots is better depends on your preference and your audience.

Using a statistical model

Since the sample size is limited and the figure perhaps not convincing enough to everyone, it is nice to also include some sort of measure to convince the audience that these variables do indeed correlate. You could simply calculate the Spearman rank correlation between count and distance, but this is (A) not a very strong correlation because ranks throw away a lot of information and (B) not respecting the dependency in your data (multiple pseudoreplicates of a, f and m).

Modeling counts

Instead of using ranks, or some other non-linear measure, a more efficient way to model the data is to use an appropriate conditional probability distribution (the distribution of your response (microplastics count) after accounting for the effect of the explanatory variable(s) (distance fromn the road curb).

Counts follow a discrete probability distribution. Specifically, independent counts follow a Poisson distribution. However, the Poisson distribution always has variance equal to the mean. In many cases there is an excess of variance causes by unknown contributing factors or some form of dependency. This excess of variance can be modeled by including an extra dispersion parameter, such as in the negative binomial distribution, or the quasi-Poisson distribution.

This can be done using generalized linear model, which can have any error distribution in the exponential family of distribution (including normal, Poisson, etc.).

Modeling dependency

Since your data come from the same three samples (a, f and m), it is logical to assume that measurements of these samples are more strongly correlated to each other. The plot above also clearly demonstrates this. To account for this dependency, you should include a random effect. Models with both random effects (here: sample) and fixed effects (here: distance) are called mixed models.

Without further knowledge of your experiment, I chose to model this below using a random intercept. However, you generally have to choose between a random intercept, slope, or both.

A model with an error structure in the exponential family and mixed effects is a generalized linear mixed model (GLMM). In the code below I demonstrate how you could run this analysis using either a Poisson or negative binomial error structure. I then compare the model to one that does not include a slope for distance, from which you can conclude that including distance results in a significantly improved fit.

Code

To produce the figure, run the following in $\textsf{R}$:

sample   <- rep(c("a", "f", "m"), each = 4)
distance <- rep(c(1,2,5,10), each = 3)
count    <- c(34253, 42085, 32365, 11587, 8198, 5580, 3510, 4424, 5296, 5296, 1116, 533)

plot(count ~ distance, pch = (sample), bty = "n", las = 1, cex.lab = 1.1, cex.axis = 0.8, 
     xlab = "Distance from road curb (m)", ylab = "Microplastic count")

The version on the right you can obtain by including log = "y" in plot.

To fit a generalized linear mixed model in $\textsf{R}$, I recommend the package lme4:

# Using a Poisson error structure
GLMM.p  <- glmer(count ~ distance + (1 | sample), family = "poisson") # model with distance
GLMM.p0 <- glmer(count ~ 1 + (1 | sample), family = "poisson")        # null model
summary(GLMM.p)                                                       # see the coefficients
anova(GLMM.p0, GLMM.p)                                                # compare fit


# Using a negative binomial error structure
GLMM.nb  <- glmer.nb(count ~ distance + (1 | sample))
GLMM.nb0 <- glmer.nb(count ~ 1 + (1 | sample))                        # null model
summary(GLMM.nb)                                                      # see the coefficients
anova(GLMM.nb0, GLMM.nb)                                              # compare fit

The model with either error structure produces a significantly better fit if you include distance as an explanatory variable.

To get rid of the warning when fitting either model, scale the distance variable:

GLMM.p  <- glmer(count ~ scale(distance) + (1 | sample), family = "poisson")
GLMM.nb <- glmer.nb(count ~ scale(distance) + (1 | sample))

Note that the coefficients are now for the scaled distance. The remaining warning about a data argument can be ignored.

What does this model look like?

Since you only have one dependent variable, you can plot the mixed model's fit over the scatter plot at the beginning, which then looks like this:

The $\textsf{R}$ code I used for this is:

plot(count ~ distance, pch = (sample), bty = "n", las = 1, log = "y", 
     xlab = "Distance from road curb (m)", ylab = "Microplastic count", 
     cex.lab = 1.1, cex.axis = 0.8, main = "Original")
for(i in 1:length(unique(sample))){
    lines(
      fitted(GLMM.nb)[sample == unique(sample)[i]] ~ distance[sample == unique(sample)[i]],
      col = 2
    )
}

This figure shows the same linear relationship for each sample, albeit with a random offset (the random intercept in the GLMM). Both the Poisson and negative binomial have canonical link function log, so the figure on the right (log-scale) looks like a line (I used the negative binomial version here). Again, change the log = ... argument if you prefer the original scale.

You can use this figure, the model summary, the measure of improved fit and of course the theoretical justification for the model to convince the audience that microplastic counts diminish with distance from the road curb.