3

I am a bioinformatics masters student who is working on a study at work and would like some advice on a biostatistics issue. As a side note: we will be consulting with a biostats core but I wanted to get practice working through problems like this.

Data: I have a data set that contains phenotypic information on a cell population for each subject. The readout of the phenotype is the % positive for that marker in that population. Here is an example table showing the data

    Group Participant Gender      Age Caucasian API Hispanic Other       Date Marker1 Marker2 Marker3
2  Group1          A1      0 39.74795         1   0        0     0 2018-07-11    1.77   13.60    77.8
3  Group1          A2      1 39.50411         0   1        0     0 2018-07-11    1.38    3.90    96.1
4  Group1          A3      0 43.79178         1   0        0     0 2018-07-25    9.34   13.60    85.2
5  Group1          A4      0 42.80274         0   0        0     0 2018-07-11    2.06    4.08    77.6
6  Group2          A5      1 41.27619         1   0        0     0 2018-07-25    0.65   16.00    79.9
7  Group2          A6      0 42.07710         1   0        0     0 2018-07-25    2.46   18.20    93.8
8  Group2          A7      0 42.70411         0   1        0     0 2018-07-11    0.30    0.00    75.0
10 Group2          A8      0 38.70387         0   0        0     0 2018-07-11    1.48    3.73    84.4
11 Group3          A9      0 40.71483         0   0        0     0 2018-07-25    1.76    7.48    90.5
13 Group3         A10      1 38.96690         0   1        0     0 2018-07-25    5.87   12.90    81.6
15 Group3         A11      0 41.46002         0   1        0     0 2018-07-25    2.40   18.80    96.0
16 Group3         A12      0 33.87945         0   1        0     0 2018-07-11    4.16    8.56    60.4

As you can see there are 3 groups with different individuals in each. Now getting to my question: we are looking to determine the impact of aging on these phenotypic markers. But, we have a lot of variables that can also impact this, like gender, race, and in infected groups the type and length of treatment. The date column is the date the samples were run. I am looking for guidance on how to best design a linear model formula. My initial readings suggested doing

Marker1 ~ Age + Gender + Caucasian + API + Hispanic + Other + Date

But I am curious if this is the best method? I read about the lmer() function and was wondering if using Date in that for the random effect using something like:

Marker1 ~ Age + Gender + Caucasian + API + Hispanic + Other + (1 | Date) 

Thanks for all advice!

Kyle K.
  • 33
  • 3

2 Answers2

1

Interesting idea. There is nothing wrong with using data as a random effect. However, one thing to remember about mixed models is that the random variables are assumed to be centered at zero and the lmer model will simply return you the variance of the random variable. I'm unsure if that really answers your questions. You can (and in many fields do) include variables as both fixed and random effects yielding a model like this:

Marker1 ~ Age + Gender + Caucasian + API + Hispanic + Other + Date + (1 | Date) 

My suggestion in a case like this would be to run both models and try to see what make sense and seems sound. You may also want to look into modeling this as a time series. I will not try to give any advice on that, as I'm not an expert.

Tanner Phillips
  • 1,162
  • 3
  • 16
1

The best way to adjust for batch effects can depend on how many batches you have.

An advantage of random-effect modeling (if the random effect only represents differences in the intercept, as in your second example) is that you are only modeling a single parameter, the variance of intercept values among the batches. With fixed effects, you need to estimate separate coefficient values for each batch beyond the first. Models with more parameter values to estimate use up more degrees of freedom, making it harder to detect true differences.

So if you really only have 2 batches as in your data sample, you probably don't gain anything by using random effects. If you have half a dozen batches or so, random effects might be more efficient. But in either type of analysis you need to think carefully about how you expect batches to differ. For example, do batches only differ in terms of an intercept offset (as is implicit in both your examples) or might they also differ in the sensitivity for detecting marker percentages? The latter possibility would require more complicated models.

A couple of other things to think about before you meet with the bioinformatics core. First, outcome variables that are percentages often shouldn't be modeled with standard linear regression. Second, as you have 3 outcome variables it might be good to try to include all 3 in a single multivariate model that takes correlations among them into account. Ask for advice on those issues, too.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you for the response! You raised a lot of good points for me to discuss with the bioinformatics core. I realize I made an oversight with my sample data and that yes, there are about 7 different dates that data was collected. For modeling percentages, is there another method you would recommend in place of linear regression? – Kyle K. Mar 15 '20 at 17:29
  • @Creatine discuss this with the bioinformaticians. I suspect that they will want to know the actual numbers of cells tested and the numbers that showed each of your markers, and then analyze with methods suited for count data, like [Poisson regression](https://en.wikipedia.org/wiki/Poisson_regression) or the related negative-binomial regression. In general for probabilities/percentages, [beta regression or logistic regression](https://stats.stackexchange.com/q/259131/28500) could be used. With some very high and some very low percentages as you have, linear regression can be unreliable. – EdM Mar 15 '20 at 18:47
  • @Creatine I now notice that some cells evidently can have more than one of the markers, as the sums of percentages for the 3 markers exceed 100 for several individuals. Make sure to discuss that issue with the bioinformaticians, too. – EdM Mar 15 '20 at 18:55