Mixed Model in a repeated measurement design and AUC

Question

My goal is to predict the $Y_i=1$ for each subject $i$ given a set of explanatory variables $x_i$. Since I have repeated measurements for some subjects, I was told to use a mixed model strategy, i.e. I assume that $$\mathbb P(Y_i=1\mid x_i) = \frac{1}{1+ \exp(-x_i\cdot\beta - \mu_i)},$$ where $\mu_i$ is a random effect to account for repeated measurements. Furthermore, $\cdot$ denotes the usual Euclidean inner product.

The quality of my model is assessed by the $AUC$. I know how to compute the $AUC$ and in a previous question here on CV, it was clarified why I can expect the $AUC$ to be asymptotically normal. However, all proofs for asymptotic normality of the $AUC$ assume independent observations, which is not the case here due to repeated measurements.

I suppose that the normality argument still holds as there are CLTs for dependent data. However, I could not find any proofs for asymptotic normality of an $AUC$ in such a setting. This made me think about whether I would even have to worry as I account for repeated measurements in my model, which is used to obtain the $AUC$. I am very confused (mainly because I spent thinking about this issue for the last couple of days). So my key questions are:

Can I expect the $AUC$ to be asymptotically normal in the given setting, and if so why.
How would I account for the repeated measurements in the variance of the $AUC$? Do I even have to bother given the fact that I account for repeated measurements in the model.

$\text{AUC}\in[0,1]$, so how does $\text{AUC}$ converge to something normal? I think you misunderstood what @Sycorax wrote in that answer, since you're trying to get a hint about how to prove a statement that cannot be true. — Dave, Sep 08 '21 at 21:18
I think the misunderstanding arises from my use of the terminology "asymptotic normality"; I understand asymptotic normality as described in these slides https://biostat.duke.edu/sites/biostat.duke.edu/files/Jianghao%20Li_UGEE.pdf in particular, slide 14. The term $\frac{U}{nm}$ is just the AUC — lmaosome, Sep 08 '21 at 23:03
How are you taking time into account here? As the risk of developing cancer increases with time, there's a chance that individuals with more observations simply have been followed for longer times and thus have been more likely to develop cancer. I fear that providing an answer to this question as posed might lead to an incorrect application to the underlying scientific issue of biomarkers and cancer risk. Please edit the question to say more about such details of your study. — EdM, Sep 14 '21 at 14:40
Indeed, you are totally right. I simplified the setting a little bit in order to emphasize my questions. In the actual setting, repeated measurements are caused by the fact that measurements at different body parts were analyzed. For example, for patient $i$ left arm, right arm, and right leg was analyzed by special procedure (which I called, for simplicity, "marker"). The actual setting is very difficult to describe, hence I decided to provide a simpler description. — lmaosome, Sep 15 '21 at 11:19
That helps, thanks. In terms of the markers as predictors of "cancer risk," do you mean _current_ presence of cancer at the time of measurement, the risk over a defined future period of time, or cumulative risk over an open-ended period of time? It would be good to edit the question to add that information, as comments are sometimes overlooked and sometimes get deleted. — EdM, Sep 15 '21 at 14:23
I have reformulated the question in a more generic way to emphasize my main point of the question — lmaosome, Sep 17 '21 at 06:56

EdM · Answer 1 · 2021-09-24T21:45:15.003

Even with asymptotic normality among estimates of AUC* values, you might be better off using bootstrapping to estimate the precision and bias of your AUC estimates. Furthermore, depending on how you intend to use your model, you might want to reconsider whether AUC gives you the measure the you need and whether a way to deal with intra-individual correlations other than random-effect modeling would be better suited to your application.

This Cross Validated thread is a great resource for seeing how AUC, concordance, and the Wilcoxon-Mann-Whitney U test are related. This answer in particular shows (1) "The AUC can also be seen as a concordance measure," reporting the fraction of comparable pairs of cases for which the order of the linear predictor values, in your terminology $(x_i\cdot\beta + \mu_i)$, agrees with the observed class memberships and (2) the U statistic "is just a simple transformation of the estimated concordance probability."

In turn, the Wikipedia page notes: "For large samples, U is approximately normally distributed." That answers your question as posed, provided that your AUC calculation (not described here) is equivalent to all comparable pairwise comparisons of cases with both fixed and random effects included in the linear-predictor values.

That does not answer the question of how large a sample is needed to get close enough to normality. Unless your data set is massive, you might be better off getting an empirical estimate of the distribution of AUC values from logistic regressions on multiple bootstrap samples. That approach has the further advantage of allowing you to gauge any optimism bias in your AUC values by using the results of the multiple logistic models to evaluate AUC (or any other performance measure) on the full data set. This Vanderbilt web page shows how to do bootstrap sampling properly with correlated or hierarchical data that might be analyzed by mixed models. Bootstrapping is often a better approach than asymptotics for inference on mixed models, as presented on this UCLA web page.

A potentially bigger problem is whether AUC will be a useful measure for your application. The question is whether AUC values from your mixed-effect model could be applied usefully to new cases. With a wide distribution of random-effect values, $\mu_i$ in your terminology, the ordering of cases implicit in AUC evaluation could have random effects dominating the fixed effects, $x_i\cdot\beta$ in your terminology. For application to new cases you will have no information on the random effects. A model with high discrimination/concordance on your training set, which includes the random effects, might thus have no useful discrimination when applied to new cases.

If your interest is in evaluating a test for application to new cases, you should consider an alternate approach to correcting for intra-individual correlations, such as a robust "sandwich" variance estimator from generalized estimating equations, which can be used in logistic regression. The R sandwich package provides several ways to produce such estimates depending on the correlation structure of the data. This Cross Validated thread discusses how to choose the correlation structure.

Levels of clustering and independence in AUC/Mann-Whitney U

Part of this answer was based on the original form of repeated measures of interest in the question, as explained in a comment from the OP: "repeated measurements are caused by the fact that measurements at different body parts were analyzed." In that scenario, presumably representing a single time of measurement and single outcome observation per individual, there is only one set of measurements and one outcome for each individual. In that scenario, there is only one linear-predictor value per individual to correspond with the single outcome in the formula provided in the question. There is independence among the individuals, so the assumptions underlying the normality of U-statistic estimates and thus of AUC hold.

Things are more complicated in the general case where there might be multiple observations over time for an individual or multiple individuals in a cluster. A simple solution for AUC estimates, suggested in this answer, is to "treat as a unit of analysis, not an individual measurement, but rather the cluster of repeated measurements"; the answer goes on to indicate approaches for specifically longitudinal data. One could also evaluate within-cluster concordance.

For a binary outcome over time for an individual, it might be better to use survival modeling, with a cluster or frailty term if multiple events per individual is possible. The frailtypack package can provide concordance estimates for survival models at among-cluster and within-cluster levels along with an overall weighted-average estimate. A quick look at the underlying code, however, suggests that standard errors/CI are only returned if bootstrapping is used. That package also allows for other types of clustered survival data, like individuals clustered within hospitals.

*AUC is an acceptable way to evaluate the discrimination provided by a single model, but it doesn't take calibration directly into account and it isn't necessarily a good way to compare different models. This answer focuses on AUC as asked in the question, although much of it is applicable to other measures of model performance.

Thank you for your comprehensive answer. The link to the Vanderbilt web page is indeed very useful if I decide to stick with a bootstrap implementation. Nevertheless, I would like to understand the asymptotic theory. I just recently came across GEE, but if I knew that beforehand, I would have included that into my question. From my understanding, a mixed effects model is one way to account for intra-individual correlation, GEE is another. My second question aims at whether I have to consider "robust standard errors" for the AUC in case I adjusted the predictions using mixed effects or GEE — lmaosome, Sep 17 '21 at 23:08
@lmaosome the asymptotics come from the U statistic if the AUC incorporates both fixed and random effects. Mixed models and GEE are indeed two different approaches to correct for clustering; those correspond to frailty/mixed models and `cluster()` terms in survival analysis. Coefficient (co)variance estimates from either approach take clustering into account so no additional consideration of "robust standard errors" is needed. That terminology isn't very specific, covering a variety of ways to adjust for clustering. I'll add a link in the answer to a package that provides several approaches. — EdM, Sep 18 '21 at 14:25
hopefully the last question: you say "asymptotics come from the U statistic". Does that mean that the Mann Whitney standard errors would need to be corrected for dependent observations eventhough the GEE used in the first place to obtain the estimated and predictions already accounts for the dependencies? — lmaosome, Sep 20 '21 at 17:59
@lmaosome For Mann-Whitney you presumably only have 1 linear-predictor value and 1 outcome for each individual, so there's no further intra-individual correlation to account for at that point. A GEE provides the same linear predictor values as you get without considering those correlations, adjusting ooefficient confidence intervals to correct for them. A mixed-effects model provides fixed and random estimated effects for each individual to provide a single linear-predictor value, while taking such correlations into account. — EdM, Sep 20 '21 at 18:50
I am not sure whether I fully understood your reply: I have $n$ observations which I use to estimate the model (using ME or GEE) since there are $k$ clusters. For each of the $n$ observations I compute the predictions and based on the predictions I compute the AUC. The AUC is equivalent to Mann-Whitney. But the well known SE formula for Mann Whitney assumes im it's derivation iid observations. This is not given here in the $k$ clusters. If I understood you correctly you say we can ignore this since we accounted for the dependencies. (side note: I am aware of overfotting problem) — lmaosome, Sep 21 '21 at 10:52
@lmaosome I will come back to expand the answer in a couple of days, to address the point you raised in your last comment. — EdM, Sep 24 '21 at 17:29
@lmaosome I've now added more details on implications of clustering and levels of analysis at the end of the answer. — EdM, Sep 25 '21 at 16:16

Mixed Model in a repeated measurement design and AUC

1 Answers1