Even with asymptotic normality among estimates of AUC* values, you might be better off using bootstrapping to estimate the precision and bias of your AUC estimates. Furthermore, depending on how you intend to use your model, you might want to reconsider whether AUC gives you the measure the you need and whether a way to deal with intra-individual correlations other than random-effect modeling would be better suited to your application.
This Cross Validated thread is a great resource for seeing how AUC, concordance, and the Wilcoxon-Mann-Whitney U test are related. This answer in particular shows (1) "The AUC can also be seen as a concordance measure," reporting the fraction of comparable pairs of cases for which the order of the linear predictor values, in your terminology $(x_i\cdot\beta + \mu_i)$, agrees with the observed class memberships and (2) the U statistic "is just a simple transformation of the estimated concordance probability."
In turn, the Wikipedia page notes: "For large samples, U is approximately normally distributed." That answers your question as posed, provided that your AUC calculation (not described here) is equivalent to all comparable pairwise comparisons of cases with both fixed and random effects included in the linear-predictor values.
That does not answer the question of how large a sample is needed to get close enough to normality. Unless your data set is massive, you might be better off getting an empirical estimate of the distribution of AUC values from logistic regressions on multiple bootstrap samples. That approach has the further advantage of allowing you to gauge any optimism bias in your AUC values by using the results of the multiple logistic models to evaluate AUC (or any other performance measure) on the full data set. This Vanderbilt web page shows how to do bootstrap sampling properly with correlated or hierarchical data that might be analyzed by mixed models. Bootstrapping is often a better approach than asymptotics for inference on mixed models, as presented on this UCLA web page.
A potentially bigger problem is whether AUC will be a useful measure for your application. The question is whether AUC values from your mixed-effect model could be applied usefully to new cases. With a wide distribution of random-effect values, $\mu_i$ in your terminology, the ordering of cases implicit in AUC evaluation could have random effects dominating the fixed effects, $x_i\cdot\beta$ in your terminology. For application to new cases you will have no information on the random effects. A model with high discrimination/concordance on your training set, which includes the random effects, might thus have no useful discrimination when applied to new cases.
If your interest is in evaluating a test for application to new cases, you should consider an alternate approach to correcting for intra-individual correlations, such as a robust "sandwich" variance estimator from generalized estimating equations, which can be used in logistic regression. The R sandwich
package provides several ways to produce such estimates depending on the correlation structure of the data. This Cross Validated thread discusses how to choose the correlation structure.
Levels of clustering and independence in AUC/Mann-Whitney U
Part of this answer was based on the original form of repeated measures of interest in the question, as explained in a comment from the OP: "repeated measurements are caused by the fact that measurements at different body parts were analyzed." In that scenario, presumably representing a single time of measurement and single outcome observation per individual, there is only one set of measurements and one outcome for each individual. In that scenario, there is only one linear-predictor value per individual to correspond with the single outcome in the formula provided in the question. There is independence among the individuals, so the assumptions underlying the normality of U-statistic estimates and thus of AUC hold.
Things are more complicated in the general case where there might be multiple observations over time for an individual or multiple individuals in a cluster. A simple solution for AUC estimates, suggested in this answer, is to "treat as a unit of analysis, not an individual measurement, but rather the cluster of repeated measurements"; the answer goes on to indicate approaches for specifically longitudinal data. One could also evaluate within-cluster concordance.
For a binary outcome over time for an individual, it might be better to use survival modeling, with a cluster or frailty term if multiple events per individual is possible. The frailtypack
package can provide concordance estimates for survival models at among-cluster and within-cluster levels along with an overall weighted-average estimate. A quick look at the underlying code, however, suggests that standard errors/CI are only returned if bootstrapping is used. That package also allows for other types of clustered survival data, like individuals clustered within hospitals.
*AUC is an acceptable way to evaluate the discrimination provided by a single model, but it doesn't take calibration directly into account and it isn't necessarily a good way to compare different models. This answer focuses on AUC as asked in the question, although much of it is applicable to other measures of model performance.