How to use resampling to reconstruct an obesity cohort?

Question

In a study, the administrative data contains everyone in a population. I am using ICD-10 code E65-E68 (excluding E66.3 which is overweight) to construct the obese cohort. Conversely, the rest of the population who didn't receive these codes will be considered as non-obese. The problem is 1) the code is not completely recorded in the dataset (which leads to low sensitivity (or a lot of false negatives) but high specificity), and 2) while the ones who got the obesity code is highly accurate (high specificity), they tend to be more obese and severe than the average obese population (which cannot be directly observed in the current study).

In the literature, they found that the sensitivity using this method ranges between 10% to 50%. I wonder if I can conduct a sensitivity analysis in which I reconstruct the "obesity" cohort from sampling people who are originally labelled into the non-obese cohort. For example, I can sample a certain percentage of the overweight and another percentage from non-obese and non-overweight individuals into the obese cohort to account for the potential high false negative rate. Is there a formalized way to do this? I try to find appropriate literature, but couldn't find anything directly addresses my set-up.

Edited to provide more information

The study objective is to compare the costs due to utilized clinical services between the obese and non-obese cohorts. The clinical utilization information is recorded in the dataset already. Because of the low sensitivity and the readily-identified obese individuals using this approach are likely more severe and using/costing more with the clinical services, my intent is to reconstruct the obese cohort in sensitivity analysis to reduce the underlying bias (towards overestimation of the higher costs amongst the initially identified obese cohort).

For example, some of the overweight individuals should probably have a true obese status, so I'd like to resample some of these overweight individuals into the new "obese" cohort. Same thing goes with the non-overweight/non-obese individuals.

All the variables indicative to obese status will be used to identify the obese individuals. For our discussion, we can assume there is no other information in the database that further inform/relate to the obesity status.

Please say more about what you're trying to accomplish with your study, in particular what you hope to achieve by accounting for the potentially high false negative rate. Do the data include information that might be associated with obesity status (height, weight, other informative ICD codes)? — EdM, Aug 20 '20 at 16:43

score 1 · Answer 1 · answered Aug 20 '20 at 18:11

If there were a set of cases that you were reasonably certain not to be obese, just as you are reasonably certain that those with particular ICD codes are obese, it might be possible to turn this into a missing-data problem for which there are well established techniques.

Once you have identified cases that are pretty definitively in one or the other category, you label those cases accordingly and then label the other cases as having missing data. Provided that the probability of that missingness is only a function of the information you have, then your data are technically "missing at random" (MAR). That's a much less stringent requirement than for data to be "missing completely at random" (MCAR) in which there is simply a single probability of missingness that applies throughout.

Although your data then would certainly not be MCAR, they would be MAR if the available information on ICD codes, clinical services utilized, provider information, patient characteristics and demographics, etc., was associated with the probability of being missing, and there was no unknown information associated with the probability of being missing. There is no way to prove that a data set is MAR, but it often is a reasonable assumption. That judgment would depend on your knowledge of the subject matter.

If the MAR assumption is reasonable you would use the standard multiple imputation approach. You probabilistically impute obesity-class assignments for all the "missing" values several times, based on the information that you have, generating several complete (albeit imputed) data sets. The higher the fraction of missing data, the more imputation sets you typically have to produce. You then do your analysis on each of those data sets, and combine the results in a way that takes into account the variability both of the modeling itself and of the imputations.

Stef van Buuren, a leader in the field of imputation, has a freely available web book that describes the theoretical basis and practical implementation of multiple imputation. Think about this approach carefully, as it seems to accomplish in a principled way the type of analysis you wish to perform.

One warning, though: obesity is not really an all-or-none phenomenon. It's on a continuum of severity, so that it might be more reliable to evaluate costs along a more graded measure of obesity. For example, if you do the above and find lower costs associated with obesity than you would have just based on the ICD codes, then one could argue that all you are really doing is shifting a threshold at which you assign the all-or-none "obese" label. In general it's not a good idea to break down a continuum into categories. Anything you can do to obtain something other than a binary obese/not distinction would probably be more generalizable and useful way to determine the association of obesity with costs.

Thanks for the insight, I will look closer but like some clarification. The challenge with the current setup is we are only highly certain with individuals identified as obese. From the non-obese individuals, there may be markedly large amount who are obese but not recorded. Likewise, overweight is a subset of the non-obese cohort, it's reasonable to assume quite a portion of them is indeed obese (but wrongly labeled as overweight). Since it's more likely for well-managed and lesser obese individuals to be mislabeled as non-obese, will the MAR assumption be unlikely to hold true? — KubiK888, Aug 20 '20 at 20:08
@KubiK888 how likely is someone not even labeled "overweight" to be obese? That's key. I don't know much about ICD coding, so read van Buuren and see if MAR might apply. The more I think about this, though, the more skeptical I am that the simple obese/not will be very informative. To start, why not just use the 3 classes you do have available from ICD: obese, overweight but not marked obese, and others? That at least would put some bounds on the problem. I don't see a way just to "resample" those last 2 classes in a way that would accomplish what you want, outside of multiple imputation. — EdM, Aug 20 '20 at 20:57

How to use resampling to reconstruct an obesity cohort?

1 Answers1