8

This question is motivated by a study my colleague was working on.

A surgeon intends to examine tissue samples within patients in hopes of finding the presence of some disease within the tissue. The procedure is invasive and risky, and so the surgeon is only able to take a maximum number of samples, say 3. The surgeon is to sample until a) The disease is found, or b) the maximum number of samples has been reached. The goal of the study is to estimate the risk of the disease given patient covariates $X$ (e.g. age, weight, genotype, etc) given the samples.

If it helps to invision the data, it might look like so

 Subject Diseased Num_Samples
  <chr>      <dbl>       <dbl>
1 001            1           2
2 002            1           3
3 003            0           3

Here, subject 1 had disease detected on the second sample. Subject 2 had disease detected on the third sample, and subject 3 was sampled the maximum number of times without detection of the disease.

The question I have is about the proper way to analyze these data.

  • If the subjects were sampled once, we could frame the study as examining the probability of finding disease from a single sample and use a logistic regression.

  • If the subjects were sampled until disease was found (assuming they were all diseased) we could model the number of samples required until detection using something like negative binomial regression.

The analysis is complicated by the sampling design. It sounds like negative binomial sampling, with censoring? I'm not sure how to properly account for the fact that a) there are a limited amount of samples we can observe, and b) the fact that the patient may have the disease and we fail to detect it within the 3 samples but theoretically could have detected it were we to continue sampling.

The answer to this question may be that the design is poor, and if that is the case so be it. But I think this poses an interesting problem for the development of a likelihood, and I'd like to discuss this if not for something interesting to think about.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Possibly related -- MLE for the "success" probability parameter of the negative binomial: https://stats.stackexchange.com/questions/164934/maximum-likelihood-estimator-for-negative-binomial-distribution. Why wouldn't modeling this as a negative binomial already account for the case where you fail to detect within n samples? Also, I don't know if you need to make the censoring assumption, but I'm not too familiar with that. – tchainzzz Dec 25 '20 at 22:12
  • *"The goal of the study is to estimate the risk of the disease given patient covariates $X$"* You have an ordinal outcome with 4 levels. In the model from Ben you have *two* parameters $\theta$ and $r$ and you might be able to split the class $y > 3$ into people that have the disease (but not found) and people that do not have the disease.... – Sextus Empiricus Dec 26 '20 at 12:06
  • ...However the $\theta$ might also be a function of $\beta$ and $X$ and possibly you have that the ratio for finding the disease in the first, second or third step is not according to the geometric distribution (maybe the surgeon tries the most likely spot first and less likely spots in the 2nd or 3rd). So the approach is a lot dependent on assumptions regarding those issues. Are they negligible or not? In the worst case you can not find the risk of the disease, but instead only the probability of 'finding' the disease. – Sextus Empiricus Dec 26 '20 at 12:06

1 Answers1

4

This can be framed as a problem with censoring if you like. Given that it is possible to take a sample without finding the disease (even if it is there) you have two aspects to your inference problem: (a) estimating the probability that a person has the disease; and (b) estimating the probability that the disease is found in a single sample (if it is there). I recommend you build up an appropriate model via basic assumptions about the relationship between your samples and the underlying disease. Here I will give an example of a simple model that could serve as a starting point for your analysis.


A simple model: Suppose we assume that the results of samples are independent conditional on the presence of the disease. (For simplicity, assume that there are no false positives, so if the disease it not there then the sample is always negative.) Let $0<\theta<1$ be the probability of finding the disease (if it is present) in a single sample, and assume that this probability is fixed for all samples over all patients.

Now, consider an individual patient with covariates $\mathbf{x}_i$ and let $D_i$ be the indicator that the person has the disease. Suppose, hypothetically, that we were to continue testing the patient until the disease is found, and let $y_i$ be the number of samples taken from the patient until the disease is found (with $y_i = \infty$ if the patient does not have the disease). Under these assumptions we have the conditional geometric distribution:

$$Y_i|D_i=1 \sim \text{Geom}(\theta).$$

The sampling mechanism used here is that we observe a censored version of $y_i$ with censorship after three samples (i.e., we obesrve either $y_i=1$, $y_i=2$, $y_i=3$ or $y_i > 3$). These outcomes have conditional sampling density:

$$\begin{align} \mathbb{P}(y_i=1|\mathbf{x}_i, D_i=0) &= 0, \\[6pt] \mathbb{P}(y_i=2|\mathbf{x}_i, D_i=0) &= 0, \\[6pt] \mathbb{P}(y_i=3|\mathbf{x}_i, D_i=0) &= 0, \\[6pt] \mathbb{P}(y_i>3|\mathbf{x}_i, D_i=0) &= 1, \\[6pt] \mathbb{P}(y_i=1|\mathbf{x}_i, D_i=1) &= \theta , \\[6pt] \mathbb{P}(y_i=2|\mathbf{x}_i, D_i=1) &= \theta(1-\theta), \\[6pt] \mathbb{P}(y_i=3|\mathbf{x}_i, D_i=1) &= \theta(1-\theta)^2, \\[6pt] \mathbb{P}(y_i>3|\mathbf{x}_i, D_i=1) &= (1-\theta)^3. \\[6pt] \end{align}$$

Suppose we denote the disease probability (given the covariates) by $r(\mathbf{x},\boldsymbol{\beta}) \equiv \mathbb{P}(D_i=1|\mathbf{x})$ where $\boldsymbol{\beta}$ is a set of model parameters. Then for a single patient we have:

$$\begin{align} \mathbb{P}(y_i=1|\mathbf{x}_i) &= \theta \cdot r(\mathbf{x}_i, \boldsymbol{\beta}), \\[6pt] \mathbb{P}(y_i=2|\mathbf{x}_i) &= \theta(1-\theta) \cdot r(\mathbf{x}_i, \boldsymbol{\beta}), \\[6pt] \mathbb{P}(y_i=3|\mathbf{x}_i) &= \theta(1-\theta)^2 \cdot r(\mathbf{x}_i, \boldsymbol{\beta}), \\[6pt] \mathbb{P}(y_i>3|\mathbf{x}_i) &= (1-\theta)^3 \cdot r(\mathbf{x}_i, \boldsymbol{\beta}) + (1- r(\mathbf{x}_i, \boldsymbol{\beta})). \\[6pt] \end{align}$$

Setting a model for the disease is equivalent to setting the form of the function $r$. For this you might like to use a logistic regression, or some similar model used on binary outcome data. Once you have done this you can form the likelihood function for your model, which is the product of the above sampling densities taken over all patients.

In terms of your inferences, you will be able to get estimators for the parameters $\theta$ and $\boldsymbol{\beta}$, and the latter will give an induced estimate for the probability of disease for a patient with any particular set of covariates. Both parameters are identifiable (so long as your $\boldsymbol{\beta}$ vector is identifiable within $r$) so this type of model should allow you to make simultaneous inferences about the probability of disease and the probability of detection of disease.

Of course, you will notice that this model is heavily dependent on the assumption that sample outcomes are independent given the presence of the disease (i.e., the second or third sample is no more likely than the first to find the disease if it is present). Some kind of assumption here is likely to be crucial to the model, and if it is wrong then your inferences will be wrong. Conseqeuently, it is worth thinking deeply about whether this is a plausible description of your sampling method in this case.

Ben
  • 91,027
  • 3
  • 150
  • 376