2

It is likely that I am incorrect terminology here, but I am trying to compute a "mean" of an interval censored random variable.

Here is an example where:
1. a random sample from the standard normal distribution is discretized to create an interval censored random sample;
2. the marginal probability distribution of the sample is computed;
3. the midpoints of the intervals are computed;
4. the weighted mean of the interval midpoints is computed.

A simple simulation shows that this leads to a biased estimate of the mean of the underlying variable, and obviously that bias depends on the:
1. number of intervals into which the variable is discretized, and;
2. the sample size.

replIntMean = replicate(100, {
  X = rnorm(1000)

  # discretize the variable
  XD = cut(X, quantile(X, probs = seq(0, 1, 0.1)), include.lowest = TRUE)

  # compute the marginal probability table
  probX = prop.table(table(XD))

  # compute the upper and lower limit of the censoring intervals
  liUL = regmatches(levels(XD), 
                    gregexpr("([\\+-]*[0-9]+\\.[0-9]+)", levels(XD)))

  # computed the weighted mean of the variable
  sum(probX * sapply(liUL, function(x) mean(as.numeric(x))))
  },
  simplify = "array")

plot(replIntMean, type = "l")
abline(h = 0, col = "blue")
abline(h = mean(replIntMean), col = "red")

enter image description here

I am wondering if there is any guidance in the literature on the right way to compute measures of central tendency of interval coded variables, and a discussion of their relative properties and interpreations.

Please note that the application is not survival analysis as most of the references tend to be to that, and also I am aware that the simplest measure of central tendency here is the modal class.

amoeba
  • 93,463
  • 28
  • 275
  • 317
tchakravarty
  • 8,442
  • 2
  • 36
  • 50
  • I see no significant bias in your simulation, nor do I see a mechanism to introduce bias in your description. However, that may because you are vague about the operation "the marginal probability distribution of the sample is computed"--if there is any way to bias the result, it must be there. Perhaps you could elaborate on that process in your question? You might also be interested in a related question asking about how to estimate variances for binned data (which is more difficult): http://stats.stackexchange.com/questions/60256. – whuber Nov 24 '14 at 18:27
  • @whuber Code included! But yes, there doesn't appear to be much bias, and I was going to plot the means based on the underlying continuous variable as well for comparison, but I am highly suspicious of this procedure, and would like guidance from the literature. – tchakravarty Nov 24 '14 at 18:29
  • As you might guess, the situation can be analyzed theoretically. However, the results depend on seemingly minor details, such as whether the intervals have the same size and whether they were chosen independently of the data. Your references to "censoring" rather than *binning* suggest the procedure might be more complex and subtler than mere binning. Details of the censoring process would be needed to carry out an analysis. – whuber Nov 24 '14 at 18:40
  • @whuber Please do point me to the theory, I fear that my searches for interval censoring might be leading me astray (the censoring process is noninformative), and that I should be looking for binning/quantization? – tchakravarty Nov 24 '14 at 18:42
  • @whuber Can you point to a literature or a textbook reference? Would be highly appreciated. – tchakravarty Nov 25 '14 at 08:47
  • 2
    Kendall & Stuart contain a discussion of using Sheppard's Corrections to estimate means and variances when the underlying distribution might be approximately normal. – whuber Feb 10 '17 at 17:07
  • @whuber 2.5 years later, but totally worth the wait. :) – tchakravarty Feb 10 '17 at 19:05
  • 1
    Sorry about that. Somehow I didn't notice your last comment. I have been [mentioning Sheppard's corrections for years](https://www.google.com/search?q=whuber+sheppard+site%3Astats.stackexchange.com&ie=utf-8&oe=utf-8). The earliest reference I can find is a 2011 comment at http://stats.stackexchange.com/questions/12919 (with a Web link). – whuber Feb 10 '17 at 19:09

0 Answers0