2

A very similar question has been asked before, but it didn't get a real answer.

Background

I would like to develop a probability model for a continuous, ratio-scale random variable $Y$. Let's say it represents the annual total income of a household.

I also have several covariates, collectively called $X$, that are predictive of annual household income.

I'm using a standard regression model setup. I want to model $log(Y)$ with a Gaussian regression model, using the $X$ variables as predictors: $$ \log\left(Y\right) \sim \mathcal{N}\left(\beta X, \sigma^2\right) $$ where $\beta$ and $\sigma$ are model parameters to be estimated. My goal is to estimate $P\left(Y|X=x\right)$.

The challenge

I have a dataset containing $N$ observations from different households (assume they're IID conditional on $X$), indexed by $n$.

In one subset of the observations (denoted $A$), the collected data is continuous, but noisy. These heads of household were asked "what is your total household income?"

In the remaining observations, $B$, the collected data is binned into tiers, e.g. "\$10,000 - \$30,000". The bin widths are not necessarily constant in linear or log space. For each of these observations, I know the range of possible $y$ values, but not the actual value.

How can I use the information from the $B$ observations to fit my model?

Some ideas

  1. Bin the data in $A$ using the same bins as in $B$.
  2. Replace the binned values in $B$ with the bin midpoints
  3. Replace the binned values in $B$ with data sampled uniformly from the bin; repeat $K$ times and analyze using standard techniques for multiple imputation.
  4. Fit two models: a continuous model on $A$ and some kind of latent-variable model on $B$. The conditional distribution of $Y$ is a mixture of both, weighted by the relative sizes of the $A$ and $B$ sets.

1 and 2 sound bad to me. 3 is straightforward but seems hacky, and I'm afraid might give poor results. 4 is enticing but I'm not sure what kind of model would be needed.

An extended challenge

I now have 3 observation groups: $A$, $B$, and $C$. The $A$ data set is continuous, and the $B$ and $C$ data sets are binned. $B$ and $C$ use different bins.

How can I fit this model using the data from $A$, $B$, and $C$?

shadowtalker
  • 11,395
  • 3
  • 49
  • 109
  • 3
    If I read your post correctly, you are asking about estimation methods for interval-censored data, about which we have [many posts](https://stats.stackexchange.com/search?q=interval+censored). – whuber Apr 12 '18 at 19:03
  • @whuber It probably should be called interval-censored, but I don't see much of a correspondence with the typical survival analysis setup. – shadowtalker Apr 12 '18 at 19:44
  • To clarify, it's not about detecting whether an event occurs in a window of time. It's about inferring a continuous distribution parameter when I only know a range for some of the data points. – shadowtalker Apr 12 '18 at 19:52
  • 3
    That's *precisely* what interval censoring means. It doesn't necessarily have anything to do with survival analysis or time intervals. But the posts I referred to you cover the situation pretty well. See https://stats.stackexchange.com/questions/56015 and https://stats.stackexchange.com/questions/265785 for Maximum Likelihood methods that can immediately be applied to your situation. – whuber Apr 12 '18 at 19:58
  • 2
    @whuber in my searching all I found was survival analysis stuff. Your answer in the first link really clears it up! So I would need to use $F(high) - F(low)$ instead of $f(value)$ in the likelihood for the censored observations? – shadowtalker Apr 12 '18 at 20:12
  • That's correct. – whuber Apr 12 '18 at 20:13
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/75939/discussion-between-shadowtalker-and-whuber). – shadowtalker Apr 12 '18 at 20:15

1 Answers1

3

As per the comments by whuber, this is just a type of censored data, called interval-censored.

The way to handle this is straightforward.

  • The likelihood for a non-censored observation is $f_{\beta,\sigma}\left(y\right)$.
  • The likelihood for a censored observation is $F_{\beta,\sigma}\left(y^\mathrm{upper}\right) - F_{\beta,\sigma}\left(y^\mathrm{lower}\right)$, where $y^\mathrm{upper}$ and $y^\mathrm{lower}$ are the upper and lower bounds of that observation's bin.
  • The likelihood for the whole data set is the product of the likelihoods for each observation, because the data is IID.

Then we can apply standard tools for maximum-likelihood analysis.

shadowtalker
  • 11,395
  • 3
  • 49
  • 109