Learning a continuous model from binned data

Question

A very similar question has been asked before, but it didn't get a real answer.

Background

I would like to develop a probability model for a continuous, ratio-scale random variable $Y$. Let's say it represents the annual total income of a household.

I also have several covariates, collectively called $X$, that are predictive of annual household income.

I'm using a standard regression model setup. I want to model $log(Y)$ with a Gaussian regression model, using the $X$ variables as predictors: $$ \log\left(Y\right) \sim \mathcal{N}\left(\beta X, \sigma^2\right) $$ where $\beta$ and $\sigma$ are model parameters to be estimated. My goal is to estimate $P\left(Y|X=x\right)$.

The challenge

I have a dataset containing $N$ observations from different households (assume they're IID conditional on $X$), indexed by $n$.

In one subset of the observations (denoted $A$), the collected data is continuous, but noisy. These heads of household were asked "what is your total household income?"

In the remaining observations, $B$, the collected data is binned into tiers, e.g. "\$10,000 - \$30,000". The bin widths are not necessarily constant in linear or log space. For each of these observations, I know the range of possible $y$ values, but not the actual value.

How can I use the information from the $B$ observations to fit my model?

Some ideas

Bin the data in $A$ using the same bins as in $B$.
Replace the binned values in $B$ with the bin midpoints
Replace the binned values in $B$ with data sampled uniformly from the bin; repeat $K$ times and analyze using standard techniques for multiple imputation.
Fit two models: a continuous model on $A$ and some kind of latent-variable model on $B$. The conditional distribution of $Y$ is a mixture of both, weighted by the relative sizes of the $A$ and $B$ sets.

1 and 2 sound bad to me. 3 is straightforward but seems hacky, and I'm afraid might give poor results. 4 is enticing but I'm not sure what kind of model would be needed.

An extended challenge

I now have 3 observation groups: $A$, $B$, and $C$. The $A$ data set is continuous, and the $B$ and $C$ data sets are binned. $B$ and $C$ use different bins.

How can I fit this model using the data from $A$, $B$, and $C$?

If I read your post correctly, you are asking about estimation methods for interval-censored data, about which we have [many posts](https://stats.stackexchange.com/search?q=interval+censored). — whuber, Apr 12 '18 at 19:03
@whuber It probably should be called interval-censored, but I don't see much of a correspondence with the typical survival analysis setup. — shadowtalker, Apr 12 '18 at 19:44
To clarify, it's not about detecting whether an event occurs in a window of time. It's about inferring a continuous distribution parameter when I only know a range for some of the data points. — shadowtalker, Apr 12 '18 at 19:52
That's *precisely* what interval censoring means. It doesn't necessarily have anything to do with survival analysis or time intervals. But the posts I referred to you cover the situation pretty well. See https://stats.stackexchange.com/questions/56015 and https://stats.stackexchange.com/questions/265785 for Maximum Likelihood methods that can immediately be applied to your situation. — whuber, Apr 12 '18 at 19:58
@whuber in my searching all I found was survival analysis stuff. Your answer in the first link really clears it up! So I would need to use $F(high) - F(low)$ instead of $f(value)$ in the likelihood for the censored observations? — shadowtalker, Apr 12 '18 at 20:12
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/75939/discussion-between-shadowtalker-and-whuber). — shadowtalker, Apr 12 '18 at 20:15

shadowtalker · Accepted Answer · 2018-04-12T20:39:55.507

As per the comments by whuber, this is just a type of censored data, called interval-censored.

The way to handle this is straightforward.

The likelihood for a non-censored observation is $f_{\beta,\sigma}\left(y\right)$.
The likelihood for a censored observation is $F_{\beta,\sigma}\left(y^\mathrm{upper}\right) - F_{\beta,\sigma}\left(y^\mathrm{lower}\right)$, where $y^\mathrm{upper}$ and $y^\mathrm{lower}$ are the upper and lower bounds of that observation's bin.
The likelihood for the whole data set is the product of the likelihoods for each observation, because the data is IID.

Then we can apply standard tools for maximum-likelihood analysis.

Learning a continuous model from binned data

Background

The challenge

Some ideas

An extended challenge

1 Answers1

Linked