Linear regression where some known records have a measurement error in dependent variable

Question

I am modelling data where the dependent variable is the number of units of a certain product sold each month in each area. In all areas, the product is sold by a chain of shops 'A' and we have exact sales data from that chain.

In some areas, the product is also sold by a smaller chain of shops 'B', and we know how many shops 'B' are present in each area, but we don't know how many units they sell.

The data is arranged in such a way that records = locations (areas). Independent variables are mainly area characteristics.

So in those areas where only shops 'A' are present, our dependent variable is measured without error. In areas where shops 'B' are also present, the dependent variable is measured with error as the number of units of that product sold in such areas is underestimated (no data from shops 'B').

I have two questions:

Should I add to the regression either a dummy for 'a shop 'B' is present in the area', or perhaps a variable that is a count of shops 'B' in each area? Would that help, and if so, why would it be helpful?
I understand that if the measurement error is random, this will increase standard errors but will not bias coefficients.

I am not sure however if my measurement error is random. The two clues that I have are as follows:

I have reasons to believe that in areas where shops 'A' sell more units than average, shops 'B' also sell more units than average.
I have reasons to believe that the likelihood that there is a shop 'B' in the area is not related to how many shops 'A' there are in the area.

If these clues are valid, is the measurement error 'random'?

kjetil b halvorsen · Answer 1 · 2019-03-22T20:53:17.077

First, the measurement error is certainly not random, since you always have an underestimate of the response variable. Your best bet would be to get some additional data, like number of shops and some measure of overall size (total sales?) of the 'A' and 'B' shops in each area. To get good results, you really need as much of extra information that you can get.

If that is impossible, maybe:
Since this is count data, I would start out with a Poisson model (in practice there might well be overdispersion, so changing maybe to negative binomial. Most here will still apply.) Let the total observed count in area $i$ be $Y_i$, $$ Y_i = Y_i^A + Y_i^B $$ but only $Y_i^A$ is observed. A Poisson regression model for the unobserved $Y_i$ will be $$ Y_i \sim \mathcal{P}(\lambda_i),~~~~ \lambda_i=\exp\{\beta_0+\beta_1\cdot \text{Area}_1+\beta_2\cdot x_i\} $$ where $x_i$ is the count of 'B' shops in the area. As a start I would estimate a model only with the subset of data with $x_i=0$ (and therefore $Y_i=Y_i^A$.) Then compute fitted values and residuals, using that estimated model, but for all the data. Plot those residuals in various ways, maybe with another color for the data not used in estimation. That should give insight.

Then I would think about ways of estimating the full model (for all of the data) using maybe the EM algorithm, see for instance Numerical example to understand Expectation-Maximization Further thought gives that this is optimistic. Without observations on the $x_i$ (when it is positive), there is no information about the process generating $x_i$ values. The parameters in the regression model have nothing to say about this, since the regression model is conditional on the $x_i$ (and other) variables. So you really need some more information. Failing that, you just stay with the results of the initial residual analysis proposed above, which at least tells you about the size of the underestimation.

Thank you very much Kjetil, I'll follow your advice. Filip – Filip S Mar 22 '19 at 15:29 — Filip S, Mar 22 '19 at 15:29

Linear regression where some known records have a measurement error in dependent variable

1 Answers1