I am modelling data where the dependent variable is the number of units of a certain product sold each month in each area. In all areas, the product is sold by a chain of shops 'A' and we have exact sales data from that chain.
In some areas, the product is also sold by a smaller chain of shops 'B', and we know how many shops 'B' are present in each area, but we don't know how many units they sell.
The data is arranged in such a way that records = locations (areas). Independent variables are mainly area characteristics.
So in those areas where only shops 'A' are present, our dependent variable is measured without error. In areas where shops 'B' are also present, the dependent variable is measured with error as the number of units of that product sold in such areas is underestimated (no data from shops 'B').
I have two questions:
Should I add to the regression either a dummy for 'a shop 'B' is present in the area', or perhaps a variable that is a count of shops 'B' in each area? Would that help, and if so, why would it be helpful?
I understand that if the measurement error is random, this will increase standard errors but will not bias coefficients.
I am not sure however if my measurement error is random. The two clues that I have are as follows:
I have reasons to believe that in areas where shops 'A' sell more units than average, shops 'B' also sell more units than average.
I have reasons to believe that the likelihood that there is a shop 'B' in the area is not related to how many shops 'A' there are in the area.
If these clues are valid, is the measurement error 'random'?