How to deal with intentionally missing data

Question

I have a dataset describing a vehicle's sensors. One of the sensors records the distance from cars in other lanes. Sometimes there are no cars to the right or the left of the vehicle and this is recorded as NA.

I would like to use this data to create a prediction model. However, I cannot get rid of the missing data since it describes an absence of cars at certain moments (which is information that should exist in the model). I don't know how to represent this missing data. Should I use a large number to represent this missing data (like maybe a million)? Should I use zero? Should I create a model for each lane and get rid of the missing data in each lane and then aggregate the two models? (sometimes there is a car in one lane but not the other or there is no car in either lane)

What is the best approach to handle this scenario?

score 0 · Accepted Answer · answered Mar 28 '20 at 22:39

There is not a unique answer to your question. Because I tend to use Bayesian methods, I would split it into two variables. The first variable would be present/absent. The second would be the distance given that a car is present.

Because there could exist blind spots, you also would need to decide the uncertainty present in the reading of present/absent. If there are no blind spots and there is no risk of instrumentation error, then you could condition on present/absent in the predictive work. If it is stochastic, then you would need to marginalize it out of the prediction.

If you were using a Frequentist method, I cannot give you a single answer because I do not know what you are predicting. Predicting a crash is different than predicting relative position. It entirely depends on the functional form that you choose. In most cases, I would still split it into two variables.

Still, there is a second Frequentist possibility because the absence of a car implies that there are still cars that are out of range in both directions. It would signal a need to change algorithms rather than one estimation. The alternate prediction would be the location of the next observation. Indeed, based on the time between observations, it may be possible to predict the distance to the next vehicle, even though you have no current reading based on arrival times in the past and current location and velocity. After all, if you are booking it down the interstate at 100 mph, you probably will not be approached from the rear unless you pass a cop on the side of the road. Likewise, if you were on the same interstate in normal conditions and traveling 10 mph below the speed limit then you are likely to be approached from the rear.

Sorry I cannot give you a clear answer without knowing a functional form.

I was thinking of switching from continuous to boolean but wasn't sure if it was the best choice. Thank you for your feedback. — Michal, Mar 28 '20 at 23:26

How to deal with intentionally missing data

1 Answers1

Linked