When to replace Not Applicable" or N/A values with a zero?

Question

I am trying to compare user ratings of various products, the majority of which come in several standard versions. However some of the products do not come in certain versions meaning that my data set has several N/A values.

The data is continuous interval type data with a range of -100 to +100

My question is, when should these values be replaced by 0?

            V1      V2      V3      V4      V5      V6      V7      V8      V9
Product 1   2.63    -5.12   -0.41   5.29    9.89    4.16    14.73   9.06    -7.80
Product 2   0.60    0.94    4.47    N/A     0.12    21.47   N/A     -4.63   1.29
Product 3   5.53    -16.20  -19.56  N/A     2.24    N/A     15.07   -3.47   -6.93

With N/A values included, excel tells me the average user rating given to each product is

Product 1 = 3.60
Product 2 = 3.47
Product 3 = -3.33

However if I replace the values with 0 then it changes the scores:

Product 1 = 3.60
Product 2 = 2.70
Product 3 = -2.59

I am sure others have dealt with this question before but I am not sure what to do. In future I want to undertake t-tests or z-tests on the data.

Some Research

To be honest apart from a little bit here and some questions on Research Gate I cant find a lot on this topic, I suspect I am using the wrong search terms however. The below is what I have got so far

There appears to be lots of questions about how to replace N/A values in datasets but not when or whether it should be done

There are a few other simalar questions on CV but they have not received an answer, e.g. here

A comment in reply to a question here, only slightly similar question suggests setting NA values to the minimum of the range, but this would significantly change my results...

P.S. I am really not sure what to tag this question with so if anyone could apply better tags it would be much appreciated.

Thanks

If you got lots of free time, take a look at the Expectation Maximization algorithm from ML. Basically what you do is try to predict what is the most likely value a missing number would take given all of what you saw in the rest of your data set. https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm — FisherDisinformation, Aug 10 '16 at 19:14
maybe this can help : http://stats.stackexchange.com/questions/226992/how-to-determine-which-of-the-important-missing-variables-to-ignore/227025#227025 — , Aug 10 '16 at 19:22
@Artificial Since these N/A values correspond to non-existent categories, any effort at imputing them would seem to have no meaning. — whuber, Aug 10 '16 at 19:22
@whuber My bad, I didn't read the full description, but I'm pretty sure EM would work if not for this constraint. — FisherDisinformation, Aug 10 '16 at 23:36

Ashe · Accepted Answer · 2016-08-10T21:16:33.207

To formalize @Whuber's comment in an answer, it seems like no, you shouldn't impute here.

From the range of available scores, I'll assume that a $0$ denotes a neutral opinion of that product/version combination. If you were to impute values of $0$ for your NAs, what you're saying is that the average user rating of that product/version combination is neutral. However, that doesn't make any sense, given that those product/version combinations don't exist at all! If a given product/version combination doesn't exist, it would be inappropriate to assign a numerical user rating to represent an opinion of a nonexistent product/version combination.

If the NAs represented missingness due to lack of data, imputation might be an option (though there are other options beyond imputation, including doing nothing). But that doesn't seem like the case here.

When to replace Not Applicable" or N/A values with a zero?

1 Answers1

Linked