How to fill NaN values that exist because there are no measures of certain features?

Question

I'm currently doing a ML project (the goal is simply to clean the data set and apply some of the models we learned , like Random Forests, Ensemble learning, etc, and test the results) for a class and I'm cleaning my data set. It's about hotels/homestays and it has a couple of rows that correspond to review scores for different parameters (cleanliness, location, etc). It has another row that corresponds to the number of reviews that a certain hotel has.

The problem is, several hotels/homestays have a value of 0 in the number of reviews (they have no reviews maybe because they are new places??), so in those hotels/homestays there are NaN values in the other reviews columns (like cleanliness, etc).

I'm really torn on how to deal with this NaN. Obviously, dropping the rows is a bad idea, as these observations are about 10% of the overall data, which is quite a lot.

My question is: 1. Should I just assign a certain value (like -1, 0, something like that) to these cells that have a NaN value due to the fact that the corresponding hotels/homestays have 0 reviews, and therefore kind of ''grouping'' all of the places that are new/have no reviews OR 2. Should I try to fill those cells with either the mean of the columns, with interpolation or with a prediction algorithm? If i do this, though, it would make sense to transform the 0's in the number of reviews to NaN and then also filling them with another value, right? Because otherwise I'd have values in the reviews columns while the last_review column would indicate that that place has 0 reviews (which wouldn't make much sense).

Sorry for the long question and thanks in advance for taking the time to read this!!

Can you tell us more about these other variables? A variable such as `number of reviews` can be 0 without issues. — user2974951, Feb 12 '19 at 10:07
@user2974951 the other variables that are NaN are variables like cleanliness_score, location_score, accuracy_score, etc. For the hotels/homestays that have a number of reviews equal to 0, these variables have NaN values, which makes sense. However, I have to change this if I want to implement some of the models I mentioned. Because the number of reviews equal to 0 makes sense, should I just replace the NaNs on those columns with a certain value value or should I make the 0's in the nº of reviews NaN and also fill them with a value based on the average of the column or some other metric? — J.Doe, Feb 12 '19 at 10:13
You may end up simply imputing these missing values with 0 or something, this would be the simplest choice. If you know R and are willing to learn, have a look at https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model — user2974951, Feb 12 '19 at 10:19
Also you could end up treating the hotels with 0 reviews differently, for ex. creating a new variable which is 0 for these, and 1 for all the others. In doing so you would try to differentiate these two classes as much as possible. — user2974951, Feb 12 '19 at 10:20
@user2974951 Thanks, I was thinking about doing that as well! Will me replacing all the NaN values with 0 (with are like 10% of the overall data) influence badly if I implement a model, like a Regression, to predict price? I was discussing this with some colleagues and one of them was saying that it could create bias... but then again, it does make sense to replace it by 0 as there are, in fact, no reviews! — J.Doe, Feb 12 '19 at 17:29
This is something that is heavily dependent on the data and I cannot tell you whether it's going to have an effect. What I would do is build two models, one with all the data where you deal with the NaN's for ex. by imputing with 0, and the second model with 90 % of the data, dropping the 10 % of the data that are troublesome. See if there are big differences in the models. If there are, then those 10 % of the data are important and you need to deal with them more sensibly. — user2974951, Feb 13 '19 at 07:46

How to fill NaN values that exist because there are no measures of certain features?

0 Answers0