Numeric variable with outliers as a categories

Question

I'm working with a dataset that has a few variables that I'm having difficulty trying to preprocess. So one of them is called MENTHLTH where it is a numeric variable.

The point of the variable is to measure the number of days a person has had a bad mental health day within the last 30 days. So if you put 1 you had one bad mental health day in the past 30 day, if you put 30 all of them were. However, exceptions exist in that if you had No bad days in the past 30 days you'd put 88 and if you were not sure you'd put 77.

Now 1/3 of the responses had a value of 1-30. Nearly 2/3 had a value of 88 and the remainder were 77 or blank.

So how should I go about dealing with this variable? Should I make it nominal and try to bin the values in such a way that it represents meaningful groupings or can I continue to just treat it as a numeric variable?

I'm running it through Weka.

You must have a reason for not treating `88` as `0`. What is it? — BruceET, Dec 06 '20 at 04:17

Deathkill14 · Accepted Answer · 2020-12-06T12:07:34.357

I agree with @Boom that replacing the "no bad days in the past 30 days" with 0 seems reasonable if there is nothing special going on (which @BruceET asks about).

I was also thinking about the dummy variable for unsure individuals and maybe that is a start. But another way to look at these entries is that they are a lot like missing observations. With missingness, there are sometimes patterns in why the observations are missing and if there are patterns it has a major impact on your analysis if you want to fill them in (which can be interesting). If not, there are other things to think about. So first, maybe take a look at the kinds of missingness (they are well defined) and see what kind you might have in our data. You can find a good description of these missingness types in Section 25.1 of this paper. Ask yourself which type of missingness you are likely finding yourself confronted with. If you believe that there is a mechanism to the missingness you observe you may want to reconsider imputation. In any case, I think it's a worthwhile exercise to understand your data better though this lense.

If you decide to go the imputation route, then there is some support in R.Multiple imputation for continuous and categorical variables can both be performed using the mi package.

So in short, perhaps you can find a path forward by understanding what type of missingness lies behind the 77's. Even if you don't impute and there isn't a good statistical trick to make them work for you, understanding them through the lense of missingness you will be able to talk about them more effectively to the audience of your paper or project report.

score 1 · Answer 2 · answered Dec 06 '20 at 05:21

1

I suggest making the value for no bad days in the past 30 days 0. Then also including a separate dummy variable for the individuals that were unsure.

answered Dec 06 '20 at 05:21

Boom

39
5

Numeric variable with outliers as a categories

2 Answers2