Should I convert integer variables with very few unique values to factors for predictive modelling?

Question

My data set has 13 variables. The dependent variable is Boolean, and all the independent variables are either integer or numeric (i.e., continuous).

data.frame':    20000 obs. of  13 variables:

$ PassFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1
 $ var1        : int  6 14 4 14 9 17 6 5 13 7 ...
 $ var2        : int  34 34 35 34 34 34 35 34 34 34 ...
 $ var3        : int  -1 0 6 0 0 -1 0 0 0 0 ...
 $ var4        : num  1096.6 403.4 162754.8 20.1 59874.1 ...
 $ var5        : num  0 0.03 0.02 0.04 0.8 0.02 0.3 0.1 0.36 0.05 ...
 $ var6        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ var7        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ var8        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ var9        : num  0.368 0.368 0.368 0.368 6.457 ...
 $ var10       : int  110 116 112 116 114 112 109 115 112 111 ...
 $ var11       : int  6 6 6 6 6 6 5 5 6 6 ...
 $ var12       : int  0 1 1 1 1 1 1 1 1 1 ...

There are not many unique values in them.

[1] "Unique values in  PassFlag : 2"
[1] "Unique values in  var1 : 23"
[1] "Unique values in  var2 : 3"
[1] "Unique values in  var3 : 9"
[1] "Unique values in  var4 : 14"
[1] "Unique values in  var5 : 68"
[1] "Unique values in  var6 : 2"
[1] "Unique values in  var7 : 2"
[1] "Unique values in  var8 : 2"
[1] "Unique values in  var9 : 3486"
[1] "Unique values in  var10 : 13"
[1] "Unique values in  var11 : 12"
[1] "Unique values in  var12 : 3"

I am trying to predict whether a student will pass or not. I want to try Logistic Regression, Decision Trees, Random Forest, SVM and xgBoost .

Should I work with the independent variables the way they are, or convert them into factors, since some of them have very few unique values? Or should I one hot encode them? Or does it depends on the model I am choosing, and if yes, what kind of data should I use for each model mentioned?

Thanks Stephen for making it more readable. I would be obliged if you could answer the question. — Mighty, Nov 05 '17 at 12:26
It looks, from what you've shown, like vars 6,7 and 8 are boolean. Var12 might also be, if there is some data entry error. — Peter Flom, Nov 05 '17 at 13:50
okay , but what about others , what shall i consider them for my modeling purpose ? — Mighty, Nov 05 '17 at 19:20

score 1 · Answer 1 · answered Nov 05 '17 at 20:05

As Peter Flom notes, your variables 6, 7 and 8 are essentially Boolean. For these, whether you treat them as numeric, as Boolean or as factors will make literally no difference.

Variable 12 has three different values, but the first few are all 0 or 1. This might also be a Boolean variable with a data problem.

For everything else, it depends. If your variables are not interval scaled, but only ordinal, then treating them as continuous variables is, strictly speaking, wrong. (It often works well enough.) You should really use models and methods that are appropriate for ordinal independent variables.

However, if the variables are interval scaled and just happen to only have a few values, I would argue that you should typically not treat them as factors. First, treating them as factors implicitly bins them into a set of discrete bins, and binning continuous variables is usually a bad idea. For instance, if your variable has $k$ different values, then treating it as a factor uses up $k-1$ degrees of freedom. Some estimates might be very noisy, if you have observed the corresponding value rarely. This also allows the response to each "factor level" to be whatever the noise in your data suggests - your response won't be constrained to be monotonic in the independent variable.

Even worse, you lose information about unobserved intermediate values. If you have ever only observed values 1, 5 and 9 and fit a model to these as factor data, then what do you do if you ever have to predict for a value of 7? If you treat the variable as continuous, this will not be a problem.

Bottom line: better not to convert continuous data with few values into factors. Better to leave them as they are, possibly transform them using splines.

Should I convert integer variables with very few unique values to factors for predictive modelling?

1 Answers1