My data set has 13 variables. The dependent variable is Boolean, and all the independent variables are either integer or numeric (i.e., continuous).
data.frame': 20000 obs. of 13 variables:
$ PassFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1
$ var1 : int 6 14 4 14 9 17 6 5 13 7 ...
$ var2 : int 34 34 35 34 34 34 35 34 34 34 ...
$ var3 : int -1 0 6 0 0 -1 0 0 0 0 ...
$ var4 : num 1096.6 403.4 162754.8 20.1 59874.1 ...
$ var5 : num 0 0.03 0.02 0.04 0.8 0.02 0.3 0.1 0.36 0.05 ...
$ var6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ var7 : int 0 0 0 0 0 0 0 0 0 0 ...
$ var8 : int 0 0 0 0 0 0 0 0 0 0 ...
$ var9 : num 0.368 0.368 0.368 0.368 6.457 ...
$ var10 : int 110 116 112 116 114 112 109 115 112 111 ...
$ var11 : int 6 6 6 6 6 6 5 5 6 6 ...
$ var12 : int 0 1 1 1 1 1 1 1 1 1 ...
There are not many unique values in them.
[1] "Unique values in PassFlag : 2"
[1] "Unique values in var1 : 23"
[1] "Unique values in var2 : 3"
[1] "Unique values in var3 : 9"
[1] "Unique values in var4 : 14"
[1] "Unique values in var5 : 68"
[1] "Unique values in var6 : 2"
[1] "Unique values in var7 : 2"
[1] "Unique values in var8 : 2"
[1] "Unique values in var9 : 3486"
[1] "Unique values in var10 : 13"
[1] "Unique values in var11 : 12"
[1] "Unique values in var12 : 3"
I am trying to predict whether a student will pass or not. I want to try Logistic Regression, Decision Trees, Random Forest, SVM and xgBoost .
Should I work with the independent variables the way they are, or convert them into factors, since some of them have very few unique values? Or should I one hot encode them? Or does it depends on the model I am choosing, and if yes, what kind of data should I use for each model mentioned?