I am working on the Diabetes in 130 US hospitals for years 1999--2008 dataset. After removing unnecessary variables (i.e. some IDs or near-zero-variance variables) and doing some naive imptuation, I am left with
'data.frame': 101766 obs. of 31 variables:
race : Factor w/ 5 levels "AfricanAmerican",..: 3 3 1 3 3 3 3 3 3 3 ...
gender : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
age : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
admission_type_id : Factor w/ 8 levels "1","2","3","4",..: 6 1 1 1 1 2 3 1 2 3 ...
discharge_disposition_id: Factor w/ 26 levels "1","2","3","4",..: 24 1 1 1 1 1 1 1 1 3 ...
admission_source_id : Factor w/ 17 levels "1","2","3","4",..: 1 7 7 7 7 2 2 7 4 4 ...
time_in_hospital : num 1 3 2 2 1 3 4 5 13 12 ...
payer_code : Factor w/ 17 levels "BC","CH","CM",..: 16 4 4 8 3 8 8 11 15 8 ...
medical_specialty : Factor w/ 72 levels "AllergyandImmunology",..: 38 19 12 63 19 19 63 19 19 19 ...
num_lab_procedures : num 41 59 11 44 51 31 70 73 68 33 ...
num_procedures : num 0 0 5 1 0 6 1 0 2 3 ...
num_medications : num 1 18 13 16 8 16 21 12 28 18 ...
number_outpatient : num 0 0 2 0 0 0 0 0 0 0 ...
number_emergency : num 0 0 0 0 0 0 0 0 0 0 ...
number_inpatient : num 0 0 1 0 0 0 0 0 0 0 ...
diag_1 : Factor w/ 716 levels "10","11","110",..: 125 144 455 555 55 264 264 277 253 283 ...
diag_2 : Factor w/ 748 levels "11","110","111",..: 80 80 79 98 25 247 247 315 261 47 ...
diag_3 : Factor w/ 789 levels "11","110","111",..: 247 122 767 249 87 87 771 87 230 318 ...
number_diagnoses : num 1 9 6 7 5 9 7 8 8 8 ...
A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
change : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
diabetesMed : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
readmitted : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
payer.code : Factor w/ 18 levels "1","10","11",..: 18 18 18 18 18 18 18 18 18 18 ...
medical.speciality : Factor w/ 73 levels "1","10","11",..: 32 73 73 73 73 73 73 73 73 11 ...
The problem is, I have no idea how to reduce the dimensionality here or handle the diag_{1,2,3}
variables. There simply seem to be too many levels involved. How should I go about doing this? All algorithms (multiple correspondence analysis, multiple factor analysis) I've tried struggle greatly and I do not see them completing in feasible time.
The goal is to predict the readmitted
variable from the others. I am unable to learn even a simple decision tree due to the aforementioned issues.