I'm working on the Loan Prediction III contest on Analytics Vidhya.
The dataset is composed this way:
Loan_Status: Loan approved (Y/N), target variable
Loan_ID: Unique Loan ID
Gender: Male/ Female
Married: Applicant married (Y/N)
Dependents: Number of members in the family
Education: Applicant Education (Graduate/ Under Graduate)
Self_Employed: Self employed (Y/N)
ApplicantIncome: Applicant income
CoapplicantIncome: Coapplicant income
LoanAmount: Loan amount in thousands
Loan_Amount_Term: Term of loan in months
Credit_History: credit history meets guidelines
Property_Area: Urban/ Semi Urban/ Rural
All the variable intuitively seem correlated with the target variable.
I'm using the XGBoost model, that gives me 77.7% of right classification. If I use all the variables (without the ID of course), without any kind of transformation, and I plot their importance in the model I get this:
Feature Gain Cover Frequency
1: Credit_History.0 0.423935159 0.136344565 0.028294260
2: ApplicantIncome 0.176082874 0.207100146 0.306386419
3: LoanAmount 0.152516620 0.228038940 0.262732417
4: CoapplicantIncome 0.113407830 0.147036347 0.156022635
5: Loan_Amount_Term 0.025578214 0.060612626 0.042845594
6: Education.Graduate 0.019317142 0.026056739 0.025869038
7: Property_Area.Semiurban 0.016191576 0.037535354 0.016976556
8: Property_Area.Rural 0.012931300 0.028867237 0.027485853
9: Self_Employed.No 0.012382003 0.022994295 0.022635408
10: Married.No 0.012002906 0.059047750 0.019401778
11: Dependents.0 0.011809200 0.006748493 0.021018593
12: Dependents.1 0.009099146 0.016882861 0.025060631
13: Gender.Female 0.004831299 0.006484499 0.016168149
14: Property_Area.Urban 0.004399681 0.002157539 0.010509297
15: Dependents.2 0.003361149 0.009833036 0.011317704
16: Dependents.3+ 0.002153900 0.004259572 0.007275667
After 25 iterations I get this train error:
[25] train-error:0.048860
So it seems that the variable Credit_History is the most important variable when it comes to predict the outcome of the loan.
What I did next was trying to use XGBoost again with just the variable Credit_History, so to model only this covariate. The train error of course is much higher than before (I used only 1 round, more just give the same outcome):
[1] train-error:0.190554
When I check on the site my classification score I get... 77.7%! So basically with all the variables I get the same score as when I use only one of them. This thing is bugging me a bit.
So the other covariates reduces the error on my training set, but when it comes to evaluating the score on the test they basically do... nothing. Why does this happen? Could it be that the XGBoost isn't the right model for this application?
I think that using feature engineering on the other covariates could improve the model correctness. Does you have any suggestion on new variables created from the others or transformation on starting variables that could help me?