Strange thing using XGBoost in Classification

Question

I'm working on the Loan Prediction III contest on Analytics Vidhya.

The dataset is composed this way:

Loan_Status: Loan approved (Y/N), target variable
Loan_ID: Unique Loan ID    
Gender: Male/ Female    
Married: Applicant married (Y/N)    
Dependents: Number of members in the family    
Education: Applicant Education (Graduate/ Under Graduate)    
Self_Employed: Self employed (Y/N)    
ApplicantIncome: Applicant income    
CoapplicantIncome: Coapplicant income    
LoanAmount: Loan amount in thousands    
Loan_Amount_Term: Term of loan in months    
Credit_History: credit history meets guidelines    
Property_Area: Urban/ Semi Urban/ Rural

All the variable intuitively seem correlated with the target variable.

I'm using the XGBoost model, that gives me 77.7% of right classification. If I use all the variables (without the ID of course), without any kind of transformation, and I plot their importance in the model I get this:

                        Feature        Gain       Cover   Frequency
 1:        Credit_History.0 0.423935159 0.136344565 0.028294260
 2:         ApplicantIncome 0.176082874 0.207100146 0.306386419
 3:              LoanAmount 0.152516620 0.228038940 0.262732417
 4:       CoapplicantIncome 0.113407830 0.147036347 0.156022635
 5:        Loan_Amount_Term 0.025578214 0.060612626 0.042845594
 6:      Education.Graduate 0.019317142 0.026056739 0.025869038
 7: Property_Area.Semiurban 0.016191576 0.037535354 0.016976556
 8:     Property_Area.Rural 0.012931300 0.028867237 0.027485853
 9:        Self_Employed.No 0.012382003 0.022994295 0.022635408
10:              Married.No 0.012002906 0.059047750 0.019401778
11:            Dependents.0 0.011809200 0.006748493 0.021018593
12:            Dependents.1 0.009099146 0.016882861 0.025060631
13:           Gender.Female 0.004831299 0.006484499 0.016168149
14:     Property_Area.Urban 0.004399681 0.002157539 0.010509297
15:            Dependents.2 0.003361149 0.009833036 0.011317704
16:           Dependents.3+ 0.002153900 0.004259572 0.007275667

After 25 iterations I get this train error:

[25]    train-error:0.048860

So it seems that the variable Credit_History is the most important variable when it comes to predict the outcome of the loan.

What I did next was trying to use XGBoost again with just the variable Credit_History, so to model only this covariate. The train error of course is much higher than before (I used only 1 round, more just give the same outcome):

[1] train-error:0.190554

When I check on the site my classification score I get... 77.7%! So basically with all the variables I get the same score as when I use only one of them. This thing is bugging me a bit.

So the other covariates reduces the error on my training set, but when it comes to evaluating the score on the test they basically do... nothing. Why does this happen? Could it be that the XGBoost isn't the right model for this application?

I think that using feature engineering on the other covariates could improve the model correctness. Does you have any suggestion on new variables created from the others or transformation on starting variables that could help me?

Maybe you made coding mistake, maybe the data is noisy and simply the algorithm doesn't work... There could be lots of reasons for that, but I can't see how the question could be answered with such limited information. — Tim, Nov 21 '17 at 12:49
Clarification request: what is your number of trees, and learning parameter? — EngrStudent, Nov 21 '17 at 21:34
@EngrStudent There is no number of trees available, I have eta=0.2, the number of rounds is set to 25 and the depth of the tree is set to 15. — Limbs, Nov 21 '17 at 22:17
Eta is your learning parameter. Rounds is your number of trees. If I were starting out, I would use eta = 0.001, rounds = 1500, and max depth between 5 and 8. If I were using your settings, I would feel like they were engineered to grossly overfit, or if not that my data was trivially simple. Look at the validation vs. training convergence and let them teach you. You want them essentially parallel and negative for at least the first half of your rounds. You want to stop training when validation slope goes positive. Try doing the exact same thing with gbm in h2o. — EngrStudent, Nov 22 '17 at 14:34
This could just be an artifact of using accuracy to make these comparisons. More information: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/312783#312783 — Sycorax, Nov 22 '17 at 15:00
Unique loan_id might be a problem. There might be something about previous loans by the same borrower. — EngrStudent, Nov 22 '17 at 19:26

Strange thing using XGBoost in Classification

0 Answers0