3

So I am currently trying to analyse the relationship between 3 categorical variables and their impact on the continuous/quantitative independent variable in R.

Basically, I want to analyse the impact that credit scores have on the remaining balance a customer has when they default and stop paying their loan.

Could someone tell me what kind of regression or test I could do to obtain explanatory results? My linear regression only gives me an r squared of .27 with about 8700 observations in the initial data set.

Vee
  • 31
  • 2
  • 1
    You can have perfect coefficient estimates to do inference with and have horrible r squared values and vice-versa so definitely don't reference it as the end-all-be-all metric to judge your regression by. – Tylerr Aug 11 '21 at 13:55
  • 1
    Why do you think that $R^2 = 0.27$ is so awful? I would expect a ton of unexplainable variability in the data (or explained only by complex interactions of variables that are not among the three you are considering). – Dave Aug 11 '21 at 13:58
  • 1
    You're treating credit scores as categorical data? – Acccumulation Aug 11 '21 at 22:41
  • Yes. Credit scores range from 1 to 20. If you're attributed a 1, you're risky for the bank and they will not increase your credit limit. If you get a 20, your credit limit could be increased. They are an automated process made by the bank (in my case). – Vee Aug 12 '21 at 06:12
  • Those would be ordinal data, which are kind of a hybrid of numerical and categorical data. // I thought the highest credit score was $850$. – Dave Aug 12 '21 at 09:38
  • I'm doing an internship in a bank for the summer and they operate under a different system I believe. Not sure if you are speaking of US credit scores, I'm based in Luxembourg. I don't think there's a common credit system that anyone can access, this one is strictly internal to the bank. – Vee Aug 12 '21 at 13:26

2 Answers2

6

You can use ANOVA as jzTUD indicates in which case you may have to use a post-hoc test such as Duncan or Bonferroni to determine which variables are significant. Alternatively, you can use multiple regression with indicator variables that although it requires a little more analysis after you get the results, it will tell you immediately which dependent variables are relevant to your model. See Ways of comparing linear regression interepts and slopes?.

LDBerriz
  • 535
  • 3
  • 9
5

Sounds like an ANOVA would be appropriate, in a 3x1 design (3 predictors, 1 outcome). It can easily be done in all common statistical softwares such as R, SPSS, Matlab, etc. If a main effect on an interaction reaches significance, you can calculate a post-hoc test using only the variables of that main effect or interaction to see which way the effect goes. On another note: Please make up your mind about what kind of test you want to run before you analyze your data. What you are doing now is p-hacking, looking for the test that will give you the most impressive results. Test choice should always be theory driven.

jzTUD
  • 51
  • 2