I've been working on a data set with binary outcome. Logistic regression was used to fit the outcome with several covariates, all of which are categorical variables.
I tried to assess the goodness-of-fit of the logistic model to the data using the Pearson's chi-square and deviance statistics, however none of them showed the model fit to the data (p-values < 0.0001). I also used Hosmer and Lemeshow (HL) test, the results is similar with p-value < 0.0001. In order to improve the model's fit, I then added the interactions between the main effects into the model, after adding all possible interactions the fit didn't improve a lot and p-values are still < 0.0001.
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 29693.7476 4384 6.7732 <.0001
Pearson 31175.7340 4384 7.1113 <.0001
Number of unique profiles: 4409
I remember that we can always make the model fit by making it more complex but wondering how should I proceed? Does it make sense to include polynomial terms of the categorical variables? Or make the model non-linear? (I'm reluctant to do these as it would be hard to interpret the results.)
Here is a brief summary of the data I'm working on
Number of Observations 495851
Number of Events 105069
I'm also wondering if it's because the sample size is too large so that it's hard to find a relative simple more that fits to the data?
Thanks!