I have built an Logistic Regression model in R. The class that I want to predict, is very unbalanced (99 vs 1).
My first finding is that this Logistic model does a better job if I train it on a balanced (50 - 50) train set, instead of on the whole train set (unbalanced (99 - 1). This is, seen the sources on the internet, a common way to deal with unbalanced data (source).
But I have doubts about the next two questions:
First Question: To decide my models performance, I used a confusion matrix. I played around with the threshold (when a prediction is classified as 1, or as 0). See this code:
predictions <- predict(mylogit, test_set, type = "response")
confusionMatrix(data = as.numeric(predictions > 0.5), test_set$target)
# Here I played around with the 0.5 cutoff, to decide when
# my model is performing best on my test set.
So I tweaked the 0.5 until I had the best confusion matrix score on my test set. Is this valid?
In addition,Second question: If my results (in terms of prediction - Confusion matrix) are bad, can I still use the model to see the influence of the factors (coefficients) on the target? So only for describing the data, and not for predicting? And why?