I have trouble interpreting logistic regressions. And reading through several materials just confused me even more. Perhaps an example along with my course of thinking might clarify - what exactly I'm getting wrong.
Say, we have a data set for any (generic) product sales, lets make it cars for sake of example (data is completely made up, so results might as well lack any logic). We have a sample of people which are car owners where some own particular car and the rest have any sort of other car. Goal is to estimate probabilities of selling this particular car to some potential client with given set of parameters or put in other words - try to make artificial segmentation of market by ranking these segments by sales probability. E.g. We'll define 2 parameters - age of client and age of car. And lets make it all categorical - we'll define 3 client age groups and 3 car age groups.
car: 1 0
client_age: [under 30] [31 - 40] [over 41]
car_age: [under 2] [2 - 5] [over 5]
So, regression would look like car ~ client_age + car_age
And the output:
Intercept: -0.8
client_age[31 - 40]: 0.5
client_age[over 41]: -0.6
car_age[2 - 5]: 0.2
car_age[over 5]: -0.9
With all coefficients being significant at 95%.
So.. now the course of thinking. The general probability of buying a particular car would simply be the ratio of car == 1 to the size of sample. Lets make it 3%. And as I understand it - logistic regression for categorical variables shows improvement in odds (exponent of coefficient) over the baseline. e.g. client_age[31 - 40] would have exp(0.5) or 1.648 or 65% more likely to own this particular car over client_age[21 - 30]. With everything else held constant. Similarly client_age[over 41] would have exp(-0.6) or 0.54881 or 45% less likely to own this car over client_age[21 - 30]. Same applies to car age.
So, my list of questions:
What is the role/interpretation of intercept?
Would it be possible to get all combinations of categorical values and rank them by probability relative to average 3% probability of buying the car?
- Is there a point in using logistic regression in this particular example? By that I mean - estimating logistic regression only with categorical variables when I can simply calculate probability for every particular subset of general data set? How would these results compare?