I'm trying to fit a GLM model with a categorical variable with 400 categories, and I'd like to reduce the number of categories. There are some categories with a lot of data, and a lot of categories with not a lot of data. The response variable are losses.
What would be the best way to predict the total number of losses for categories that don't have a lot of data? I'm thinking of just grouping these smaller categories into a larger group, or removing the observations containing the smaller categories all together, but I'm afraid they'll affect the accuracy of the estimates of the categories of the larger group. Ideally I'd like estimates for each one of the 400 categories.
For example,
Say the equation I'm trying to model is Losses ~ Model + Age, where Model refers to the model of a car and Age refers to age of the car. Model of the car has 400 categories.
Furthermore, the data set has 400,000 observations.
50,000 of those observations have Camry has the car model. 50,000 has Corolla as the car model. 10,000 has Volkswagon as a car model. Then there are 50 categories with around 5,000 observations, and then 297 categories with the rest of the observations (140,000). Suppose one of these are Lamborghinis, with 10 observations.
I fit a glm model, and get the following coefficients:
Call: glm(formula = losses ~ model + age, family = gamma(link = "log"), data = cars)
Coefficients:
(Intercept)
2.280700
model:Camry
0.009783
model:Corolla
.01
....
model:Lamborghini
.409
age
.37
My goal is to predict how much some of these car models affect losses, i.e. I'm interested in the coefficients like 'model:Camry 0.009783' above.
I know there is a possibility there are some models which are rarer (like the Lamborghini) are significantly more expensive, although the model will probably report their standard error as much higher.
How do I,
1) Ensure the model coefficients with a high number of observations (like Camry, Corolla) be as accurate as possible? (this is more of a priority)
2) Estimate the coefficients of the rarer car models (like Lamborghini)?