One of the predictors I had in a logit model is "City". Problem is this categorical variable has too many factor levels. e.g. In a Sample of $\sim 3000$ there are already $\sim 200$ different cities.
Is it fair to still retain City as a predictor or should I purge it entirely from the model? An alternative is to retain, say, the top five most common cities and then code all the rest as a new factor level "Others". The top city occurs $\sim 60$ times but by the fifth common city this occurance drops down to 30. Some cities occur only once or twice in the dataset.
PS. One problem I face (if I retain all factor levels) is that certain levels occur in the validation set but not in the training set. Then the model complains at validation time.
PPS. On more reading, I found sugesstions to use combine.levels() from the Hmisc package in R
. Maybe that will work, though not sure how exactly yet.
Is there an elegant way to deal with these issues?