8

I'm building a logistic regression model and one of the variables I have is postcode, I might be over thinking this but is it fine for me to leave postcode as is and regress it as:

fitlogit <- glm(target ~ postcode + ..., family="binomial", data = dat)

I was thinking it might be better to combine postcodes into 4 distinct areas and model it as dummy variables once I figure out rough ranges for the groups.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    Your example code doesn't show logistic regression? I hope `postcode` is a `factor` variable? – Roland Dec 10 '14 at 08:20
  • Are postcodes meaningful in your data? In the US, they are just delivery routes that the post office finds logistically convenient. They do not correspond to geographic areas and are often rather demographically heterogenous. – dimitriy Dec 10 '14 at 08:31
  • 2
    In Australia they are geographical boundarys, a good measure of location. And Yes I did use it as a factor, I forgot to write it out as a logistic model. I've edited the post – Grant McKinnon Dec 11 '14 at 01:18
  • See also https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Jan 16 '20 at 06:08

1 Answers1

7

There are a couple approaches you can take here, from least to most laborious:

  • If there are few enough post codes in your study, it's probably fine to leave it as is. A general rule of thumb is that there should be ten events per variable (where each postal code is a different variable, plus all the other variables you're considering). If each postal code has more than 10 residents and you don't have too many other features, this is probably fine, though you should of course test for overfitting.

  • If you don't have enough events to make all postcodes into dummy variables, you could also dummify only the most-often-sampled post codes (for instance, every post code with more than 50 residents).

  • You could take the solution that you mentioned and cluster postcodes in some way. Of course there are multiple ways to cluster them--geographically is the most obvious, but income, demographics, or population density might be more informative.

  • If you have ideas about what attributes of the postal code might mediate the relationship between it and your dependent variable, you could use a multilevel model. This requires the most additional effort since it requires you to fit a more complex model and acquire post-code data, but it is potentially the most rewarding since it allows you to understand the postcode relationship better.

Finally, this is probably obvious, but make sure the postcode is encoded as a categorical rather than numerical variable--this one's bitten me too many times!

Ben Kuhn
  • 5,373
  • 1
  • 16
  • 27
  • I actually have 697 distinct postcodes so I'm thinking clustering them into smally groups may be better I like the idea of grouping by income/density as well. Could be more informative as you say – Grant McKinnon Dec 10 '14 at 03:44
  • In the last point on fitting a multilevel model, I don't get why this specifically would require ideas about what attributes might mediate. – Gijs Jan 16 '20 at 06:27
  • In addition, hierarchies in postal codes are possibly quite easy to get by. This could be as simple as `glmer(target ~ (pc4 | pc3) + ..., family="binomial")`, where `pc4` is four letters and `pc3` is three letters, corresponding to one hierarchy. – Gijs Jan 16 '20 at 06:32