2

Imagine I have a dataset of people where I can find the city and country they live in. The data is such that, given the city, there is only one possible country. For example, given Madrid as a city, the country the person lives at can only be Spain, and if I say London, the person can only live in England.

So one can say that the information of which country the person lives in is already contained in the city variable.

Given this situation, is it any better to fit a generalized linear model using both country and city variables instead of only city? Does this change if the model is non-linear? Does it depend on the specific kind of model I use (regression, SVM, trees...)?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dukebody
  • 141
  • 3
  • 1
    What is your dependent variable? Adding those grouping variables can make a big difference in some cases and generally linear mixed models are designed for such data see questions tagged [tag:mixed-model] . – Tim May 21 '15 at 09:26
  • The dependent variable could be, for example, if the person buys a product or not (binary), or the price of the product bought (continuous, zero if didn't buy anything). – dukebody May 21 '15 at 10:23
  • 1
    "price of the product bought (continuous, zero if didn't buy anything)" is a troublesome way of coding a variable... – Tim May 21 '15 at 10:25
  • Ok, then think only about the previous binary case, buy or not buy. – dukebody May 21 '15 at 10:26
  • Then still GLMMs feel to be a kind of model that may fit in here (e.g. http://www.ats.ucla.edu/stat/mult_pkg/glmm.htm) – Tim May 21 '15 at 10:28

1 Answers1

1

If your number of cities is really large you should read Principled way of collapsing categorical variables with many levels? (and posts linked there!). Apart from that, you have what we can call a hierarchical variable, and how you code it depends on what interpretations you want. If you are interested in country effects, you can make a Country variable, and then a City variable. This variables of course cannot be crossed, since they have a hierarchical structure. So in effect you are modeling first a country effect, and then a city effect as a deviation from that city's country. Or you can do with only a City variable, in which case you model the city effect directly. These two ways of modeling will give the same fitted values, the only difference lies in parameter interpretation. These comments do not really depend on which class of model you are fitting, at least within a class of regression-like models.

Alternatively, you could also use a random (or mixed) effects model. It depends on the question you are asking from the data, see Should "City" be a fixed or a random effect variable?.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467