Using a multi-level model when some variables are clustered but majority is not and there are plenty of clusters?

Question

I use an example to illustrate my question.

I have a model that explains choice of low fat vs full fat milk, that was actually bought in a store. We model it with a binary logistic regression.

The model parameters mostly stem from a questionnaire, that a lot of low-and high fat milk customers filled out. However, we also used their ZIP codes, to see if they live in a rural area or not, and if cows are held in their ZIP code (those 2 variables have a correlation of .5).

For rural areas we use ZIP code density as a proxy and group accordingly. For the cows we use the number of cows per 100 inhabitants "Cowsper100".

We argue the more rural, the more high fat milk, as processed food is less popular in rural areas and more cows per inhabitants also lead to more interest in high fat milk. (This is a mock example, so yeah, I am not sure how convinced you are, but assume you were convinced.)

For simplicity of this question assume we only look at the following model:

High Fat Milk Purchase (Yes/No) = b0 + b1*RuralArea + b2*Cowsper100 + b3*SurveyCovariate + error

One of the reviewer encourages us to use a multi-level model. However we are insecure, because we have very few people per ZIP code, and many ZIP codes. Following this question's top answer, we might not need it, right? OLS with clustered standard errors vs. multilevel modeling when the main interest is at the individual level

In all areas you can purchase both high and low fat milk. (People that purchase both are counted for only one group, according to a rule that makes more sense in non-milky context.)

What is the general rule: When do you need a multi-level model? Is there anyone who could help me, by pointing to the relevant literature?

Can you provide some more descriptive information about the number of people per zip code. What is the mean number of people, standard deviation, and range. How many zip codes have just 1 person, 2 persons, or 3+? — Erik Ruzek, Apr 28 '20 at 15:47
Dear Erik So I basically have 500 area codes and only 70 of them have 17 or more people in it. I am not able to have better statistics atm. — canIchangethis, Apr 29 '20 at 11:49
Can you give a rough estimate of how many of the 500 area codes have only 1 person in them? — Erik Ruzek, Apr 29 '20 at 12:51
A few but to be honest I do not have the exact number (data is on a server and we always have VPN overloaded so I am not having much access to it at the moment ) — canIchangethis, Apr 29 '20 at 13:41

score 1 · Accepted Answer · answered Apr 29 '20 at 17:54

Since you appear to have many more area codes with >1 person than you do area codes with exactly 1 person, multilevel modeling (MLM) may indeed be appropriate. Your question about whether it should be used instead of OLS with standard errors adjusted for the nesting of individuals within area codes is an important one.

The main reason, in my opinion, to move toward a MLM would be if you wanted to explore variation in the effects of your predictor(s) on your outcome. MLMs, through random slopes, allow for the association between a predictor and outcome to vary from group to group (area code to area code in your data). You can then explore whether predictors measured at the area code level explain this variation through interactions between the varying predictor and area code variables. Note that there are additional assumptions entailed in allowing for such random slopes (and intercepts) and this can also be a turn off for some researchers.

However, if such varying effects of predictors are not of interest to you, then using an OLS with a standard error adjustment for the clustering or nesting in your data will give you exactly what you want.

May I have a short question back, because I cannot easily implement and test this: Using MLM would I still obtain estimates for my 2 geographical variables "Cowsper100" and "RuralArea"? Thanks for helping me clarify this. Extremely helpful! — canIchangethis, May 04 '20 at 13:25
Thank you! That is extremely helpful! Accepted your answer as right :) Very neat! — canIchangethis, May 05 '20 at 09:01

Using a multi-level model when some variables are clustered but majority is not and there are plenty of clusters?

1 Answers1