R large glm with sample weights

Question

I have a "stacked panel" data set with ~600,000 rows. While the data are compiled from a survey , the data are no longer in survey format: rather, these are person-year observations for ~100000 people over ~10 years. Thus, I'm not able to use R's survey package in a strict sense, because I do not have survey data.

It is called mydata and it looks something like this:

unique_pid    housing.category  agebucket  sample.weight   year
1_1           0                 (30, 45]   29.9            1999
1_1           0                 (30, 45]   29.9            2000
1_1           1                 (30, 45]   19.9            2001
1_1           1                 (30, 45]   39.9            2002
1000_33       0                 (15, 30]   10              1982
1000_33       1                 (15, 30]   10.2            1983
1000_33       0                 (15, 30]   13              1984
1000_33       1                 (15, 30]   12              1985
1000_33       0                 (15, 30]   10              1986
1000_33       1                 (15, 30]   12              1987 
88_2          0                 (30, 45]   0.99            1990
88_2          0                 (30, 45]   0.89            1991 
88_2          1                 (30, 45]   1.99            1992
88_2          0                 (30, 45]   2.99            1993

I am running a weighted logit with R's glm(). My call looks like this:

glm(housing.category ~ agebucket, data= mydata, family="binomial", weights=sample.weight)

I get the following warning:

Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

My weights are contained in vector sample.weight in the dataframe above. While they are not integers, they are designed to adjust for over- and under-sampling of certain groups in the survey: oversampled populations are assigned values < 1, while undersampled populations are assigned values >1. In this sense, they are frequency weights.

However, when I read the documentation for glm(), I see that the these weights should represent the number of trials - i.e. they should be vector of integers.

So far, I have worked with the survey R package, but my data are no longer panel.

Am I misinterpreting the weights argument? Is this a numerical error? If so, how can I run a logit weighted by sample weights?

Here is a rough translation from software output language into English: How could a number like "29.9" or "10.2" possibly represent a "number of observations" of a person? How do you take fractional observations? — whuber, Oct 28 '14 at 17:15
My mistake: thanks for pointing this out. Vector sample.weight is not strictly representative of the number of observations; rather, it is a sample weight: Persons in under-represented groups get sample.weight > 1, and those in over-represented groups get sample.weight < 1. In order to generate statistics that are representative of the population, it isn't necessary that sample.weight be an integer. I've edited my question to reflect this. Is it possible to run a logit weighted by these sample weights? — svenkatesh, Oct 28 '14 at 18:22
That sounds like a probability weight that you should be using, not a glm weight. Look at the survey package, and you won't get this warning. — Jeremy Miles, Oct 28 '14 at 18:55
Thanks. My data are no longer survey data - they are person-year observations for ~10000 people across ~20 years that come from a survey. Would the survey package's equivalent methods still be applicable here? — svenkatesh, Oct 28 '14 at 21:20

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

I don't know why you are saying that these are "no longer survey data". Jeremy Miles is right that you have probability weights, and need to account for them differently. (Why glm refers to the number of trials as weight is a trick question to the developers' conscience.) See also Jeremy's earlier answer on a somewhat related topic.

You should still stick to the survey package, and at the very least try the svyglm function of it. However, since you have longitudinal data, correlations within "subjects" (unique_pid) will not be properly accounted for in the modeling part with neither glm nor svyglm. You would need some version of gee to do that properly, but I don't think you can easily marry gee and survey packages, although conceptually is quite possible (GEE generates estimating equations, and if a statistical estimand is produced by estimating equations, it can wrapped into a survey-compatible estimator).

score -2 · Answer 2 · answered Oct 28 '14 at 18:46

-2

Try scaling all your weights to be integers. For example, multiply them all by 100.

The docs says:

For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes

answered Oct 28 '14 at 18:46

wolfsatthedoor

771
1
7
21

3

Unfortunately, much of the output would then become arbitrary: the standard errors of estimate, for instance, would vary inversely with the scale factor used. This is because larger multiples make it look like more data were gathered, leading to the false impression of greater precision in the results. – whuber Oct 28 '14 at 19:11
So what should she do? – wolfsatthedoor Oct 28 '14 at 20:15
1

Until we find out what those numbers in `sample.weight` mean, we really don't know and are just guessing. The guess in a comment by @Jeremy Miles looks reasonable. – whuber Oct 28 '14 at 21:14

R large glm with sample weights

2 Answers2