I have a "stacked panel" data set with ~600,000 rows. While the data are compiled from a survey , the data are no longer in survey format: rather, these are person-year observations for ~100000 people over ~10 years. Thus, I'm not able to use R's survey package in a strict sense, because I do not have survey data.
It is called mydata
and it looks something like this:
unique_pid housing.category agebucket sample.weight year
1_1 0 (30, 45] 29.9 1999
1_1 0 (30, 45] 29.9 2000
1_1 1 (30, 45] 19.9 2001
1_1 1 (30, 45] 39.9 2002
1000_33 0 (15, 30] 10 1982
1000_33 1 (15, 30] 10.2 1983
1000_33 0 (15, 30] 13 1984
1000_33 1 (15, 30] 12 1985
1000_33 0 (15, 30] 10 1986
1000_33 1 (15, 30] 12 1987
88_2 0 (30, 45] 0.99 1990
88_2 0 (30, 45] 0.89 1991
88_2 1 (30, 45] 1.99 1992
88_2 0 (30, 45] 2.99 1993
I am running a weighted logit with R's glm()
. My call looks like this:
glm(housing.category ~ agebucket, data= mydata, family="binomial", weights=sample.weight)
I get the following warning:
Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
My weights are contained in vector sample.weight
in the dataframe above. While they are not integers, they are designed to adjust for over- and under-sampling of certain groups in the survey: oversampled populations are assigned values < 1, while undersampled populations are assigned values >1. In this sense, they are frequency weights.
However, when I read the documentation for glm()
, I see that the these weights should represent the number of trials - i.e. they should be vector of integers.
So far, I have worked with the survey
R package, but my data are no longer panel.
Am I misinterpreting the weights argument? Is this a numerical error? If so, how can I run a logit weighted by sample weights?