I want to figure out adjusted expenditure (AE) as a proportion of income (Y) for different income bands and family compositions. The adjustment is taking mortgage (M), pension (P) and child & education (K) related expenditure away from total expenditure (E).
My variables are Y, E, M, P, K, marriage status, pension status, family composition amongst many others; the kind of survey data available in most national living standards data.
I have survey data from 6 periods, each about 4 years apart; 40,000 observations in all. If I pool all of the data there are sufficient observations for each combination of income band and family size to just take the average Y, E, M, P and K, and compute AE, by income band, for each combination of family composition. For example, I take the average: Y of 100k, E of 80k, M P & K of 20k, and produce AE of 60k which gives me the ratio AE/Y of 60% for households in the highest decile who have 1 earner and 2 children.
My problem is that an ANOVA test of the equality of the means tells me that the 6 data-sets are significantly different. (The smallest data set has the largest variance which might inflate the true p value and allow me to accept the null but perhaps this is just a bit of statistical chicanery?) Taking the data-sets individually I have many cohorts that have only 2 or 3 observations, some have none, and thus provide unreliable and erratic results.
Either I can pool, and if someone can suggest a way that this might be feasible that would be great, or I can attempt to model. Perhaps E, M, P and K as a function of Y, marriage status (dummy), pension status (dummy), mortgage status (dummy), the number of kids, maybe the square of the number of kids and if you have kids (dummy).
When I attempt such a regression there are a number of issues: 1. the small sample problem occurs here too, does it not, if I make the regression conditional on family composition, that is, only use the observations relating to a particular cohort for the regression pertaining to that cohort? 2. this doesn't account for non-linearity in the X variables. For example, P, M and K probably increase with Y but at a decreasing rate. 3. There is an obvious systematic relationship between the independents.
So my two questions are 1. Can I pool the data even though ANOVA tests tell me I can't? 2. Can someone suggest a model or technique (shrinkage estimators, non-linear regression?) that will allow me to reliably estimate these components?
Thanks.
- Income Band 1 2 3 4 5 6 7
- Single earner 58 27 18 10 2 1 1
- Single earner, Cpl 50 34 14 12 7 3 4
- Single earner 26 21 12 10 11 2 3
- Single earner, Cpl 26 33 16 10 12 11 6
- Single earner, Cpl + 1 21 19 22 12 13 1 10
- Single earner, Cpl + 2 58 27 18 10 2 1 1
- Single earner, Cpl + 3 11 15 18 6 6 1 6
- Two earners, Cpl 25 15 23 15 13 4 15
- Two earners, Cpl+ 1 23 17 24 18 9 6 9
- Two earners, Cpl+ 2 11 17 18 13 10 4 16
- Two earners, Cpl+ 3 58 27 18 10 2 1 1
- Single parent +1 25 12 2 1 0 0 0
- Single parent + 2 6 6 1 1 0 0 0
- Single parent + 3 4 3 1 0 0 0 0