Help with the functional form of my model?

Question

I want to figure out adjusted expenditure (AE) as a proportion of income (Y) for different income bands and family compositions. The adjustment is taking mortgage (M), pension (P) and child & education (K) related expenditure away from total expenditure (E).

My variables are Y, E, M, P, K, marriage status, pension status, family composition amongst many others; the kind of survey data available in most national living standards data.

I have survey data from 6 periods, each about 4 years apart; 40,000 observations in all. If I pool all of the data there are sufficient observations for each combination of income band and family size to just take the average Y, E, M, P and K, and compute AE, by income band, for each combination of family composition. For example, I take the average: Y of 100k, E of 80k, M P & K of 20k, and produce AE of 60k which gives me the ratio AE/Y of 60% for households in the highest decile who have 1 earner and 2 children.

My problem is that an ANOVA test of the equality of the means tells me that the 6 data-sets are significantly different. (The smallest data set has the largest variance which might inflate the true p value and allow me to accept the null but perhaps this is just a bit of statistical chicanery?) Taking the data-sets individually I have many cohorts that have only 2 or 3 observations, some have none, and thus provide unreliable and erratic results.

Either I can pool, and if someone can suggest a way that this might be feasible that would be great, or I can attempt to model. Perhaps E, M, P and K as a function of Y, marriage status (dummy), pension status (dummy), mortgage status (dummy), the number of kids, maybe the square of the number of kids and if you have kids (dummy).

When I attempt such a regression there are a number of issues: 1. the small sample problem occurs here too, does it not, if I make the regression conditional on family composition, that is, only use the observations relating to a particular cohort for the regression pertaining to that cohort? 2. this doesn't account for non-linearity in the X variables. For example, P, M and K probably increase with Y but at a decreasing rate. 3. There is an obvious systematic relationship between the independents.

So my two questions are 1. Can I pool the data even though ANOVA tests tell me I can't? 2. Can someone suggest a model or technique (shrinkage estimators, non-linear regression?) that will allow me to reliably estimate these components?

Thanks.

Income Band 1 2 3 4 5 6 7
Single earner 58 27 18 10 2 1 1
Single earner, Cpl 50 34 14 12 7 3 4
Single earner 26 21 12 10 11 2 3
Single earner, Cpl 26 33 16 10 12 11 6
Single earner, Cpl + 1 21 19 22 12 13 1 10
Single earner, Cpl + 2 58 27 18 10 2 1 1
Single earner, Cpl + 3 11 15 18 6 6 1 6
Two earners, Cpl 25 15 23 15 13 4 15
Two earners, Cpl+ 1 23 17 24 18 9 6 9
Two earners, Cpl+ 2 11 17 18 13 10 4 16
Two earners, Cpl+ 3 58 27 18 10 2 1 1
Single parent +1 25 12 2 1 0 0 0
Single parent + 2 6 6 1 1 0 0 0
Single parent + 3 4 3 1 0 0 0 0

Could you add some tables showing the amount of data broken down by period, income band, and family composition? — eric_kernfeld, Sep 13 '17 at 19:06
There are non-parametric regression methods where you need not assume a specific functional form for the regression curve. (I am thinking of generalized additive models.) GAMs are typically fitted by assuming that the regression function is smooth. Is that a type of solution you would be interested in? — eric_kernfeld, Sep 13 '17 at 19:09
Is this what you need Eric? Treat the top band as the income header. So there are 4 observations in the 1st income band for a single parent with 3 children. I'll figure out a better way to format it if this is what you need. — steve, Sep 14 '17 at 10:50
Thanks for the sample-size update. Is the following correct? You have calculated your response variable as (E - (M+P+K)) / Y. You want to predict this response variable based on income and family composition. You don't want to pool the data from your six sources because they don't seem to all have the same mean value of your response. — eric_kernfeld, Sep 20 '17 at 21:08
What you're calling the response variable IS the item I'm trying to calculate but I didn't include it a regression because Y appears as both part of the response and as an independent. Instead I would estimate each component of the response separately (apart form Y) and then calculate the response separately. — steve, Sep 22 '17 at 10:16
I would love to pool but a comparison of mean expenditure or income in each one of the datasets tells me they are significantly different, with the caveats mentioned above. — steve, Sep 22 '17 at 10:16
Sorry if this is a stupid question, but I am confused and I need to take a step back. What is the purpose this modeling effort? You say you want to figure out AE/Y, so what's keeping you from calculating it directly and moving on with your study? — eric_kernfeld, Sep 22 '17 at 15:43
The problem is that for some of the cohorts for which I'm trying to calculate the measure, for example 1 person families with 3 kids, I have almost no observations. If I can pool the 5 datasets this problem goes away but ANOVA tests suggest that this is not legitimate. The smaller datasets have the largest variance and I've read that this can inflate the true p value for the ANOVA but I'm not sure if this is a fudge. So what I need is some way of pooling the data or else some model that will allow me to estimate each component separately and then calculate the ratio for each cohort. — steve, Sep 23 '17 at 17:20
This seems problematic to me because a) I still have very few observations with which to run the regression for some cohorts and also b) there are obvious non-linearities in some of the variables. For example, mortgage expenditure increases with income so a straight dummy for mortgage expenditure is not very realistic. All interactions turn out to be insignificant though even though I can see that mortgage payments are non-linear. Not sure if this explanation is clarifying or just making it worse?! — steve, Sep 23 '17 at 17:21
I don't think I can specify a particular model in good conscience without seeing the data myself, but I can write an answer about possible tools. — eric_kernfeld, Sep 26 '17 at 12:42

eric_kernfeld · Accepted Answer · 2017-09-26T13:53:38.873

Here are a couple of tools that seem useful in this scenario.

For smooth, flexible regression, generalized additive models (GAM's) are popular, and there are great tools for fitting them. The idea is to have a polynomial regression with a penalty on the curvature, so that adjacent observations are neither identical nor completely unrelated. Modern GAM fitting tools are able to adaptively choose the degree of penalization.

The "additive" part of gAm refers to a structure where variables contribute to the response through sums, as in $y = f(x_1) + g(x_2)$, but if this seems too restrictive, there are ways to include interactions. See here (especially the second answer with the te() in the formula):

How to include an interaction term in GAM?
Another useful idea here is mixed-effects modeling. This allows you to fit a single regression model across all cohorts, but include terms accounting for heterogeneity of effects. For example, you could write

$$K_{it} = \beta_{married} X_{i, married} + \beta_{kids} X_{i, kids} + b_{t,0} + \beta_0 + \epsilon_{i,t}$$

where $t$ indexes cohort, $X_{i, married}$ indicates the number of spouses (probably 0 or 1), $X_{i, kids}$ indicates the number of kids, $b_t$ is a random effect for cohort, and $\epsilon_{i,t}$ is an error term. In my example, $\beta$ is unconstrained but $b_t$ is assumed to follow some zero-mean distribution, for example $N(0, \sigma)$. When the parameters are estimated, this assumption shrinks the $b_t$'s towards zero, moving the estimates from different cohorts closer to one another. They don't have to be identical, but they still share information this way. You can even allow for heterogeneity in the slopes:

$$K_{it} = (\beta_{married}+ b_{t,married}) X_{i, married} + (\beta_{kids}+ b_{t,kids})X_{i, kids} + \beta_0 + b_{t,0} + \epsilon_{i,t}$$

Finally, my graduate program would disown me if I didn't mention that these models should be constructed only along with thorough investigation of, and visualization of, the data. Here are some questions I would want to look into.

Are there any outliers that might indicate data entry errors (or unexpected but genuine signals)?
If you are going to shrink some estimates towards one another, do they show the same trend individually?
If the data are gathered at different times and in different places, how are they adjusted for inflation?
If you fit a linear model on one or two covariates and plot the residuals, do they display patterns that justify more complex models?
Are the outcomes correlated? For example, do high residuals for income correspond with high residuals for mortgage expenses?

What an excellent answer. Thanks so much for your time Eric. I'm not familiar with GAMS but I know mixed effects so may try there first. Thanks again. (I voted your answer up but am a lightweight statsexchanger so no green arrow) — steve, Sep 27 '17 at 12:53
I am glad you found it helpful! Good luck with your project. — eric_kernfeld, Sep 27 '17 at 13:07

Help with the functional form of my model?

1 Answers1