1

I would like to categorize a large sample and make some estimates for each category aka subset. The problem is that some subsets contain very few data points. How do I deal with that? For example:

1) Data set

Say there is a large sample of personal annual incomes (Euros) in Europe. Fields and example entry:

Country  Region   Occupation  Age  Income
Germany  Bavaria  Engineer     31   50200

2) Tuscany singers

Estimate the average income of 53 year old professional singers in Tuscany, Italy. How do you do it? My first thought would be to just average the Incomes of all entries where Country=Italy, Region=Tuscany, Occupation=Singer and Age=53.

But what if only two entries match those criteria?

Country  Region   Occupation  Age  Income
Italy    Tuscany  Singer      53   22500
Italy    Tuscany  Singer      53   13700

I could then maybe look at e.g. ages 50-56 in Tuscany to get a bigger sample. I could also look at the average income of all singers aged 53 in all of Italy, and adjust it if incomes in Tuscany generally differ from the rest of Italy. So manually and intuitively I could guess a number, but I need to be able to tell a computer how to do it...

3) Greater London taxi-drivers

Next, estimate the population average income of 37 year old taxi drivers in Greater London, UK. For this combination, say the data set contains 1230 entries, which should be a large enough sample, so you would just average them to get the estimate, right? No need to look at other ages or regions.

4) The question

What I am looking for is a way to get an estimate of the population average Income for any combination of Age, Occupation and Region - something that works whether or not there is a lot of data (by looking at similar data if the sample is small). I imagine that you would use the same formula/procedure in both 2) and 3), but with the Greater London sample subset, very little weight (but how little?) would be assigned to other ages and regions, since the sample is large.

There should also be some measure of confidence in the estimate, for example standard error or a confidence interval.

Surely this kind of problem must be quite common in statistics. E.g. http://en.wikipedia.org/wiki/Multilevel_model discusses some of the same things, but does not seem to fit my problem still.

If anyone could name some methods or concepts that might help with this problem, it would most helpful.

Just something that might put me on the right track - I am quite stuck and don't even know what to google at this point.

Thank you for reading!

user843355
  • 85
  • 3

1 Answers1

0

Here is one way to get started, there might be very different approaches. Your problem sounds like a case for using multiple regression - but with care. Regression because you are trying to model the conditional mean of the income given the value of some covariates.

Care is needed because of many aspects. For starters, income distributions are often skewed (there are only a handful of Pavarottis but many poor singers). Model specification needs to take care of possible non-linearities, (e.g. non-monotonous age effect) variance inhomogeneities, outliers, interaction effects, and so on. In practice, it is very tricky to model incomes. (Econometrics deals with this.) For example, it is well known that the gender and the education and other things also matter.

With multiple regression, there is also the problem of hidden extrapolation - there might not be any young bavarian opera singers in your sample, but you can use your model's estimates to model their income - perhaps in a very bad way (essentially estimating the income is just a wild guess (a bet on linearity) if you have no data in this region of your data space). But if your model specification is correct, using regression is more efficient than taking the means of subsets.

Standard tools exist to assess the fit of your model (model diagnostics) and to quantify the uncertainty associated with a prediction.

binkyhorse
  • 538
  • 1
  • 3
  • 9
  • Thank you for answering! I can't figure out how regression can be used here - but that could of course be just me. I see how it could work for Age - you can put Income on one axis and Age on another, increase or decrease Age and see what happens to the the average Income. But you can't put Region on an axis and increase/decrease it and see what happens to the average Income. Only Age is a number variable, the other columns are categories. Maybe I just don't get it, so if you still think multiple regression is a good fit, please give me a nod, and I will look into it further. – user843355 Apr 07 '14 at 08:54
  • Regression models can incorporate so-called categorical variables (with values being categories, such as occupation or region) as regressors. One way to deal with them would be to estimate an own intercept for every category (except one, the so-called reference level). This corresponds to a model which has one regression line per value of the categorical variable with all regression lines having the same slope, but each having its own intercept. More refined models can give each line its own slope and many things more. Any decent text on multiple regression will treat this. (tbc) – binkyhorse Apr 07 '14 at 15:21
  • The big argument in favor of regression is that the standard parameter estimates have extremely attractive theoretical properties, they are BLUE: best linear unbiased estimators. Again, all of this is treated in any good text on regression. If you are especially interested in the economic example you gave, look for standard econometrics books. See [on this site](http://stats.stackexchange.com/questions/4612/good-econometrics-textbooks) for references. (end) – binkyhorse Apr 07 '14 at 15:21
  • I just saw that I lost a sentence between my first and my second comment, namely: "Under some conditions that you will find in the textbooks on linear regression, regression has very useful properties." – binkyhorse Apr 07 '14 at 15:55
  • Thank you for taking the time to explain - much appreciated! This is really useful information about regression that I didn't know, and I will certainly look into it further. As I mentioned, I have come across this problem in many different situations (when categorizing some large sample). Can you think of a similar problem where linear regression is not a good fit? Maybe I should make that a new question... – user843355 Apr 08 '14 at 13:49