I would like to categorize a large sample and make some estimates for each category aka subset. The problem is that some subsets contain very few data points. How do I deal with that? For example:
1) Data set
Say there is a large sample of personal annual incomes (Euros) in Europe. Fields and example entry:
Country Region Occupation Age Income Germany Bavaria Engineer 31 50200
2) Tuscany singers
Estimate the average income of 53 year old professional singers in Tuscany, Italy. How do you do it? My first thought would be to just average the Incomes of all entries where Country=Italy, Region=Tuscany, Occupation=Singer and Age=53.
But what if only two entries match those criteria?
Country Region Occupation Age Income Italy Tuscany Singer 53 22500 Italy Tuscany Singer 53 13700
I could then maybe look at e.g. ages 50-56 in Tuscany to get a bigger sample. I could also look at the average income of all singers aged 53 in all of Italy, and adjust it if incomes in Tuscany generally differ from the rest of Italy. So manually and intuitively I could guess a number, but I need to be able to tell a computer how to do it...
3) Greater London taxi-drivers
Next, estimate the population average income of 37 year old taxi drivers in Greater London, UK. For this combination, say the data set contains 1230 entries, which should be a large enough sample, so you would just average them to get the estimate, right? No need to look at other ages or regions.
4) The question
What I am looking for is a way to get an estimate of the population average Income for any combination of Age, Occupation and Region - something that works whether or not there is a lot of data (by looking at similar data if the sample is small). I imagine that you would use the same formula/procedure in both 2) and 3), but with the Greater London sample subset, very little weight (but how little?) would be assigned to other ages and regions, since the sample is large.
There should also be some measure of confidence in the estimate, for example standard error or a confidence interval.
Surely this kind of problem must be quite common in statistics. E.g. http://en.wikipedia.org/wiki/Multilevel_model discusses some of the same things, but does not seem to fit my problem still.
If anyone could name some methods or concepts that might help with this problem, it would most helpful.
Just something that might put me on the right track - I am quite stuck and don't even know what to google at this point.
Thank you for reading!