About four years ago I took a graduate course in multivariate statistics. Now I'm faced with a data problem which is ringing a bell about something covered toward the end of the semester but even though I've gone back through my old homeworks etc. I'm having trouble determining what technique I am looking for. Secondarily, whatever technique I'm looking for I'm wanting to implement in R but I'll settle for just being pointed in the right direction.
Here is a simplified statement of the problem.
Suppose I have a data set where each entry describes an object and for each entry there are several variables. I'll arbitrarily name those variables X, A, B, C, and D, some of which may be numerical quantities and some may be categories:
X | A | B | C | D
---|---|---|---|---
4.3|265| Y |red|7.1
3.2|740| N |grn|9.0
2.2|655| Y |wht|8.2
.
.
.
Through some mathematical process performed on this data, a function f is established such that X = f(A,B,C,D), with the understanding that here by "=" I mean "a value and some confidence interval" and not strict equality.
Suppose now that the aforementioned data set is a subset of a larger dataset in which the X variable is missing and so are some number of the other variabless (always the same variables for all elements of the dataset), such that e.g. f(B,D) gives me an estimated value for X with what I would expect to be a wider confidence interval than I would have obtained if I also had supplied fields A and C.
In my actual problem, the data set elements are buildings, the X variable is a given year's energy consumption for each building, and I have a great many other variables I can supply to include footprint square footage, number of stories, land use type (roughly 200 mostly-not-hierarchical categories), annual heating and cooling degree days at the building's location, and construction year. My expectation is that some of the variables will prove to have a very high bearing on energy consumption, others will not, and yet others may have a significant bearing even though the connection between them is not at all readily apparent.
Of the entire territory over which I am concerned with the buildings' energy consumption, I only have many data variables for a contiguous area comprising roughly 1/10th of the entire territory and I can only obtain actual annual energy consumption (X) for a subset of that contiguous area's buildings. My hope is that through statistical techniques I can leverage the data where I have a large number of data variables including energy consumption to estimate 1) energy consumption for the rest of that contiguous area's buildings and 2) energy consumption for the other 9/10ths of the entire territory where I don't have energy consumption data and have only a few of the other variables. I may be able to get real energy consumption data from buildings where I only have a few variables instead of the complete (less energy consumption) set of variables if that helps anything (even if just to increase the sample size).
Is what I'm trying to do possible and does it have a name? Is this what ANOVA/MANOVA are for?