A multivariate data problem in search of a technique

Question

About four years ago I took a graduate course in multivariate statistics. Now I'm faced with a data problem which is ringing a bell about something covered toward the end of the semester but even though I've gone back through my old homeworks etc. I'm having trouble determining what technique I am looking for. Secondarily, whatever technique I'm looking for I'm wanting to implement in R but I'll settle for just being pointed in the right direction.

Here is a simplified statement of the problem.

Suppose I have a data set where each entry describes an object and for each entry there are several variables. I'll arbitrarily name those variables X, A, B, C, and D, some of which may be numerical quantities and some may be categories:

 X | A | B | C | D
---|---|---|---|---
4.3|265| Y |red|7.1   
3.2|740| N |grn|9.0
2.2|655| Y |wht|8.2
.
.
.

Through some mathematical process performed on this data, a function f is established such that X = f(A,B,C,D), with the understanding that here by "=" I mean "a value and some confidence interval" and not strict equality.

Suppose now that the aforementioned data set is a subset of a larger dataset in which the X variable is missing and so are some number of the other variabless (always the same variables for all elements of the dataset), such that e.g. f(B,D) gives me an estimated value for X with what I would expect to be a wider confidence interval than I would have obtained if I also had supplied fields A and C.

In my actual problem, the data set elements are buildings, the X variable is a given year's energy consumption for each building, and I have a great many other variables I can supply to include footprint square footage, number of stories, land use type (roughly 200 mostly-not-hierarchical categories), annual heating and cooling degree days at the building's location, and construction year. My expectation is that some of the variables will prove to have a very high bearing on energy consumption, others will not, and yet others may have a significant bearing even though the connection between them is not at all readily apparent.

Of the entire territory over which I am concerned with the buildings' energy consumption, I only have many data variables for a contiguous area comprising roughly 1/10th of the entire territory and I can only obtain actual annual energy consumption (X) for a subset of that contiguous area's buildings. My hope is that through statistical techniques I can leverage the data where I have a large number of data variables including energy consumption to estimate 1) energy consumption for the rest of that contiguous area's buildings and 2) energy consumption for the other 9/10ths of the entire territory where I don't have energy consumption data and have only a few of the other variables. I may be able to get real energy consumption data from buildings where I only have a few variables instead of the complete (less energy consumption) set of variables if that helps anything (even if just to increase the sample size).

Is what I'm trying to do possible and does it have a name? Is this what ANOVA/MANOVA are for?

score 0 · Answer 1 · edited Apr 13 '17 at 12:44

Is what I'm trying to do possible and does it have a name?

Yes. I would call this "multivariate regression with missing data."

Is this what ANOVA/MANOVA are for?

There are some key differences: ANOVA and MANOVA are used when the predictors are categorical. You seem to have some continuous predictors. Also, ANOVA is usually thought of as an inference tool, not a prediction tool. You need predictions, not parameter inferences.

Before I continue, be warned that inferences made using data from one area may not apply to other areas, and missing data may not have similar characteristics to data that are present. These sources of uncertainty will not disappear if you ignore them.

I'd suggest you use a linear model or a generalized linear model. Within this framework, you have several options to deal with missing data and selection of variables. For variable selection, one popular option is LASSO regularization, discussed here on CV. There's also the AIC and its descendants (like AICc). For missing data, you could:

Train the model separately for each subset of variables where you need predictions.
Impute the data.
Use built-in features to handle missing values. There is some info here about facilities to do this in R.

Regression is a rabbit hole, and so is missing data, and so is model selection. Good luck.

Thank you, Eric. Will a linear or generalized linear model be able to make use of only continuous predictors and not the categorized ones? Or, at least where a categorized predictor has just two values, can one "linearize" that (i.e., give it a scalar multiple)? I also assume that if I have reason to expect that one of my inputs is related to the output in a nonlinear way - suppose that I assume that building energy consumption goes as the square of the building footprint area - I'd need to force that factor to linearity by using its square root? — WatcherOfAll, Dec 04 '15 at 17:22
GLMs can use binary or categorical predictors by forming [dummy variables](https://www.moresteam.com/whitepapers/download/dummy-variables.pdf). RE: nonlinearity, that seems like a nice solution. — eric_kernfeld, Dec 05 '15 at 22:44

A multivariate data problem in search of a technique

1 Answers1