How to summarize knowledge about importance of variables?

Question

Assume that we have 30 features (inputs) each of which can potentially influence the result (output). We try to use the available ("observed") mapping from features to targets to develop a predictive model that uses only some of the features. Each feature can be either included into the input of the model or not. It means that in total we have 2^30 combinations of what arguments to use. I would like to notice that model is fixed, I do not change the model (in a broad sense), I just decide what variables to use.

Now we have considered several thousand combinations and for each of the combinations we have a measure of how good it is (for that, of course, we use the testing set). For example, we know that if we take variable 1, 7, 13, and 27 we get an error (accuracy of the model) that is equal to 123.456. If we use variable 3, 4, 10, 11, 13, 27 and 29. We get an error equal to 23.456.

My question is: How to summarize the knowledge about performance of the model as a function of the used arguments?

In other words I need a "meta-model" that describes performance of my models. I need this because I want to use the obtained knowledge to search over the huge space of potential models more intelligently.

More specifically, I have a mapping from N binary vectors to a real (float) value. Each binary vector indicates if a corresponding variable is used or not and the real output describes the performance of the corresponding model. Now I want to use this "meta-data" to train a "meta-model". But before I start a "meta-training" I would like to decide what model is a good one to train. This model should capture such concepts as "information stored in variables". For example I can imagine that variable 7 has no additional information to the combination of variables 3 and 5. Or, we can have "orthogonal" variables that cover the target space "independently".

This is a very broad topic. You can get started in learning about it by searching our site for [model selection](http://stats.stackexchange.com/search?q=model+selection). BTW, there are $2^{30}$ possible combinations of features (without even considering interactions or higher-order relationships), which is much greater than $30^2$. — whuber, Nov 17 '14 at 17:37
It was a typo with 30^2 and I think that the questions is much more specific than the model-selection question. I am not asking here about hot to use training and testing sets, how to validate a model, what measure to use. My question is more specific. I would like to know what is a good model to summarize knowledge about performance of another model. In other words, I am searching for a meta-model. — Roman, Nov 18 '14 at 07:54
After the edit, this is actually a really interesting question. I suspect you can just start with linear regression but I'd be curious to know if there's a "right" answer. — shadowtalker, Nov 18 '14 at 08:30
Once you're at this point with your $2^{30}$ scores, there's an optimization problem. If this is too large a space to search exhaustively, you need a search heuristic. For example, add/subtract variables one at a time, always choosing the optimal next one to add/subtract. — awcc, Nov 20 '14 at 08:40
The more interesting question is whether scoring all $2^{30}$ models by this one training set is reasonable (probably not if it's small), and what else to do if it isn't. Also, and maybe this is what you're getting at, you can look to some meta model that blends the various choices depending on how well they do. This seems pretty interesting. — awcc, Nov 20 '14 at 08:43
Is there some reason why you need to examine the `2^30` possibilities in parallel, as the question seems to suggest, rather than examine the 30 variables serially with a procedure like the LASSO? The pattern of predictor coefficients as more variables are introduced sequentially illustrates "information stored in the variables," in the context of all the variables that could be chosen for inclusion. — EdM, Nov 20 '14 at 19:28
What is the purpose of your meta-model? Should it (a) predict the accuracy as accurate as possible, or (b) be human-understandable? — user31264, Nov 20 '14 at 22:54
@user31264, both. I would like to know how much information each variable contain as well as to detect cases when one or more variables contain (almost) the same information as other one or more variables. — Roman, Nov 21 '14 at 10:16

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

In case you are interested in a linear model, here's direction that hasn't been mentioned in the comments / other answers:

Let $x \in R^{30}$ be the feature vector, and you want to learn a linear mapping $y = w\cdot x$ that maps $x$ to a value (that can be transformed into a probability using logistic regression, from which you can do anything).

I assume that there's a cost involved in obtaining many features / using all of them for prediction (otherwise, why not use all of them).

What you can do is to regularize the learning process with $L_1$ norm. It is known that $L_1$ regularization results in sparser models (i.e. a weight vector $w$ with lots of zeros in it), which is what you want.

Specifically, you would minimize the following: $$ \textrm{RegularizedLoss}(w,\lambda) = \textrm{Loss}(w) + \lambda \cdot \|w\|_1,$$ where $\textrm{Loss}(w)$ is the logistic loss. There are efficient ways to do this (for example, vowpal wabbit can handle $L_1$ regularization).

The specific value of $\lambda$ controls the tradeoff between accuracy and sparsity, i.e. how much loss in accuracy are you willing to take in return for a model that uses less features.

HTH.

score 0 · Answer 2 · answered Nov 20 '14 at 08:58

Whenever I approach problem of optimization function that I cannot define, but I can calculate, I use some heuristic method.

As your searched space is big (but not BIG big, 30 features is not a lot), I would go with some genetic algorithm. This could be quite trivial, but representation of your problem may just be binary vector, so standard masked crossover and bit flipping mutation may have quite good results.

As for your question specifically: you can summarize this knowledge in form of GA itself. You do not need human-readable representation of your knowledge, so actual knowledge will be encoded in specimen representation and evaluation function.

Disclaimer: This is my intuition, close to guessing (but with some experience in heuristic optimization). I cannot show you any papers that would back this idea, but I'm pretty sure you'll find something on that.

it is clear that an alternative to an exhaustive search is not-exhaustive search with some optimization technique (for example genetic optimization). However, my question was about what is an appropriate to summarize our knowledge about importance of arguments (how informative they are). — Roman, Nov 20 '14 at 11:19

How to summarize knowledge about importance of variables?

2 Answers2