Ways to extract patterns yielding high scores

Question

Suppose I have a table, containing several features and a score denoting the performance (higher is better) of the corresponding features. Like this:

| feat1 | feat2 | feat3 | ... | score |
---------------------------------------
| 'a'   | 256   | 490   | ... | 0.336 |
| 'a'   | 128   | 469   | ... | 0.614 |
| 'b'   | 64    | 533   | ... | 0.826 |
...

Imagine this to be records of e.g. hyperparameters of an ML model and its accuracy, settings of a denoising filter and the denoising score or adjustments of a bike and the resulting comfort level. This table might contain thousands of lines and 10–100 features.

I am interested in two things:

What features (columns) affect the score?
What feature values (rows) lead to high scores?

So I am not asking to estimate the performance of a new setting (like a regression would do), I am asking the other way around: What settings lead to high scores? (By the way: Is there a name for this type of task? I assume this is neither classification, regression, clustering – but what is this called?)

What are (beginner friendly) methods to answer question 1 and 2? For example, I could just insert this data into a spread sheet and sort by descending score, however I might not be able to see a common pattern among the top rows.

I am sure there are some statistical methods to find patterns I am looking for. But I'm lacking vocabulary to properly search for these (I appreciate any clarifications regarding vocabulary and also references to literature).

Additional note: Some of the features are only nominal or ordinal (they are not purely cardinal)

BruceET · Answer 1 · 2019-07-30T19:14:59.420

First, in case you just want to make strategic use of the overall score: A simple place to start for (1) would be to find Spearman correlations between 'overall score' and scores in other columns(for 'features'). The most important columns would be the ones with Spearman correlation nearest to $1.$

Notes: (a) If you have access only to ordinary (Pearson) correlations, they might be useful, but they reflect linear relationships, and some important relationships might not be strictly linear. (b) For Feature 1, you would have to make sure your program recognizes the values as ordinal (numerical for Pearson); perhaps use something like a = 5, b = 4, etc.

As for (2), I'm not sure I understand what you mean by 'settting'. Maybe you should find the highest 1% of 5% of overall scores in the last column, and then see which rows are involved and what they have in common.

Second, if you have access to the method used to get the formula for making the overall scores, you may be able to figure out which columns in the table were most influential. (You couldn't just look at the sizes of the coefficients for the various columns because those will depend heavily on the units and variability of the numbers in the columns.)

Moreover, if the makers of the score had some ideal 'dependent' (i.e., predicted) variable in mind and used a regression procedure to make a score that imitates that ideal variable, then the regression output may make it clear which variables influenced the score most heavily. To some extent ability to discover this would depend on the style of the regression output. If you get to the point of doing this, you can show some regression output in another question and ask for help interpreting it.

I replaced the term "settings" by "feature values". As far as I understand, I can use the Spearman correlations to see if there are linear relationships between individual features and the score. Are there methods to consider multiple features at once, in case the analysis of individual features does not give further insights? Anyway: The scenario you describe by **First** is excactly what I am looking for – making strategic use of the score. I do not have access to the method used to get the formula. — user3389669, Jul 31 '19 at 07:28
Pearson correlation looks just at the linear component of association, Spearman correlation is more general: see the first example [here](https://stats.stackexchange.com/questions/419587/when-does-the-sum-of-the-medians-the-median-of-the-sum/419758#419758). I had in mind comparing two variables (columns) at a time to start. Regression methods can be used to look at effects of several at a time, but that is more subtle. My guess is you'll detect the most important/interesting connections with Spearman correlations. — BruceET, Jul 31 '19 at 08:16
Pearson correlation looks just at the linear component of association, Spearman correlation is more general: see the first example [here](https://stats.stackexchange.com/questions/419587/when-does-the-sum-of-the-medians-the-median-of-the-sum/419758#419758). Also [here](https://stats.stackexchange.com/questions/366326/is-spearman-correlation-never-greater-than-pearson-correlation). I had in mind comparing 2 variables (columns) at a time. Regression methods can be used to look at effects of several at a time, but that is more subtle. Hope you'll detect the most important connections w/ Spearman. — BruceET, Jul 31 '19 at 08:22

Ways to extract patterns yielding high scores

1 Answers1