Advice on classifier input correlation

Question

I have a lot of data on previous race history and I'm trying to predict a percentage chance of winning the next race using Regression, kNN, and SVM learning algorithms.

Say a race has 5 runners, and each runner has a previous best course time of, say $T_i$ (seconds).

I've also introduced an additional input for RANK of previous best course time of the 5 runners with value 0 to $1 - \frac{T_i-T_{min}}{T_{max}-T_{min}}$

My question is: does introducing both the absolute best course time and rank best course time cause any problems?

I understand that these inputs are likely to correlate but if someone runs a world record time they are more likely to win easily but this will get lost using the rank input only which would assign them a rank of 1.

mbq - thank you for editing my formulae. Are you able to direct me on how to do this properly for future posts? I'm used to using [code] ... [/code] but couldn't find the proper method here. Thank you! — , Mar 14 '11 at 15:37
This is a Markdown-enabled website with $\LaTeX$ support, further info can be found here: http://stats.stackexchange.com/editing-help. — chl, Mar 14 '11 at 17:00

score 5 · Accepted Answer · answered Mar 14 '11 at 20:02

It depends on the classifier. Some classifiers (such as Naive Bayes) explicitly assume feature independence, so they might behave in unexpected ways. Other classifiers (such as SVM) care about it much less. In image analysis it is routine to throw thousands of highly correlated features at SVM, and SVM seems to perform decently.

For kNN, adding more features will artificially inflate their importance. Suppose you have two features, best course time and coach experience. They will influence the distance equally. Now you add another feature 'course time multiplied by two'. This is essentially a replica of the first feature, but kNN doesn't know about it. So this feature now will influence the distance computation more significantly. Whether you want this or not will depend on the task. You probably don't want features to influence the distance more just because you thought of more "synonyms" for them.

A compromise might be to perform feature selection first and then use kNN. This way two "synonyms" of the same feature will be retained only if both are important.

@ SheldonCooper - "In image analysis it is routine to throw thousands of highly correlated features at SVM, and SVM seems to perform decently." - I would love to know more about how you do this without running out of memory. SVM have always performed quite well in prediction accuracy but with 200 input variables and over 100k samples SVM in Clementine and SPSS quickly exceeds memory limits. This may well be a another thread but would love to know how to balance SVM performance with large dimensionality & samples? — , Mar 15 '11 at 00:56
I'm not sure about SPSS. I've used libsvm with thousands of features without any problem. People have used it with millions of features (although the features in those cases are sparse, i.e. only some are active in any given instance). — SheldonCooper, Mar 15 '11 at 01:15

score 2 · Answer 2 · answered Mar 14 '11 at 14:42

2

I think it depends on if the purpose of your model is descriptive (e.g. considering variable importance or hypothesis tests in the regression) or purley predictive. If it is the former, then certainly input features that are strongly correlated will create difficulties in making inference about how variables impact the output and relate to each other. For example, can you say variable 1 is the most important if it shares most of its variance with variable 2?

Even in regression, multicollinearity will not impact the coefficients, only the standard error (the estimates will still be those that minimize the squared error) so the predictions are ok.

I tend to consider colinearity between inputs not that big an issue when building a predictive model. The best and only way to know for certain is to build a model with both variables and then with only the one with the strongest relationship to the target variable and see which produces the best predictions on new data.....

answered Mar 14 '11 at 14:42

B_Miner

7,560
20
81
144

Thanks for your response, am I correct in summarising? If I don't care about the reason for prediction accuracy I could introduce multiple inputs which are various functions of other inputs? But if I want to explain what input factors influence an output prediction then multicollinearity is very important? Does this hold up for all classifier methods? – Mar 14 '11 at 15:18
@osknows it is my opinion that regression (linear, generalized linear) is the domain where multicollinearity is an issue and is an issue *largely* for inference, hypothesis tests etc. As for other types of classifiers where predictive accuracy is the key interest, one should be careful about throwing the kitchen sink at the algorithm and hoping it figures it all out. If prediction is what you are concerned with, I would suggest experimenting and seeing how inclusion of correlated variables affect the accuracy. As in all things, there is no one answer and is dependent on the degree. – B_Miner Mar 14 '11 at 17:07
Also, if you are concerned and assuming the variables are continuous, you can try a PCA (or related) analysis first to create features that are independent. Your decription (using a best time and a rank of that time) doesnt sounds like it would be near perfect collinearity is it? Since two runners could both have a rank of 1 and very different times depending on who else was in their race. One other idea is to run the analysis with a boosted decision tree or random forest and examine the variable importance ranking to see if both variables are important... – B_Miner Mar 14 '11 at 17:12

Advice on classifier input correlation

2 Answers2

Linked