1

I am looking at some intuition to understand the impact of features on the accuracy of the classification algorithm. I compute accuracy by performing a 50-50 training-testing split on the dataset. The classification algorithm I am using is kNN. There are 2 queries I have:

  1. Assuming I have 2 features. Let's say I get an accuracy of 10% with feature 1 alone and an accuracy of 20% with feature 2 alone. For simplicity, I assume that the two features are independent. Further, the features have been scaled. What is the accuracy I should expect when I use both the features together. Are there any limits (minimum/maximum range)? Is there any theoretical support for these limits?

Specifically, can the accuracy improve drastically (as compared to accuracy with a single feature), when both features are used together? In one of my datasets, I see 4% accuracy individually with each feature; but together they give greater than 40% accuracy.

  1. Taking (1) further, is it possible that the accuracy in fact degrades (worse than 20%) when I use both the features together. In other words, under what circumstances might the accuracy degrade by providing more information.

Thanks, Girish

  • 1
    Welcome to CV.SE! Related question [here](https://stats.stackexchange.com/questions/18815/increasing-number-of-features-results-in-accuracy-drop-but-prec-recall-increase) and related query on CV.SE [here](https://stats.stackexchange.com/search?q=feature%20selection). – LmnICE Sep 08 '20 at 16:47
  • The information that features are uncorrelated doesn't really imply anything here. Features can be strongly dependent (nonlinear dependence) even if uncorrelated, and this can have big impact on kNN classification. Maybe it implies something (I haven't thought it through) if you assume features independent rather than uncorrelated, which you may want to do; being uncorrelated is useless here as an assumption. – Christian Hennig Sep 08 '20 at 19:27
  • There are further details on which this depends. If you classify using a single variable, scaling doesn't matter, but if you classify using more variables, it does. For example, if feature 1 has variance 1000 and feature 2 has variance 0.01, if you put them together without rescaling you should expect the same accuracy as from feature 1 alone, here 10%, because this will totally dominate the resulting distance (I assume you use Euclidean). – Christian Hennig Sep 08 '20 at 19:31
  • Thanks LmnICE for the pointers – Girish Vaidya Sep 09 '20 at 12:41
  • Thanks Lewian for your inputs. You may be correct, the features being just uncorrelated might not imply anything. I have corrected the original question stating that they are independent. Also, added another assumption that they are scaled. – Girish Vaidya Sep 09 '20 at 12:44
  • I have further edited the part (1) of the query. Based on the query provided by LmnICE, I understand that indeed the accuracy could degrade with more features. However, can the accuracy improve drastically when features are used together? – Girish Vaidya Sep 09 '20 at 13:02

0 Answers0