5

I have relatively large (100k items) dataset which I need to split in two groups. So far I've tried knn and the results are not good mainly because I have disproportion in my training data: 90% of points belong to the first group. The same proportion is expected to be in test data.

Is there a way to improve prediction quality with this kind of data? Performance is not important while quality of prediction is paramount.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Moonwalker
  • 389
  • 1
  • 13
  • do you mean 'into' groups or 'in two' groups?. Also how many variables do you have? – user603 Feb 21 '13 at 13:25
  • @user603 I mean in two groups say spam and not spam. I have <100 features, but it may grow up to 1000 features (other people working on implementing new features). – Moonwalker Feb 21 '13 at 13:29
  • This thread is somewhat related, & may be interesting for you: [does-an-unbalanced-sample-matter-when-doing-logistic-regression](http://stats.stackexchange.com/questions/6067/). – gung - Reinstate Monica Feb 21 '13 at 14:15
  • wait: this is confusing. Do you have access to the real labels? --i wrote my answer assuming that you have no access to the true labels. – user603 Feb 21 '13 at 14:17
  • @user603 for a few datasets yes, I have access to real labels (I'm using separate datasets for training and testing), but not for the ones used in production. Sorry for confusing you, I really liked your answer – Moonwalker Feb 21 '13 at 15:13
  • When you say your results are not good, how do you measure that? I could guess 'group 1' for every sample and be right about 90% of the time from just their group proportions - that sounds pretty good for some applications. Telling us what level of performance you need, what you're getting now, and whether all errors are equally acceptable (e.g. spam misclassified as not spam VS not spam misclassified as spam), might be helpful in working out what you should try next. – Pat Feb 21 '13 at 17:09
  • @Pat I'm doing some kind of outliers detection for text, like inverse spam problem or in another words detecting important texts. I may remove up to 90% points from the set in such way that in new set proportion of the important texts would be higher. I'm running knn and leaving only texts which are classified by it as "important" which gives me concentration of really important texts among them to be around 20-30% which is not enough for my task. I need concentration to be >50%. – Moonwalker Feb 22 '13 at 00:37
  • Okay. So he concentration needs to go up by a factor of $5$ or so. Depending on the degree of class overlap one would have though that would be possible with a kNN type model. The first thing I'd suggest looking at is how you're measuring the `distance' between different texts when calculating nearest neighbors, and play around with different methods. That can make a huge difference. Have you tried bag of words type modelling, and if so, are you using term weighting? – Pat Feb 22 '13 at 11:30

1 Answers1

1

First of all, if you ditch accuracy for AUC and use a k-NN implementation that outputs some continuous score (proportion of votes, weighted votes, etc) you would be able to know if your model has any discriminant power.

Now, if you want to keep accuracy, you could try different weights to the votes of each class.

Firebug
  • 15,262
  • 5
  • 60
  • 127