4

I'm working with a CSV which contains approximately 220,000 entries. My aim is to predict one of the attributes (ATT1) using the other 3 (ATT2, ATT3, ATT4).

I've been able to do this using NaiveBayes, but now I feel unsatisfied with the result. The reason is that ATT1 can be one of 6 values (VAL1-6), but these are not evenly distributed into the dataset. I'm afraid this could lead to an unprecise prediction.

How do I select a given number of entries for each value of ATT1 from within RapidMiner?

mdewey
  • 16,541
  • 22
  • 30
  • 57
Gurzo
  • 143
  • 1
  • 4
  • @Gurzo Instead of "data subset", I think the precise term for what you're trying to do is "stratified sampling". Maybe, this is a solution: http://rapid-i.com/api/rapidminer-5.1/com/rapidminer/operator/preprocessing/sampling/AbsoluteStratifiedSampling.html. – chl Apr 09 '11 at 12:54
  • as much I'd like to see more people use rapidminer, I think that this question is way more appropriate for the rapidminer forum (http://forum.rapid-i.com/). Beside: Naive Bayes can handle unevenly distributed discrete labels/targets. – mlwida Apr 09 '11 at 12:54
  • @chl: Stratified Sampling works best if the classes are equally distributed, which they aren't. :( – Gurzo Apr 09 '11 at 12:59
  • 1
    @Gurzo Ok, I mean you can impose to keep a certain number of cases from each class. – chl Apr 09 '11 at 13:08
  • 1
    @Gurzo: What ? Stratified Sampling is exactly the way to go if the classes are unevenly distributed. If they are evenly distributed (given such a huge number of observations), the result of simple random sampling will be equivalent to stratified. btw: vote for close ! – mlwida Apr 09 '11 at 13:29
  • @steffen: Stratified Sampling alone boosts accuracy by 2%, but also completely "ignores" the 2 values with the least entries... That's what I was trying to avoid. – Gurzo Apr 09 '11 at 13:50
  • Why have you removed your answer? Was it wrong? –  Apr 09 '11 at 14:23
  • @mbq: I couldn't make it work for my problem... I probably should have waited to post it as answer... :/ – Gurzo Apr 09 '11 at 15:48

1 Answers1

4

Use the Sample operator with the Balance checkbox. You can set the sample size per class that way (to a balanced one)

@steffen, the mandate for this site covers stats AND stats software. There are tons of R questions on here, so it's fair to ask questions about other software too.

Neil McGuigan
  • 9,292
  • 13
  • 54
  • 62
  • (+1 for the rapidminer tips) Agree about R questions and stats software. However, questions about statistical software that are barely related to statistical analysis per se will be better served on SO (which is clearly stated on the [FAQ](http://stats.stackexchange.com/faq), see the *programming* subject). Now, the ongoing debate about stratified sampling in the comments makes it relevant (to a certain extent) on here. I'm sure @steffen will agree with that, and his first comment might also be temperated by the ensuing ones. – chl Apr 10 '11 at 10:18
  • @chl: I agree. However, I estimate that 99% of the questions about rapidminer will be related to machine learning (it's in the nature of the tool). Take a look at the rm-forum. It is overflowed with questions similar to this one, which is a good representative example. In depth coverage of machine learning in questions like "how to do X in tool Y" will only emerge if the goal and the way of the OP are questioned. I am just afraid that stats* will mirror the rm-forum. Ceterum censeo R is special and deserves a special treatment ;) – mlwida Apr 10 '11 at 15:15
  • @steffen Be sure I've heard you, and I'm certainly sharing the same concern about technical or basic questions. But please note that a lot of users here (including myself) answered similar questions about R in the past, even if those questions would have been rejected/downvoted on R-help or SO. It seems @Neil's response proved to be helpful, the OP is registered, replied to (somewhat fruitful) comments, and, as I hope, may come again with some good stats questions. Sorry, these are my Sunday evening lucubrations... – chl Apr 10 '11 at 20:05