1

I'm trying to build a training set for a classifier.

A vector evaluates to either conclusive 'C' or unconclusive 'U'.

U Y69S 12 -1.5 1.83 3.45 5.412 6.441 9.864 14.666 15.68 12.082 8.384 4.016 0.0 
U Y69T 12 0.904 1.699 3.672 6.543 7.642 10.435 16.099 16.604 13.411 8.916 5.427 0.0 
C Y69V 12 -0.293 2.192 4.202 5.835 7.97 10.467 16.623 16.588 13.109 8.209 4.192 0.0 
C Y69W 12 -6.65 -7.501 -6.627 -4.786 -5.456 -2.025 1.883 14.33 10.738 6.658 7.978 0.0 
C Y80A 12 1.505 0.597 2.105 4.901 5.007 9.476 13.273 14.413 11.049 6.402 2.726 0.0 
U Y80C 12 0.633 -0.558 0.328 3.899 5.734 7.99 13.345 14.463 10.246 4.905 1.134 0.0 
C Y80D 12 4.928 6.02 6.754 9.612 12.618 17.849 17.876 17.605 12.73 7.035 2.059 0.0 
U Y80E 12 -0.772 -1.421 0.855 2.469 7.932 16.783 16.341 15.808 12.597 8.455 4.644 0.0 
C Y80F 12 0.311 -1.267 -0.332 3.294 5.497 8.231 11.756 13.57 9.524 5.054 1.777 0.0 
U Y80G 12 -0.023 -0.346 1.376 4.351 4.044 8.748 12.373 15.347 10.454 6.044 2.55 0.0 
C Y80H 12 -2.762 -4.235 -3.276 -0.661 1.749 5.74 10.979 13.685 9.291 6.207 1.279 0.0

When preparing the data set, should I include roughly equal amounts 'C' and 'U' values?

TMOTTM
  • 573
  • 1
  • 5
  • 14

1 Answers1

3

The first level of answer is keep the original proportion of U and C. This is what is sometimes called stratified cross validation (see for example Understanding stratified cross-validation)

At a second level, if there is a severe imbalance of C and U (lower than 20%/80% or lower than 10%/90%) you may want to balance the training set if your goal in not global accuracy but detecting the low frequency class. But that depends on the classification algorithm you are planning to use. See the blog http://www.win-vector.com/blog/2015/02/does-balancing-classes-improve-classifier-performance/#more-3022 for a nice comparison of the effect of imbalance on different algorithms.

Jacques Wainer
  • 5,032
  • 1
  • 20
  • 32