First I must state that I am new to this community and sometimes the same question is answered. If so, please direct me to that answers. I will summarize my question as much as I can.
I have a randomly collected data set which have 56 498 raws and feature vector of 381 size. These are categorized to 11 classes. The number raws in the classes are as follows.
- African 352
- EastAsian 2512
- EasternEuropean 381
- LatinAmerican 2917
- MiddleEastern 645
- NorthAmerican 41524
- NorthernEuropean 250
- SouthAsian 621
- SoutheastAsian 457
- SouthernEuropean 4180
- WesternEuropean 2659
These data are randomly collected and the number doesn't represent the size of the population and each should be given the same consideration. The problem is that since the NorthAmerican 41524 dominates in the data set, the classification(C 4.5 and Random Forests) provide highly accurate results/biased results due to this.
Therefore, I need to sample this data. The sampling method I am considering is disproportionate stratification with equal allocation. Since the lowest size of the class variable is 250 I am considering taking 250 random samples of each class and feed it into the classifier.
But I strongly doubt that choosing 250 sample from NorNorthAmerican 41524 will not represent that entire population. This is the problem I am facing right now. I cannot use the data set as it is and sampling method should also be powerful enough to represent the population.
What should I do? can anyone suggest me a good method to follow?
Thank You!