0

First I must state that I am new to this community and sometimes the same question is answered. If so, please direct me to that answers. I will summarize my question as much as I can.

I have a randomly collected data set which have 56 498 raws and feature vector of 381 size. These are categorized to 11 classes. The number raws in the classes are as follows.

  1. African 352
  2. EastAsian 2512
  3. EasternEuropean 381
  4. LatinAmerican 2917
  5. MiddleEastern 645
  6. NorthAmerican 41524
  7. NorthernEuropean 250
  8. SouthAsian 621
  9. SoutheastAsian 457
  10. SouthernEuropean 4180
  11. WesternEuropean 2659

These data are randomly collected and the number doesn't represent the size of the population and each should be given the same consideration. The problem is that since the NorthAmerican 41524 dominates in the data set, the classification(C 4.5 and Random Forests) provide highly accurate results/biased results due to this.

Therefore, I need to sample this data. The sampling method I am considering is disproportionate stratification with equal allocation. Since the lowest size of the class variable is 250 I am considering taking 250 random samples of each class and feed it into the classifier.

But I strongly doubt that choosing 250 sample from NorNorthAmerican 41524 will not represent that entire population. This is the problem I am facing right now. I cannot use the data set as it is and sampling method should also be powerful enough to represent the population.

What should I do? can anyone suggest me a good method to follow?

Thank You!

Isura Nirmal
  • 111
  • 1
  • 1
    Couldn't you take a stratified random sample, but sample with certainty from the the smaller classes? Then you will could simply take a SRS from withing the other classes. Why do you strongly doubt that 250 will not represent the entire population? In what way? If there is auxiliary information available in your dataset, you could further stratify your PSU's to ensure your data is more representative. – StatsStudent Oct 18 '15 at 07:28
  • 1
    The term 'raws' is strange, I imagine you mean samples/observations/rows? So you intend to train a classifier predicting 'regionality' on a basis of other features and you have 56k samples. – Soren Havelund Welling Oct 19 '15 at 11:26
  • Yes the raws=samples/observations/rows forgive me for the spelling mistake. I didnt notice that – Isura Nirmal Oct 19 '15 at 14:55

1 Answers1

1

Your training target class distribution is highly skewed towards "NorthAmerican". If classes are not very easy to separate, the RF model will end up predicting the target of all new samples as "NorthAmerican". If you believe training class distribution should not be the prior expectation for future samples, you can e.g. assume uniform probability of any class, such that the posterior prediction of the random forest model is not tainted by the skewed training data. In practice, you can incorporate such a flat expectation with bootstrap stratification.

If you train 500 trees. You can bootstrap 250 samples from each target class, 25 $\cdot$ 11 = 2750 samples in total for each tree. I simulated 1000 times how many "NorthAmerican" targets would be selected at least once in the RF model. It is very unlikely to include less than 90% of the "NorthAmerican" samples. If you increase number of trees you will get close to 100%.

In short:

  • Stratify only, if reasonable.
  • Don't worry about not utilizing the "NorthAmerican" class examples, you will.

Here's a thread on how to use stratified random forests in R.

simulation of stratified sampling

hist( #plot histogram
  main="Simulating 1000 stratified RF models",
  xlab="how many northamerican's will be included in a least once tree",
  x =  replicate(1000,#simulate 1000 times
    length(unique(unlist( #get unique "Northamricans in each forest"
      replicate(500, #simulate stratified bootstrapping in  500 trees
        sample(41524,250,replace =T),simplify = F)
    )))
  )
)

enter image description here

  • Cannot thank you enough for taking that much of time to support me. Actually the upper limit size of each sample size as 250 is mentioned as NorthernEuropean set has 250 instances. I am less familiar with the R as I am using weka and python. Also the data set is in binary occurrence matrix with 384 feature vector. Let me try them and let you know. Thanks again! – Isura Nirmal Oct 19 '15 at 15:03