Balance classes when sampling

Question

I have a large dataset describing numerous customers' behaviour and I am trying to solve a binary classification problem with a null accuracy on 90% (90/10 distribution amongst the two classes).

Given that I have computational limitations and thus are forced to take a subset of the sample, would it make sense for me to manipulate the balance to, let's say; 60/40 or 50/50 in my sample, now that I am limited to a fixed amount of total observations due to my hardware, just to "expose the machine learning algorithm to more of both classes" (from an marginal utility point of view)?

I have found multiple discussions about this online but not about this exact situation. I am very much aware of the fact that it would be optimal to just use ALL observations, and that it will mess up the true disitribution, but my rationale is that the problem is nothing like a poll sample but rather the idea of feeding the algorithm with more examples of observations that it haven't seen that many times.

Following guide states: "Consider testing under-sampling when you have a lot data (tens- or hundreds of thousands of instances or more)"

Would this impact the performance of the machine learning algorithm negatively and thus my prediction model so that I will get worse classifications on a 90/10 test set? And would someone be able to explain me why?

[Similar quesion has been answered, just take a look at it.](http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning?rq=1) — pltc325, Mar 20 '17 at 16:25
Thanks for your reply! I have already come across that question which is not exactly what I am looking for. There are no computational limits in that example and thus I would argue that many machine learning algorithms would classify better with all data. I do not have that option here. I'm wondering whether a sample would be better with the original distribution or with an undersampling approach. — Mikkel, Mar 20 '17 at 16:30
I would suggest you edit the title of your question so it reflects the issue of hardware limitations and/or the need to under-sample; as it stands it is somewhat uninformative and too generic for what is the actual issue. — usεr11852, Mar 21 '17 at 20:56

score 2 · Answer 1 · edited Mar 21 '17 at 12:58

2

In general, more data is preferred, but not necessarily the raw full data, especially when it is imbalanced.

The robustness of algorithms to imbalanced data varies from one to another, take naive Bayes as an example, the core formula of this algorithm is:

$$\mathbb{P}(Y|X) = \mathbb{P}(Y)\mathbb{P}(X|Y)/\mathbb{P}(X)$$

We can see that the prediction probability $\mathbb{P}(Y|X)$ for any $X$ is proportional to $\mathbb{P}(Y)$, which in your case, is either $0.1$ or $0.9$, that is to say the original distribution of $Y$ has a huge impact on the result of prediction. In other words, naive Bayes can't handle imbalanced data very well.

So what to do?
a) You can choose another algorithm that does better job in this situation such as Random Forest
b) Resampling your data
c) Reweight your data by $1/\mathbb{P}(Y)$
... and so on

In conclusion, more data is better provided that it is well formed, and when it's not the case, keeping the integrity of it doesn't really make any sense.

edited Mar 21 '17 at 12:58

Chill2Macht

5,639
4
25
51

answered Mar 20 '17 at 17:31

pltc325

21
2

Very helpful and it got me thinking. It might make sense to over/under-sample with machine learning algorithms that do not perform well with class imbalance. Although, as you state, I would argue that it would always be better to use another machine learning algorithm if we are talking about a situation of class imbalance (might be more applicable in other cases with larger imbalances such as fraud detection etc.) But it still doesn't exactly answer whether algorithms such as DT or RF would perform better by undersampling or with a normal sample (with pc limitations) – Mikkel Mar 20 '17 at 21:04
Why the downvote to this answer? – usεr11852 Mar 21 '17 at 20:43
No idea, it wasn't me - I don't think I have enough reputation to vote at all :-) – Mikkel Mar 21 '17 at 23:08
Sorry I did not mean to suggest you did it. It was a general comment to the downvoter. – usεr11852 Mar 21 '17 at 23:36

usεr11852 · Accepted Answer · 2017-03-21T20:53:55.700

I think one needs to take into account that changing the class proportions in the training sample will substantially change the final model that is going to be learned through training. This might not be something we necessarily want. The cost of misclassifying majority class samples might be not negligible (invasive medical treatments to rare diseases being a standard example). When down-sampling a dataset we intrinsically hypothesis that the cost of misclassification of their class is similar but this might not be the case; cost-sensitive learning (Elkan,2001) is potentially important concept in this setting.

In addition, before randomly under-sampling a dataset it could be beneficial to look at a number of resampling algorithms (eg. NearMiss - (Zhang & Mani (2003)) or One-sided selection - (Kubat & Matwin (1997)) which can help us do an informed downsampling of a dataset. Majority downsampling algorithms while far less famous than their minority oversampling cousins (eg. SMOTE) still exist! In that way when downsampling the data we are able to discard points that (in principle at least) do not greatly benefit our training process while at the same time retaining useful exemplars.

Finally, examining carefully the "large dataset" it might be possible to construct a better training set through it. By that I mean to recognise latent features or patterns in the data that can assist our training while at the same time lower the size of our training dataset in absolute terms. Feature engineering is an extremely crucial part of any real life Machine Learning application. It allows domain expertise to enter the modelling task; in addition it effectively changes the representation of our original problem (hopefully to a smaller but more informative set in this case). To that extend we might want to use a dimensionality reduction technique like PCA or ICA to reduce the size of our feature space in absolute terms and possibly avoid hardware limitations altogether.

I believe this answered my question to some extent. The articles are very valuable and presents different results to the problem. Especially the SMOTE article gave interesting results. As I see it now, it is highly based on an individual assessment of the given problem (The degree of imbalance, the algorithm utilized etc.) --> I ended up trying both, and I ran into problems when under-sampling (with a 55/45 distribution), so the random sample gave me a better model. (The exact distribution of the entire dataset was 82,5/17,5 I tried with DT, RF, LDA and GBC.) — Mikkel, Mar 22 '17 at 00:05
I am happy I could help. In general Tolstoy's: “*All happy families are alike; each unhappy family is unhappy in its own way.*” is true for modelling too. Straightforward problems share very similar properties but the challenging problems require bespoke solutions. — usεr11852, Mar 22 '17 at 10:04

Balance classes when sampling

2 Answers2