FEATURE SCALING, TRAIN-TEST SPLIT , UNBALANCED DATA

Question

i was doing an analysis on water quality and building a classification model for the same. But the data here is imbalanced and also needs to be scaled. My question here is.. How do i carry out my task .. I am new to this domain and i am confused in which order do i do it.. do i need to do train test split first and then normalize or normalize and then split the data .. And also how do i balance the data .. I have choosed to use SMOTE here but should i balance it after normalization and splitting or do i need to balance it beforehand.. I am stuck here.. any suggestions would be highly appreciated.. Thank You...

Statisticians do not see class imbalance as such a problem. It might be helpful if you say why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Feb 25 '22 at 12:19
Do not use SMOTE unless you are using an old classifier system, such as a single decision tree, that doesn't have a good means of avoiding over-fitting, or a means of implementing cost-sensitive learning. SMOTE is unlikely to help (and may make things worse) if you are using a modern classifier system, such as an SVM, which already has good solutions for both problems. — Dikran Marsupial, Feb 25 '22 at 12:57
A water quality problem suggests that it might be better to address the problem as a regression task, and then if you are looking for occasions where threshold levels are breached, you can do that from the regression output. Note that in those circumstances, predicting the conditional variance as well as the conditional mean can be helpful. http::/dx.doi.org/10.1016/j.neunet.2007.04.024 — Dikran Marsupial, Feb 25 '22 at 13:01
The easiest way to understand the invalidity of SMOTE and other balancing methods is to realize that when training to a balanced sample you can't apply the prediction rules to future data unless you balance them. And to balance them you need to know the outcome you are predicting, which is not possible. — Frank Harrell, Feb 25 '22 at 13:01
@FrankHarrell I don't think that is true of simple rebalancing. If you have a probabilistic classifier, you can post process the estimates of conditional probability to account for the difference in training set and operational class frequencies. An EM type algorithm can also be used to adapt to operational class frequencies without knowing what they are beforehand, see Saerens et al doi:10.1162/089976602753284446 . — Dikran Marsupial, Feb 25 '22 at 13:06
Having said which, with a probabilistic classifier, there is little point in balancing the dataset, except for reasons of computational expense. It isn't the case for SMOTE though as SMOTE has a regularizing effect because of the method used to generate the synthetic examples. It probably isn't going to be as good a method of regularization as implemented in modern machine learning tools though (such as ridge regression ;o) — Dikran Marsupial, Feb 25 '22 at 13:07
@Hridikalpa look into cost-sensitive learning. Unless you have a *very* limited amount of data, the justification for re-balancing or re-weighting the data is because the false-positive and false-negative costs are different, and the degree of imbalance in the dataset is entirely irrelevant to the amount of rebalancing that is required, it depends only on the misclassification costs. — Dikran Marsupial, Feb 25 '22 at 13:15
False positive and false negative are artificial constructs that result from the use of forced-choice classification. Probability models don't need any of this, and lead to better decision making. A win-win. Probability models use all available data and get the model intercept correct, leading to maximizing the chance of getting accurate absolute predictions. — Frank Harrell, Feb 25 '22 at 13:35
@FrankHarrell In many practical applications, decisions do have to be made and errors do have costs that may not be equal. The probability approach is not a win-win because the model is unlikely to be perfect, and compromises will be involved. Those compromises may favour factors that do not affect the decision at the expense of factors that do. I give a concrete (if somewhat adversarial) example here: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/538524#538524 — Dikran Marsupial, Feb 25 '22 at 13:46
I pretty much agree about the value of proabiistic modelling, but there *are* downsides that we need to be aware of, and at the end of the day we need to choose our performance metric according to the demands of the application. In a lot of cases, that is minimising the expected loss (appropriately accounting for the misclassification costs). — Dikran Marsupial, Feb 25 '22 at 13:49
The oversimplified method of forced-choice classification (premature decisions) does nothing but cover up the problems you mentioned. Coupled with the fact that probability predictions tell you not to make a decision (say when the probability is near 0.5), the probability approach has multiple advantages and better reflects uncertainties. And it means we can quit talking about balance. One more point is that probability estimates allow the utility function to vary over types of observations. — Frank Harrell, Feb 25 '22 at 17:58
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/134429/discussion-between-dikran-marsupial-and-frank-harrell). — Dikran Marsupial, Feb 25 '22 at 18:22

FEATURE SCALING, TRAIN-TEST SPLIT , UNBALANCED DATA

0 Answers0