Imbalanced categorical data

Question

I'm studying the behavior of machine failures in a production scenario. For this, I generated random data to form my imbalanced training set, consisting of categorical data, which indicate whether or not there was a failure in each subperiod. The failures were generated according to a exponential distribution. I have 24 features (Period_1 to Period_24), each containing information about the historical failures for 448 subperiods. Furthermore, I have three more features consisting of Temperature, Moisture, and Pressure (generated using the Normal distribution). My intention is to predict the behavior of the failures for the next period based on these features.

I used the ROC metric and considered several strategies to deal with unbalanced data, such as oversampling, undersampling, ROSE, and ADASYN. Furthermore, I tried to use ensemble to improve performance. I tested all of the following models: gradient boosting algorithm, random forest, Classification and Regression Trees, neural networks, Bagged CART, SVM, C5.0, eXtreme Gradient Boosting, and k-Nearest Neighbors. I also tried to use regularized models but none of these strategies worked. The best result obtained was using the model "SVMRadial" considering resampling with the ROSE package. In this case, ROC = 0.7614, Sensitivity = 0.7639, and Specificity = 0.6065 for the training set and Sensitivity = 0.75, and Specificity = 0.6914 for the test set (the latter obtained through the Confusion Matrix). However, when making predictions, the trained model is resulting in high probabilities for wrong predictions. So, I would like to know if this is a problem of the training model or in the fact that I have 24 categorical variables + 3 numerical variables. Also, would anyone have any idea how to improve these results?

Any help will be appreciated.

A sample of the data:

if you generate random data, why would the model be able to predict correctly the labels? Am i missing something? Can you elaborate on how you generate random data — StupidWolf, Aug 14 '20 at 16:54
Welcome to CV.SE. [This question](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) might be helpful for dealing with unbalanced data. It's a duplicate, but I think it's a better summary. — LmnICE, Aug 14 '20 at 17:06
@StupidWolf I generated my own data because I didn't find any real data that fit the problem I'm studying. I thought that the learning models would be able to analyze the data and predict the labels, is that incorrect? — Fernanda, Aug 14 '20 at 17:59
What's not clear from your explanation is whether your simulation included any predictors that were, by design, associated with failure rates. You might, for example, have let the exponential distribution parameter be a function of temperature in your simulation. Whether you did something like that isn't clear. If all you simulated were random values of temperature, pressure, and moisture without any attempts to build in associations with failure, then there is no true relationship between those predictors and failure to model, and any relationship you found would be spurious. — EdM, Aug 14 '20 at 21:01
So each row in your sample corresponds to a period, and the features are the 24 preceding periods plus temperature, moisture and pressure? — LmnICE, Aug 15 '20 at 14:56
Further, how much data do you have? Machine learning models tend to need a lot of data as a proportion of the number of features in order to make accurate predictions. — LmnICE, Aug 15 '20 at 15:04
@LmnICE Each row in my sample corresponds to a subperiod. I have 24 features (each corresponding to one period), each with 448 subperiods (rows) plus the temperarature, moisture, and pressure. Then, I have 27*448 = 12096 data. Is this amount of data enough? I'm sorry, I'm new to Machine Learning and I'm having a lot of questions in structuring my problem. — Fernanda, Aug 15 '20 at 17:19
@EdM I didn't consider this relation when generating my data. How can I make this association between the predictors when generating my data? — Fernanda, Aug 15 '20 at 17:21
@Fernanda It's still not clear to me what you're trying to do. Rephrasing: what is it that you`re trying to predict (is it machine failure during one subperiod?), and which variables do you want to use to predict it (is it 24 historical failure data plus temperature, moisture and pressure?)? — LmnICE, Aug 15 '20 at 19:54
@LmnICE I'm trying to predict the machine failures for next period, e.g., Period_25. In this case, I have to predict every subperiod for this period, indicating if is a 'Failure' or if it is 'Normal'. I want to use to predict the 24 historical failure data plus temperature, moisture and pressure. — Fernanda, Aug 15 '20 at 20:10
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/111831/discussion-between-lmnice-and-fernanda). — LmnICE, Aug 15 '20 at 20:13

Imbalanced categorical data

0 Answers0