3

I have a dataframe with 930 000 rows and 220 variables. The objective is a binary classification but my response classes are imbalanced. (88% - 12%)

I want to use SMOTE to artificially create observations for the rare event but the function takes forever to run. By forever I mean it has been running for over 90 minutes.

My PC is not the slowest. SSD and 8gb ram.

Can anyone confirm if this duration is unusual or if this function is just slow in general, as the creation of observations with 220 variables might be computationally intensive?

Is there perhaps a better way to do this?

LeGossler
  • 31
  • 2

2 Answers2

2

As has been written about extensively on StackExchange, when you have a rare outcome, classification is not the right approach, and it is appropriate instead to engage in the estimation of tendencies (probabilities). The existence of an invalid statistical procedure like SMOTE is evidence for misunderstanding. Any approach that requires one to delete data is invalid. See https://fharrell.com/post/classification for details.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    I think my favorite comment about SMOTE is your tweet: https://twitter.com/f2harrell/status/1062424969366462473?lang=en. – Dave Apr 26 '21 at 11:27
0

I just tested it on my relatively new MacBook pro(i7 32g RAM) and it takes about 30s for 100,000 rows and 8 variable data, I tried even larger dataset for test and it took much longer time. Your computer's memory is indeed an issue, but I'm sure it's not the main problem. Try using some feature extraction method such as PCA might helpful. (I am using smotefamily package for test)

Carl
  • 21
  • 1