I am trying to find the f1 score, precision, recall of a highly imbalanced dataset. I would like to use k-fold cross validation approach. I followed the procedure:
- create arrays to store testing data and predictions.
- split the training and testing data
- do the over/under sampling in training data
- train the model and get the predictions
- append the test data and test result to test array [A] and predictions array [B]
- go back to (1) for another fold cross validation.
- calculate the f1-score by comparing [A] and [B]
This is my code:
import pandas as pd
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# pip install imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
# creating a dataset
X, y = make_classification(n_samples = 1500, n_features = 5, n_redundant = 0, weights = [0.9])
df = pd.concat([pd.DataFrame(X), pd.DataFrame(y, columns = ['class'])], axis = 1)
# df.head()
# getting the input and output feature
input_features = df.drop(labels = 'class', axis = 1)
target_feature = df['class']
# specifying the number of folds
n_folds = 5
# array to store the test data and predictions
test_data_array, predictions_array = [], []
# model
clf = RandomForestClassifier()
for fold in range(0, n_folds):
# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(input_features, target_feature, test_size = int(len(target_feature)/n_folds))
# print(X_test.shape)
# print(Counter(y_train))
# do the oversampling / undersampling only in training data
over_sample = RandomOverSampler(sampling_strategy = 'minority')
X_train_sampled, y_train_sampled = over_sample.fit_resample(X = X_train, y = y_train)
# print(Counter(y_train_sampled))
# training the model:
clf.fit(X_train_sampled, y_train_sampled)
# getting the predictions from the testing data
predictions = clf.predict(X_test)
# appending them into the array
test_data_array = test_data_array + y_test.tolist()
predictions_array = predictions_array + predictions.tolist()
# calculating f1 score from the appended arrays
print(classification_report(y_true = test_data_array, y_pred = predictions_array))
I would like to get experts advices about this procedure. I tried inbuild cross-validation in sklearn but it can only show the average f1-scores of all the folds. Not sure which is the right approach, whether to append them and calculate f1 score as a whole or to calculate f1-scores of each fold and get the average?
Is this a right way to do this or Is there any other better way to achieve this?