0

I am trying to find the f1 score, precision, recall of a highly imbalanced dataset. I would like to use k-fold cross validation approach. I followed the procedure:

  1. create arrays to store testing data and predictions.
  2. split the training and testing data
  3. do the over/under sampling in training data
  4. train the model and get the predictions
  5. append the test data and test result to test array [A] and predictions array [B]
  6. go back to (1) for another fold cross validation.
  7. calculate the f1-score by comparing [A] and [B]

This is my code:

import pandas as pd
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# pip install imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

# creating a dataset
X, y = make_classification(n_samples = 1500, n_features = 5, n_redundant = 0, weights = [0.9])
df = pd.concat([pd.DataFrame(X), pd.DataFrame(y, columns = ['class'])], axis = 1)
# df.head()
# getting the input and output feature
input_features = df.drop(labels = 'class', axis = 1)
target_feature = df['class']
# specifying the number of folds
n_folds = 5
# array to store the test data and predictions
test_data_array, predictions_array = [], []
# model
clf = RandomForestClassifier()
for fold in range(0, n_folds):
    # splitting the dataset
    X_train, X_test, y_train, y_test = train_test_split(input_features, target_feature, test_size = int(len(target_feature)/n_folds))
    # print(X_test.shape)
    # print(Counter(y_train))
    # do the oversampling / undersampling only in training data
    over_sample = RandomOverSampler(sampling_strategy = 'minority')
    X_train_sampled, y_train_sampled = over_sample.fit_resample(X = X_train, y = y_train)
    # print(Counter(y_train_sampled))
    # training the model:
    clf.fit(X_train_sampled, y_train_sampled)
    # getting the predictions from the testing data
    predictions = clf.predict(X_test)
    # appending them into the array
    test_data_array = test_data_array + y_test.tolist()
    predictions_array = predictions_array + predictions.tolist()
# calculating f1 score from the appended arrays
print(classification_report(y_true = test_data_array, y_pred = predictions_array))

I would like to get experts advices about this procedure. I tried inbuild cross-validation in sklearn but it can only show the average f1-scores of all the folds. Not sure which is the right approach, whether to append them and calculate f1 score as a whole or to calculate f1-scores of each fold and get the average?

Is this a right way to do this or Is there any other better way to achieve this?

JSVJ
  • 115
  • 4
  • 1
    [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) The same problems apply to the F1 score, and indeed to all evaluation metrics that rely on hard classifications. Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Apr 20 '21 at 05:13
  • 1
    I'm working on the AutoML python package https://github.com/mljar/mljar-supervised and I use the approach to get all predictions from all folds and then compute the score. If you have a large dataset and many samples then the difference between score on all samples vs average score from folds should be small. For small number of samples the difference could be high and I prefer score from all samples. – pplonski Apr 20 '21 at 07:41

0 Answers0