0

I am currently working on a classification problem using tf-idf and Naive Bayes for two classes A and B. I have randomly shuffle the dataset before implementation, and I was experimenting with the following parameters:

  1. Training - 90% , Testing - 10%

  2. Training - 80% , Testing - 20%

  3. Training - 30% , Testing - 70%

The result was that the accuracy keeps on increasing by 5 to 10 % whenever I decrease the percentage of Training data i.e. (3) has the highest accuracy (around 50%). And, the precision, recall and F1 score were fixed around 75% to 79% for all of them.

Now, again when I tested the data with K-fold cross-validation, with K values 3, 6, 10. Here, k=10 gave me the highest accuracy around (80%) while the precision, recall and F1 score were fixed around 68% to 72%.

I am not able to justify why is it like that( increasing accuracy while decreasing the size of training data)? Because as per my knowledge the overfitting and underfitting can be seen from the values of precision, recall and F1 score. However, the above result didn't show any such case.

Also, why the results of the test without cross validation have better precision, recall and F1 score with low accuracy than the k-fold cross validation result?

Edit: I have taken same size of data for both class A and B, there is no imbalance of data.

Results:

without CV

('total data : ', 1266) ('training pc: ', 30)

                 precision    recall  f1-score   support
            0.0       0.70      0.73      0.71       438
            1.0       0.73      0.69      0.71       449
    avg / total       0.71      0.71      0.71       887

NB (Confusion matrix and accuracy)

    [[ 321.  117.] 
     [ 139.  310.]] 0.498420221169 (Accuracy)

('total data : ', 1266) ('training pc: ', 80)

                 precision    recall  f1-score   support
            0.0       0.79      0.79      0.79       141
            1.0       0.73      0.73      0.73       113
    avg / total       0.76      0.76      0.76       254


    [[ 111.   30.]
     [  30.   83.]] 0.153238546603

('total data : ', 1266) ('training pc: ', 90)

                 precision    recall  f1-score   support
            0.0       0.76      0.81      0.79        64
            1.0       0.80      0.75      0.77        63
    avg / total       0.78      0.78      0.78       127

    [[ 52.  12.]
     [ 16.  47.]] 0.0781990521327

With CV K fold

k=3 ('total data size: ', 1266) training pc: 60

                 precision    recall  f1-score   support
            0.0       0.69      0.60      0.64       211
            1.0       0.65      0.73      0.69       211
    avg / total       0.67      0.67      0.66       422

    [[ 445.  188.]
     [ 183.  450.]] 0.824125230203

k=5 ('total data size: ', 1266) training pc: 80

                 precision    recall  f1-score   support
            0.0       0.64      0.58      0.61       126
            1.0       0.62      0.67      0.64       126
    avg / total       0.63      0.63      0.63       252


    [[ 466.  167.]
     [ 170.  463.]] 0.855432780847

k= 10 ('total data size: ', 1266) training pc: 90

                 precision    recall  f1-score   support
            0.0       0.64      0.57      0.61        63
            1.0       0.61      0.68      0.65        63
    avg / total       0.63      0.63      0.63       126


    [[ 482.  151.]
     [ 153.  480.]] 0.885819521179
OnePunchMan
  • 101
  • 5
  • What are proportions of the classes? – Jakub Bartczuk Mar 10 '18 at 14:53
  • @JakubBartczuk I have put it in the question. – OnePunchMan Mar 10 '18 at 14:56
  • Can you post training and test metrics for both test and training data? – Jakub Bartczuk Mar 10 '18 at 15:00
  • Also, if you use Python, can you post the outputs of `classification_report` from scikit-learn? – Jakub Bartczuk Mar 10 '18 at 15:01
  • I am new to the classification problem, and yes I am using python and scikit learn which is also new to me. I have the confusion matrix and the accuracy of all the above test. I don't know how to get classification_report. – OnePunchMan Mar 10 '18 at 15:03
  • [classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) - you use it on y_test, and y_pred (which should be obtained from running `predict` on classifier with your test data) – Jakub Bartczuk Mar 10 '18 at 15:06
  • ok, so mean, the actual values of precision, recall and F1 score right. I calculated it manually from the confusion matrix [TP FN, FP TN]. using the formula, Precision= TP/(TP+FP); Recall= TP/(TP+FN), F1= 2*(Recall*Precision)/(Recall + Precision). – OnePunchMan Mar 10 '18 at 15:10
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/74302/discussion-between-jakub-bartczuk-and-onepunchman). – Jakub Bartczuk Mar 10 '18 at 15:11
  • You may be interested in [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) – Stephan Kolassa Mar 10 '18 at 22:37

0 Answers0