1

Will like to seek some veteran feedback on this.

I am working on an unbalanced, multilabel stock market dataset for educational purposes. The dataset shape looks like this

2503 - 0
234  - 1
32   - -1

0 represents hold, 1 represent buy signal and -1 represent sell signal.

I have been reading up on which evaluation metric to use and concluded that F1 score is the one to go with. Next issue comes in which is either micro or macro. If I understand correctly, micro should be use if I wish to bias towards the majority label (in this case 0 that represents hold) while macro bias towards the minority labels (in this case, I assume it is both 1 and -1?).

Choosing of the evaluation metric seems to be depend on the domain of the data and how the labels are in term of importance. In the case of a stock market, where I do not have enough knowledge being a student still, which option will be better?

P.S: I feel like micro is the one to use as most of the time in a stock market, we want to hold the stock rather than buy or sell. What do you guys think? I am also using SVM/Random Forest/NB to try this prediction problem.

  • 1
    Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Oct 03 '21 at 12:56
  • 1
    Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Oct 03 '21 at 12:56
  • ((Recommented)) @StephanKolassa thanks for your feedback. I went to read up further and saw that Naive Bayes is one of the probabilistic classification that you mentioned. Does that mean that after training and fitting a model, I will then apply the proper scoring rules (e.g Brier) to evaluate the performance instead? Additionally, why will probabilistic classification be better in this case? – datanewbie96 Oct 03 '21 at 13:43
  • @StephanKolassa, scoring rules offer a means of relative evaluation and thus are useful for comparing multiple forecasts. For a single forecast, it may be more meaningful to assess it in absolute terms. Two relevant aspects are statistical adequacy and the expected loss (given a user's loss function). Or am I mistaken? – Richard Hardy Oct 03 '21 at 15:11
  • @RichardHardy: you make good points. If by *expected loss* you mean expected loss in *utility*, then we need to look at the classification/prediction in the context of the subsequent *action* and assess whether the entire system is "good enough". (Which again implies a comparison: "good enough" *compared to what*? To doing nothing? To continuing as before?) Also, the F1 score is commonly also used exactly in this way: to compare predictions, or models. – Stephan Kolassa Oct 03 '21 at 15:48
  • @datanewbie96: probabilistic predictions are better because they help you cleanly separate the modeling/prediction aspect from the subsequent decision aspect. Hard 0-1 classifications always smuggle in a threshold somewhere, and you don't know where subpar performance comes from: from a wrong classifier, or from a badly thought out (implicit) mapping from the probabilistic classifications to actions. You may find the link to an earlier thread about thresholds useful. – Stephan Kolassa Oct 03 '21 at 15:50
  • @StephanKolassa thanks Stephan. Sorry I am still new to this subject, but what do you mean by "hard classification"? – datanewbie96 Oct 03 '21 at 16:40
  • Sorry please bear with me as I will like to check another two more things - my dataset are actually in discrete, I don't suppose this will affect what you mentioned? Also, I have evaluated both RF and SVM (just to see log loss) and I got log loss of 0.46 and 1.06 respectively. I can't understand what log loss value represent despite reading it online... appreciate if you can provide some insights too thanks. – datanewbie96 Oct 03 '21 at 16:55
  • By "hard classifications", I mean classifiers that output "hard" labels: "instance 1 is class B, instance 2 is class A". This contrasts with probabilistic classifiers, which output something like "instance 1 has a 20% probability of being class A, instance 2 has a 67% probability" (assuming two classes A and B). No, if your training data are discrete, that is no problem at all. Finally, the log loss is a [strictly proper scoring rule](https://stats.stackexchange.com/q/477479/1352), so that is good. [This thread may be helpful.](https://stats.stackexchange.com/q/274088/1352) – Stephan Kolassa Oct 04 '21 at 16:05

0 Answers0