Real utility of small accuracy improvements in sentiment classifiers

Question

I have lately been reading papers regarding Sentiment Analysis, where most researches report that their improvements made them achieve an increase of 1~2%, or even 0.5% in accuracy compared to baseline (non-trivial) methods.

Of course, I understand that such an increase (even if small, if statistically significant) is a good thing, but what are the advantages, in terms of application/utility of such a small improvement ? Is there any application/practical use that could directly benefit from such a small improvement ? If such an application exists, how to know which is the minimal accuracy needed for it?

(Similar question is posted here, but I am asking for a more practical view rather than statistical)

Generally speaking I do not think that because you are capable of detecting a small difference through statistical analysis that it makes it worthwhile. The word statistical significance just means that saying that the difference is at least x has a type I error of say 0.05. It doesn't make x important. What is a practical meaningful difference is a question for the investigator. If you were able to collect a very large sample size you may be able to detect a small difference x that is very unimportant. — Michael R. Chernick, May 21 '12 at 01:15
Often the appropriate way to pick the sample size is to find how large n has to be to detect what the investigator considers is a meaningful difference d. N is chosen so that the power of the test is high (say 90%) that a difference of d or more will be detected. — Michael R. Chernick, May 21 '12 at 01:20

Tim · Answer 1 · 2018-08-17T11:55:22.733

Such improvements matter if you want to publish or win a Kaggle competition, but are less important for practical applications.

Obviously, more accurate is better, so if you can use something that is more accurate, better for you. But, first, there are implementation costs, for example, Netflix didn't use the algorithm that won their competition and stayed with less accurate, simpler one because the change wasn't worth the additional costs. Second, the fact that some algorithm, trained on some particular training set has achieved some result on a particular test set does not mean that it will do the same for any dataset, or even that the performance won't change over time. Moreover, there is ongoing debate if the algorithms described in the literature aren't overfitting to the test sets that are used by everyone (see Recht et al, arXiv:1806.00451).

So usually you won't re-design your machine learning pipeline just because the "better" algorithm was described in the literature. On another hand, you may still consider testing it, or it may inspire some improvements in ongoing or future projects, but only after assessing the cost-effectiveness (where cost is time, computational resources needed etc).

Real utility of small accuracy improvements in sentiment classifiers

1 Answers1