Why does torchvision.models.resnet18 not use softmax?

Question

I see image-classification models from torchvision package don't have a softmax layer as final layer. For instance, the following snippet easily shows that the resnet18 output doesn't have a sum = 1, thus the softmax layer is certainly absent.

from torchvision import models
import torch

model = models.resnet18(pretrained=False)

x = torch.rand(8,3,200,200)

y = model(x) 

print(y.sum(dim=1))

So, the question is, why pytorch vision does not put a softmax layer in the end? And how much putting a softmax layer can improve performance? And why?

How do you plan to relate `y` to the class labels? What is the loss function that you will use to train this model? In NNs, Softmax is *nearly* synonymous with classification, but there are lots of ways to train models to learn something about classes that are not, themselves, **classification** networks, because they are learning a representation, e.g. [tag:triplet-loss]. Likewise, there are alternatives to softmax for classification. Comparing logits and probits is one example: https://stats.stackexchange.com/questions/20523/difference-between-logit-and-probit-models/30909#30909 — Sycorax, Aug 31 '21 at 14:35
Does this answer your question? https://stats.stackexchange.com/questions/162988/why-sigmoid-function-instead-of-anything-else — Sycorax, Aug 31 '21 at 14:42

mhdadk · Accepted Answer · 2021-09-04T15:27:48.577

8

Whether you need a softmax layer to train a neural network in PyTorch will depend on what loss function you use. If you use the torch.nn.CrossEntropyLoss, then the softmax is computed as part of the loss. From the link:

The loss can be described as: $$ \text{loss}(x,class) = −\log\left(\frac{\exp⁡(x[class])}{\sum_j \exp(x[j])}\right) $$

This loss is just the concatenation of a torch.nn.LogSoftmax followed by the torch.nn.NLLLoss loss. From the documentation of torch.nn.CrossEntropyLoss:

This criterion combines LogSoftmax and NLLLoss in one single class.

and from the documentation of torch.nn.NLLLoss:

Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.

It seems that the developers of these pretrained models had the torch.nn.CrossEntropyLoss in mind when they were creating them.

edited Sep 04 '21 at 15:27

answered Aug 31 '21 at 15:00

mhdadk

2,582
1
4
17

2

This seems to beg the question, because `CrossEntropyLoss` is just the concatenation of a `torch.nn.LogSoftmax` followed by the `torch.nn.NLLLoss` loss. Are you saying that a classification model must necessarily use softmax, either as a layer in the model or as a step in computing the loss? Or are there alternative classification models that do not use softmax? – Sycorax Aug 31 '21 at 15:02
1

I am mainly answering OP's question: "So, the question is, why pytorch vision does not put a softmax layer in the end?". In the testing phase, you don't *need* to use the softmax layer since the softmax function is monotonically increasing: you can just take the argmax of the outputs of the linear layer to obtain the predicted class. – mhdadk Aug 31 '21 at 15:07
1

Reading between the lines, it seems that your answer is written specifically about `torchvision.models.resnet18`, and does not comment on neural networks for classification generally. I think your answer would be more clear if you added this specificity. – Sycorax Aug 31 '21 at 15:11
1

In general, it doesn't *have* to be softmax, it just enforces the "sum to one" constraint "in hardware" rather than the network having to learn it. If you have a lot of data, then I suspect a well-trained network with logistic outputs will probably be just as good. Linear outputs and MSE also work (as the conditional mean of the targets are the probabilities of class membership), but that is pushing things even further into the networks "software" when it could easily be done for the network with a better activation function. – Dikran Marsupial Aug 31 '21 at 15:19
1

Of course that assumes the implementation is modular and doesn't hardcode specific combinations (which it shouldn't, it isn't as if it was the computationally expensive bit ;o). – Dikran Marsupial Aug 31 '21 at 15:21
... and you can get symbolic maths packages to write that bit of the code for you! – Dikran Marsupial Aug 31 '21 at 15:39

Why does torchvision.models.resnet18 not use softmax?

1 Answers1