Normalizing exponential function which transforms a numeric vector such that all its entries become between 0 and 1 and together sum to 1. It is often used as the final layer of a neural network performing a classification task.
Questions tagged [softmax]
201 questions
109
votes
4 answers
Softmax vs Sigmoid function in Logistic classifier?
What decides the choice of function ( Softmax vs Sigmoid ) in a Logistic classifier ?
Suppose there are 4 output classes . Each of the above function gives the probabilities of each class being the correct output . So which one to take for a…

mach
- 1,545
- 3
- 10
- 12
60
votes
5 answers
Backpropagation with Softmax / Cross Entropy
I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer.
The cross entropy error function is
$$E(t,o)=-\sum_j t_j \log o_j$$
with $t$ and $o$ as the target and output at neuron $j$, respectively. The sum is over…

micha
- 703
- 1
- 6
- 5
56
votes
2 answers
Cross-Entropy or Log Likelihood in Output layer
I read this page:
http://neuralnetworksanddeeplearning.com/chap3.html
and it said that sigmoid output layer with cross-entropy is quite similiar with softmax output layer with log-likelihood.
what happen if I use sigmoid with log-likelihood or…

malioboro
- 851
- 1
- 11
- 19
47
votes
6 answers
Why is softmax output not a good uncertainty measure for Deep Learning models?
I've been working with Convolutional Neural Networks (CNNs) for some time now, mostly on image data for semantic segmentation/instance segmentation. I've often visualized the softmax of the network output as a "heat map" to see how high per pixel…

Honeybear
- 599
- 1
- 6
- 8
33
votes
2 answers
How to set up neural network to output ordinal data?
I have a neural network set up to predict something where the output variable is ordinal. I will describe below using three possible outputs A < B < C.
It is pretty obvious how to use a neural network to output categorical data: the output is…

Alex I
- 913
- 2
- 9
- 18
30
votes
3 answers
Why is softmax function used to calculate probabilities although we can divide each value by the sum of the vector?
Applying the softmax function on a vector will produce "probabilities" and values between $0$ and $1$.
But we can also divide each value by the sum of the vector and that will produce probabilities and values between $0$ and $1$.
I read the…

floyd
- 1,240
- 13
- 24
19
votes
2 answers
How deep is the connection between the softmax function in ML and the Boltzmann distribution in thermodynamics?
The softmax function, commonly used in neural networks to convert real numbers into probabilities, is the same function as the Boltzmann distribution, the probability distribution over energies for en ensemble of particles in thermal equilibrium at…

bjarkemoensted
- 452
- 3
- 15
15
votes
2 answers
Different definitions of the cross entropy loss function
I started off learning about neural networks with the neuralnetworksanddeeplearning dot com tutorial. In particular in the 3rd chapter there is a section about the cross entropy function, and defines the cross entropy loss as:
$C = -\frac{1}{n}…

Reginald
- 153
- 1
- 6
14
votes
3 answers
Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words?
I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec's CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.

Franck Dernoncourt
- 42,093
- 30
- 155
- 271
14
votes
3 answers
Non-linearity before final Softmax layer in a convolutional neural network
I'm studying and trying to implement convolutional neural networks, but I suppose this question applies to multilayer perceptrons in general.
The output neurons in my network represent the activation of each class: the most active neuron corresponds…

rand
- 427
- 1
- 5
- 10
13
votes
1 answer
Softmax overflow
Waiting the next course of Andrew Ng on Coursera, I'm trying to program on Python a classifier with the softmax function on the last layer to have the different probabilities.
However, when I try to use it on the CIFAR-10 dataset (input : (3072,…

Dlmss
- 143
- 1
- 6
12
votes
1 answer
Log probabilities in reference to softmax classifier
In this https://cs231n.github.io/neural-networks-case-study/ why does it mention "the Softmax classifier interprets every element of ff as holding the (unnormalized) log probabilities of the three classes."
I understand why it is unnormalized but…

Abhishek Bhatia
- 461
- 4
- 13
12
votes
4 answers
Why is the softmax used to represent a probability distribution?
In the machine learning literature, to represent a probability distribution, the softmax function is often used. Is there a reason for this? Why isn't another function used?

SHASHANK GUPTA
- 1,139
- 2
- 10
- 17
11
votes
4 answers
What is an intuitive interpretation for the softmax transformation?
A recent question on this site asked about the intuition of softmax regression. This has inspired me to ask a corresponding question about the intuitive meaning of the softmax transformation itself. The general scaled form of the softmax function…

Ben
- 91,027
- 3
- 150
- 376
11
votes
3 answers
Why 'e' in softmax?
I am doing an introduction to ML with tensorflow and I came across softmax activation function. Why is in the softmax formula e? Why not 2? 3? 7?
$$
\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
$$
$$
\begin{eqnarray}
\sum_j a^L_j & = &…

Gillian
- 213
- 2
- 6