11

I am doing an introduction to ML with tensorflow and I came across softmax activation function. Why is in the softmax formula e? Why not 2? 3? 7?

$$ \text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

$$ \begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} $$

Tensorflow tutorial

NN book

Gillian
  • 213
  • 2
  • 6
  • Possible duplicate of [What is the reason why we use natural logarithm (ln) rather than log to base 10 in specifying function in econometrics?](https://stats.stackexchange.com/questions/27682/what-is-the-reason-why-we-use-natural-logarithm-ln-rather-than-log-to-base-10) – Tim Aug 06 '17 at 12:51
  • @Tim I think the answer to that question really doesn't get at the heart of the issue here. Usually you're not trying to interpret the softmax variables in the same way as you would with functions in econometrics. I thought it was more that it was easier to calculate the derivatives of softmax. – John Aug 06 '17 at 13:16
  • @Tim I understand why compounded interest limit yields 'e' but I am unable to transpose the reason for this connection into softmax. – Gillian Aug 06 '17 at 13:36
  • It seems it could be any arbitrary base and it would give us approximatelly (maybe even precisely?) correct image of what the distribution looks like. And working with irrational numbers such as 'e' surely slows down computation. – Gillian Aug 06 '17 at 13:44
  • 1
    How does this slow the computation? In any case you'd be dealing with floating-point numbers... – Tim Aug 06 '17 at 13:48
  • @Tim I assume that there must be difference between raising an integer and a transcendental number to the same power. – Gillian Aug 06 '17 at 14:14
  • Make an experiment and check how much computation time would you save if you used floating point numbers vs integers in here. This would be negligible, especially when you'd use it inside a complicated algorithm like neural networks. [Laziness is a virtue of a programmer](http://threevirtues.com/), don't waste your time for useless optimizations. – Tim Aug 21 '17 at 09:22

3 Answers3

11

Using a different base is equivalent to scaling your data

Let $\mathbf{z} = \left(\ln a\right) \mathbf{y}$

Now observe that $e^{z_i} = a^{y_i}$ hence:

$$ \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{a^{y_i}}{\sum_j a^{y_j}}$$

Multiplying vector $\mathbf{y}$ by the natural logarithm of $a$ is equivalent to switching the softmax function to base $a$ instead of base $e$.

You often have a linear model inside the softmax function (eg. $z_i = \mathbf{x}' \mathbf{w}_i$). The $\mathbf{w}$ in $\mathbf{x}' \mathbf{w}$ can scale the data so allowing a different base wouldn't add any explanatory power. If the scaling can change, there's a sense in which different base $a$ are all equivalent models.

So why base $e$?

In exponential settings, $e$ is typically the most aesthetically beautiful, natural base to use: $\frac{d}{dx} e^x = e^x$. A lot of math can look prettier on the page when you use base $e$.

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • Would the function be the same if e to the z was replaced with ln(z) ? – Jack Vial Nov 09 '17 at 17:05
  • @Jack I'm not sure I follow what you specifically had in mind? – Matthew Gunn Nov 09 '17 at 17:24
  • If e^x = ln(x). Can the softmax function be written using ln(x) instead of e^x? – Jack Vial Nov 09 '17 at 18:01
  • You could write the [softmax](https://en.wikipedia.org/wiki/Softmax_function) as $\operatorname{Softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j=1}^k e^{x_j}} = \exp\left(\ln\left( \frac{e^{x_i}}{\sum_{j=1}^k e^{x_j}} \right) \right) = \exp\left( x_i - \ln\left( \sum_{j=1}^k e^{x_j} \right)\right)$. But other than that, I really don't know what you're looking for? – Matthew Gunn Nov 09 '17 at 18:42
  • 1
    MatthewGunn I think he wanted to know if rewriting the softmax equation to: $\text{softmax}(x)_i = \frac{\ln(x_i)}{\sum_j \ln(x_j)}$ would yield the same result as $$\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$. However the statement @Jack made that: "$e^x =ln(x)$" isn't true, so I am not sure where he was going with that – Sebastian Nielsen Oct 22 '18 at 21:45
  • **I have a question**. The only reason we raise the independent variable as an exponent (x, y, z - whatever you want to call it) is so that we can avoid negative values canceling out postive values right? If that is true (please tell me if you know that is the casae), then we can replace $e^x$ with $|x|$ right? $\text{softmax}(x)_i = \frac{|x_i|}{\sum_j |x_j|}$ – Sebastian Nielsen Oct 22 '18 at 21:58
  • @SebastianNielsen No, that's different. Graph $y = \frac{e^x}{e^x + 1}$ and $y = \frac{|x|}{|x| + 1}$ and see they're rather different. The former is strictly increasing in $x$. The latter is decreasing in $x$ for $x < 0$. – Matthew Gunn Oct 22 '18 at 22:03
  • 2
    @SebastianNielsen My point was that $f(x) = \frac{a^x}{a^x + 1} $ and $f(x) = \frac{e^{bx}}{e^{bx} + 1}$ are literally the **same** function for $b = \ln a$. Basic math: $e^{x \ln a} = e^{\ln a^x} = a^x$. – Matthew Gunn Oct 22 '18 at 22:06
3

This is indeed a somewhat arbitrary choice:

The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.

Some potential reasons why this may be preferred over other normalizing functions:

  • it frames the inputs as log-likelihoods
  • it is easily differentiable
brazofuerte
  • 737
  • 4
  • 19
2

Some math becomes easier with $e$ as a base, that's why. Otherwise, consider this form of softmax: $\frac{e^{ax_i}}{\sum_j e^{ax_j}}$, which is equivalent to $\frac{b^{x_i}}{\sum_j b^{x_j}}$, where $b=e^a$.

Now, consider this function: $\sum_i\frac{e^{ax_i}}{\sum_j e^{ax_j}} x_i$. You can play with coefficient $a$ making the function less or more soft max.

When $a\to\infty$, it is $\max(x)$ because $\lim_{a\to\infty}\frac{e^{ax_i}}{\sum_j e^{ax_j}}=\mathrm{argmax}(x)$.

When $a=1$ it is $\mathrm{softmax}(x)\cdot x$ - a smoother version of max.

When $a=0$ it is as soft as it gets: a simple average $\frac 1 n \sum_i x_i$

Aksakal
  • 55,939
  • 5
  • 90
  • 176