0

I know that for big datasets we should try to consider calculations effort and try to minimize execution speed if it does not harm quality.

In many models, like regression, neural network, probably somewhere else, sigmoid function is used as a cost function. Input values in range (- infinity, -4] give output 0.01 or less. Input values in range [4, infinity) give output 0.99 or more.

Could we just map such values to 0.01 or 0.99 directly to improve calculation speed?

Does it make sense to improve performance against some inaccuracy in cost function?

desertnaut
  • 278
  • 2
  • 10
Denis
  • 3
  • 2
  • That's done sometimes (it's called min max scaling) with features. One advantage of logits (and activation functions like tanh/relu) is they helps with skew, avoid numeric precision issues and help with interpretation. – Learning stats by example Jun 16 '20 at 13:09
  • Does this answer your question? [Why sigmoid function instead of anything else?](https://stats.stackexchange.com/questions/162988/why-sigmoid-function-instead-of-anything-else) – Learning stats by example Jun 16 '20 at 13:39
  • It is used as an *activation* function, *not* as a cost function. – desertnaut Jun 16 '20 at 14:33
  • You could. It's not *obvious* that speed would be improved -- you're replacing an exponential and some arithmetic with a comparison and a branch, so it's possible the replacement would be slower. Testing would be needed to see if it was better, worse, or basically the same. – Thomas Lumley Jun 25 '20 at 06:51

1 Answers1

1

If you're using a numerically stable version of the sigmoid function, some version of your proposal is already done to prevent overflow. In the function

$$f(x) = \frac{1}{\exp(-x)+1},$$ very small values of $x$ can cause numerical overflow. So to remediate that, a sigmoid implementation might look something like

def sigmoid(x):
    if x < -20.0:
         return 0.0
    else:
         return 1.0 / (exp(-x) + 1.0)

We don't need to worry about the case of large $x$, because if $x$ is too large, then $\exp(-x)$ becomes zero and we simply have $\frac{1}{0+1}=1$.

The value of -20 is chosen to be close to where overflow would result in a NaN return. A different choice could be more appropriate depending on the floating point precision and the particular usage. In particular, we might want to pick a well-chosen value to preclude erratic behavior near the cusp of loss of precision.

The purpose is not to conserve compute time, because the difference between 0.01 and 0.001 can be very important to your computation. Throwing away that precision could give bogus results, such as stopping a gradient-based method in its tracks because the gradient suddenly becomes zero. Whether or not it's a good idea to compromise your computation's precision to get a small performance increase should be decided on a case-by-case basis, since the cost of imprecision could be very high in one instance but negligible in another.

Sycorax
  • 76,417
  • 20
  • 189
  • 313