0

Step and switch-like functions can be thought of as deterministic switches at some threshold, so smooth sigmoidal-like functions describe that switch with uncertainty around that value.

switch vs smooth transition

The machine learning literature discusses 'activation functions' extensively, but the choices are usually justified by application results. I am not aware of an argument for particular functions form first principles.

The heaviside function is sometimes approximated by $H(x) \approx \frac{1}{2} + \frac{1}{2}tanh(rx)$, and I think the gaussian CDF has a more immediate probabilistic interpretation.

Since probability distributions can be described as arising from idealized processes (e.g. gaussian from brownian motion, Gamma from sequence of exponential waiting times, binomial from repeated sampling with a fixed probability) my question is whether there is a canonical choice, or canonical choices depending on the process that generates the step function.

EDIT

The question is not whether there are continuous approximations to the step or Heaviside functions, nor whether there are functions that are popular or commonly used (so it's different from this question) as an alternative to the discontinuous step. The question is whether there are functions that represent probabilistic switch-like behavior and arise naturally from probabilistic reasoning (if not in general, for specific processes that generate some sort of switch).

Sycorax
  • 76,417
  • 20
  • 189
  • 313
caesoma
  • 11
  • 2
  • 2
    "Canonical" in what sense, please? You *seem* to be asking about *approximating* step functions, but to understand such a question we would need to know what the context is in which that is done. – whuber Mar 19 '21 at 16:39
  • That is part of the question. I am not asking about approximating the function, but whether there are processes that generate a probabilistic switch-like behavior and are associated to specific distribution, or distributions. I edited the question to try and make that clear. Thanks. – caesoma Mar 19 '21 at 17:03
  • 3
    I have voted to reopen but wonder, based on your comments, whether your title and first few lines are misleading about your intentions. Asking for a "canonical sigmoid-like function" seems likely to produce very similar answers to asking "why is the sigmoid function so widely used", which explains why your question was closed originally. If your true intent is closer to "are there processes in which the sigmoid naturally arises" then I suggest you edit your question and title more extensively, rather than simply add this as a post-script at the bottom. – Silverfish Mar 22 '21 at 23:10
  • 2
    This is partly because I think making such an edit would strengthen your case for reopening, partly as a service to future users/readers/searchers. The title and first few lines are what gets shown eg in search engine snippets, so it helps for these to capture the "meat" of the issue.. Also, future readers shouldn't need to make sense of a question being closed/edited/reopened. They ideally just see an improved, coherent, flowing question. A "tacked-on" paragraph explaining why your "real question" is different to what the paragraphs above suggest doesn't flow so well and feels confusing – Silverfish Mar 22 '21 at 23:17
  • Thank you for voting to reopen. From the beginning the title stated "probabilistic", not "sigmoid-like". This is 'Cross Validated', so I expect that the distinction would be clear to frequenters of this site. The question was originally closed because it asked for a "canonical" probabilistic version, which I understand is vague, but with the edits it should also be clear what the aim is. The reference to processes is auxiliary as motivation to any probabilistic reasoning. – caesoma Mar 24 '21 at 14:04
  • 3
    I appreciate everyone's effort to maintain the integrity of the site; however, it seems like it is expected that the questions must be elaborated in a way that there is a "right answer" that should be known to someone. This discourages more open questions that may still deserve an informative discussion. Editing the question twice and posting three comments to get no answers is not the ideal outcome to me. That is why I usually go to more specific discussion boards, and rarely here, although I think this should be a central resource. – caesoma Mar 24 '21 at 14:13
  • Thanks for the clarification. Those are valid points. Many questions are indeed high-quality and to the point; it makes sense to avoid excessive noise in the responses. Hopefully these procedures will also be able to balance that with the need for more specific replies. – caesoma Mar 25 '21 at 02:51
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/121253/discussion-between-caesoma-and-silverfish). – caesoma Mar 25 '21 at 02:51
  • 1
    I think that he is going a different direction than "why logistic". There are a number of "pathological" but really useful functions like step, and Dirac-delta. One could argue that the Dirac-delta is the first derivative of the step function. It is the basis for Haar wavelet structure. – EngrStudent Mar 25 '21 at 12:43

3 Answers3

2

Take a coin with probability of heads p. Flip the coin n times, where n is odd. Mark a success if at least half of them show heads, or a failure otherwise.

This can be modeled by a polynomial that approaches the "step" function as n gets large; in this case it's a polynomial in Bernstein form with $n+1$ coefficients; the first $(n+1)/2$ are zeros and the rest are ones.

Any other kind of "switch" (which succeeds only if certain counts of heads occur) can be approximated with a suitable polynomial in Bernstein form: For coefficient $j$, where $j \in [0, n]$, set it to 1 if $j$ heads are treated as a success, and to 0 otherwise. See also the reference below.

EDIT (May 26):

I found a paper by Ferreira and Zocchi that appears to explore several probabilistic models for smooth versions of piecewise linear functions.

REFERENCES:

  • Goyal, V. And Sigman, K., 2012. On simulating a class of Bernstein polynomials. ACM Transactions on Modeling and Computer Simulation (TOMACS), 22(2), pp.1-5.
  • Ferreira, IEP, Zocchi, SS, "A family of smooth piecewise-linear models with probabilistic interpretations", https://arxiv.org/pdf/2011.07753.pdf
Peter O.
  • 863
  • 1
  • 4
  • 18
  • That's a good probability-based explanation for a step-like switch. The only question is whether that process would converge to a specific distribution, since it's not convenient to use a polynomial with high degree/several coefficients in a modeling setting. – caesoma Apr 01 '21 at 16:39
  • This polynomial is formed as $\mathbb{P}(X \ge (n+1)/2)$, where $X$ is a binomial(_n_, _p_) random variable, or alternatively by taking $1 - F((n+1)/2-1)$, where $F(x)$ is the distribution function of a binomial(_n_, _p_) random variable. – Peter O. Apr 02 '21 at 00:30
  • Of course. I meant continuous, but that step is straightforward. – caesoma Apr 02 '21 at 02:36
  • Also, the polynomial is relatively simple here. Each coefficient is either 0 or 1, meaning the polynomial could be stored in $n+1$ bits of memory. If only a threshold of heads matters, then it's even easier: just store _n_ and the threshold. – Peter O. Apr 02 '21 at 09:29
1

A probability distribution is a mathematical function; it assigns the probabilities to the outcomes of an experiment. A Bernoulli random variable takes the value $1$ with probability $p$ and the value $0$ with probability $1-p$ for some fixed $0 \le p\le 1.$ In this sense, it is a switch (it takes on values $0$ and $1$), and it is probabilistic (how often it takes the value $0$ or $1$ varies with $p$).

Sycorax
  • 76,417
  • 20
  • 189
  • 313
1

Depending on the underlying continuous probability distribution(s), different "soft" switch functions may appear.

$ f(x) = \frac{1}{2} + \frac{1}{2} \tanh (rx)$, also known as the logistic function, appears naturally when you have two normally distributed classes with equal variances (or covariance matrices, in a multidimensional case). It describes the probability that an observation with the continuous value $x$ was generated by one of the two classes. You may take a look at this post for details and one-dimensional derivation.

Also, logistic function fits well into the formalism of generalised linear models, through the natural parameter (as its inverse) of the Bernoulli distribution.

Incidentally, logistic function appears in physics as the Fermi-Dirac distribution, where it governs the distribution of a large class of elementary particles ("fermions"), e.g. electrons, over energy states.

The other "soft switch" from your question, the Gaussian CDF, appears naturally when the boundary between two "classes" can be modelled by the normal distribution. This is common e.g. in toxicology, where the exact lethal dose of a toxicant varies between individual subjects (bacteria, lab animals), but this variation follows (approximately) the normal distribution. In that case the Gaussian CDF gives you the probability that the particular dose will be lethal.

So, as you can see, both of your "soft switches" have some connection to the Gaussian distribution, which is quite common, thanks to the Central Limit Theorem. That's probably the closest you can get to being "canonical", and you still have to make some additional assumptions. And, of course, there are infinitely many other distributions and their combinations, leading to different "soft switches", but they are less common in practice.

caesoma
  • 11
  • 2
Igor F.
  • 6,004
  • 1
  • 16
  • 41
  • It is interesting because I tend to think of the logistic function, not distribution. It also appears as the solution to differential equations in population dynamics. Although that may be more of a modeling than probabilistic argument, the population densities could be interpreted as probability densities and the "switch" could be between sparse and crowded populations. – caesoma Apr 02 '21 at 02:30