1

Letting $\sigma$ denote the (standard) logistic function, this page explains that, given any linear model (which can return any real number as the predicted value $\hat{y}$)

$$f_{\hat{\theta}}(\mathbf{x}) = \hat{\theta} \cdot \mathbf{x} $$

one can get a logistic model which returns probabilities as fitted/estimated values by defining:

$$f_{\hat{\theta}}(\mathbf{x}) = \sigma\left(\hat{\theta} \cdot \mathbf{x} \right) \,. $$

In other words, we take the output of linear regression—any number in
R — and use the sigmoid function to restrict the model's final output to be a valid probability between zero and one... To construct the model, we use the output of linear regression as the input to the nonlinear logistic function.

The only properties of the logistic function $\sigma$ that this seems to use for motivation are that it takes values between $0$ and $1$ only. Even if somehow the fact that it is also monotone increasing and continuous are important, then these properties are still satisfied by any continuous CDF.

Question: What is the motivation for using the logistic function as opposed to any other CDF?

Logistic regression has always seemed arbitrary to me for some reason, but this seems to allow me to explain precisely the source of my confusion.

This question seems to be related to the concept of activation function in neural networks. Namely, it is commonly accepted that one has a wide variety of valid choices when choosing an activation function for the neurons in a neural network, with the logistic (sigmoidal) function being only one of many possible viable choices.

Question: Why shouldn't (we obviously can) define and use analogous regressions for any bounded activation function re-scaled to take values between $0$ and $1$ inclusive?

Chill2Macht
  • 5,639
  • 4
  • 25
  • 51
  • 1
    Well, if you replace it with the normal cdf, you get probit regression. My understanding is that a lot of it comes down to computational complexity. It is just easy to work with the logistic function. – Alex Aug 23 '18 at 02:25
  • Does it have something to do with the logistic function being somehow optimal for cross-entropy loss? http://www.textbook.ds100.org/ch/17/classification_cost.html Couldn't we still use any other CDF for which cross-entropy loss was convex? (Also, this would only reduce the question to the question of why we should prefer cross-entropy loss, although apparently the answer to that is KL divergence: http://www.textbook.ds100.org/ch/17/classification_cost_justification.html ) – Chill2Macht Aug 23 '18 at 02:27
  • @Alex Ohh I think you're right -- I remember having heard of probit regression before, but seemingly understand it for the first time now. Thank you for connecting the dots for me! (I hope that doesn't sound sarcastic -- I do genuinely mean it.) – Chill2Macht Aug 23 '18 at 02:28

0 Answers0