Letting $\sigma$ denote the (standard) logistic function, this page explains that, given any linear model (which can return any real number as the predicted value $\hat{y}$)
$$f_{\hat{\theta}}(\mathbf{x}) = \hat{\theta} \cdot \mathbf{x} $$
one can get a logistic model which returns probabilities as fitted/estimated values by defining:
$$f_{\hat{\theta}}(\mathbf{x}) = \sigma\left(\hat{\theta} \cdot \mathbf{x} \right) \,. $$
In other words, we take the output of linear regression—any number in
R — and use the sigmoid function to restrict the model's final output to be a valid probability between zero and one... To construct the model, we use the output of linear regression as the input to the nonlinear logistic function.
The only properties of the logistic function $\sigma$ that this seems to use for motivation are that it takes values between $0$ and $1$ only. Even if somehow the fact that it is also monotone increasing and continuous are important, then these properties are still satisfied by any continuous CDF.
Question: What is the motivation for using the logistic function as opposed to any other CDF?
Logistic regression has always seemed arbitrary to me for some reason, but this seems to allow me to explain precisely the source of my confusion.
This question seems to be related to the concept of activation function in neural networks. Namely, it is commonly accepted that one has a wide variety of valid choices when choosing an activation function for the neurons in a neural network, with the logistic (sigmoidal) function being only one of many possible viable choices.
Question: Why shouldn't (we obviously can) define and use analogous regressions for any bounded activation function re-scaled to take values between $0$ and $1$ inclusive?