Two Beginner Level Questions in ML

Question

I am repeatedly surprised by how often these three things appear while any ML discussion is there:

Log-Likelihood: I understand the max likelihood principle, why log?
Softmax: Why softmax everywhere? Is it tied to log-likelihood in any way.
Sigmoid: Why sigmoid function only in NNs?

Please help me understand/direct me to resources which provided an intuitive + mathematical (rigorous) validations of these observations. Thanks a lot.

Three quite distinct questions (your title is wrong), and some of the premises seem to be not wholly justified. You should consider posting separate questions, but beware - it's likely they're already answered. e.g. in relation to the first, consider [this existing question](http://stats.stackexchange.com/questions/141087/i-am-wondering-why-we-use-negative-log-likelihood-sometimes) or [this one on math.SE](http://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution) (which doesn't really rely on the Gaussian part) and ... — Glen_b, Oct 03 '16 at 09:03
... [this one on SO](http://stackoverflow.com/questions/2343093/what-is-log-likelihood). So search carefully before editing to ask one question and before reposting the others — Glen_b, Oct 03 '16 at 09:07
@Glen_b thanks for the links, but the truth is I tried looking for the questions separately, i wanted to see if there is smthng unifying these things. One derives or justifies others like taking log of softmax helps or smthng like that. — Sie Tw, Oct 03 '16 at 09:44
@Glen_b you didn't point to any solutions for question 2 and question 3. So, 2/3rd of the question is valid and plz remove the hold, not justified as I am more interested in knowing the linking b/w these 3 things. It is not too broad. Thanks. — Sie Tw, Oct 03 '16 at 09:48
No, the links for the first one were *examples* of what you can find with a simple search. It's not up to me to find them for you, it's up to you to search for them before posting your question.. — Glen_b, Oct 03 '16 at 11:06
@Glen_b okay, my point was the one after that. Anyways, thanks. — Sie Tw, Oct 03 '16 at 11:08
My response was to you saying "you didn't point to any solutions for question 2 and question 3. So, 2/3rd of the question is valid" ... That's not how it works. As it stands you appear to have three questions -- and still too broad. In addition I pointed out that the first question appeared to be a duplicate or near duplicate. Your edit goes a little way toward connecting the first two questions but the questions are not clearly related (especially not the third), so it remains too broad as well as now less clear. You either need a much deeper edit to clarify what single question ... ctd — Glen_b, Oct 03 '16 at 11:15
ctd ... you are asking or you need to split them up. Please read our [help] especially in relation to [asking questions](http://stats.stackexchange.com/help/asking) — Glen_b, Oct 03 '16 at 11:16

score 3 · Accepted Answer · answered Oct 03 '16 at 09:36

Transforming the likelihood with log, thus giving log-likelihood often makes things easier to handle. A well known case is estimating with maximum likelihood parameters of a Gaussian population from random samples. Working directly with likelihood is hard, however taking logs it will separate the components of the distribution and since they are independent they can be fit separately. This is possible obviously because log is a monotonic transformation over positive values and thus maintain the solution. An example can be seen here: Why minimize negative log likelihood?
Softmax is a way to transform distances from $(-\infty,\infty)$ interval to $(0,1)$ interval. This kind of mapping is useful mostly when you want to map distances into probability. For example consider you have a categorical distribution, like a variable which can take $k$ possible values, which can be represented as a vector of zeros with 1 on the selected category. However instead of binary variables you have a distance for each category, aka a vector of distances? How can you transform that into the desired space? One solution in softmax. Note that it is not the only one and there is not always a connection with log likelihood. Wikipedia page should give you clear details.
Sigmoid is a particular case of softmax. Also it is not used only in NNs. Often is seen in logistic regression. Basically often logistic function is called sigmoid for binary case and softmax for cases when you have more than 2 categories.

Two Beginner Level Questions in ML

1 Answers1