2

I am trying to implement logistic regression where the label space is {-1,+1} instead of the usual {0,1}. I know that I can model this as a 0-1 problem but nevertheless I wanted to see if I can derive this from first principles (using MLE).

The min log likelihood expression I get is: $ \ l(\theta) = \Sigma_{i=1}^{m}\ \log(1+exp(-y^{i}\Theta^{T}x^{i})) $ where $\{\dots \ (x^{i},y^{i}) \dots \} $ are the $m$ training examples (x is a $n$-dimensional vector).

So now I try to find the gradient for this and I get: $ \frac{\partial l(\theta)}{\partial \theta_j} = \frac{\mu.y.x_j}{1+\mu} $ where $j=1\dots n$ are the indices corresponding to features and $\mu = exp(y\Theta^{T}x)$

However, when I try to solve this with Matlab's fminunc I do not get any updates on my initial weight vector. My Matlab code for this is:

temp1 = exp((-y).*(X*w));
temp2 = temp1.*((1+temp1).^(-1)).*y;
grad  = (X'*temp2);

Can somebody point what I am doing wrong here?

Jim
  • 1,912
  • 2
  • 15
  • 20
Jagadeesh
  • 23
  • 1
  • 4

3 Answers3

7

Expanding Frank Harrells answer, to derive likelihood function you first need to define the probabilistic model of the problem. In the case of logistic regression, we are talking about a model for binary target variable (e.g. male vs female, survived vs died, sold vs not sold etc.). For such data, Bernoulli distribution is the distribution of choice. Notice that using $\{0, 1\}$ or $\{-1, +1\}$ coding is not a part of the definition of the problem, it is just a way of encoding your data, the labels are arbitrary and can be changed. In this case we decide to use the $\{0, 1\}$ labels because they have some nice properties, but the main problem in logistic regression is estimating the probability of "success". We use the $\{0, 1\}$ encoding, because the model is defined in terms of Bernoulli distribution that uses such labels.

If you insisted on defining the likelihood function in terms of a distribution that assigns $1-p$ probability for $-1$ and $p$ probability for $+1$, then you would need to use such distribution in your likelihood function. The distribution would have the following probability mass function

$$ g(x) = p^{(x+1)/2} (1-p)^ {1-(x+1)/2} $$

what basically reduces to Bernoulli distribution for $(X+1)/2 \in \{0, 1\} $.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 2
    (+1) Rewriting your answer as $$g(x)=p^{(x+1)/2}\,(1-p)^{(x+1)/2}$$would make it more evident that all one has to do here is re-express $x\in\{-1,1\}$ as $(x+1)/2\in\{0,1\}$ and apply the usual result. Thus, there is nothing to be gained either conceptually or mathematically by the change in coding. – whuber May 02 '19 at 14:00
  • 1
    @whuber thanks, that seems to be more clear. – Tim May 02 '19 at 14:59
4

This is not machine learning. The tag should be logistic regression and maximum likelihood. I've corrected this.

It is traditional to have $Y=[0,1]$ in formulating the likelihood function. But if you want to show that you can get the same result with any coding, choose character values instead of numeric to stay general, e.g., $Y=[A,B]$. Then write out the associated functions, avoiding software code until the very end.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • Ok, but could you please see if what I have done for the {-1,+1} case is correct or not ? – Jagadeesh Jan 24 '15 at 12:52
  • I'd like to see the general case. – Frank Harrell Jan 24 '15 at 12:53
  • 3
    A lesson to us all -- very once in a while even Frank Harrell will get downvoted. :D – tchakravarty Jan 24 '15 at 14:13
  • I don't really understand why it is not machine learning. – wij Jan 24 '15 at 19:01
  • 6
    The logistic regression model was invented no later than 1958 by DR Cox, long before the field of machine learning existed, and at any rate your problem is low-dimensional. – Frank Harrell Jan 24 '15 at 19:37
  • 6
    Kindly do not downvote an answer unless you can show that it is wrong or irrelevant. – Frank Harrell Jan 24 '15 at 19:38
  • 3
    In case it's needed, let's stress that while using 0 and 1 may be a convention, but it's the best convention by far, allowing a direct interpretation of means as observed proportions and of the model as predicting probability. (Me too on the downvoting: it's hard to interpret the 2 downvotes here except as personal irritation about the comment on machine learning. I added a +1 to do what I can.) Cf. comments in @Tim's answer. – Nick Cox Feb 14 '18 at 13:31
0

After reading the answer from @Tim , I think I understand the use of transformation in the Bernoulli distribution, but I also got confused. When using the log-likelihood for the response y={-1,1}, we use

$$ \begin{aligned} logit \frac{P(x_{t})}{1 - P(x_{t})} &= X_{t}^{T}\beta_{t}\\ P(x_{t}) &= \frac{exp(X_{t}^{T}\beta_{t})}{1 + exp(X_{t}^{T}\beta_{t})} \\ \end{aligned} $$ that is $$ \begin{aligned} P(y=1|X) &= P(x_{t}) = \frac{exp(X_{t}^{T}\beta_{t})}{1 + exp(X_{t}^{T}\beta_{t})} \\ P(y=-1|X) &= 1 - P(x_{t}) = \frac{1}{1 + exp(X_{t}^{T}\beta_{t})} \\ \end{aligned} $$ that is $$ \begin{aligned} P(y=+1 or -1|X) &= \frac{1}{1 + exp(-y*X_{t}^{T}\beta_{t})} \end{aligned} $$ Then the log likelihood is $$ \begin{aligned} log~L = \sum_{t=1}^{n} log ~\frac{1}{1 + exp(-y_{t}*X_{t}^{T}\beta_{t})} \end{aligned} $$ After that, we obtain the coefficient estimates.

How to connect this log-likelihood function to the answer from @Tim ?

I think the use of the expression of g(x) in their answers is just trying to unify the expression of p(x) and 1-p(x). But as long as we obtain and hold on to the $$ P(y=1|X) $$ we are good.

vtshen
  • 419
  • 3
  • 13