Where does probability come in to logistic regression?

Question

Let $x_0, x_1, x_2, \ldots, x_n$ be our features, and let $y$ be the target variable.

With linear regression, our hypothesis is: $$h_{\theta}(x) = \sum_{i=0}^{n} \theta_{i} x_{i}$$ where $x_0 = 1$.

Now, with logistic regression, the hypothesis is: $$h_{\theta}(x) = \frac{1}{1 + e^{-\theta^{T}x}}$$ I have a few questions:

Are we simply using the hypothesis from linear regression, plugging it in to the sigmoid function and then that is our new hypothesis for logistic regression? So we're still assuming that the result is a linear combination of the features (before substituting in to the sigmoid function)?
Where does probability come in to this? I've seen that $$P(y = 1 | x; \theta) = h_{\theta}(x)$$Where does this come from? Of course, it's plausible since we are in the range $[0,1]$, but I don't understand how this sigmoid function outputs the probability of the target variable belonging to class 1?

Hint: $E(y)=Prob(y=1)$ for any binary random variable $y$. – Michael M Jul 30 '20 at 16:00 — Michael M, Jul 30 '20 at 16:00

score 1 · Answer 1 · answered Jul 30 '20 at 12:28

I find it helpful to think about logistic regression as a special case of Generalized Linear Models (GLM). In GLMs, we assume that the conditional distribution of the response belongs to the exponential family of distributions. (I say the "conditional" distribution here because the response follows this distribution only for some fixed value of the independent variables and the parameters). We then further assume that the expected value of the conditional distribution of the response (or the conditional expected value) is given as a function of a linear combination of the independent variables, $h(x^T\theta)$. The inverse of this function, $h^{-1}$, is called the link function.

In logistic regression, we assume that the response $Y$ is conditionally distributed following a Bernoulli distribution (a binomial distribution with $n=1$), where the parameter $p$ (the "success rate") may depend on $x$ and $\theta$, thus $$Y|x;\theta \sim Bern(p(x,\theta)).$$ We now wish to choose $p(x,\theta)$ in such a way that the conditional expected value of $y$ is equal to $h(x^T\theta)$, $$E(Y|x,\theta) = h(x^T\theta).$$ Conveniently, in the Bernoulli distribution we have that the expected value is just equal to $p$, which is equal to the probability that $Y = 1$. Therefore, $$E(Y|x,\theta) = P(Y=1|x;\theta)= h(x^T\theta).$$ If we define $h$ to be the standard logistic function, $$h(x^T\theta)=\frac{1}{1+e^{-x^T\theta}},$$ your case follows.

Aswin Barath · Answer 2 · 2020-07-30T14:03:13.237

0

We are not simply using the hypothesis from linear regression. And we plug it into the sigmoid function to convert into a new hypothesis for a couple of reasons!

Before we start we must know that the Logistic Regression Model is applied to Classification Problems. Classification can be of two types: (1) Binary- (1 or 0), (True or False), (Yes or No) etc. (2) Multiclass- (1, 2, 3 or 4) etc.

Ans:- Now we substitute this hypothesis into the sigmoid function because: Logistic Regression Model wants 0 <= hθ(x) <=1 A Sigmoid function helps us convert hθ(x) to values between 0 and 1

This is where Probability comes into the picture. The Hypothesis hθ(x) = P(y=1|x;θ)

Here hθ(x) = estimated probability that y=1 on input x. That is the sigmoid function directly provides us with this probability, as it has a range of [0,1] just like probability has [0,1] in mathematics.

The reason why we don't use Linear Regression for Classification:

For example- We want to predict a tumour is Malignant or Benign using one feature-x: Tumor size and we want to predict y: Malignant(1) or Benign(0). We can threshold the classifier output of hθ(x) at 0.5 as follows: If hθ(x) >= 0.5, predict 'y=1' If hθ(x) < 0.5 'predict y=0'

But if we apply this same technique with more training data:

it may predict wrong.
it may predict y>1 or y<0, but we need either 0 or 1 as answer in y. So, applying Linear Regression to a classification problem often isn't a great idea.

edited Jul 30 '20 at 14:03

answered Jul 30 '20 at 12:39

Aswin Barath

103
4

1

Nicen answer with graphs and all. Minor point: "but we need either 0 or 1 as answer in y", what we want are the posterior class probabilities that x belongs to the categories, that being 0 or 1. – Match Maker EE Jul 30 '20 at 12:58
Yes, you are absolutely right, we need to classify the results into either 0 or 1 categories. Thanks for pointing it out. – Aswin Barath Jul 30 '20 at 13:01
1

@AswinBarath We do not need to classify. Logistic regression predicts a probability. Thresholding to give a discrete category is a separate issue, and we can pick whatever threshold we deem appropriate (could be $0.5$...could be $0.99$). This is where proper scoring rules come in. – Dave Jul 30 '20 at 13:53
@Dave Yeah you are right, thanks for pointing it out. That was my mistake to specify threshold in the sigmoid function's answer section. – Aswin Barath Jul 30 '20 at 14:04
With the posterior probabilities as outcomes, you can convert them to any (new) prior distribution by the rule presented here:: https://stats.stackexchange.com/questions/476911/how-are-artificially-balanced-datasets-corrected-for/476948#476948 – Match Maker EE Jul 30 '20 at 14:36
@MatchMakerEE Thanks for sharing. – Aswin Barath Jul 30 '20 at 15:24

Where does probability come in to logistic regression?

2 Answers2