Using logistic regression to classify data set (asking about the steps)

Question

I have a training data set of predictors being classified into 2 classes. For now, I have already created a logistic regression model, that is, having solved the coefficients $\beta_0, \beta_1,\cdots$.

What do I do now to be able to classify the testing data set?

@Spätzle I am asking in general, not specific codes in R. I want to get the intuition first before I use a package. So my question is, what do I do next with the coefficients that I have calculated? — cgo, Oct 28 '21 at 08:01
Please add the [tag:self-study] tag & read its [wiki](https://stats.stackexchange.com/tags/self-study/info). Then tell us what you understand thus far, what you've tried & where you're stuck. We'll provide hints to help you get unstuck. Please make these changes as just posting your homework & hoping someone will do it for you is grounds for closing. — Stephan Kolassa, Oct 28 '21 at 08:16
Have you looked at [the Wikipedia page on logistic regression](https://en.wikipedia.org/wiki/Logistic_regression)? — Stephan Kolassa, Oct 28 '21 at 08:17
Your logistic model will give predictions for your test set in some form (essentially log-odds or odds or probabilities). You can then decide a criterion to turn these into classification predictions,typically a cutoff above which you classify a predicted positive based on the costs of making erroneous classifications, and can test those predicted classifications against reality — Henry, Oct 28 '21 at 08:54
@Henry: note that it is *much* better to test the *probabilistic* predictions against actual outcomes, using proper scoring rules, than to test the thresholded "hard" classifications. *All* evaluation measures on "hard" classifications (accuracy, sensitivity, specificity, F1, ...) suffer from the problems described at [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) — Stephan Kolassa, Oct 28 '21 at 10:52

score 2 · Accepted Answer · answered Oct 28 '21 at 08:26

2

Given a training dataset $(X,y)$ where $X$ is the covariate matrix and $y$ the dependent variable, we fit a logistic regression model with coefficients vector $\hat{\beta}$ and its covariance matrix $Cov(\hat{\beta})=\hat{V}$.

When provided a new sample $x_i$, the LR model work the following way:

Estimate the linear predictor, $\hat{\theta}_i=x_i^T\hat{\beta}$
Estimate the output probability using the sigmoid function, $$\hat{\pi}_i=P(y_i=1|x_i)=sigmoid(\hat{\theta}_i)=\frac{e^{\hat{\theta}_i}}{1+e^{\hat{\theta}_i}}=\frac{1}{1+e^{-\hat{\theta}_i}}$$ that's for the logit linking function. a similar method is available for probit linking function.
Provide a prediction based on $\hat{\pi}_i$. In a well-balanced model the threshold should be 0.5 (that is, $\hat{y}_i=\begin{cases}1\quad \hat{\pi}_i\ge 0.5\\0\quad \hat{\pi}_i< 0.5\end{cases}$, but there could be other cutoff values).

answered Oct 28 '21 at 08:26

Spätzle

2,331
1
10
25

2

The covariance matrix is not necessary, is it? Also, I have to disagree with your point 3 - setting the threshold should never be done unthinkingly: [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352). – Stephan Kolassa Oct 28 '21 at 08:40
While I'm not ruling out other threshold values (as written above), numerous literature references relate to this issue (believe me, I had to refer them in my Masters' thesis). The general agreement is that other values are possible, but the "proper" way is sticking with 0.5 and adjusting the coefficients accordingly (which usually means changing the intercept value). Regarding the covariance matrix - it is one important feature of the LR model, and while it's not used for binary predictions it is used for constructing CIs for $\hat{\pi}_i$. – Spätzle Oct 28 '21 at 08:51
1

Regarding the threshold, I would be interested in any arguments as to why 0.5 is the "proper" threshold, and especially why one should change the intercept. As you can see from [my answer at the link](https://stats.stackexchange.com/a/312124/1352), my position is diametrically opposed. Perhaps you would be interested in writing a comment there and pointing to literature? – Stephan Kolassa Oct 28 '21 at 08:57
It's not likely as I have a huge to-do list and some hectic schedule (and as you can see, I'm here instead). In a nutshell: 0.5 as threshold represents the assumption of $\pi_i\sim Logi(0,1)$. Taking different threshold means $\mu\ne 0$, which in turn could reflect a violation of the assumption that $\hat{\theta}_i$ is an unbiased estimator of $\theta_i$. This should be your approach when considering a high-stake binary classification problem (e.g benign vs malignant), you should assume no continuum. It is either 0 or 1, and type-II error could be fatal. Different case? different approach. – Spätzle Oct 28 '21 at 10:59
2

I understand all about not following up on CV because of other commitments, no pressure at all. However, I have to admit that I do not follow your argument at all, sorry. Maybe if you ever have the time to add a few pointers to literature (as I said, ideally at the other thread, which is more focused on this topic), we could continue the conversation. – Stephan Kolassa Oct 28 '21 at 11:05

Using logistic regression to classify data set (asking about the steps)

1 Answers1