0

I have a dataframe containing many feature columns, one of them being Score, with values between 0 and 1 (if it helps: it represents the difficulty of a test. The closer to 1, the easier the test). I created an extra column that applies the logit function to each of these values.

The reason why I did this transformation is to do a logistic regression to predict the difficulty. However, many values will be inf because in the input there are values such as 0 and 1 (most of them are 1). What do I do with them? Should I apply the logistic function (inverse of logit) to 0 and 1's? Should I just eliminate them?

  • You're predicting difficulty, based on difficulty? This is confusing. – Gijs Jul 12 '21 at 08:56
  • @Gijs I'm sorry, the column with the scores will be deleted, and I will only have the features (test length, test time etc) and this new target column where the logit function was applied. So instead of having a target column "Scores" with values between 0 and 1, I will have a target column with values in \R, but the problem is that there are lots of infinities. – n.mathfreak Jul 12 '21 at 09:06
  • 4
    Logit or logistic regression is in my experience always implemented by a dedicated routine or through some more general routine such as code for generalized linear models. It is not equivalent to regression following a logit transform: in fact for the classic case with a binary outcome with values 0 and 1 logit doesn't yield any finite values at all. For your case with a more nearly continuous proportion as outcome you should **not** drop any values, but just use an appropriate routine in your unnamed software environment. – Nick Cox Jul 12 '21 at 09:34
  • @NickCox I tried various regression models in Python and Google AutoML, using the original column with values between 0 and 1, and I've been told that that was the problem because there was a minimum and maximum value. I had low MAE, RMSE, RMSLE and MAPE, but also a low R^2. That's why I wanted to transform the values. – n.mathfreak Jul 12 '21 at 09:53
  • 2
    Indeed; your response or outcome is bounded and much of the point of logit regression is to respect that and also that it has a particular variance structure, so that for example as the mean approaches 0 or 1 so also the variance approaches zero. But as said logit regression is not regression on a logit-transfomed outcome. Some people would be happy in practice with a linear model fit to the untransformed data. Otherwise interest in applying a logit model to a continuous proportion is common in practice, but often only discussed in the context of generalized linear models. – Nick Cox Jul 12 '21 at 10:03
  • 1
    Otherwise put, linear regression on the original data might work not too badly, although the ideal conditions for such a regression are unlikely to be met closely; logit regression using a generalized linear model routine is the alternative. Logit transformation of a continuous proportion is problematic because zeros and ones don't fit that recipe. – Nick Cox Jul 12 '21 at 10:06
  • @NickCox thank you, things are more clear right now. If I may ask, do you have any recommendations regarding GLM routines (documentations, sources). I'm very new to this. – n.mathfreak Jul 12 '21 at 10:28
  • 1
    I would start with the book by Dobson and Barnett. I can't recall if they cover your application. – Nick Cox Jul 12 '21 at 10:39

1 Answers1

1

What you seems to want is sometimes called fractional logistic regression, there are many Questions on this site, start with What is the difference between logistic regression and Fractional response regression? and search this site.

But very short:

  • You should NOT transform the scores, use them directly as the outcome variable. Logistic regression will transform implicitly their expectation, not the scores themselves.

  • Since your data is not really binomial, the likelihood function is not strictly correct, so use

  • With R you would use something like

mod0 <- glm(score ~  ., family=quasibinomial, data=your_df)
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Thank you! I'm working in Python and I found and used this reference: https://statcompute.wordpress.com/2012/12/16/fractional-logit-model-with-python/. But I am getting a negative pseudo R^2, Log-Likelihood, LL-Null and LLR p-value. Is this something to worry about? – n.mathfreak Jul 13 '21 at 06:50
  • I have no experience with pseudo-R2. Negative loglik is as expected, but a negative p-value is impossible! Better for you to include the results (with some plots) in an edit to the Q – kjetil b halvorsen Jul 13 '21 at 15:18