6

I am using the scikit-learn library to perform regression. However in my case I need the dependent variable to be constrained in the range 0 to 1. The dependent variable represents count proportions (counts in some category divided by a total count) and is there not continuous. I can see two ways to achieve this.

  1. Transform the dependent variable to the full real number line and perform normal regression.
  2. Transform the regression problem into a categorical one by selecting n classes each representing the range (i/n) to (i+1/n).

My guess is that the first option wouldn't work well in practice and the second looks like an ugly kludge (which might work).

What is a good way to constrain the dependent variable in regression (in Python)?


Regression for an outcome (ratio or fraction) between 0 and 1 suggested using Beta regression but I don't fully understand this option. Could anyone set out what Beta regression is in technical detail for those who don't use R?

graffe
  • 1,799
  • 1
  • 22
  • 34
  • 1
    Possible duplicate of [Regression for an outcome (ratio) between 0 and 1](http://stats.stackexchange.com/questions/29038/regression-for-an-outcome-ratio-between-0-and-1) – Tim Oct 10 '16 at 10:26
  • @Tim Thanks. I added something to the question as I don't understand the accepted answer. – graffe Oct 10 '16 at 10:29
  • 1
    Are your outcomes count proportions (counts in some category divided by a total count) or continuous proportions? – Glen_b Oct 10 '16 at 10:31
  • @Glen_b They are count proportions. – graffe Oct 10 '16 at 10:31
  • Thanks, that's crucial information. Please edit it into your question. – Glen_b Oct 10 '16 at 10:34
  • Check e.g. http://stats.stackexchange.com/questions/232979/logistic-regression-use-of-real-values-between-0-and-1as-opposed-to-two-clas/233003#233003 or – Tim Oct 10 '16 at 10:40
  • Possible duplicate of [How to do logistic regression in R when outcome is fractional?](http://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional) – amoeba Oct 10 '16 at 10:58
  • @amoeba Ah.. that is the answer in R. My question is how to do it in Python. – graffe Oct 10 '16 at 11:03
  • 2
    I know. But note that if you were to write a question "How to specify proportion as a DV in scikit-learn?" it would be closed as off-topic... – amoeba Oct 10 '16 at 11:08

1 Answers1

9

Beta regressions are used for continuous proportions (like the proportion of land with a particular soil type).

For count proportions, the most common models would be binomial regression models, a particular type of generalized linear model (GLM).

Of those, logistic regression is the most widely used though there's a number of other link functions that are used.

The estimated fit is automatically constrained to lie within the bounds.

It doesn't transform the response; it relies on fitting a function that stays inside the limits.

[Numerous questions on site discuss logistic regression. A few discuss other models - probit regression and complementary log-log regression, for example]

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thank. Please excuse the naive question but it looks like http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression expects the dependent variable to be either 0 or 1, not a real number in the range 0 to 1. See e.g. "Logistic regression, despite its name, is a linear model for classification rather than regression." – graffe Oct 10 '16 at 10:43
  • 2
    I think it is worth mentioning (as I did [here](http://stats.stackexchange.com/questions/232979/logistic-regression-use-of-real-values-between-0-and-1as-opposed-to-two-clas/233003#233003)) that in fact count proportion plus sample size is just another way of *storing* the same data about number of successes and number of failures, i.e. this is a standard logistic regression problem. – Tim Oct 10 '16 at 10:44
  • @Tim Thank you but I am stuck at the practicalities. The logistic regression libraries I have found are classifiers. That is they expect the dependent variable to be a class, 0 or 1. – graffe Oct 10 '16 at 10:46
  • 1
    @Lembik logistic regression is **not** a classifier (see [here](http://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification)) and it is possible to use it with proportions data -- but I'm not sure is scikit-learn is able to do so, from what you are saying, I guess it's not. – Tim Oct 10 '16 at 10:48
  • 2
    @Lembik If your software does not allow proportions as DVs then you can hack it by replicating the rows of your data table: e.g. if you have a row with 7/10 proportion, make it 7 rows with outcome 1 and 3 rows with outcome 0. Then run logistic regression. – amoeba Oct 10 '16 at 10:57
  • 2
    Some programs (R included) will allow you to specify proportions at each combination of predictors, but as amoeba says you can simply "unpack" your data back to 1s and 0s if you know the successes and the total count (i.e. the numerator and the denominator of your proportion). – Glen_b Oct 10 '16 at 11:47