Which regression model to use for a probability as dependent variable?

Question

My dependent variable is a probability. As such, values lie between 0 and 1. The most common values are 0, 0.5, and 1 each occurring in 20% to 30% of the observations but any value in between is possible and some do occur.

Question 1: Which regression model is best to explain such data?

Ordinary least squares (OLS, function lm in R’s stats package) is not suitable as it does neither account for the limited interval nor the accumulation at the margins.
Logit regression (function glm with parameter family="binomial" in R’s stats package) accounts for the accumulation at 0 and 1 but does not allow intermediate values.
Ordered logit regression (function polr in R’s MASS package) could be applied when I divide the [0, 1] interval in subintervals. However, I lose the continuous nature of the dependent variable.
For probit and ordered probit regressions, the same applies as for logit and ordered logit.
Left- and right-censored tobit regression (function tobit with parameters left=0 and right=1 in R’s AER package) might be appropriate. However, I found the following quote: “Some researchers have considered using censored normal regression techniques such as tobit ([R] tobit) on proportions data that contain zeros or ones. However, this is not an appropriate strategy, as the observed data in this case are not censored: values outside the [0, 1] interval are not feasible for proportions data.” (p. 302 in Baum (2008), http://www.stata-journal.com/sjpdf.html?articlenum=st0147).

Below you find a code example

# Load libraries
library(stats, MASS, AER)
# Generate data
set.seed(123)
data <- data.frame(x1 <- runif(60, min = 0, max = 1), x2 <- runif(60, min = 0, max = 1))
data$y  <- -0.7 + data$x1 + 2 * data$x2 + rnorm(60, mean = 0, sd = 0.5)
    data$y  <- ifelse(data$y < 0, 0, data$y)
data$y  <- ifelse(data$y > 0.4 & data$y < 0.6, 0.5, data$y)
data$y  <- ifelse(data$y > 1, 1, data$y)
    data$yCat <- data$y
    data$yCat <- ifelse(data$yCat > 0 & data$yCat < 0.5, 0.25, data$yCat)
    data$yCat <- ifelse(data$yCat > 0.5 & data$yCat < 1, 0.75, data$yCat)
    data$yCat <- as.factor(data$yCat)
    hist(data$y, breaks=101)
# Different regression models
summary(lm(y ~ x1 + x2, data=data)) # OLS
summary(glm(y ~ x1 + x2, data=data, family="binomial")) # Logit
summary(polr(yCat ~ x1 + x2, data=data)) # Ordered logit
summary(tobit(y ~ x1 + x2, data=data, left=0, right=1)) # Tobit

To make matters worse, my data is panel data. I know how to handle individual, time, and mixed effects and random and fixed effects models using plm from R’s plm package and F-test, LM-test, and Hausman test do decide which of these is best.

Question 2: For the dependent variable described above, which panel regression model is best?

Below your find a code example for the data structure. This extends the prior example.

# Load library
library(plm)
# Generate data (builds on prior example)
data$id <- rep( paste( "F", 1:15, sep = "_" ), each = 4)
    data$time <- rep( 1981:1984, 15 )
pData <- pdata.frame(data, c( "id", "time" ))
# Panel regression example
summary(plm(y ~ x1 + x2, data=pData, model="within", effect="twoways")) # Based on OLS

A comment to your 'logit does not allow intermediate values': see http://stats.stackexchange.com/questions/164120/interesting-logistic-regression-idea-problem-data-not-currently-in-0-1-form/164127#164127 — , Aug 21 '15 at 08:01
See also [Extending logistic regression for outcomes in the range between 0 and 1](http://stats.stackexchange.com/q/43366/17230). And you don't need to bin the dependent variable to use ordered logit regression -see [Transforming continuous variable to ordinal for estimation with ordered logit](http://stats.stackexchange.com/q/145302/17230) — Scortchi - Reinstate Monica, Aug 21 '15 at 08:43

Which regression model to use for a probability as dependent variable?

0 Answers0

Linked