My dependent variable is a probability. As such, values lie between 0 and 1. The most common values are 0, 0.5, and 1 each occurring in 20% to 30% of the observations but any value in between is possible and some do occur.
Question 1: Which regression model is best to explain such data?
Ordinary least squares (OLS, function
lm
in R’sstats
package) is not suitable as it does neither account for the limited interval nor the accumulation at the margins.Logit regression (function
glm
with parameterfamily="binomial"
in R’sstats
package) accounts for the accumulation at 0 and 1 but does not allow intermediate values.Ordered logit regression (function
polr
in R’sMASS
package) could be applied when I divide the [0, 1] interval in subintervals. However, I lose the continuous nature of the dependent variable.For probit and ordered probit regressions, the same applies as for logit and ordered logit.
Left- and right-censored tobit regression (function
tobit
with parametersleft=0
andright=1
in R’sAER
package) might be appropriate. However, I found the following quote: “Some researchers have considered using censored normal regression techniques such as tobit ([R] tobit) on proportions data that contain zeros or ones. However, this is not an appropriate strategy, as the observed data in this case are not censored: values outside the [0, 1] interval are not feasible for proportions data.” (p. 302 in Baum (2008), http://www.stata-journal.com/sjpdf.html?articlenum=st0147).
Below you find a code example
# Load libraries
library(stats, MASS, AER)
# Generate data
set.seed(123)
data <- data.frame(x1 <- runif(60, min = 0, max = 1), x2 <- runif(60, min = 0, max = 1))
data$y <- -0.7 + data$x1 + 2 * data$x2 + rnorm(60, mean = 0, sd = 0.5)
data$y <- ifelse(data$y < 0, 0, data$y)
data$y <- ifelse(data$y > 0.4 & data$y < 0.6, 0.5, data$y)
data$y <- ifelse(data$y > 1, 1, data$y)
data$yCat <- data$y
data$yCat <- ifelse(data$yCat > 0 & data$yCat < 0.5, 0.25, data$yCat)
data$yCat <- ifelse(data$yCat > 0.5 & data$yCat < 1, 0.75, data$yCat)
data$yCat <- as.factor(data$yCat)
hist(data$y, breaks=101)
# Different regression models
summary(lm(y ~ x1 + x2, data=data)) # OLS
summary(glm(y ~ x1 + x2, data=data, family="binomial")) # Logit
summary(polr(yCat ~ x1 + x2, data=data)) # Ordered logit
summary(tobit(y ~ x1 + x2, data=data, left=0, right=1)) # Tobit
To make matters worse, my data is panel data. I know how to handle individual, time, and mixed effects and random and fixed effects models using plm from R’s plm package and F-test, LM-test, and Hausman test do decide which of these is best.
Question 2: For the dependent variable described above, which panel regression model is best?
Below your find a code example for the data structure. This extends the prior example.
# Load library
library(plm)
# Generate data (builds on prior example)
data$id <- rep( paste( "F", 1:15, sep = "_" ), each = 4)
data$time <- rep( 1981:1984, 15 )
pData <- pdata.frame(data, c( "id", "time" ))
# Panel regression example
summary(plm(y ~ x1 + x2, data=pData, model="within", effect="twoways")) # Based on OLS