How to model a dependent variable with a U-shaped dependent variable?

Question

Note: I should have flipped x and y in this example to be in line with typical notation. Sorry for any confusion.

Consider the following data:

set.seed(1839)

## generate u-shaped dependent variable
x1 <- rgamma(100, .5, 1)
x2 <- rgamma(100, .5, 1)
x2_reversed <- max(x2) + min(x2) - x2
x <- c(x1, x2_reversed)
hist(x)

Note the U-shape of the distribution of x. Imagine we have a dichotomous, categorical predictor y:

## generate independent variable
y <- c(rep("a", 100), rep("b", 100))

Because of the way I generated the data, I know that people from groups of a and b come from different distributions. But this is merely a reproducible example, so the obvious fact that b is higher than a is not interesting to me. Instead, I am wondering how to model data more generally when the dependent variable has a U-shape.

I model political data, where respondents tend to be very polarized, hence the bimodal/U-shaped data of the dependent variable. Let's say I do an experiment, comparing two messaging techniques. How would I model this relationship? If I use a linear model with a Gaussian link function, I violate the assumption of normality:

## lm with gaussian link function
model1 <- lm(x ~ y)

## problem is that residuals are not normally distributed
qqnorm(model1$residuals)
qqline(model1$residuals)
hist(model1$residuals)

I have seen questions elsewhere on this site about regression using a beta link function (Fit a Beta Regression model to a U-shaped Dependent Variable), but the beta distribution is bounded between 0 and 1, and my dependent variable falls outside of those bounds.

My first thought was to cut people into three groups, based on the dependent variable, and then do an ordinal regression (low, medium, high). But it is bothering me that I do not know how to model this in a way that doesn't turn a continuous variable artificially into a categorical one.

My next thought was to model this in Stan using some specific distribution in the likelihood specification, but I am not sure what distribution (or mixture of distributions) to use to specify the likelihood.

How can I model a continuous dependent variable with a strong U-shaped distribution?

Edit 1: I feel as if beta regression seems the way to go, but I am having trouble with values needing to be less than 1 or greater than 0. Normalizing my values to be between 0 and 1 (inclusive) yields this error, as beta regression requires the dependent variable cannot be 0 or 1 exactly (see https://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf):

## using beta regression
library(betareg)
x <- sapply(x, function(x_i) {
  (x_i - min(x)) / (max(x) - min(x)) 
})

betareg(x ~ factor(y))

Error in betareg(x ~ factor(y)) : 
  invalid dependent variable, all observations must be in (0, 1)

Edit 2: Per the betareg vignette, I rescaled my dependent variable like:

x <- sapply(x, function(x_i) {
  x_i <- (x_i * (length(x) - 1) + 0.5) / length(x)
  (x_i - min(x)) / (max(x) - min(x))
})

model2 <- betareg(x ~ y)
summary(model2)

Call:
betareg(formula = x ~ y)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-4.5795 -0.4558  0.0019  0.6160  1.6084 

Coefficients (mean model with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.7528     0.1176  -14.91   <2e-16 ***
yb            3.3336     0.1820   18.32   <2e-16 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   3.3452     0.3507   9.538   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 194.2 on 3 Df
Pseudo R-squared: 0.6756
Number of iterations: 14 (BFGS) + 1 (Fisher scoring)

Since the beta distribution can seemingly take any shape, I'm still unsure if this is the appropriate way to model the dependent variable. I'm sure I can figure it out with enough reading into beta regression, but if anyone can explain it concisely, it would be appreciated!

Beta distributions can be extended to any closed interval for $y \in [a, b]$ given a scaling $(y - a) / (b - a)$. — Nick Cox, Nov 08 '17 at 00:33
@NickCox correct me if I'm wrong, but Beta distributions do *not* include 0 and 1, right? Values must be less than 1 or greater than 0? So if I took my zero values from `x`, they would equal (0 - 0)/(1 - 0) = 0, which is not a valid value for a Beta distribution? Similarly, this would make something like 3 be (3 - 0) / (1 - 0) = 3, which is also not valid for a Beta distribution? — Mark White, Nov 08 '17 at 00:37
https://en.wikipedia.org/wiki/Beta_distribution cites the interval as I have it. In any case, the key point is how your mean response behaves approximately. The mean won't ever equal either bound for practical problems. Also, don't worry too much about U-shapes. The logit transform of a beta density is always unimodal. Indeed, there is another answer: apply logit to continuous bounded intervals. — Nick Cox, Nov 08 '17 at 00:46
@NickCox The reason I mention that is because `betareg::betareg()` will not take the dependent variable being 0 or 1. See edit in original post. — Mark White, Nov 08 '17 at 00:56
I didn't write it! In any case this isn't an R forum and the suggestion of logit remains. — Nick Cox, Nov 08 '17 at 01:00
@NickCox Sure, but beta regression requires 0 < y < 1, is what I was remarking on: https://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf. Could you expound upon your other answer of "apply logit to continuous bounded intervals"? — Mark White, Nov 08 '17 at 01:03
See e.g. my answer at https://stats.stackexchange.com/questions/70917/why-do-we-need-clearly-defined-variables-as-response-dependent-variables/70921#70921 In social science, see Papke and Wooldridge papers on fractional responses. — Nick Cox, Nov 08 '17 at 01:16

How to model a dependent variable with a U-shaped dependent variable?

0 Answers0