How to model conditional variance?

Question

Sorry if this question has been asked before; I'd love to read any discussion around this. There's got to be a better way to summarize this question as well.

I've got covariates $X$ and response $Y$, and suppose I know that when $X$ is high (or low), so is the variance in $Y$, though perhaps not the expectation. Is there a standard approach to modeling this?

Thinking of examples where this could pop up:

Maybe Y is a stock price, and X is the number of articles written about the stock that day - we don't know if the news is good or bad, but we know the stock probably did something interesting that day if so many articles were written about it. This is assuming you're not interested in forecasting into the future, I guess.
Maybe $X = Z^2$, $Z \sim N(0, \sigma)$, and $Y$ has a partial correlation with $Z$, but you don't have access to $Z$ directly.
You have some advanced archers, and you're interested in the technique they use given their builds/limb lengths/injuries, but for some reason you gave them varying levels of caffeine beforehand, ranging from 0 to 10 cups of coffee (which you recorded dutifully). So their technique is still unbiased, just more jittery, and the noise in your measurements is greater for more caffeinated archers.

These are contrived examples, but this is just for fun right now.

One kind of exploratory approach that I can think of goes something like this:

fit a linear regression on all the other covariates, leaving out your "variance predictor" $X_v$
plot the squared residuals against your variance predictor to decide on some functional form of the relationship between the two, eg $V(Y)$ proportional to $exp(X_v)$
Use $\frac{1}{\exp(X_v)}$ as weights and run a weighted linear regression

This makes some sense to my little bird brain, but I'm sure there's a better choice than an exponential link function, or even WLS.

Isabella Ghement · Accepted Answer · 2021-02-10T09:58:27.677

When you say I've got covariates X and Y, do you really mean that you have a response variable Y and a set of predictor variables X?

Your examples cover a lot of ground, so you might benefit from keeping things simple to begin with. "Just for fun" is relative and the more complex the question, the more time it requires someone to spend answering it. (There are generally no guarantees that a question will be answered on this forum, I would think.)

For the sake of simplicity, I will assume that the observations collected on X and Y are actually independent (which would most likely NOT be the case for the data in your first example). I will also assume that we only have a single predictor variable X.

With such data and the stated assumptions in place, one flexible approach you could use to model both the conditional mean and conditional variance of Y given X would involve the use of so-called GAMLSS models, aka Generalized Additive Models for Location, Scale and Shape. See https://www.gamlss.com.

If we can further assume that the conditional distribution of Y given X is Normal, we would in effect deal with a special case of the above type of model: Additive Model for Location and Scale. In R, this model could be specified like this:

library(gamlss) 

model <- gamlss(Y ~ pb(X), 
                sigma.formula = ~ pb(X), 
                family = NO2, 
                data = Data)

Here, the function NO2() defines the normal distribution, a two parameter distribution, with mean equal to mu and variance equal to sigma.

The function pb() is a P-spline smoother which will allow for the possibility that the effect of X on the conditional mean and the conditional variance of Y given X is potentially nonlinear. The potential nonlinearity of the effects is captured in a nonparametric fashion. The data will help reveal the underlying shape of these effects - you will not have to guess what complicated forms these effects might have. (There are other types of smoothers available, such as cubic splines cs().)

For more, see http://www.gamlss.com/wp-content/uploads/2013/01/gamlss-manual.pdf and https://www.gamlss.com/wp-content/uploads/2019/10/Practicals-Bilbao.pdf.

Of course, if the data collected on X and Y are not independent (e.g., X and Y are daily time series), you would need to change your modelling framework. Dependence complicates things!

I think one important consideration is whether the conditional variance of Y given X is of primary interest or just a nuisance. If it is just a nuisance, trying to get an explicit model for it - let alone the best possible model - might be overkill. If you can assume that X has a linear effect on the conditional mean of Y given X, then you might get away with using some type of Huber-White correction of the standard error of your estimated linear effect of X on the conditional mean of Y given X. In other words, you can fit a model for the conditional mean of Y given X which assumes constant variability of the Y values at each X value but then correct the standard errors to make your inference on the linear effect of X on the conditional mean of Y given X robust to the presence of non-constant variability.

thank you! you're right, this was kind of a sprawling question, but this is exactly what i came for. besides the approach of explicitly modeling everything, i like that you emphasized the alternative of using a standard error correction for heteroskedasticity, i hadn't even considered that. — goopy, Feb 10 '21 at 14:34
Cool! There are other ways to explicitly model conditional variance by imposing a specific model on it - if you check out Zuur et al.’s book on mixed effects models, you will see how to do this in R using the *gls()* function from the **nlme** package: https://link.springer.com/book/10.1007/978-0-387-87458-6. — Isabella Ghement, Feb 10 '21 at 17:19

How to model conditional variance?

1 Answers1