6

Is it possible to put standard deviations or variances into a linear model, as the data to be explained? I have a predictor which I think will linearly increase the standard deviation of a measure, and it is this variability that is of interest.

For each condition, I calculated the standard deviation, so that I have a vector of standard deviations which I'd like to model. I then fed this into a linear model

std_k( y_ik ) =  X_ij * beta_j + error_ij

where X is something like

[ 1  -2 
  1  -1
  1   0
  1   1
  1   2 ]

I realise that standard deviations are not normally distributed, so this isn't quite right. Can I transform the variable so that the error terms would be normally distributed? Or can I use a "generalised" linear model with a link function?

(I actually want to feed it into a mixed model, since several subjects perform the experiment. Each subject will have a different baseline variability, and I want to look at the variability across subjects by condition. I will also need to compare groups of subjects. Mixed model seems appropriate for that purpose)

Sanjay Manohar
  • 906
  • 9
  • 16
  • "*Is it possible to put standard deviations or variances into a linear model, as the data to be explained?*" -- Not only possible, but actually common, since effectively that's the basis of what's done in [Levene's test](http://en.wikipedia.org/wiki/Levene%27s_test). More generally, I'd lean toward using GLMs for models involving variances, but @rvl seems to have that well-covered in his answer. (With a mixed model, perhaps GLMMs) – Glen_b Sep 29 '14 at 23:44

2 Answers2

8

Yes, you can do this. A GLM of the SDs with a log link and a gamma family is one way to do it, if you think the populations are normal.

It is also not uncommon for people to regress log SD on a bunch of predictors. It is approximate, but all models are. One text where you can see this being done is Box, Hunter, and Hunter, Statistics For Experimenters (2nd edition), in their helicopter experiment in Chapter 12.

The log is intuitively correct here because scale parameters like SDs are multiplicative effects, and logging them makes hem additive -- suitable for a linear model.

Russ Lenth
  • 15,161
  • 20
  • 53
  • 2
    How would you incorporate the fact that you have many measurements contributing to this? Ie, even setting aside the repeated measures / mixed effects aspect, consider regressing 3 SDs where each SD is calculated from nj (say 20) measurements, you go from N=60 original measurements to 3. – gung - Reinstate Monica Sep 29 '14 at 22:00
  • @rvi Could you explain in more detail what you mean by scale parameters being multiplicative effects? – 114 Sep 30 '14 at 01:15
  • Thank you very much for the answer and reference. Very helpful - I will go with a gamma family GLM, which I had not come across before, but makes perfect sense now. The original variable `y` is normally distributed. – Sanjay Manohar Sep 30 '14 at 06:58
  • 2
    @gung, in these situations the data are often subsampled anyway, meaning they are not true replicates. Also, as subsample size increases, the error variance drops (even with log SD), so there is some compensation at hand. If subsample sizes are unequal, weights should probably be used. – Russ Lenth Sep 30 '14 at 13:21
  • 1
    @114 - Think about the way we standardize measurements, or un-standardize them. If you have $Z$ with mean $0$ and SD $1$, then $X=\mu +\sigma Z$ has mean $\mu$ and SD $\sigma$. Note that we *multiplied* by $\sigma$. That's why I say SDs are multiplicative by nature. – Russ Lenth Sep 30 '14 at 13:29
2

It sounds like you are proposing essentially a two-stage least squares, where stage one reduces each cluster to its standard deviation about a cluster-specific mean. This seems fine, although note that you could actually model on the observational level, ie, let the variance for each observation be a linear function of covariates. Note that I don't know of any off-the-shelf software that would allow for exactly that.

Returning to the two-stage approach, if cluster $i=1,...,N$ are normally distributed, eg $Z_i \sim N(\mu_i, \rho^2_i)$ then the sample variances will be scale chi-square distributed with $N_i -1$ degrees of freedom. Letting $S^2_i$ denote the sample variance in cluster $i$, then $$S^2_i \sim \frac{\rho^2_i}{N_i-1} \times \chi^2(N_i-1).$$

In more detail, we have that \begin{align*} E S^2_i & = \rho^2_i, \\ Var S^2_i & = 2\frac{\rho_i^4}{N_i - 1}. \end{align*}

A gamma GLM assumes that $Var Y = \phi (E Y)^2$, so this might actually be a case for gamma regression, with an identity link! (Which is a first for me, I think.) If the $N_i$ differ very much, then you need precision weights $1/(N_i-1)$.

Andrew M
  • 2,696
  • 14
  • 25
  • Thank you for the clear logic - I had not thought of it as a 2-level estimation, which of course it is! That helps understand exactly how the error will be distributed, and clarifies the assumptions required. – Sanjay Manohar Sep 30 '14 at 07:00