Correct mathematical notation for subscripts in regression equation

Question

For the first time in a paper I am going to attempt to write out the equation for my model using mathematical notation, but I am a little unclear about how best to do it for the particular model I am using

In my experiment I have two groups (let's call them A and B) of participants and I want to estimate the average level of the outcome variable y in each group. My model is a simple linear regression using level-means coding, with a separate intercept term for each group (see here).

In the draft of the paper I have written the model out like so

$y_i = \alpha_Ax_{Ai} + \alpha_Bx_{Bi} + \varepsilon$

and have explained it with the following text "where $y_i$ is participant $i$'s expected score, $\alpha_A$ is the average score for group A, $x_{Ai}$ is a binary variable indicating whether participant $i$ belongs to group A, $\alpha_B$ is the average score for group B, $x_{Bi}$ is a binary variable indicating whether participant $i$ belongs to group B, and $\varepsilon$ is measurement error."

I have several question about this equation. First, in my model I estimate separate variances for group A and group B.

Question 1: Should I acknowledge the fact that there are separate estimates of the variance in the regression equation? For example would this equation be more appropriate?

$y_i = \alpha_Ax_{Ai} + \alpha_Bx_{Bi} + \varepsilon_{ij}$

where $y_i$ is participant $i$'s expected score, $\alpha_A$ is the average score for group A, $x_{Ai}$ is a binary variable indicating whether participant $i$ belongs to group A, $\alpha_B$ is the average score for group B, $x_{Bi}$ is a binary variable indicating whether participant $i$ belongs to group B, and $\varepsilon$ is the average amount participant i's score deviates from the average in group j, the group they were allocated to.

Question 2: If this second version is the correct version:

is it correct to have the i and j subscripts after $\varepsilon$ to account for the fact that there are separate variances? I am concerned that in this version j only appears after $\varepsilon$ and not after the $x$'s.
is it more correct to describe $\varepsilon$ as measurement error or the amount participant i's score deviates from their group mean?

Your explanation of the model appears to confuse its *parameters* with *data statistics.* For instance, although $\alpha_A$ might be *estimated* as the average score for group $A,$ it is an unknown model parameter that will likely differ from the average Group $A$ score. Moreover, $y_i$ is not the "expected" score: it *is* the score. In question 2, the subscript $j$ makes little sense, as you suspect. You may continue to write "$\varepsilon_i,$" but you need to state something about its variance (which will depend on the values of the $x_{Ai}$ and $x_{Bi}$). — whuber, Sep 04 '19 at 22:48
Fantastic @whuber, thank you. Immensely helpful. I have a few follow-up questions about your comment: (a) so would it be correct to say "$\alpha_A$ is the *estimated* averaged score for group A"?, (b) when you say I need to state something about $\varepsilon_i$'s variance being group-dependent, do you mean using mathematical notation (and if so how?) or that I simply explain it in-text, (c) would it be more correct to describe $\varepsilon_i$ as measurement error or the deviation of participant *i*'s score from the average score for group `A`? — llewmills, Sep 04 '19 at 23:14
No (a) $\alpha_A$ and $\alpha_B$ are parameters, not estimates; (b) the assumptions about the $\epsilon_i$'s [no $j$ involved] need be spelled out, for instance if there is a different variance for each group, which your estimates differing seems to imply; (c) both descriptions are vague so fit at a general level, if group $A$ means the population and not the sample. — Xi'an, Sep 05 '19 at 03:05
Thank you @Xi'an, it seems I haven't given anywhere near enough thought to this, either about what parameters are or how to explain them to readers who know even less than me. So how would you describe the $\alpha$ and $\varepsilon_i$ terms in the model if you had to explain the equation in a paper? — llewmills, Sep 05 '19 at 05:26

score 3 · Accepted Answer · answered Sep 05 '19 at 14:34

It helps to be straightforward and non-technical with statistical explanations, even with a scientific audience. You will have to determine the best trade-off between clarity and verbosity. In my experience as a peer reviewer of scientific papers, I have been impressed with those that describe their models clearly and correctly in plain language (which, regrettably, is unconventional and unusual): that is evidence the authors truly understand what they are writing about and care that their audience shares that understanding.

There are many ways to approach the problem of describing your model, but presuming the intended audience consists of people interested in the subject rather than the statistics, consider formulating the clearest possible explanations of the statistical concepts. This is facilitated by breaking the model description into a logical sequence.

For instance, you might begin by setting the scene:

This model describes a hypothetical population of people who are represented by the participants in this study. Each person in the population is characterized by their group $\mathcal A$ or $\mathcal B$ and their response $y.$

(Supply an unambiguous characterization of this population somewhere in your paper.)

You can proceed to describe your variables:

Group membership is coded with an ordered pair of variables $(x_{\mathcal A}, x_{\mathcal B})$ that is set to $(1,0)$ for people in group $\mathcal A$ and $(0,1)$ for people in group $\mathcal B.$ The response $y$ is the raw score (which is observed for each participant and hypothetical for all others).

Now you can posit a quantitative model:

Individual scores are expected to be close to an unknown value determined by the group membership: $\alpha_{\mathcal A}$ for members of $\mathcal A$ and $\alpha_{\mathcal B}$ for members of $\mathcal B.$ Writing $\varepsilon$ for the deviation between a score and that value permits the scores to be expressed as $$y = \alpha_{\mathcal A} x_{\mathcal A} + \alpha_{\mathcal B} x_{\mathcal B} + \varepsilon.$$

Explicitly make it a probability model by characterizing the random terms:

The deviations, or "errors" $\varepsilon,$ are modeled as independent zero-mean random variables. The variances of all the errors within a group are assumed to be the same, given by $\sigma_{\mathcal A}^2$ in group $\mathcal A$ and $\sigma_{\mathcal B}^2$ in group $\mathcal B.$ Both these variances are unknown and may be different.

If you feel the need to iterate the distinction between the population and the sample, or wish to expose the basic simplicity hiding behind the notation, you may make this explicit for the observations:

Quantities bearing the subscript $i$ denote values for participant $i;$ thus, viewing the groups as disjoint subsets of the population, $$y_i = \alpha_{\mathcal A} x_{\mathcal {A} i} + \alpha_{\mathcal B} x_{\mathcal {B} i} + \varepsilon_i=\left\{\matrix{\alpha_{\mathcal {A} i}+\varepsilon_i&\text{if } i\in\mathcal{A} \\ \alpha_{\mathcal {B} i}+\varepsilon_i&\text{if } i\in\mathcal{B}}\right.$$ and $$\operatorname{Var}(\varepsilon_i)=\sigma^2_{\mathcal A} x_{\mathcal{A}i} + \sigma^2_{\mathcal B} x_{\mathcal{B}i} = \left\{\matrix{\sigma^2_{\mathcal {A} i}&\text{if } i\in\mathcal{A}\\ \sigma^2_{\mathcal {B} i}&\text{if } i\in\mathcal{B}}\right.$$

Finally, it is usually a good idea to explicitly distinguish parameters from estimates:

Values of the model parameters $\alpha_{\mathcal A},$ $\alpha_{\mathcal B},$ $\sigma^2_{\mathcal A},$ and $\sigma^2_{\mathcal B}$ estimated from the data are distinguished with "hats" as $\hat\alpha_{\mathcal A},$ etc.

Later, or when writing for a statistically sophisticated audience, you may abbreviate the foregoing with a more telegraphic description:

The model allows for different group means and group variances. It assumes the raw scores $y_i$ are independent random variables with $$y_i = \alpha_{\mathcal A} x_{\mathcal {A} i} + \alpha_{\mathcal B} x_{\mathcal {B} i} + \varepsilon_i$$ and $$\operatorname{Var}(\varepsilon_i) = \sigma^2_{\mathcal {A}i} x_{\mathcal {A} i} + \sigma^2_{\mathcal {B}i} x_{\mathcal {B} i}$$ where the variables $x_{\mathcal {A} i}, x_{\mathcal {B} i}$ are the indicator functions for groups $\mathcal A$ and $\mathcal B$ respectively (also known as level means coding).

Thank you @whuber, a precise exacting answer. I will almost certainly not have space to outline your full model explanation as this model is only one of five in the paper, but you have given me an excellent roadmap for a reduced version. Thank you again. — llewmills, Sep 05 '19 at 21:33
You're welcome--and I suspected the space limitations would apply, which is why I have offered a collection of points you can select from as well as an abbreviated account at the end. — whuber, Sep 05 '19 at 21:40

Correct mathematical notation for subscripts in regression equation

$y_i = \alpha_Ax_{Ai} + \alpha_Bx_{Bi} + \varepsilon$

$y_i = \alpha_Ax_{Ai} + \alpha_Bx_{Bi} + \varepsilon_{ij}$

1 Answers1

Linked