Should the response variable meet the assumption of constant variance?

Question

Information about the data:

The tensile strength of a synthetic fibre used to make cloth for men’s shirts is of interest to a manufacturer. It is suspected that the strength is affected by the percentage of cotton in the fibre. Five levels of cotton percentage are of interest: 15%, 20%, 25%, 30% and 35%. Five observations were taken at each level of cotton percentage; the 25 observations were run in random order. The data collected from the experiment are given in the table below. Y denotes the response variable, that is, the tensile strength, measured as resistance to a fixed stress on a scale from 0 to 50. The means of Y for the different levels of cotton percentage, and the overall mean response, are also given.

Table

Problem to solve:

The idea is to analyze the data with an analysis of variance.

Let $Y_{ij}$ be the $i$th replicate of the response variable taken at the $j$th level of the factor representing the percentage of cotton. Which of the following assumptions (below) is needed for the analysis to be appropriate? The objective is to justify and explain why the assumptions that have not been chosen are inadequate or otherwise inappropriate.

Assumptions

$(i) The data $y_{ij}$ must be a sample from a normal distribution. (ii) The observations $y_{ij}$ must be a sample from a normal distribution with constant variance. (iii) The observations $y_{ij}$ must be independent and must be sampled from normal distributions $N(\mu_j,\sigma^2)$ where $\mu_j$ is the expected tensile strength when the factor representing the percentage of cotton takes level $j$.$

I have read extensively but I still am confused with the following concepts:-

What does the notation $N(\mu_j,σ^2)$ stand for?
Should the observations from $Y_{ij}$ be from a sample with both a normal distribution and constant variance?
Would it be reasonable to model the relationship between strength and percentage cotton (as a continuous variate) using a simple linear regression model?

Glen_b · Accepted Answer · 2017-01-06T10:20:49.473

What does the notation $N(\mu_j,σ^2)$ stand for?

Read as "from a normal distribution with mean $\mu_j$ (mu-j) and variance $\sigma^2$ (sigma-squared)" (i.e. standard deviation $\sigma$). This is simply saying that the mean in each group may differ but the variance is constant, and that the population distributions the samples are drawn from are normal.

Should the observations from $Y_{ij}$ be from a sample with both a normal distribution and constant variance?

On normality:

The individual sample is only five numbers; the sample can't be "normal" as such. It's the population from which the sample is drawn that is assumed to be normally distributed (and then only when it comes to doing hypothesis tests or finding confidence intervals (etc) that make that assumption).
You can't really determine that a sample is drawn from any distribution. You can sometimes tell that it isn't.
Even then, it may not be important that it isn't, since in real situations data are almost never actually drawn from a normal distribution. Our distributional assumptions are generally false, but they may well be reasonable approximate models and may lead us to tests with good properties if the distribution from which the data are drawn are reasonably close to normal (even in cases when we could tell that they aren't).
common situations where we can immediately tell that we don't actually have normal distributions are (a) when the values are necessarily positive (truly normal distributions always include some probability of being less than 0, so if your values can't be negative they can't actually be normal) and (b) when the values are integers (the data are drawn from some discrete distribution but the normal distribution is not discrete). [But as already mentioned, this may not really matter - our models are approximations]

On constant variance:

Similar remarks apply here. The sample standard deviations won't be equal. It's the population from which the sample is drawn that is assumed to have constant variance (or standard deviation).
You can't really determine that the samples are drawn from distributions with the same variance. You can sometimes tell that they aren't.
Even then, it may not be important that it isn't the case, since in real situations the samples are almost never actually drawn from distributions with identical spread -- if the population variances were nearly equal, it won't matter all that much, and when the sample size is the same in every group the test is pretty insensitive to this assumption. In any case, it's possible to do a test that doesn't assume constant variance (for example, via a Welch-Satterthwaite type procedure), and this may very often be a good choice, if you're not confident that the assumption would be reasonable.
Note that if you did sample from distributions that have the same standard deviation, the sample standard deviations would differ, and with small samples like these they could differ quite a bit (you typically expect bigger variation in standard deviation than you have here even when the assumption is exactly true). These standard deviations are reasonably close together and given the equal sample sizes, I wouldn't bother worrying about the assumption any further; it won't present any substantial problems for the properties of the test.

Would it be reasonable to model the relationship between strength and percentage cotton (as a continuous variate) using a simple linear regression model?

It may sometimes be reasonable to do something like that but keep in mind that it's quite possible that the relationship is not linear. Indeed, often in an experiment like this the aim is to find approximately where the outcome (tensile strength in this case) is highest precisely because it's known/believed that it likely occurs "somewhere in the middle" of the range of possible input values (while a linear model would always place the highest strength at one of the ends).

In this case we can readily see that this is the situation here (i.e. relationship isn't anywhere close to linear, and the higher values are not at either end):

Ultimately whether to use something like anova or a linear regression or perhaps some form of nonlinear model depends on what you're trying to achieve (what it is you want to find out) -- the choice should be made before you collect the data you want to use for inference; you should avoid making assumptions that you don't feel confident would be reasonable before you collect the data (whether of normality, constant variance, linearity and so on). [Failing that, at least collect enough data to split the sample so you're not performing inference on the same data you used to pick a model.]

Typically with an industrial process like this you're not "in the dark" as to how the process operates and may well have a good sense of the typical mean and variation in the output as the main input or inputs change, so that you can start such an experiment with tenable assumptions.

Thank you Glen_b, I am feeling more confident after reading your answer. Many thanks — Alice Hobbs, Jan 06 '17 at 08:28

Should the response variable meet the assumption of constant variance?

1 Answers1

Linked