How to generate heteroskedastic data for linear regression analysis given Y

Question

I have at m different points on a surface representing an organ n measures of a organ property for n subjects (such as wall thickness). These values have been stored in a matrix Y with m columns and n rows. The measures at different points are highly correlated - the correlation coefficient between two columns of Y always ranges between 0.6 and 0.9. For each column $\tilde{Y}$ of Y I have computed a linear regression model of the form

$ \tilde{Y} = \beta X $

where X is a vector which contains n values of a variable such as age or height - one for each subject. X is always the same for all the m regression models.

By doing this, I am trying to test at all the points of the surface where there is a significant association between the values at that point and the variable in X - age for example. By using this approach (mass univariate analysis) I could discover the regional effects of that clinical variable on the organ. However, as the number of points under study is greater than 100k, I have to apply to the p-values associated to each correlation coefficient a multiple testing correction that I have specifically created for this problem.

Unfortunately, the data are heteroscedastic and therefore one of the linear regression assumptions is violated and I would like to test how the failing of this assumptions affects the results that I have obtained. In particular, in order to test that I would like to generate a vector X with no relationship ($\beta=0$) with $ \tilde{Y}$ and that would make the variance of Y unequal along the range of X (heteroskedasticy). But in doing so, I would like to maitain untouched the values of Y as I believe that the correlation between its columns plays an important role.

Do you have any idea/suggestion on how I can generate such data, please?

I already succeed to generate heteroskedastic data by generating X using a normal distribution and by adding to each column of Y an additional term generated with a normal distribution with mean 0 an variance equal to X, but in this way I am losing the correlation between the columns of Y.

Please edit your question to add [these details](http://arfer.net/w/statqgl). — Kodiologist, Jul 01 '16 at 17:27
That's much better, but I still don't see what broader problem you're trying to solve. — Kodiologist, Jul 01 '16 at 17:50
Thank you for your help. I am trying to test at all the points of the surface where there is a significant association between the values at that point (i.e. wall thickness) and the variable in X - age for example. By using this approach (mass univariate analysis) I could discover the regional effects of that variable on the organ. I hope it helps. Thank you. — Helm, Jul 01 '16 at 18:06
Great. Another thing: in your $Y$ matrix, do all $mn$ points represent *the same* organ property, or are some measurements of thickness, some measurements of curvature, etc.? Likewise, is all the data about a single organ, which you haven't named for some reason, or is some about one organ and some about another? If it's not all about one property of one organ, please describe how the properties and organs are organized (pun not intended) in the data. — Kodiologist, Jul 01 '16 at 18:08
Yes, each nm value of the matrix Y represents the same organ property but at a different point on its surface and for a different subjects. Each column represents a point on the organ, each row represents one patient. (n,m) = (patient n, point m). This approach can be used to study different organs such as the brain, the heart, the liver etc. For example, now I am studying the heart so now Y is composed of 100k+ columns that store the values of the wall thickness of 200 patients. Thank you. I hope it helps. — Helm, Jul 01 '16 at 18:22

score 4 · Answer 1 · answered Jul 01 '16 at 18:39

4

There's a big gain to be gotten by reorganizing your data. Right now, you're treating the wall thickness of each point in the heart as a completely separate dependent variable (DV), when clearly there are meaningful relationships among all these things. Instead, move the position information to $X$, so now you have only one DV. (How to encode the position in $X$ is potentially a deep topic, for which see a textbook on spatial data analysis, but whatever coordinate information you use to identify the positions in the first place is a decent place to start.) Note that you now need to add a subject identifier to $X$ since you now have $m$ rows for each subject instead of just one. Now, you can build one big regression model instead of $m$ small ones. You can look at, e.g., the overall effect of age on wall thickness by using a main effect of age, or the effect of age on wall thickness at a particular point on the heart by using an interaction term. You will probably want to use a mixed model where each person gets a random intercept, and perhaps also each position gets a random intercept.

Heteroscedasticity may or may not show up again, which there is no doubt many ways to respond to, but you can avoid it optimistically biasing your conclusions by predictively validating your model instead of relying on significance testing.

answered Jul 01 '16 at 18:39

Kodiologist

19,063
2
36
68

Thank you for your answer. Yes, I am treating the points as separate variables as this is the major strength of the method. My question is about how I can generate X randomly - without any relationship with Y - given Y, in a way that the data become heteroscedastic. I can generate heteroskedastic data by generating X using a normal distriubution and then by adding to Y an additional term generated with a normal distribution with mean 0 an variance equal to X, but in this way I am losing the correlation between the columns of Y. – Helm Jul 01 '16 at 18:54
2

That's not a strength. It's throwing away massive amounts of information (viz. the dependency between points), and creating unnecessary multiple-comparison problems. – Kodiologist Jul 01 '16 at 18:56
I see your point, but it is not easy at all to do what you suggest.. maybe I will try it in future, but at present I cannot change the entire approach (that it has been also used in more than 4000 scientific articles - so people definitevely clever than me didn't find something better) and I would like only to find an answer to the question I asked, please. – Helm Jul 01 '16 at 19:04
1

"Not easy at all"? It seems straightforward to me. But in any case, you are doing medical research and have a responsibility to do the right thing even when it's inconvenient. I'm saddened but not surprised to hear that an entire literature has committed itself to a statistical mistake. My own field, psychology, has even deeper problems like that. – Kodiologist Jul 01 '16 at 19:13
Sorry, I believe it seems straightforward to you beacuse the main topic here is expressed in a oversimplified way.. The approach I am using is not the object of the question.. This is also why I expressed the question using general terms without going in detail... The question is about how to generate heteroskedastic data for linear regression analysis given Y as specified in the main topic. Thank you. – Helm Jul 01 '16 at 19:24
1

"the main topic here is expressed in a oversimplified way" — Then express it correctly. That's why I asked for all those details. You probably have an [XY problem](http://xyproblem.info/). – Kodiologist Jul 01 '16 at 19:28
Kodiologist, thank you for your help, but I believe we are not understanding each other. I don't think I have a XY problem as my problem is how to generate heteroskedastic data for linear regression analysis given Y as specified in the main topic, not how perform the regression analysis in a different way. I am not going to change the structure of Y. I hope it is clear. If you don't know how to solve it is not a problem. Thank you. – Helm Jul 01 '16 at 19:45
4

I work in biostatistical consulting, & I'm sorry to say I am not the least surprised to find that there is some field w/ 4k published papers in all of which the analyses are invalid. (But then again, I am a grumpy, jaded, cynical, misanthropic, etc., old man.) – gung - Reinstate Monica Jul 02 '16 at 21:49
1

But @gung, we love you here, just the way you are! – Matthew Drury Jul 04 '16 at 02:52

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

4

+1 to @Kodiologist, that's clearly the answer as best we can tell from what has been written. Moreover, this is by definition an XY problem, since how to generate heteroscedastic data is what you want to do to "test how the failing of this assumptions affects the results that I have obtained".

That said, I can somewhat address the topic you have specified. Namely, you cannot generate $X$ data to create a null relationship in which the existing $Y$ data will be heteroscedastic. You will not be able to do this because heteroscedasticity is in the $Y$ data, and not in the $X$ data. That's why you have been able to generate new $Y$ data with heteroscedasticity, but have been having trouble replicating the heteroscedasticity by generating pseudo-random $X$ data. To understand this more fully, it may help you to read my answer here: What does having “constant variance” in a linear regression model mean?

edited Apr 13 '17 at 12:44

Community

1

answered Jul 02 '16 at 21:47

gung - Reinstate Monica

132,789
81
357
650

Hi gung, thank you for your help. I came to the your same conclusion: "you cannot generate X data to create a null relationship in which the existing Y data will be heteroscedastic" as "heteroscedasticity is in the Y data, and not in the X data". This is what I was asking for. I still don't see why I should have add tons of other information, but I believe it's my fault.. so sorry about that. Many thanks. – Helm Jul 03 '16 at 10:08
1

No worries, @Helm. We are, even at our most-grumpy-seeming, actually trying to help you. The extra information is so that we can provide what you really need; this stuff about heteroscedasticity is only what you think you need to solve a different problem, so that you can solve a third problem. It would be faster, and better for everyone, to solve your actual problem from the start. – gung - Reinstate Monica Jul 03 '16 at 12:54

How to generate heteroskedastic data for linear regression analysis given Y

2 Answers2