Correlation vs. independence

Question

I would like to have a scholarly discussion, if that is allowed on this site. If it is not allowed, please direct me to the relevant site, or better, please migrate this question.

My question is embarrassingly fundamental. I would like to know, in general, does correlation help or hurt? In what way is a data analysis/statistical problem harder when you have correlation? What are some ways in which correlation can help?

Well, to start off, we have the central limit theorem (or laws of large numbers) for independent random variables. Would you say that having correlated random variables makes a limit law harder? Yes, if we look at it from a mathematician's point of view: the analysis will be harder. But, doesn't it actually help us because now the variables cannot behave erratically? Shouldn't it be easier then to expect a limit and even compute it by looking perhaps at suitable chunks?

Proper correlation helps in detection of noise from signal. If the eigenvalues of the population dispersion matrix are all very close to each other, then the dispersion matrix is close to identity, which implies that the coordinates will be independent mostly. In such a case, every coordinate is same. But with correlation, we know the eigenvalues have to differ in magnitudes, and hence if the largest eigenvalue is large enough, we can get the principal axis explaning a sizeable part of the data. This is the well known base for PCA.

Practitioners complain that a lot of theoretical results deal with independent observations. But, it seems to me that that theory tackles the harder case, a case about which nothing really can be predicted until someone comes along with a result. Correlation should make our lives easier. What are your thoughts on this?

This is a vague question. You write " In what way is a data analysis/statistical problem harder when you have correlation? What are some ways in which correlation can help?". Do you have any specific problems analysis/statistical problems you're looking at? — Demetri Pananos, Mar 23 '21 at 22:41
Not at all. I want to listen to experiences people have in general. I have cited two, but I am sure other people have a lot of chime in. — Landon Carter, Mar 23 '21 at 22:42
There are multiple limit theorems that deal with dependent data. In general, correlation is just a hurdle to answering a scientific question that requires specific methods, what's more: a sample comprised of dependent samples will contain less informaiton than one of independent samples. However, correlated data can answer specific questions: like how does a measure *change* over time within an individual, or what is the variability between individuals within a cluster (family, household, school, etc.). So really nothing specific can be said. — AdamO, Mar 23 '21 at 23:13

score 1 · Answer 1 · answered Mar 23 '21 at 22:54

An important piece of information would be correlation between what? Indeed, in OLS the slope parameter in simple regression is a rescaling of the correlation between the $y$ and $x$ (see here for a discussion).

Correlation (in so far as $y$ and $x$ are correlated) is thus a good thing! In fact, its the main thing we care about, or at least is plays a major part in the thing we care about.

However correlation between covariates in a multiple regression can prove an obstacle. We know already that if $x_1$ and $x_2$ are perfectly correlated then this results in a degenerate solution to the optimization problem. However, if $x_1$ and $x_2$ are highly correlated yet not perfectly correlated, the optimum still exists but we high uncertainty as to the effect of either variable. I discuss correlation between covariates here, here, and here (in this one, I demonstrate why the uncertainty inflates). So correlation between covariates in this case is not bad per se, but disadvantages us when we would otherwise be better off without it.

There is much much more here to say. Within subject correlation is a good thing to estimate if you can since it may explain some observed heterogeneity of outcomes (see mixed models), independence is a key assumption of most GLMs since it turns the product of density evaluations into a sum of log density evaluations etc. Without a much more targeted question, I'm not sure what else to provide you with.

Your last paragraph is very interesting. Don't worry that you have to write something specific, I am happy for your two cents. Let's talk about your third paragraph. While choosing predictors, you can have difficulty when predictors are correlated. This raises questions. What is the meaning of choosing predictors when you do not even know the correct model? Would you be happy to get an R^2 of 0.95 on a linear model while the real model is quadratic? Why should we have to choose among predictors? — Landon Carter, Mar 23 '21 at 22:59
@LandonCarter I think these are all good questions, but they are fairly broad and I have lots -- perhaps too much -- to say about them to attempt to provide answers in the comments. — Demetri Pananos, Mar 23 '21 at 23:05

Correlation vs. independence

1 Answers1