Identifying linear relationships among many interrelated variables

Question

I have a dataset with many variables. Some of these variables are linked to each other in various ways, but I don't know in advance those that are.

For example, these are some relationships:

$A=B$ -- obvious and easy to spot/remove;
$A=B\cdot\mathrm{constant}$ -- again easy to remove using a correlation matrix;
The problem comes with $A = B+C+D+E$ or even $A = B+(0.5\cdot C)+(0.5\cdot D+E).$

The last set are difficult to identify and PCA doesn't provide me with all the clues to identify what A is made up of.

For example, a simple $A = B + C$ will not always show a link. If half of the B values are 0.1 with the other half distributed from 0.5-1.5 randomly and C is a normal distribution with mean 1 no correlation is detected between B and A. However, correlation is often detected in A and C due to the distribution in C still being present in A.

Without knowing in advance these relationships are present how do identify these underlying relationships?

In essence, I am trying to remove these relationships before running PCA and later modelling with remaining uncorrelated variables.

Identifying linear relationships among many interrelated variables

0 Answers0