6

I have a dataset with many variables. Some of these variables are linked to each other in various ways, but I don't know in advance those that are.

For example, these are some relationships:

  • $A=B$ -- obvious and easy to spot/remove;

  • $A=B\cdot\mathrm{constant}$ -- again easy to remove using a correlation matrix;

  • The problem comes with $A = B+C+D+E$ or even $A = B+(0.5\cdot C)+(0.5\cdot D+E).$

The last set are difficult to identify and PCA doesn't provide me with all the clues to identify what A is made up of.

For example, a simple $A = B + C$ will not always show a link. If half of the B values are 0.1 with the other half distributed from 0.5-1.5 randomly and C is a normal distribution with mean 1 no correlation is detected between B and A. However, correlation is often detected in A and C due to the distribution in C still being present in A.

Without knowing in advance these relationships are present how do identify these underlying relationships?

In essence, I am trying to remove these relationships before running PCA and later modelling with remaining uncorrelated variables.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Samuel
  • 121
  • 10

0 Answers0