0

Suppose a dataset with 5 columns. Let's call them: id, X_1, X_2, X_3, result.

  1. For some reason result is not good as it should be (greater result is considered as better while smaller as worse). I need to show that bad result was caused by bad X_3. How can I show that X_3 has more influence on result than X_1 and X_2?

  2. What is the correct way to show that X_3 is not a linear combination of X_1 and X_2? Maybe I can try to fit X_3 with linear regression of X_1 and X_2 and see how good linear regression will fit X_3?

  3. Is there a way to determine if it's possible to fit X_3 with a nonlinear function of X_1 and X_2?

Ale
  • 1,570
  • 2
  • 10
  • 19
Eugeny89
  • 175
  • 2
  • 10
  • how do you know how good result is to be? Is there a target? Is there a loss metric? I would look at information criteria for model selection. AIC, ... can say "this model is better even if it has slightly worse error because it also has significantly fewer parameters. – EngrStudent Sep 21 '20 at 13:50
  • @EngrStudent result is a number varying from `0` to `N`. `0` is bad while `N` is very good – Eugeny89 Sep 21 '20 at 13:55
  • I saw that. It depends on how many elements are 0. If only a few are then there is negligible loss of information. The important part, I think, might be between the mean and the upper tail. If log10 doesn't work so well, you might try showing square-root instead. – EngrStudent Sep 21 '20 at 14:06
  • On question 1: You seem to imply that `result` is a function of `X_1`, `X_2` and `X_3` and you might want to expand on that assumption. Do you mean to show that `X_3` has more influence in a linear model or in any possible model? There is a PCA tag but no mentioning of pca in the question? – Bernhard Oct 01 '20 at 14:11
  • @Bernhard Yes, you're right. I'm considering `result` as a function of `X_1`, `X_2` and `X_3`. So yes if we take a linear or non-linear model, I want to show that `X_3` has more influence on result than `X_1` and `X_2`. The reason I've put pca tag is that pca method seem to be something close to what I'm searching. It shows how important the variable is, so maybe there's some sort of modification of PCA method for my problem. – Eugeny89 Oct 05 '20 at 08:18
  • 1
    I did just upvote @gazza89 s answer. Fit a model that can handle most sorts of nonlinearity like a random forest and see if you get good predictive power. – Bernhard Oct 05 '20 at 12:38

2 Answers2

2

The question of "which feature has the most impact on the target" (basically your Q1) is quite a common one and it comes in many flavours. I do however note that you specifically mentioned the bad result being "caused" by x_3. There's basically no way to conclusively prove from raw data alone that x_3 "causes" bad results.

To use a simple example, if the result column represented how long somebody lives (and thus low result = bad), and the x_3 column represented how much they spent in their life on running shoes, you would likely find a correlation. People who spend more on running shoes likely live longer. This is of course because spend on running shoes is likely correlated with things that tend to actually cause you to live longer, such as exercising more, or generally having more disposable income which is in turn correlated with better access to healthcare.

The problem with interpreting this as causation is that you would incorrectly conclude that "in order to live longer, spend more money on running shoes". People who have no intention of exercising would spend more on running shoes (thus breaking the correlations that existed in your training data) and they wouldn't start living longer. Obviously this is a silly example because it's all very intuitive, but most data is less intuitive than this. The only way you can truly test for causation is to run a test in which you randomise the value of the variable whose causative properties you're trying to establish/investigate....which in many useful/real life situations, is very difficult/impossible to do. To de-jaronise this a little, in your case, I'm saying you'd need x_3 to be a feature which you as the experiment designer are able to vary randomly without changing to values of any other features (and to be clear, this means features we have access to, and ones we don't). The value of x_3 must not be predictive of anything apart from the result variable.

Generally, all you'll be able to do with your data is establish the extent to which x_3 predicts your result. This should only be interpreted as "if I know the value of x_3, how much more accurately can I predict result than if I did not know" and not "if I want to get a better result, is taking action to get a favourable x_3 a viable strategy?"

On Q2: yes fit a linear regression. Then plot your residuals. If they appear normally distributed, or at least distributed according to some sensible distribution, you can say that x_3 is a linear combination of x_1 and x_2 up to a XXX noise term. If that's not the case, you can conclude it's not a linear combination.

On Q3: While this isn't conclusive, you could simply fit your favourite regressor (e.g. random forest, xgboost or a neural network depending on data size) using x_1 and x_2 as features and x_3 as the target. If you manage to predict x_3 quite well (e.g. good test $R^{2}$ value) and your residuals look sensibly distributed, then you can conclude that x_3 is well-described by a non-linear combination of x_1 and x_2.

gazza89
  • 1,734
  • 1
  • 9
  • 17
1

Taking your questions in order.

Q1 there is a substantial literature on relative importance. See this Q&A Methodology for calculating variable importance in dataset using regression where I gave some references.

Q2 yes

Q3 you would need to specify what sorts of non-linear relationship you would consider first before embarking on this.

mdewey
  • 16,541
  • 22
  • 30
  • 57