How do you "correct" for another predictor when investigating correlations?

Question

I have a large data set of survey responses in which test takers, after taking a test, reported on resources they used, hours spent studying each month, etc. I also have access to the test taker's grades, GPA, class rank, etc. I'd like to know which preparation factors are most predictive of exam score.

I have figured out correlation coefficients for each of the predictors using the corrcoef function in MATLAB to get an idea of how predictive each factor is, but I'm concerned there may be some interacting effects. For example, I think some study habits or resources are more commonly employed/used by students with higher GPA's prior to taking the exam. I've done some initial analysis by chunking the data and getting correlation coefficients for ranges of GPA percentile; e.g., 25-50th GPA percentile.

I know in studies they often refer to "correcting" for a particular predictor for a response variable. For example, a statistically significant correlation is observed between X and Y, but when correcting for Z, the correlation is no longer significant.

Is multiple linear regression the right way to "correct" for GPA (or another predictor)? Or is chunking the data into GPA ranges better?

I'm using MATLAB with the Statistics & Machine Learning Toolbox, and I've used fitlm and stepwiselm for linear relationships without any interactions. If anyone with MATLAB experience can weigh in on how to accomplish this particular task with fitlm, regress, or another function, that would be greatly appreciated.

Thank you!

You can do this using a linear model, however this won't give you the effect, you may be looking for partial correlation. — user2974951, Jul 17 '19 at 06:02

score 2 · Accepted Answer · answered Jul 22 '19 at 08:45

Yes, a multiple regression probably would be appropriate here. You would need to specify the interactions in the model for this to work, so it's good that you have specific hypotheses/expectations about which variables interact.
I said probably in the previous point because it depends on the structure of the dependencies you are interested in. It is possible that a structural equation model (SEM) would be more appropriate.
Another possibility is to use a machine learning approach in which you do not specify your interactions a priori. For example, random forests capture interactions quite well without them being specified. However, extracting understanding from the fitted random forest is more complicated that simply using coefficient values.
I would not 'chunk' the data and investigate correlations (or even partial correlations). It's more efficient to use all the data at one time in a single model.
Whichever approach you use, keep in mind that what you will be modelling will be biased by students' perceptions. Especially after having taken a test, how much they believe they have studied may be coloured by how well they think they did on the test.

How do you "correct" for another predictor when investigating correlations?

1 Answers1