I have a large data set of survey responses in which test takers, after taking a test, reported on resources they used, hours spent studying each month, etc. I also have access to the test taker's grades, GPA, class rank, etc. I'd like to know which preparation factors are most predictive of exam score.
I have figured out correlation coefficients for each of the predictors using the corrcoef
function in MATLAB to get an idea of how predictive each factor is, but I'm concerned there may be some interacting effects. For example, I think some study habits or resources are more commonly employed/used by students with higher GPA's prior to taking the exam. I've done some initial analysis by chunking the data and getting correlation coefficients for ranges of GPA percentile; e.g., 25-50th GPA percentile.
I know in studies they often refer to "correcting" for a particular predictor for a response variable. For example, a statistically significant correlation is observed between X and Y, but when correcting for Z, the correlation is no longer significant.
Is multiple linear regression the right way to "correct" for GPA (or another predictor)? Or is chunking the data into GPA ranges better?
I'm using MATLAB with the Statistics & Machine Learning Toolbox, and I've used fitlm
and stepwiselm
for linear relationships without any interactions. If anyone with MATLAB experience can weigh in on how to accomplish this particular task with fitlm
, regress
, or another function, that would be greatly appreciated.
Thank you!