I am using a dataset to examine the relationship between student effort and student success in MOOCs (Massive open online courses). The dataset is quite large, 641 138 rows. Each row represents an individual and their (aggregate) interactions with the MOOC (course) they were enrolled in. Not all rows can be used for analysis due to missing data values. The dataset contains data for 16 MOOCs.
With regards to effort there are variables representing how many times the student interacted with the MOOC, how many days they interacted with the MOOC, how many chapters they interacted with, how many times they wrote on forums, and how many times they watched videos. As a measure of student success I want to use if they earned a certificate or not (it is like passing or failing the course).
So I was thinking to run a logistic regression, where the dependent variable is certified (true/false) and I use one (or more?) independent variables that represent student effort. I also want to control for some demographic variables available in the dataset, by adding them as independent variables to the logistic regression: Age, gender, country (this is sometimes available in aggregated form in the dataset for anonymity reasons), and level of education.
Working on this problem a few things are still unclear to me:
- How do I select the one (or combination of) independent variable(s) that should represent student effort? (I imagine some will be strongly correlated, so might not be good to use in combination.)
- Is it okay to control for differences between courses (MOOCs) by adding course_id as an additional independent variable to the logistic regression? (This option seems to yield higher prediction accuracy and better model fit.)
In case it is not clear in the above description, below is an example of a logistic regression command (in R), where I have included number of chapters as the independent variable representing effort, demographic variables have been included as independent variables, and course_id has been included as an independent variable to control for course (MOOC) differences:
fit <- glm(certified~nchapters+final_cc_cname_DI+LoE_DI+YoB+gender+course_id,
na.action=na.omit, data=ds, family=binomial)
Edit: To clarify, I am examining the above problem because I am writing a paper about it for a course in Multivariate Quantitative Research Methods. The course's focus is on commonly used first generation multivariate analyses in psychological research.