Logistic regression and inclusion of independent and/or correlated variables

Question

I am using a dataset to examine the relationship between student effort and student success in MOOCs (Massive open online courses). The dataset is quite large, 641 138 rows. Each row represents an individual and their (aggregate) interactions with the MOOC (course) they were enrolled in. Not all rows can be used for analysis due to missing data values. The dataset contains data for 16 MOOCs.

With regards to effort there are variables representing how many times the student interacted with the MOOC, how many days they interacted with the MOOC, how many chapters they interacted with, how many times they wrote on forums, and how many times they watched videos. As a measure of student success I want to use if they earned a certificate or not (it is like passing or failing the course).

So I was thinking to run a logistic regression, where the dependent variable is certified (true/false) and I use one (or more?) independent variables that represent student effort. I also want to control for some demographic variables available in the dataset, by adding them as independent variables to the logistic regression: Age, gender, country (this is sometimes available in aggregated form in the dataset for anonymity reasons), and level of education.

Working on this problem a few things are still unclear to me:

How do I select the one (or combination of) independent variable(s) that should represent student effort? (I imagine some will be strongly correlated, so might not be good to use in combination.)
Is it okay to control for differences between courses (MOOCs) by adding course_id as an additional independent variable to the logistic regression? (This option seems to yield higher prediction accuracy and better model fit.)

In case it is not clear in the above description, below is an example of a logistic regression command (in R), where I have included number of chapters as the independent variable representing effort, demographic variables have been included as independent variables, and course_id has been included as an independent variable to control for course (MOOC) differences:

fit <- glm(certified~nchapters+final_cc_cname_DI+LoE_DI+YoB+gender+course_id, 
           na.action=na.omit, data=ds, family=binomial)

Edit: To clarify, I am examining the above problem because I am writing a paper about it for a course in Multivariate Quantitative Research Methods. The course's focus is on commonly used first generation multivariate analyses in psychological research.

Please add the `[self-study]` tag & read its [wiki](https://stats.stackexchange.com/tags/self-study/info). Then tell us what you understand thus far, what you've tried & where you're stuck. We'll provide hints to help you get unstuck. Will you actually analyze these data, or is this just a hypothetical scenario to probe your understanding of these issues? — gung - Reinstate Monica, Aug 16 '17 at 14:26
Yes, I will analyze these data (I added an edit about it now). I am writing a paper (topic, dataset and analysis method chosen by me) where I will examine the relationship between student effort and student success in MOOCs. I am not stuck, and this is not homework questions, so I wouldn't think it is appropriate with the self-study tag..(?) I wrote the questions here to ensure that I address the problem in the best way possible, hoping to learn something from people who are more knowledgeable and experienced in statistics than I am. — Jea, Aug 16 '17 at 14:53

score 4 · Answer 1 · answered Aug 16 '17 at 13:40

What you are asking are some fundamental questions about regression analysis that are not just about your specific use case. Hence, I recommend you reading more on regression analysis or take an online course such as Statistical Learning thought by those whom actually proposed some of the regularization method I'm discussing:

How do I select the one (or combination of) independent variable(s) that should represent student effort? (I imagine some will be strongly correlated, so might not be good to use in combination.)

Selecting independent (or not so independent) variables depend on the regularization method used for logistic regression analysis. There are many of them such as LASSO (L1), Ridge Regression (L2), and Elastic Net. Have a look at this question.

Each of these regularization methods treat correlated variables/features differently. L1, favors one and ignores the other (zero weight), while L2 distributes weights equally, and Elastic Net is basically a mix of both. This is from Elastic Net Wikipedia page:

if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization).

Now, regarding your second question:

Is it okay to control for differences between courses (MOOCs) by adding course_id as an additional independent variable to the logistic regression? (This option seems to yield higher prediction accuracy and better model fit.)

This is totally fine as long as you what you are doing which is separating the contribution of variables within course vs. among all courses. This is a very important topic in the regression analysis and very popular in clinical research.

By creating a feature such as course_id, you'd need to also perform a regression for each course separately to analyze the contribution of variables for each course (within group).

The weights (or contribution of variables) that you get when training on all courses together with course_id, would simply tell you the amount of contribution of variables for all courses, in general.

gung - Reinstate Monica · Accepted Answer · 2017-08-17T00:23:11.657

@NULL is right that this is a general question that isn't specific to your use case. Let me supplement his answer a little.

What you really need to do in any given situation is think very hard about what you want to do and why. You have a situation where you want to build a model with response $Y$, but you believe that a set of $X$-variables will be highly correlated. That may not be a problem.

First, why do you think the variables are correlated?

Are they all measures of the same underlying construct?
Are they partly overlapping, but partly distinct?
Are some of them causally related to others?

Second, what are your long-run goals?

Do you just want to describe the data (their distributions and relationships)?
Are you exploring the data to generate hypotheses for future research?
Do you want to test a hypothesis? About what,
1. one of the correlated variables relative to another, or
2. about what they have in common, or
3. about something completely unrelated (these are controls and the variable of interest is uncorrelated with them)?
Do you want to to make a predictive model?
1. Who is going to be using this to make predictions? What data will they have (e.g., will they have some of the variables, but not others)?
2. Is it likely that values of the different variables will diverge in future cases?

Your answers to the questions above will guide how you deal with the variables.

Here are some possible strategies:

If you want to describe the data, just do so. The collinearity is part of the results to be described.
Good exploratory data analysis to generate hypotheses is hard. Try a lot of different models (e.g., different combinations of variables) and think deeply about them. Which are plausible? What would it mean, substantively, if one or another were the true data generating process? Don't only consider the point estimates of your betas, but also compute the models that would result from the true values being towards the extremes of the confidence intervals.
If you think these $X$-variables are all just different manifestations of a single latent variable (which is somewhat implied by your description), you could:
1. Combine them (for example, you could run a factor analysis, or maybe a PCA).
2. Or possibly, use all of them. Extracting factors will inevitably leave out some information (and hopefully the measurement error); using all of them will guarantee every nuance is captured. To test them, drop all the correlated $X$-variables as a group and perform a nested model test.
3. Or possibly, just pick one $X$-variable at random if the correlations are close enough to $r = 1.0$. At that point there is little to be gained by bothering with other strategies.
If you think the $X$-variables are partly overlapping, you could perform a factor analysis and extract $>1$ factor, or use the $X$-variable you see as most directly measuring the main idea and residualizing the rest so that the resulting set are orthogonal.
There are various ways to deal with causally related $X$-variables, and which to use will depend on the nature of the causal pattern you suspect and what you are trying to do. That said, the default approach might be to model the relations among the $X$-variables and $Y$ using something like structural equation modeling.
A test of an extracted central factor from your $X$-variables used as an explanatory variable in a multiple regression model will be a good test of the information shared in common amongst your $X$-variables.
If you want to test something about two (or more) correlated $X$-variables vis-a-vis each other, that is going to be very difficult to do, and you are just in a difficult situation.
On the other hand, if you want to test an exposure completely unrelated to the correlated $X$-variables, just go ahead. The multicollinearity won't have any effect on the test of interest.
If you are trying to create a predictive model, consider who will use the model to make predictions, in what situation, and what data will they most likely have access to. If they are likely to have $X_3$, but not $X_1, X_2, X_4,$ or $X_5$, use $X_3$.
Predicted means aren't terribly affected by collinearity, so if that's all you care about and the correlations are likely to be similar when the model is used to make predictions, you should be fine.
Conversely, if future data may occur in the regions of the $X$-variable space that aren't represented in your dataset, thar be dragons. Making a variety of different models and using model averaging may provide some limited safeguard.

Logistic regression and inclusion of independent and/or correlated variables

2 Answers2

First, why do you think the variables are correlated?

Second, what are your long-run goals?

Here are some possible strategies:

Linked