What is the appropriate regression type and approach when dealing with multiple continuous and categorical variables?

Question

I am trying to figure out the best approach to deal with cross-correlated variables in statistical analysis.

I am helping to analyse results of a randomised control trial for an educational intervention for a group of poor pre-school children in three different localities. We set up a trial with an experimental group and an intervention control group in each of the locations where students were randomly assigned to each.

We see a strong effect of age of children and time spent by children on the activities (the participation was voluntary) with the test results. However, one of the locations ended up concentrating a group of children that were most active. We have qualitative explanations for why that is the case, however, we would like to see whether we can tease out the effect of time spent at the activities.

Working in the Python language, I have tried different approaches – hierarchical regression where I have an issue how to input individual variables. Multiple regression where I am uncertain how valid this approach is for binary categorical variables (this holds also for hierarchical version). What is the appropriate analysis for two continuous variables (age, time at activities) and one categorical (three locations) for this problem?

What are your variables? What is the design of the study? What is the response? — gung - Reinstate Monica, Oct 08 '16 at 18:48
Variables - Time (mins in educational activities), Age (years and months), Locale (municipality where the study took place). In each locality, students were in one of two different educational interventions (experimental and control intervention) but since this was an extracurricular activity they could take different amount of times in each (hence the time variable). At this stage, we are interested in finding out to what extent any of these variables affects their results on cognitive tests that were administered to them at the end of the interventions. — Matt, Oct 08 '16 at 19:13
There is nothing wrong with using a categorical variable (treatment group) with two levels or location (with 3 levels) in regression. Is that what you were wondering? — mdewey, Oct 09 '16 at 12:48
Thanks! Yes that was the first thing that I was worried about. The second one is what is the right approach to input individual predictors in step-wise regression when you are facing [multi-collinearity problem](http://stats.stackexchange.com/questions/14500/how-can-a-regression-be-significant-yet-all-predictors-be-non-significant) which is what I am facing. I want to make sure that at each step I make an unbiased input of predictors (i.e. which goes first, second, etc.) and follow the best-practices here. — Matt, Oct 09 '16 at 13:11
You should not use stepwise regression. All the output is incorrect. — Peter Flom, Aug 25 '18 at 11:34

score 1 · Answer 1 · answered Aug 25 '18 at 11:39

First, the "type of regression" (e.g. OLS, logistic ...) depends mostly on the nature of the dependent variable. When that variable is continuous, the usual starting point is ordinary least squares, but there are alternatives. Your dependent variable appears to be "time spent on an activity" which could be continuous or not, depending on how you measured it.

Second, variable selection for models has been covered many times here. Stepwise is not a good method.

Third, one assumption of most forms of regression is that the errors are independent. Since you have multiple locations with multiple people at each, this assumption is likely to be violated. One solution is to use multilevel models.

What is the appropriate regression type and approach when dealing with multiple continuous and categorical variables?

1 Answers1