0

I am looking to model physical activity (in minutes) as my dependent variable. I have several independent variables of the environment around the school (intersections, traffic, etc).

What type of model would make sense? I was thinking multiple linear regression but some of the variables do not really have a linear relationship.

jonsca
  • 1,790
  • 3
  • 20
  • 30
user10720
  • 11
  • 1
  • 4
  • 2
    Could you tell us something about the evidence you have of nonlinear relationships? In many cases a little bit of nonlinearity won't matter. – whuber Aug 22 '12 at 19:08

2 Answers2

6

Linear regression can accommodate non-straight-line relationships between IVs and the DV through various transformations of variables, addition of polynomial terms and so on.

That is a model like

$y = b_0 + b_1x_1^2 + b_2x_1 + b_3x_3^5$

is a linear model. But a model such as

$y = b_0 + 2^{b_1x_1}$

is not.

If the data are really nonlinear, then the choice of model depends partly on what you know about the relationships. If you don't know much, a spline regression may work well.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • I have tried to create log's for the IV but there is still no linear relationship, is there another way to transform the variables. Findings to date on the relationships between these variables is quite mixed. I will look into a spline regression. Thanks – user10720 Aug 22 '12 at 18:29
  • Have you looked into Box-Cox? It's been discussed here a lot, and there are also lots of resources on the web. – Peter Flom Aug 22 '12 at 18:36
2

I don't know if this suggestion might be too advanced, but if you want to model duration (i.e., time until cessation), the appropriate approach is survival analysis. Most likely, the Cox proportional hazards model is best.

With regard to non-linear relationships, @PeterFlom is giving you good advice that transformations (such as squared terms) and splines can help.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Is physical activity recorded over time as well as measured in duration of activity? If so then a time series model might be more appropriate. – Michael R. Chernick Aug 22 '12 at 19:01
  • That's a good point, @MichaelChernick, I'm not sure. I'm interpreting "I am looking to model physical activity (in minutes)" as meaning that how many minutes physical activity lasted is the DV, but I could be wrong. – gung - Reinstate Monica Aug 22 '12 at 19:03
  • No I interpret it that way too. But what is not said is whther or not duration of physical activity is continuously recorded over time or not. I think your assumption when you suggested survival analysis was that it is recorded once for each of a number of students. – Michael R. Chernick Aug 22 '12 at 19:09
  • Yes I am trying to model minutes of PA as the DV. It is measured as the duration of PA over a 1 hour sample (0 minutes to 60 minutes). The sample is quite large and evenly distributed. It is the IV's where they are skewed and have a non-linear relationship. What is my best bet? Should I transform the IV's for multiple linear regression? Or use another form of regression? – user10720 Aug 22 '12 at 19:09
  • Each student has a number of minutes when they are active (0-60). No intensity is being included – user10720 Aug 22 '12 at 19:16
  • I would not divide by 60. Moreover, it PA>60 in some cases, you have censored data, which is the biggest reason for needing SA instead of OLS reg. Note that the dist of your IV's is irrelevant (for more on that, see my answer [here](http://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not/33320#33320)). As for non-linear relationships, Peter Flom is giving you good advice. – gung - Reinstate Monica Aug 22 '12 at 19:16
  • No values are over 60 as it was only a 1 hour sample I am looking at. For the IV's I should run BoxCox? – user10720 Aug 22 '12 at 19:20
  • So there was no one who was still active when the hour ended? Eg, lets say a child has been resting under a shade tree & then starts playing 10 minutes before the 1 hour sample period ends. When the period ends, that child is still playing. Thus, this child's duration of PA is >10 min, but we don't know by how much; ie that duration is *censored*. – gung - Reinstate Monica Aug 22 '12 at 19:23
  • Yes the child may still be active after the 60 minute sample, but i only want to look at the 1 hour increment. What they do before or after is irrelevant. What type of model would you suggest? Multiple linear or survivor? – user10720 Aug 22 '12 at 19:26
  • You may only be interested in that hour, but you would be introducing an artifact that will distort your results. You should use SA. – gung - Reinstate Monica Aug 22 '12 at 19:29
  • Ok thanks, so PA will be my time variable, what is my Status variable? Do I need to adjust the IVs to become linear? – user10720 Aug 22 '12 at 19:31
  • Your status variable is whether they were still active when the hour ended (be sure you know whether your status encodes *event* or *censored*, software differ on that). I don't know what it means for an IV to be linear; I think of linear as pertaining to a relationship b/t the IV & the DV. If that relationship isn't linear, then you could use squared terms (etc) as Peter Flom suggested. – gung - Reinstate Monica Aug 22 '12 at 19:37
  • Ok so I square the IV's to create a linear relationship with the DV? – user10720 Aug 22 '12 at 19:42
  • That would be perfectly fine. – gung - Reinstate Monica Aug 22 '12 at 19:47