7

I am currently analysing data where the outcome variable is 'U' shaped. The outcome variable asks 'how many of the last seven days have you smoked'. Most responses to this fall in the first (none) and last (all seven) categories. Because of this I do not think a count data model is appropriate.

What would be a good approach to modelling this variable?

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
Becky
  • 71
  • 1
  • 3
    What are you trying to get at by using the number of days as the dependent variable? It seems (and your data seems to verify this) that people are either smokers or they are not so it could be appropriately viewed as a binary measurement. If you simply must use the number of days as the outcome, then you could use an ordinal regression model (e.g. the proportional odds model) but I'm not sure what added understanding that would give since your response distribution is basically binary. – Macro Feb 21 '13 at 13:36
  • Just to clarify, you say "the outcome variable asks", do you mean "the outcome variable indicates"? I.e. the "outcome variable" is actually the variable predicted by the regression? – Wayne Feb 21 '13 at 16:13
  • @Macro: It's "basically binary", but perhaps they are concerned with the middle outcomes -- which are tail events as it were. For example, maybe they are looking at smokers who are trying to quit and possible relapse triggers? (And perhaps once you relapse, it's highly likely that you'll stay in a relapsed state for a while.) Or perhaps they're looking at non-addicted smokers (who do exist), to see if events on certain days tend to trigger smoking. – Wayne Feb 21 '13 at 16:18
  • Thank you for your help. Sorry- I realise my question does not make complete sense. The variable I am predicting is how many days a week a respondent says that they have smoked. From the data I can see that responses are most frequent at zero and seven, however, I think it would be inefficient if I make this variable binary. – Becky Feb 21 '13 at 16:53
  • Answers here will be applicable to this case as well, [How to model this odd-shaped distribution (almost a reverse-J)](http://stats.stackexchange.com/q/49443/1036). – Andy W Feb 21 '13 at 18:05

2 Answers2

1

You might want to take a look at two-part (aka hurdle) count data models. A good place to start is Chapter 17 of Cameron and Trivedi's Microeconometrics using Stata. In fact, your smoking example is the one they use to motivate this. Essentially, you have one model to determine if a person takes up smoking, and then another one that determines how much if they decide to do it.

Another good source for overdispersed hurdle count data is Farbmacher (2011) SJ paper (scroll down to find it). Overdispersion happens when the (conditional) variance of your outcome exceeds the (conditional) mean, which is often the case with data like this.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • I would be concerned with the censoring at the top of the distribution as well. – Andy W Feb 21 '13 at 18:11
  • @AndyW You might have to elaborate on this. I am not sure I understand what it would mean for someone to smoke for 8 or more days per week. Is some contexts, like when stadium demand exceeds seating capacity, this makes sense, but not here. – dimitriy Feb 21 '13 at 18:17
  • On second thought after a bit more coffee, I see what you're getting at. The outcome is bounded above by 7, which violates the nonnegative integer count assumption, which is indeed a problem. I wonder if this can be recast as a two-part proportion model if you rescale the outcome to be fraction of the week. Maybe one can do this using frm from SSC. – dimitriy Feb 21 '13 at 18:36
  • It doesn't really matter how you logically interpret it, there is an upper limit on the outcome that will likely not be approximated well with any count model. It is always possible if you have a really excellent exogenous predictor of when someone will smoke all of the last 7 days there won't be a problem with predictions, but I wouldn't assume that in the (vast) majority of situations. – Andy W Feb 21 '13 at 18:36
  • I'm not following `I wonder if this can be recast as a two-part proportion model if you rescale the outcome to be fraction of the week. Maybe one can do this using frm from SSC.`. Off the cuff it might be reasonable to approach this an an ordinal regression problem as well. – Andy W Feb 21 '13 at 19:40
  • @AndyW Your criticism is a fair one. Let me see if I can elaborate. The general, two-part model consists of some sort of binary participation decision part and then an extensive decision part. The extensive, how-much decision part can be OLS, a count, or a proportion model, depending on the nature of the outcome. There are several appropriate models for proportional outcomes outlined in Stata Tip #63. This would involve some hacking since the expected value is the product of the two. On the other hand, the frm command from ssc seems to be an out-of-the-box solution. – dimitriy Feb 21 '13 at 21:55
  • I'm unfamiliar with [fractional regression models](http://ideas.repec.org/c/boc/bocode/s457542.html), so I can't make much comment on that. How exactly is this better/more appropriate than ordinal regression though?. Related to the logit transform and modelling proportions, see Smithson & Verkuilen, 2006, [A Better Lemon Squeezer](http://psychology3.anu.edu.au/people/smithson/details/betareg/Smithson_Verkuilen06.pdf) that suggests beta regression for likert scales with large floor/ceiling effects. – Andy W Feb 22 '13 at 13:16
  • FYI the next article after Farbmacher in that same Stata journal introduces a right censored poisson regression model, with an example outcome very similar to this ([Raciborski, 2009](http://www.stata-journal.com/article.html?article=st0219)). – Andy W Feb 22 '13 at 13:18
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/7631/discussion-between-andy-w-and-dimitriy-v-masterov) – Andy W Feb 22 '13 at 13:18
0

Think about the construct of interest

I'd think about the construct you are trying to measure. As Macro mentioned, it may be that your variable is largely reflecting the fact that people are either smokers or not smokers. If they are smokers, they will tend to smoke every day of the week, and if they are not smokers, they wont.

There might also be a third category of casual or occasional smokers. That said, your single item measure might not be the best way of discriminating between these three categories. So, if you are interested in the distinction between regular and casual smokers, then I'd look at incorporating some other indicators of casual smoking.

If you are interested in frequency or intensity of smoking, then your item is poor at measuring that. You would be better off asking about average frequency of smoking per day or some similar question.

General recommendations

Thus, I'd consider thinking more deeply about what you want to measure. But if you're stuck with the data you have, you might want to do one of a few different things:

  • Recode the variable to none or one or more and predict using binary logistic regression.
  • Recode the variable to none, one to six, and 7 and predict using multinomial logistic regression.
  • Do no recoding and predict the variable using something like an ordered probit or ordered logistic regression.
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250