1

My dataset contains a lot of variables that appear to me as practically categorical on a continuous scale to differing degrees.

Many have a large chunk of zeros or specific value followed by one or more apparent separate chunks. In some cases this is obvious where there are literally 2 specific single effectively on/off. Others are much more potential candidates where there are almost 2 or more separate distributions.

I am trying to model a continuous normally distributed dependent on a number of potential variables (collected on a continuous scale). Most of these are likely not to contribute to the model. I will be using various modelling methods to explore what is best (i.e. I will be trying tree methods where the apparent binomial appearance isn't a problem). I am not assuming a good model can be produced.

In these situations is are there any hard and fast rules/techniques as to whether to categorise or not? Also having potentially performed the transformations what considerations/measures might I have to be aware of? I would say most of the dataset is like this.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Samuel
  • 121
  • 10
  • 2
    The answer's going to depend on what you want to do with them next. – Scortchi - Reinstate Monica Mar 13 '14 at 16:53
  • I am trying to model a continuous normally distributed dependent on a number of potential variables (collected on a continuous scale). Most of these are likely not to contribute to the model. I will be using various modelling methods to explore what is best (ie, i will be trying tree methods where the apparent binomial appearance isnt a problem). I am not assuming a good model is able to be produced. – Samuel Mar 13 '14 at 17:03

1 Answers1

3

What's relevant is knowledge of what each predictor represents & the implications for its relation to the response given the uses to which your model will be put; the distribution of the predictors may provide clues & impose limitations, but doesn't itself determine the functional form of the model. Discretizing continuous variables is rarely a good idea (see What is the benefit of breaking up a continuous predictor variable?).

In the case of a predictor that takes only two values, how you code it won't affect the model's fit (assuming you're applying no shrinkage based on the magnitude of coefficients). It's still worth thinking how you'll use the model to predict future observations: if 20% & 40% represent the rate in lower & higher tax bands, what'll you do when rates go up to 22% & 45% next year?

In the case of a predictor that takes only $k$ values, you have a choice of modelling it as nominal—with $k-1$ dummies—or as continuous—with up to $k-1$ coefficients in, say, a polynomial. You need to think what you'll do if a $(k+1)$th value arises in a future sample.

In the case of a predictor with a continuous bimodal distribution, you shouldn't take that as an indication that it needs to be made discrete. If people in your sample tend to be either short or tall for some reason, that doesn't imply blood pressure has a discontinuity in its relation to height.

In the case of a predictor generally continuous but with specific over-represented values, you need to think how they arise, & whether that suggests something special about them in relation to the response. If you're modelling propensity to buy loft insulation, the value of someone's monthly gas bill could be a useful predictor; but a gas bill of zero more likely means they don't have a mains gas supply than that they don't heat their home. In situations like this it can be useful to include a dummy variable to flag the over-represented value, in addition to the continuous predictor. (See here for some details.)

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • With the data in question the potential for predictors to alter(as in the tax example) is a definite possibility. And yes the variables in question with 2 peak values will likely not form a step in the response. Also, where zeros/one value predominates; this is (potentially) often a default value. Sometimes this is changed but by a step. I have very recently discovered this is the case as what i have is, in effect, inputs of nutritional ingredients dictated by market price. Thus, for a set time a value of 40 may jump to 80 or not be used at all ie, 0.. does this mean i should not catagorise? – Samuel Mar 18 '14 at 13:32
  • In summary, for what i have, it appears from your (and the linked) responses i should not catagorise as i expect there is more likely linear response between and two distribution peaks where i simply dont have data to fill the gap? I hope this is correct. – Samuel Mar 18 '14 at 13:42
  • From what you've said it certainly sounds like you've no reason to categorize anything. Of course interpolating or extrapolating into regions where you've no data is not free of risk; the alternatives are collecting more data or not using the model to make predictions. – Scortchi - Reinstate Monica Mar 18 '14 at 16:58
  • I agree. it is likely that these predictors are more on the side of a somewhat linear response than not just from the real context of the data. But i cannot rule out a step in the values between what i have. It is likely that more data will come in, but as it is something beyond my control it is likely that these gaps will not be filled – Samuel Mar 20 '14 at 09:04