0

I'd like to model an interaction term between a continuous variable and categorical variable, while accounting for possible aliasing in the variables. I was wondering what the best way to do this was.

As an example, suppose I have a data set containing damages incurred, car type (sedan, truck, etc), car model, and car age.

Damages incurred ($)        Type         Model          Age
1000                       Sedan         Hyundai G30S      10
300                        Truck         Ford F150          3
500                        Motorcycle    Yamaha F90         2

I'd like to include Age as a predictor, but I have reason to suspect that car age is associated with the car type, i.e. for instance, car ages affect losses very differently for sedans than for trucks. So preferably I'd like to include an interaction term, Type:Age, to account for this.

I also want to include Model, however, once I know the car model I definitely know the car type, so I cannot include Type in the modeling equation due to possible aliasing.

However, I don't want to use Model:Age in the modeling equation, because I have reason to believe that the car model doesn't add much more information than the car type; i.e. car type and age combined have the same effect as car model and age. However, including Model:Age can significantly increase the number of degrees of freedom, since there are so many kinds of car models.

So is there a way to somehow include Age:Type, Model, and Age in the GLM without dealing with significant issues in the model output? Or if not, what would be the best way around it?

platypus17
  • 127
  • 10
  • 1
    Yes, you can definitely use `Age:Type`, `Model`, and `Age` as your variables in the GLM. Just think of it as a model including `Type` as well, but with the coefficients of `Type` constrained to 0. – Tim Mak May 07 '20 at 07:09
  • Ah I see, thank you! I'm curious how it might affect the results of the coefficients. Do the other coefficients change their interpretation? If not then why would that be the case? – platypus17 May 07 '20 at 16:06
  • Also how does it not contradict with the answers found here: https://stats.stackexchange.com/questions/11009/including-the-interaction-but-not-the-main-effects-in-a-model – platypus17 May 07 '20 at 16:08
  • 1
    You can actually parameterize it such that `Type` is included in the model without collinearity, and you would obtain the same result. For example, Suppose your `Model` take values {1,2,3,4,5,6}, where `Type`=1 when `Model` <= 3, and `Type`=2 otherwise. You can then recode Model as `Model2`, with values {1,2,3,1,2,3}, such that 1,2,3 indicates "subtypes" within `Type`. – Tim Mak May 08 '20 at 01:46
  • 1
    So, it doesn't contradict the answer to https://stats.stackexchange.com/questions/11009/including-the-interaction-but-not-the-main-effects-in-a-model in that in fact you have included the main effects by including `Model`. – Tim Mak May 08 '20 at 01:47
  • 1
    Actually, to correct my above comments, if you use `Model2`, then you need to include `Model2` and the interaction term `Type*Model2` to make it equivalent to the original model. I assume all of these to be `factor` variables also (in R). – Tim Mak May 08 '20 at 01:53
  • To clarify does this mean I'll have `Model2`, `Type`, `Type:Model2`, and `Age` in the final model? I'm wondering why the GLM doesn't treat the two 2's in the `Model2` as the same, even though they are technically in the same column. – platypus17 May 08 '20 at 02:58
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/107748/discussion-between-platypus17-and-tim-mak). – platypus17 May 08 '20 at 03:01

0 Answers0