1

I am using one-hot encoding to transform my categorical variable. But it's not just a presence-absence situation. Consider the variable as a device that can have with different brands as well as different model numbers. So, for example it can be Sony 10, Sony 10.5, or LG 2000, LG 3200. The brands differ and the model numbers have their own range too.

What I did was something like this:

I convert:

---------------------------
|   Index   | Device      
---------------------------
|   0       | Sony,10 
|   1       | Sony,10.5
|   2       | LG,2000
|   3       | LG,3200    

to:

---------------------------
|   Index   | Dev_Sony | Dev_LG      
---------------------------
|   0       | 10       | 0
|   1       | 10.5     | 0
|   2       | 0        | 2000
|   3       | 0        | 3200 

Question: I am using multiple linear regression. Using the above encoding, the model numbers (e.g. 10 vs 10.5) are useful when comparing devices of the same brand, but I'm not sure if they make sense in comparison with other brands. So, I was wondering if there is a better way of encoding such data.

UPDATE

based on the answer, my dataframe would look like this:

|   Index   | Dev_Sony | Dev_LG  | Model_Number   
---------------------------
|   0       | 1        | 0       | 10
|   1       | 1        | 0       | 10.5
|   2       | 0        | 1       | 2000
|   3       | 0        | 1       | 3200
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467

1 Answers1

2

Make two categorical variables, Device with values Sony, LG, ... and Model_Number with values 10, 10.5, 2000, 3200, ... . Then Model_Number is nested within Device. See then How do you deal with "nested" variables in a regression model? for how to model this.

But, very shortly, if you are using R then use the nesting operator / in the formula language, y ~ Device/Model_Number + ....

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Thanks for the answer. I'm using `Python`, if I understand your other post correctly, it should be modelled as "conditional variables": `y ~ Device + Device:Model_Number`? and why `Device/Model_num` rather than `Device*Model_num`? Can you give me a textbook reference where I can read more about this? – towi_parallelism Jan 27 '20 at 17:22
  • Those two notations are equivalent! – kjetil b halvorsen Jan 27 '20 at 18:04
  • 1
    No, I mean `Dev/Num` (which can be read Dev, and within Dev, Num) expands into `Dev + Dev:Num`. A good discussion is in https://www.springer.com/gp/book/9780387954578 – kjetil b halvorsen Jan 27 '20 at 18:45
  • 1
    I was implementing this today and realised that I forgot about one part. In practice, I'd still need to get the dummy variables for the devices. So, I'd still end-up with columns of unique devices. R should be doing the same behind the scene (I get the dummy_codes in Python using the Pandas library). Please have a look at the `Update` based on your answer. And then, I'll have `Dev` and `Dev*Num` in the formula. – towi_parallelism Feb 12 '20 at 05:13
  • 1
    The data frame in your update looks fine. – kjetil b halvorsen Feb 12 '20 at 14:15
  • Also, multicollinearity itself is a problem anyway, since, in your approach, all dummy variables will be almost perfectly collinear with their interaction terms (interaction is 0 when a dummy variable is 0, and a positive number (`Dev*Num` when the dummy variable is 1) . My point is since they are collinear, we can then drop one of them, say the dummy variable itself, and we end up using only the interaction term, which is exactly what I used in the first place (the second dataframe in the question) – towi_parallelism Feb 13 '20 at 18:26