0

Background of The Question

Let's say, I have four categories (A, B, C, D). Considering one (D) as a reference variable, there will be three categories on which I have to work. But the problem is one participant can be in several categories, in each observation. For example: In a single observation, a participant can be in both A and C categories which violates the rules of creating dummy variables as described here.

Then, My Question

Which type of variables (like dummy variable) can I use that will allow me to keep a participant in two or more categories, for a single observation?

Notes

  • I am aware of interaction variable. None of the categories (in my problem) can be that type of variable.
  • I know in CV, there are lots of questions regarding dummy variables. However, I did not find the answer of my question. Instead, I have mostly learned from those questions what should and what should not be done in case of dummy variables.
  • My question is similar to this one which is unanswered.
Md. Sabbir Ahmed
  • 264
  • 2
  • 13

1 Answers1

1

Here is an example for Base-N Encoding using Python. Please view the below example data :

import pandas as pd
df=pd.DataFrame({"A":['a','b','c','d','e','ab','bc','bd']})

When Base N Encoder is applied

import category_encoders as ce
encoder= ce.BaseNEncoder(cols=['A'],return_df=True,base=5)
data=encoder.fit_transform(df)
data.loc[:,"A"]=df.A

Base-N Encoder Data Output

    A_0 A_1 A_2 A
0   0   0   1   a
1   0   0   2   b
2   0   0   3   c
3   0   0   4   d
4   0   1   0   e
5   0   1   1   ab
6   0   1   2   bc
7   0   1   3   bd

Binary Encoder Strategy

encoder= ce.BinaryEncoder(cols=['A'],return_df=True)
data=encoder.fit_transform(df)
data.loc[:,"A"]=df.A

Binary Encoder Data Output

    A_0 A_1 A_2 A_3 A
0   0   0   0   1   a
1   0   0   1   0   b
2   0   0   1   1   c
3   0   1   0   0   d
4   0   1   0   1   e
5   0   1   1   0   ab
6   0   1   1   1   bc
7   1   0   0   0   bd
Anant Kumar
  • 111
  • 3