2

I am working on a binary classification problem with 1000 rows but have multiple high cardinality variables.

So, I decided to use Hash encoder to avoid curse of dimensionality.

However, after feeding in my columns as shown below,

    encoder = ce.HashingEncoder(cols=['market', 'Segment', 'Application',
           'Product Classification','State', 'Pincode'
           'Project Status','Country','line', 'DIV'], return_df=True)
categorical_data_transformed = encoder.fit_transform(categorical_data)

I got an output like below

enter image description here

My questions are

a) How do I know which hash value corresponds to what category or column?

b) How do I get the column names? instead of col1, col2, col3 etc?

c) How do we use this if we wish to explain our predictions to business users?

The Great
  • 1,380
  • 6
  • 18

1 Answers1

3

You don’t. It uses hashing trick, so the data is passed through a hashing function that maps the data to codes. The function can and will map different values to same codes, because in general it is used to reduce cardinality of your data. To learn what values were mapped to what codes, you need to create a dictionary by looping over your data, transforming it, and saving the input $\to$ output pairs for all the unique inputs. The mappings are meaningless, so you can’t interpret them. In terms of explainability it is a classical black box.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • thanks, upvoted. So, hashing may not be a good approach if we are interested in explaining the predictions to business users? Meaning, we don't know which feature has which values leads to prediction 1 – The Great Jan 23 '22 at 10:01
  • is there any other approach that you suggest for feature encoding ? (other than one hot encoding) because I have high cardinality variables – The Great Jan 23 '22 at 10:02
  • @TheGreat see the linked thread for many solutions. – Tim Jan 23 '22 at 10:03