How to decide on encoding high cardinality variables for a small dataset?

Question

I already referred the posts here, here, here, here, here etc. Don't mark as duplicate please.

I have a dataset with 1008 rows with 16 input variables and 1 target variable.

However, 14 of my input variables are categorical in nature and the number of unique categorical value that each feature contains is shown below

My questions are as follows

a) My problem can be solved by using traditional classification technique like logistic regression, random forests, decision trees etc. can these algos work if I am gonna one-hot encode my 14 input features? I suggest one-hot encode because my categorical input variables and their values doesn't have any inherent order/rank to them. So, basically they are like product_ids, part_ids, market_segment etc. So, there is no order but the cardinality is high can traditional algos work with such large feature space (if I one-hot encode). Our objective is to predict which product (id) with which part id from which region sold by which seller failed? here, product is f1, part is f2, region is f3, seller is f4 etc.

b) Is there any other better way/alternative suggestion to do this? I ask because I understand my dataset size is only 1008 rows. So,we have to make sure that it doesn't overfit.

c) I will drop correlated input features to reduce no of input variables. Is there any other suggestions/help please?

d) Though dataset is small, we are working on extracting more data. So, we may have more data point in future as well. But for now, we have only 1006 rows and how best we can get our methodology right now. So, I can use it for future data points

What about embeddings or hashing-trick https://stats.stackexchange.com/a/388051/35989 ? — Tim, Jan 15 '22 at 16:01

user304405 · Answer 1 · 2022-01-15T16:20:12.513

1

I'm a beginner at this topic, but we dealt with it in a course of my university. In your case, you could remove near zero variance variables and aggregate certain categories. I would recommend you to read the Chapters 3.4 and 3.6 of the following book: https://bradleyboehmke.github.io/HOML/engineering.html#feature-and-target-filtering

edited Jan 15 '22 at 16:20

answered Jan 15 '22 at 15:42

user304405

11
3

thanks, for your help. upvoted – The Great Jan 20 '22 at 06:55

How to decide on encoding high cardinality variables for a small dataset?

1 Answers1