0

I am working on a binary classification problem with input variables like country, state,city, product, product type, product segment etc. Similarly, I have lot more hierarchical categorical variables

As you can see, variable city is a granular level info of variable country. Same with other hierarchial variables.

My questions are as follows

a) We want our ML model to find factors such as state, country, city etc.

ex: We would like to predict in which country, state and city, does our product has high likelihood of selling? ex: `Product A has 90% likelihood of selling in Country A, State A and City A.

b) How to run correlation between hierarchical variables? Should we retain top level variable or bottom/granular level variable?

c) Does it make sense to feed all this hisrarchial variables into ML model? How to decide on feature selection here?

c) Any other suggestion on how to handle hierarchical variables during feature engineering and ML model building etc?

Can guide me on this?

The Great
  • 1,380
  • 6
  • 18

1 Answers1

1

Given a city you can immediately infer both the state and a country. So the latter two features give you no additional information whatsoever! Because of that, the simple answer would be to remove state and city from the dataset.

However, there are some caveats here. First of all - what's the original problem you're trying to solve? If you're interested in explaining factors driving sales in different states - you could for instance build a separate model for each state and have city as a feature (it can be indeed an important component). In this case you'd use both: state to partition the dataset and have city as a feature. Second of all - can you have missing data? Is it possible that you have a sample for which you know the state but not the city? In such case, state feature no longer brings no information and including it in the model would be worth considering.

Tomasz Bartkowiak
  • 1,249
  • 12
  • 20
  • Thanks, upvoted for the help – The Great Jan 18 '22 at 11:11
  • Do you know whether there is any auto ml solution, that can handle the encoding of my input categorical variables with more than 100 unique values? Like hashing etc – The Great Jan 18 '22 at 11:13
  • Why not using [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)? Also, before encoding a variable, plot its histogram. Maybe 95 out of 100 unique values appear only once or a couple of times and you can consider all of them as "Other"? In such case you'd only have 6 unique values after preprocessing. – Tomasz Bartkowiak Jan 18 '22 at 11:16
  • Right. But ubfortunately, in my case, all product ids are unique..Meaning in a dataset of 1000 rows, I have 813 unoque values...whereas other columns have 100 plus unique values in a 1000 rpw dataset – The Great Jan 18 '22 at 11:22
  • Then look at [BinaryEncoder](https://contrib.scikit-learn.org/category_encoders/binary.html) – Tomasz Bartkowiak Jan 18 '22 at 11:34
  • So, there is no special consideration for hierarchical variables? Just that we retain the granular level detail (so, we can infer high level data without ML). For correlation of categorical input variables (ip varaible 1 vs ip varaible 2), do you run multiple chi-square tests? I have around 14 - 15 input categorical variables..Or is there any automated approach that can generate all this correlation heatmap? Do you know any such package which can help us feed the cat_features_list and outputs the heatmap – The Great Jan 18 '22 at 12:01
  • You mean something like [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)? – Tomasz Bartkowiak Jan 18 '22 at 12:22
  • Yes, two things. First, how do you do correlation between categorical variables when there are more than 15 input variables? Do you run 15*14 chi square tests? Second, how do you visualize the results from those 15*14 tests? – The Great Jan 18 '22 at 12:49
  • What is the question you want to answer by performing those tests or calculating the correlation? Are the categorical variables ordinal? If not, the concept of correlation [might not really apply](https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable). In such case you might consider information-theoretic concepts such as [mutual information](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) (loosely speaking - how much knowing a value of one feature tells you about the other feature). – Tomasz Bartkowiak Jan 18 '22 at 15:51
  • Yes, those input categorical varoables are not ordinal. They are nominal with multiple unique values.Okay, I will try MI. Is it okay if I reach out to here in comments section for any doubts? Appreciate all your help and time – The Great Jan 18 '22 at 22:06
  • Best if you create a new question so other people could possibly a) help you, b) benefit from others' answers. Also, anything related to _MI_ is a bit off-top in this thread so it'd make sense to keep it separate – Tomasz Bartkowiak Jan 19 '22 at 05:22
  • One quick question on hierarchical variables. Like you said, If we know the city info, we can infer the state, country and region info. So, we have to input only city variable into the model. Am I right? Rest of the variables can be excluded from the list of features. Am I right? (except the special case that you said above) – The Great Jan 19 '22 at 08:11
  • If you have time, can I seek your help/ inputs for this post? https://stats.stackexchange.com/questions/560349/ranking-and-predicting-an-outcome-with-without-a-ml – The Great Jan 19 '22 at 11:41