4

I am working on this dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016, and it has a lot of categorical variables, while I am more used to work with the continuous ones.

Except for the binary "sex" variable, which is simple, there are variables with more categories: "generation", "continent" and, especially, "country". There are ~100 countries, and countries are definitely not ordinal, so I suppose I cannot just convert countries to numbers since the distances between them will not make sense. But at the same time I do not have a good feeling about make ~100 columns for countries dummy variable.

  1. Is it a good approach to create these dummy variables and then just carry out dimensionality reduction? Which kind of dimensionality reduction would be the most suitable?
  2. Is there a better alternative I don't know about?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Valeria
  • 511
  • 1
  • 3
  • 11
  • Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Jan 12 '20 at 03:36
  • 3
    Everything depends on how you are going to use these categorical variables. If you are talking of reducing the number of categories - see the link provided above. If by dimensionality reduction you mean (as normally) imbedding the many categories in few dimensions that are _continuous_ variables - then this quantification of categorical variables is known as (multiple) correspondence analysis. – ttnphns Jan 12 '20 at 03:41
  • 1
    @ttnphns yes, I was originally thinking about performing MCA, but I am not sure if it make sense in this case (as I've mentioned, I have experience only with categorical data). So, if have columns country_0, country_1,..., country_100 and all the other categorical columns and then perform MCA, this would be a viable solution to this problem? And if I want to combine it with the continuous data, should I eg. perform PCA separately on the continuous data, and then take results both from MCA and PCA? (I know that FAMD exists, but I currently cannot seem to run the only implementation I've found). – Valeria Jan 12 '20 at 03:46
  • What is the problem with having 100 one-hot encodings? – Tim Jan 12 '20 at 07:49
  • @Tim most of them do not seem to be correlated with the target variable, so I don't see much point in having them around in this case. – Valeria Jan 12 '20 at 07:50
  • 3
    "Lack of correlation" is not a valid criteria for excluding variables. – Tim Jan 12 '20 at 08:33

1 Answers1

1

In the case of countries I would recommend using predefined groups (at least in the beginning). Geographic groups (continent, region etc.) Economic groups (currency, trade, income etc.) Cultural/political groups (religion, state, war etc.)

I would prefer a solution using this kind of variables because interpreting results will be a lot easier if results are based on groups that are already studied.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • In the dataset there is total GDP & GDP per capita & population. Do you think that it would make sense to perform clustering on those, and from these create several groups of countries for these dataset? (As it's my exploration and not real-life project, I'm just curious to do as much as possible with the dataset itself, without bringing additional sources.) – Valeria Jan 12 '20 at 04:21