Alternatives to using dummy variables?

Question

I am working on this dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016, and it has a lot of categorical variables, while I am more used to work with the continuous ones.

Except for the binary "sex" variable, which is simple, there are variables with more categories: "generation", "continent" and, especially, "country". There are ~100 countries, and countries are definitely not ordinal, so I suppose I cannot just convert countries to numbers since the distances between them will not make sense. But at the same time I do not have a good feeling about make ~100 columns for countries dummy variable.

Is it a good approach to create these dummy variables and then just carry out dimensionality reduction? Which kind of dimensionality reduction would be the most suitable?
Is there a better alternative I don't know about?

Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Jan 12 '20 at 03:36
Everything depends on how you are going to use these categorical variables. If you are talking of reducing the number of categories - see the link provided above. If by dimensionality reduction you mean (as normally) imbedding the many categories in few dimensions that are _continuous_ variables - then this quantification of categorical variables is known as (multiple) correspondence analysis. — ttnphns, Jan 12 '20 at 03:41
@ttnphns yes, I was originally thinking about performing MCA, but I am not sure if it make sense in this case (as I've mentioned, I have experience only with categorical data). So, if have columns country_0, country_1,..., country_100 and all the other categorical columns and then perform MCA, this would be a viable solution to this problem? And if I want to combine it with the continuous data, should I eg. perform PCA separately on the continuous data, and then take results both from MCA and PCA? (I know that FAMD exists, but I currently cannot seem to run the only implementation I've found). — Valeria, Jan 12 '20 at 03:46
@Tim most of them do not seem to be correlated with the target variable, so I don't see much point in having them around in this case. — Valeria, Jan 12 '20 at 07:50
"Lack of correlation" is not a valid criteria for excluding variables. — Tim, Jan 12 '20 at 08:33

score 1 · Answer 1 · edited Feb 15 '21 at 03:31

1

In the case of countries I would recommend using predefined groups (at least in the beginning). Geographic groups (continent, region etc.) Economic groups (currency, trade, income etc.) Cultural/political groups (religion, state, war etc.)

I would prefer a solution using this kind of variables because interpreting results will be a lot easier if results are based on groups that are already studied.

edited Feb 15 '21 at 03:31

kjetil b halvorsen

63,378
26
142
467

answered Jan 12 '20 at 04:18

Grigorij Abramov

41
2

In the dataset there is total GDP & GDP per capita & population. Do you think that it would make sense to perform clustering on those, and from these create several groups of countries for these dataset? (As it's my exploration and not real-life project, I'm just curious to do as much as possible with the dataset itself, without bringing additional sources.) – Valeria Jan 12 '20 at 04:21

Alternatives to using dummy variables?

1 Answers1