2

I have a dataframe with 20 categorical variables, each with 30+ levels. As a result I don't have a target variable on hand per-say but I would like to use statistical techniques or machine learning to show specifically how the certain levels of each variable relate to eachother.

When we see "B" in Column D is it expected to also see 'G' in Column J?

I was thinking maybe finding counts of the variables but are there anyways in going past the Fisher Exact and Chi Squared tests? Maybe see the interplay between the distribution of frequencies on more than one variable??

My major point here is I would like to use machine learning to determine which levels are coming up more often than others, but without a target variable I am unsure how to proceed with feature selection. Seems unsupervised but I am unsure how I could go about pointing fingers at a specific level of a variable or show how they relate to one another?

1 Answers1

1

Hard to say, it is quite a general question. Decision trees (Random forest or Boosted Regression Trees) could be useful in this case if the categories are clearly separated. Yet, these machine learning models are used to predict or as exploratory analysis. For example, how well is do the levels in Column D correlate with (predict) levels in Column J. You would like to see a pairwise comparison for each possible combination?

To me it seems that you want to produce something similar as a correlation matrix, but then for categorical variables. However, for this machine learning is not necessary (or perhaps I am too pragmatical). I would not make it too (unnecessary) complex when a simple (or simplistic approach) would suffice. I never performed a pairwise categorical correlation, but a quick search delivered me this Correlations with unordered categorical variables.

A4-paper
  • 63
  • 7