3

There is a dataset of about 8500 different kinds of mushrooms, each datapoint has about 20 features. The features are purely categorical: color of the cap, its shape and so on. None of them are ordered. For every datapoint it is known if the mushroom is poisonous or not. My task is to determine which features distinguish the edible mushrooms from the non-edible ones.

My knowledge of statistics is limited, I have read the following example on analysing categorical data. Following that example, my intent is to do the following:

  1. For every feature, perform a $2 \times 2$-table Chi-square test to understand if there is any kind of an association.
  2. For every feature with an association, compute and odds ratio to see if the association is significant or not.

My concern is that I will be treating each feature separately. So, I will have about 20 separate experiments. Maybe, I am missing some statistical test which would take into account the fact that there are many (not just one) categorical features which determine if the mushroom is edible or not.

alisianoi
  • 133
  • 6

1 Answers1

1

Since your outcome is categorical, I think that this case calls for a logistic regression. Since your predictors are also categorical, as far as I know, they should be coded, using dummy variables. See more details on dummy coding (in R) on this page and on logistic regression on this website.

Now, a couple of notes about performing logistic regression, using different statistical languages and environments. Since logistic regression is a part of generalized linear models (GLM) family, if you use R, you can use glm() function for the analysis: http://data.princeton.edu/R/glms.html. If you use SAS, you can use LOGISTIC procedure. If you use SPSS, this tutorial might be helpful.

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93