How to calculate categorical distribution?

Question

Suppose I have 150 records with continuous and categorical values.In which only one column has categorical values with three categories namely setosa, versicolor and virginica.

How to calculate categorical distribution for them?

The allusion is to the _Iris_ data (e.g. http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching) As far as I can see, the categorical distribution is that there are 50 values of each category, so frequencies are (50, 50, 50) and probabilities 1/3 each. Any decent program will have ways to give you a table and/or to save the frequencies in a vector or variable. Is that what you want? If you want something else, you may have to add more detail. Note that questions on how to do this in particular software are off-topic here. — Nick Cox, Jan 21 '16 at 10:14
What does in your instance `How to calculate categorical distribution for them?` actually mean? — ttnphns, Jan 21 '16 at 11:25
Want to find distance between clusters handling categorical data, while searching i found about categorical distribution..i thought this would help me..If there is any way please help me — Melvin Arun, Jan 21 '16 at 11:55
I don't think this will help you. As @NickCox pointed out, the Iris data is uniform across the three types of plants. But that won't help you find the distance between clusters because that information doesn't go into the cluster analysis — Peter Flom, Jan 21 '16 at 11:58

score 3 · Answer 1 · edited Jan 21 '16 at 12:39

The term categorical distribution describes probabilities of observing $k$ exclusive events that for convenience are denoted as numbers $x \in \{1,...,k\}$. A probability mass function assigns probability to each of the events

$$ \Pr(x = i) = p_i $$

with a constraint that $\sum_{i=1}^k p_i = 1$. If you want to calculate the probabilities from data, then you are possibly interested in an empirical distribution. Calculating empirical probabilities is very simple. If the number of times that $x=i$ was observed in the dataset is denoted $n_i$, then

$$ \hat p_i = \frac{n_i}{\sum_{i=1}^k n_i} $$

Notice however that such an estimate would be obviously incorrect if you did not observe some value in your dataset, which is possible in general. In such a case the estimate of probability from your data would be zero (i.e. impossibility). This is called the zero-frequency problem and a number of work-arounds for it are possible. The simplest correction is to add some value $\alpha$ to your counts

$$ \hat p_i = \frac{n_i + \alpha}{(\sum_i n_i) + k\alpha} $$

The common choice for $\alpha$ is $1$, i.e. applying uniform prior based on Laplace's rule of succession, $1/2$ for Krichevsky-Trofimov estimate, or $1/k$ for Schurmann-Grassberger (1996) estimator. Notice, however, that what you do here is apply out-of-data (prior) information in your model, so it gets a subjective, Bayesian flavor. With this approach you have to remember the assumptions you made and take them into consideration.

Such an approach is equivalent to Bayesian estimation using Dirichlet prior (as described in Wikipedia) with equal parameters $\alpha = (1,...,1)$. You can use a Bayesian approach even if there is no zero-frequency problem, but when you want to include some out-of-data information in your statistical model. In this case the maximum a posteriori estimate for $p_i$ is

$$ E(p_i) = \frac{n_i + \alpha_i }{\sum_i n_i + \alpha_i} $$

where the $\alpha_i$ can be interpreted as assumed a priori "pseudocounts" for each event. In this case, the probabilities $p_i$ follow a Dirichlet distribution

$$ p_i \sim \mathrm{Dir}(n_1+\alpha_1,...,n_k+\alpha_k) $$

Of course, if your out-of-data knowledge suggests using some informative, non-uniform, prior you can use different values for the $\alpha_i$.

Schurmann, T., and P. Grassberger. (1996). Entropy estimation of symbol sequences. Chaos, 6, 41-427.

How to calculate categorical distribution?

1 Answers1

Linked