Comparing distributions of ordered categorical values

Question

I have a dataset that consists of an ordered series of categorical samples that are not distributed perfectly random; every sample has an elevated chance of being the same as the sample before it (i.e. there is a tendency for 'streaks' of the same value). I want to take a subset of this dataset (using a selection function, let's say S()) and then from this subset take a uniform sample of say 10% of all values in the original dataset. Now I want to know if the distribution of the result of this is the same as the distribution in the original data set; in other words, I want to measure if my selection function S() changes the distribution, or in other words still, if my resulting selection is an accurate representation of the whole population.

To get the sample size for a normally distributed quantity at a given confidence level and margin of error for the data set of a certain size, I'd normally use a 'sample size calculator' online and call it a day. But in the case described above, I'm not quite sure how all pieces fit together - does it matter that my data is categorical? Does the order of the samples matter? Is this an appropriate case for using the Chi-square test of indepence?

I think my question reduces to 'how can I compare the distribution of two series of ordered categorical values' but I'm not even sure if that question makes sense - hence my admittedly clumsy description above. Can anyone clarify how I could go about this?

score 1 · Accepted Answer · edited Apr 13 '17 at 12:46

It really depends on your assumptions about independence. If order doesn't matter, a straight chi-square is appropriate, but if order does matter:

If you assume the probability of a category depends only on a fixed number of previous values, you'd be looking at a Markov chain. i.e., that there is an underlying transition matrix with a [(number of categories)^(number of previous values it depends on) by (number of categories, the result) ] matrix. https://math.stackexchange.com/questions/86340/how-can-i-compare-two-markov-processes
If you assume the probability of a category depends on ALL previous values, you would be looking at a categorical time series. I would look at https://quant.stackexchange.com/questions/848/time-series-similarity-measures

That said, what I described is most useful if you want to understand the model (the generative approach, Generative vs. discriminative). But from what I'm hearing, you might want the discriminative approach: to compare the error rate of your model with a benchmark. You would be aiming to say "On this dataset, my model does better than this benchmark/state-of-the-art model. It's a good model. Use it." You can use simple error rate, or cross validation, or a number of more sophisticated methods. https://en.wikipedia.org/wiki/Regression_diagnostic

Comparing distributions of ordered categorical values

1 Answers1