Comparison statistics for ordered categorical variables

Question

I'm a looking for a counterpart to the Mann–Whitney U statistic for the case where the two samples ($\hat{F}$ and $\hat{G}$) are drawn from discrete ordered distributions (the data can only take one of three values, $\{a, b, c\}:a<b<c$).

I want to know if $\hat{F}$ is significantly to the left of $\hat{G}$.

I considered ordered logistic regression (on a dummy indicating sample membership) but the data set is huge and this is proving too slow (I used the R package ordinal).

Edit:

The distributions from which $\hat{F}$ is drawn is symmetric. That of $\hat{G}$ is right skewed.

Edit:

In an alternative formulation of the problem, we could obtain two samples ($\hat{F}$ and $\hat{G}$) drawn from discrete distributions taking values in $\{-\infty,\infty\}$ but with typically 90%+ of the pmf at exactly 0. Initially, I am considering using the sign of these measures (which yields the problem above) but any ideas on how to approach the initial problem is welcomed.

Edit2 (ttnphns comments):

A little of context is necessary. I have to test 331 groups (each against its own baseline, so I have 331 non-paired two samples $\{X_i,Y_i\}_{i=1}^{331}$ where $X_i=\{x_{ij}\}_{j=1}^{n_i}\forall i$ and $Y_i=\{y_{ij}\}_{j=1}^{m_i}\forall i$).

The proportion of ties and number of observations varies hugely between the 331 groups.

The good behavior of the MW test in the presence of zero inflated data has been discussed on this site before.

Just to be sure, I ran a permutation test. To do this I just permute (re-sample without replacement) the labels $(i,j)$ among the observations (so that $x_{ij}\rightarrow x_{lk}$). Here too, I find that the bootstrapped $z$ scores behaved rather well.

However when I plot the 331 $z$ scores from the actual data (not the bootstrapped ones), here is the distribution I obtain:

The variance of these scores is much larger than the bootstrapping led me to expect. I am wondering if my bootstrap strategy is the correct one.

@ttnphns: good point. I worry the $z$ score from the Mann-Whitney test will be negatively affected by the large number of ties (in the sense of having its significance inflated). — user603, Jan 10 '18 at 11:52
As I'm aware, (i) large-sample Z statistic of MW is usually computed corrected for ties (so you shouldn't worry much); (ii) with overall sample size up to about 400 an exact p-value after Dineen and Blakesley (1973) can be obtained (you may read about i and ii, for example, in "SPSS Statistics Algorithms" pdf, commands `Nonparametric tests` and `NPAR tests`. (iii) with small sample sizes you may get exact permutation p-value or, with bigger data (iv) Monte Carlo p-value. — ttnphns, Jan 10 '18 at 14:21
Also, if your categorical distribution is very skewed you might think to dichotomize it, _sometimes_ that would make sense to an investigator. — ttnphns, Jan 10 '18 at 14:23
@ttnphns: I have provided more info as to why I worry a bit. — user603, Jan 10 '18 at 14:52
I don't follow you terminology what you did. Bootstrapping is sampling with replacement and is used to estimate st.errors/CI of a _statistic_ (such as median or proportion). Monte Carlo simulation does permutations (but not exhaustive) and gives CI for a _p-value_ of a test (the point estimate of the p-value being in the middle). I don't know if bootstrap is anyhow useful with MW; but Monte Carlo is, and it bypasses computation of the z test statistic. — ttnphns, Jan 10 '18 at 19:42

score 1 · Accepted Answer · edited Jun 11 '20 at 14:32

Here I simply cite an excerpt from one SPSS tutorial document on doing Exact testing with nonparametric tests (such as Mann-Whitney).

When asymptotics break down

Modern statistical methods rely on the results of mathematical statistics, which have established theorems and distributional results that hold true “in the limit” for larger sample sizes, or asymptotically. You should be concerned that asymptotic results do not apply if:

the sample size is small;

data [e.g. crosstables] are sparse;

data are skewed in distribution across groups; or

data are heavily tied.

If any of these conditions are true for your data, the p-value at which you operate can be different from the asymptotic p-value, and you can come to incorrect conclusions.

While the above conditions may seem intuitively simple, it is difficult to make recommendations in practice. It is tempting to try to come up with a rule of thumb stating that some sample size is small while another is large. Yet, the effort to do so is too simplistic because it ignores other elements of structure, such as the degree of imbalance in the data.

For example, with regard to sparseness in contingency tables, two commonly cited rules of thumb are:

the minimum expected cell count for all cells should be at least 5;

for tables larger than 2x2, a minimum expected count of 1 is permissible as long as no more than about 20 percent of the cells have expected values below 5.

The problem is these (and other) rules are sometimes unduly conservative. Once again, it is difficult to find a rule that always holds true. When faced with sparseness, some researchers collapse categories to conform to the above rules. However, collapsing categories cannot be recommended because it can seriously distort what the data convey about associations.

Skewness refers to imbalance in group sizes. When you perform studies prospectively, you can sometimes ensure groups are balanced. When studies are done retrospectively, or you are studying a relatively rare event or phenomenon, you may have little control over balance. A rule of thumb is the “80:20 Rule.” This rule assumes two things:

data are relatively balanced if skewness is no more extreme than 80:20;

this applies to every subgroup of interest.

When skewness is extreme, even seemingly large sample sizes might be inadequate.

A common reason for ties is measurement. In areas such as medical studies or the social sciences, researchers use ordered items and scales and end up lumping together subjects that are otherwise distinct – if it were possible to measure them using some quantitative metric. In these situations, one obvious remedy is to get more data. However, because of time or cost, this is not always possible. Instead, the methods of exact statistics have shown great promise.

Exact statistics

For a given data set and test situation, you can generate a reference set of like data of which the observed data are a particular realization. Having generated repeated realizations, you will know how discrepant the observed data are relative to a “universe” of like data. The fraction of possible realizations at least as discrepant as the observed data generate an exact p-value.

Generating an exact p-value is computationally intensive. Fortunately, advances in statistical computing, coupled with advances in computing power, have made it possible to quickly calculate exact p-values for common statistical situations.

Note that exact test methodology does not necessarily rely on “brute force” evaluation of all possible tables in a reference set – that may take a very long time. Instead, sophisticated algorithms make it possible to calculate a p-value by implicit, rather than explicit, enumeration of the reference set. Other things equal, exact methods work faster on small data sets than large ones. And, known approaches on large data sets may be too time-consuming. However, for large and well-balanced data sets, asymptotic statistical results apply.

When you want an exact p-value but it would take too long to compute, you can conveniently generate a Monte Carlo interval that will contain the exact p-value with specified confidence.

Although exact results are always reliable, some data sets are too large for the exact p value to be calculated, yet don’t meet the assumptions necessary for the asymptotic method. In this situation, the Monte Carlo method provides an unbiased estimate of the exact p value, without the requirements of the asymptotic method. The Monte Carlo method is a repeated sampling method. For any observed table [a crosstable of frequencies that is a nonparamereic test's base], there are many tables, each with the same dimensions and column and row margins as the observed table. The Monte Carlo method repeatedly samples a specified number of these possible tables in order to obtain an unbiased estimate of the true p value.

Another quote from SPSS documentation, showing formulas for Mann-Whitney Exact and Monte Carlo exact significance estimation:

Comparison statistics for ordered categorical variables

Edit:

Edit:

Edit2 (ttnphns comments):

1 Answers1

When asymptotics break down

Exact statistics