Mann-Whitney test with ties

Question

I have two data sets with PDFs roughly like this:

$$ p(x) = \left\{ \begin{array}{lr} .75 & x = 0\\ \text{Lomax}(x) & x > 0 \end{array} \right. $$

i.e. it's continuous, except there is a huge number of points at $x=0$.

I wish to tell if the values in one set are lower than the other. However there are a lot of ties in my data, which makes it kind of hard. It was suggested to me that I do something like: bootstrap a bunch and if a 95% confidence interval has x's being less than y's (as determined by Mann-Whitney U), we declare $X<Y$.

It's not really clear from the definition of U what I do about ties though. wilcox.test in R seems to break them evenly, e.g. when comparing $(0,0)$ to $(0,0)$ it says $U=2$. (This behavior seems to contradict its documentation though which says it returns "the number of all pairs (x[i], y[j]) for which y[j] is not greater than x[i].")

Is breaking them evenly the right thing to do here? I'm worried since I have such a huge fraction of ties that my intuition about a "95% confidence interval" doesn't actually correspond to 95% of my samples.

(this is basically a response to Procrastinator on my [old question](http://stats.stackexchange.com/questions/41526/hypothesis-testing-for-pareto-distributions) but I thought asking a new question better fit the SE form.) — Xodarap, Nov 02 '12 at 18:49
(+1) For spotting this. [Baklizi and Eidous (2006)](ftp://metron.sta.uniroma1.it/RePEc/articoli/2006-1-4.pdf) give a reference on this `The Mann- Whitney statistic (Lehmann, 1975) is defined as ... A version corrected for ties is also given in Lehmann (1975).` The reference is [Nonparametrics: Statistical Methods Based on Ranks](http://www.amazon.co.uk/Nonparametrics-Statistical-Methods-Based-Ranks/dp/0387352120). Unfortunately I have not had the chance to have a look at it. — , Nov 02 '12 at 19:12

score 3 · Accepted Answer · 2012-11-05T15:35:16.727

Answer to the question

First of all, it is important to notice that the quantities $P(X\leq Y)$ and $P(X\lt Y)$ are different, given that the variables are not continuous.

Let $X$ and $Y$ be two independent random variables whose distribution is a mixture of a discrete and a continous distribution such that $P(X=0)=p_1>0$ and $P(Y=0)=p_2>0$. Then by the law of total probability we have that

\begin{eqnarray*} P(X\leq Y)&=&P(X\leq Y \vert Y=0)P(Y=0)+P(X\leq Y \vert Y>0)P(Y>0)\\ &=& P(X=0)P(Y=0)+P(X\leq Y \vert Y>0)[1-P(Y=0)]\\ &=& p_1p_2 +P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)(1-p_2). \end{eqnarray*}

Now, $P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)=p_1+P(X\leq Y\vert X>0,Y>0)(1-p_1)$. With this, we obtain an expression for $\theta=P(X\leq Y)$ in terms of quantities that we can estimate. Note that

\begin{eqnarray*} P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy, \end{eqnarray*}

where $F_X$ is the CDF of the continuos part $X$ and $f_Y$ is the PDF of the continuos part of $Y$ (In your case, a Lomax distribution).

Now, how to estimate the parameters? I am going to use nonlinear squares between the CDFs and the empirical CDFs. This method works in your case given the large sample size. Please, find below an R code for conducting this estimation using a simulated sample of size $n=1000$.

rm(list=ls())
p0 = 0.75
alpha0 = 3
lambda0 = 1

# Function for simulating a Lomax variable
rlomax = function(n,alpha,lambda) return( lambda*( (1-runif(n))^(-1/alpha) - 1 ))

# Simulated data, X and Y
set.seed(1)
ns = 1000
simx = simy = rep(0,ns)

for(i in 1:ns){
u = runif(1)
if(u<p0) simx[i] =  0
else simx[i] = rlomax(1,alpha0,lambda0)
}

for(i in 1:ns){
u = runif(1)
if(u<p0) simy[i] =  0
else simy[i] = rlomax(1,alpha0,lambda0)
}

hist(simx,col="red")
hist(simy,add=T,col="blue")

# Distribution function of the mixture

FM = function(x,p,alpha,lambda){
temp = x
for(i in 1:length(x)){
if(x[i]==0) temp[i]=p
if(x[i]>0) temp[i] = p + (1-p)*( 1-(1+x[i]/lambda)^(-alpha) )
}
return(temp)
}


ecdfdatx = ecdf(simx)(sort(simx))
ecdfdaty = ecdf(simy)(sort(simy))

Datax = data.frame(sort(simx),ecdfdatx)
Datay = data.frame(sort(simy),ecdfdaty)

# Fit for the first data set

nls_fitx = nls(ecdfdatx ~ FM(sort(simx),p,alpha,lambda), data=Datax, start =     list(p = 0.75, alpha = 3,  lambda = 1) )
nls_fitx
plot(ecdf(simx))
lines(sort(simx), predict(nls_fitx), col = "red")

# Fit for the second data set

nls_fity = nls(ecdfdaty ~ FM(sort(simy),p,alpha,lambda), data=Datax, start = list(p =     0.75, alpha = 3,  lambda = 1) )
nls_fity
plot(ecdf(simy))
lines(sort(simy), predict(nls_fity), col = "red")

With this code we obtain estimators of the parameters $(p_1,\alpha_X,\lambda_X,p_2,\alpha_Y,\lambda_Y)$. The remaining step consists of calculating $P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy$.

# remaining quantity

px.h = coef(nls_fitx)[1]
py.h =coef(nls_fity)[1]
ax.h = coef(nls_fitx)[2]
ay.h = coef(nls_fity)[2] 
lx.h = coef(nls_fitx)[3] 
ly.h = coef(nls_fity)[3] 

# Lomax PDF
dlomax = function(x,alpha,lambda) return(alpha*(1+x/lambda)^(-(alpha+1))/lambda)

# Lomax CDF
plomax = function(x,alpha,lambda) return(1-(1+x/lambda)^(-alpha) )

tempf = function(x) plomax(x,ax.h,lx.h)*dlomax(x,ay.h,ly.h)

p.l = integrate(tempf,0,Inf)$value

# Estimator of theta
px.h + p.l*(1-px.h)*(1-py.h)

Similarly, the estimator of $P(X<Y)$ can be calculated as follows

# Estimator of theta2
p.l*(1-px.h)*(1-py.h)

Then the quantity $P(X\leq Y)$ depends on the probabilities $p_1$ and $p_2$ and therefore this quantity may be misleading. For instance, if $X$ and $Y$ are i.i.d. and $p_1,p_2\approx 1$, we have that $P(X\lt Y)\approx 0$ and $P(X \leq Y)\approx 1$.

My conclusion: The stress-strength coefficient is not what you need for comparing the performance of both companies.

How to solve the problem?

I think this problem can be seen as a decision problem. You have two companies providing a programming service and you want to decide which one is better. Consider the hypothetical case where one of the companies produce a large proportion of codes with zero errors but also that when a code contains errors, it is likely that the number of errors is large. Is this better than a company with a lower proportion of codes with zero errors but smaller positive errors?

A toy naive example. Suppose that your decision rule is based on the estimated proportions of 0-error codes as follows:

Estimate $p_1$ and $p_2$. If $\hat{p}_1/\hat{p}_2\in(0.9,1.1)$, then proceed to estimate $\theta = P(X<Y)$ using only the positive quantities. If a $95\%$ confidence interval for $\theta$ contains the value $0.5$, then there is no criterion for choosing one of the companies. If this value is not contained in the confidence interval, then choose Company X if $P(X<Y)>0.5$ or Company Y is $P(X<Y)<0.5$.
If the ratio of estimators $\hat{p}_1/\hat{p}_2<0.9$, then choose Company Y.
If the ratio of estimators $\hat{p}_1/\hat{p}_2>1.1$, then choose Company X.

This (naive, I repeat) criterion favours companies that produce more 0-error codes and proceeds to select one based on the stress-stress coefficient of the continous part when they seem to produce a similar proportion of 0-error codes.

In order to conduct a proper analysis, one would need to select a proper loss-function based on expert's opinion in order to come up with a reasonable selection criterion. This would require more effort and I think it would fall out of the scope of this site but I hope this answer gives you some help.

Some references of possible interest:

Statistical Decision Theory and Bayesian Analysis

Bayesian Theory

Bayesian Decision Analysis: Principles and Practice

It would also help to check the literature on Software quality control and see the critera adopted by some companies.

Thanks, this seems very clever! I'll definitely keep looking into the literature - since any method I choose is going to require as much art as science, having an expert on my side will surely help. — Xodarap, Nov 03 '12 at 20:44

Mann-Whitney test with ties

1 Answers1

Answer to the question

How to solve the problem?

Linked