Answer to the question
First of all, it is important to notice that the quantities $P(X\leq Y)$ and $P(X\lt Y)$ are different, given that the variables are not continuous.
Let $X$ and $Y$ be two independent random variables whose distribution is a mixture of a discrete and a continous distribution such that $P(X=0)=p_1>0$ and $P(Y=0)=p_2>0$. Then by the law of total probability we have that
\begin{eqnarray*}
P(X\leq Y)&=&P(X\leq Y \vert Y=0)P(Y=0)+P(X\leq Y \vert Y>0)P(Y>0)\\
&=& P(X=0)P(Y=0)+P(X\leq Y \vert Y>0)[1-P(Y=0)]\\
&=& p_1p_2 +P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)(1-p_2).
\end{eqnarray*}
Now, $P(\{X\leq Y\} \cap \{X=0 \cup X>0\} \vert Y>0)=p_1+P(X\leq Y\vert X>0,Y>0)(1-p_1)$. With this, we obtain an expression for $\theta=P(X\leq Y)$ in terms of quantities that we can estimate. Note that
\begin{eqnarray*}
P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy,
\end{eqnarray*}
where $F_X$ is the CDF of the continuos part $X$ and $f_Y$ is the PDF of the continuos part of $Y$ (In your case, a Lomax distribution).
Now, how to estimate the parameters? I am going to use nonlinear squares between the CDFs and the empirical CDFs. This method works in your case given the large sample size. Please, find below an R code for conducting this estimation using a simulated sample of size $n=1000$.
rm(list=ls())
p0 = 0.75
alpha0 = 3
lambda0 = 1
# Function for simulating a Lomax variable
rlomax = function(n,alpha,lambda) return( lambda*( (1-runif(n))^(-1/alpha) - 1 ))
# Simulated data, X and Y
set.seed(1)
ns = 1000
simx = simy = rep(0,ns)
for(i in 1:ns){
u = runif(1)
if(u<p0) simx[i] = 0
else simx[i] = rlomax(1,alpha0,lambda0)
}
for(i in 1:ns){
u = runif(1)
if(u<p0) simy[i] = 0
else simy[i] = rlomax(1,alpha0,lambda0)
}
hist(simx,col="red")
hist(simy,add=T,col="blue")
# Distribution function of the mixture
FM = function(x,p,alpha,lambda){
temp = x
for(i in 1:length(x)){
if(x[i]==0) temp[i]=p
if(x[i]>0) temp[i] = p + (1-p)*( 1-(1+x[i]/lambda)^(-alpha) )
}
return(temp)
}
ecdfdatx = ecdf(simx)(sort(simx))
ecdfdaty = ecdf(simy)(sort(simy))
Datax = data.frame(sort(simx),ecdfdatx)
Datay = data.frame(sort(simy),ecdfdaty)
# Fit for the first data set
nls_fitx = nls(ecdfdatx ~ FM(sort(simx),p,alpha,lambda), data=Datax, start = list(p = 0.75, alpha = 3, lambda = 1) )
nls_fitx
plot(ecdf(simx))
lines(sort(simx), predict(nls_fitx), col = "red")
# Fit for the second data set
nls_fity = nls(ecdfdaty ~ FM(sort(simy),p,alpha,lambda), data=Datax, start = list(p = 0.75, alpha = 3, lambda = 1) )
nls_fity
plot(ecdf(simy))
lines(sort(simy), predict(nls_fity), col = "red")
With this code we obtain estimators of the parameters $(p_1,\alpha_X,\lambda_X,p_2,\alpha_Y,\lambda_Y)$. The remaining step consists of calculating $P(X\leq Y\vert X>0,Y>0)=\int_0^{\infty}F_X(y)f_Y(y)dy$.
# remaining quantity
px.h = coef(nls_fitx)[1]
py.h =coef(nls_fity)[1]
ax.h = coef(nls_fitx)[2]
ay.h = coef(nls_fity)[2]
lx.h = coef(nls_fitx)[3]
ly.h = coef(nls_fity)[3]
# Lomax PDF
dlomax = function(x,alpha,lambda) return(alpha*(1+x/lambda)^(-(alpha+1))/lambda)
# Lomax CDF
plomax = function(x,alpha,lambda) return(1-(1+x/lambda)^(-alpha) )
tempf = function(x) plomax(x,ax.h,lx.h)*dlomax(x,ay.h,ly.h)
p.l = integrate(tempf,0,Inf)$value
# Estimator of theta
px.h + p.l*(1-px.h)*(1-py.h)
Similarly, the estimator of $P(X<Y)$ can be calculated as follows
# Estimator of theta2
p.l*(1-px.h)*(1-py.h)
Then the quantity $P(X\leq Y)$ depends on the probabilities $p_1$ and $p_2$ and therefore this quantity may be misleading. For instance, if $X$ and $Y$ are i.i.d. and $p_1,p_2\approx 1$, we have that $P(X\lt Y)\approx 0$ and $P(X \leq Y)\approx 1$.
My conclusion: The stress-strength coefficient is not what you need for comparing the performance of both companies.
How to solve the problem?
I think this problem can be seen as a decision problem. You have two companies providing a programming service and you want to decide which one is better. Consider the hypothetical case where one of the companies produce a large proportion of codes with zero errors but also that when a code contains errors, it is likely that the number of errors is large. Is this better than a company with a lower proportion of codes with zero errors but smaller positive errors?
A toy naive example. Suppose that your decision rule is based on the estimated proportions of 0-error codes as follows:
- Estimate $p_1$ and $p_2$. If $\hat{p}_1/\hat{p}_2\in(0.9,1.1)$, then proceed to estimate $\theta = P(X<Y)$
using only the positive quantities. If a $95\%$ confidence interval for $\theta$ contains the value $0.5$, then there is no criterion for choosing one of the companies. If this value is not contained in the confidence interval, then choose Company X if $P(X<Y)>0.5$ or Company Y is $P(X<Y)<0.5$.
- If the ratio of estimators $\hat{p}_1/\hat{p}_2<0.9$, then choose Company Y.
- If the ratio of estimators $\hat{p}_1/\hat{p}_2>1.1$, then choose Company X.
This (naive, I repeat) criterion favours companies that produce more 0-error codes and proceeds to select one based on the stress-stress coefficient of the continous part when they seem to produce a similar proportion of 0-error codes.
In order to conduct a proper analysis, one would need to select a proper loss-function based on expert's opinion in order to come up with a reasonable selection criterion. This would require more effort and I think it would fall out of the scope of this site but I hope this answer gives you some help.
Some references of possible interest:
Statistical Decision Theory and Bayesian Analysis
Bayesian Theory
Bayesian Decision Analysis: Principles and Practice
It would also help to check the literature on Software quality control and see the critera adopted by some companies.