0

In this article by David Hand an implicit function of the classification cost ratio is calculated for a specific dataset, resulting in a discrete distribution:

enter image description here

This is defined as

$$ w_G(c) = \pi_0 f_0(P_1^{-1}(c))\left\vert \frac{dP_1^{-1}(c)}{dc}\right\vert + \pi_1 f_1(P_1^{-1}(c))\left\vert \frac{dP_1^{-1}(c)}{dc}\right\vert$$

with

$$c= \Pr(1\vert T)= P_1(T)= \pi_1 f_1(T) / \left\{ \pi_0f_0(T) + \pi_1f_1(T)\right\}$$

and $$T(c_0,c_1)=\underset{t}{\text{arg min}}\left\{ c\pi_0(1-F_0(t)) + (1-c) \pi_1 F_1(t) \right\}$$

with $t$ being any threshold score, $T$ the threshold that minimizes the cost ratio $c=\frac{c_0}{c_0+c_1}$, and $f_0,f_1$ the pdf's of score values for normal and disease groups (with corresponding cdf's $F_0,F_1$), and $\pi_0, \pi_1$ the proportion of normal and diseased.

Can someone give me some mathematical equation, algorithm or pseudocode sketch as to how the peaks in the plot are generated?


To illustrate with an example from this related question:

install.packages('pROC')
install.packages('ROCR')
install.packages('Epi')
install.packages('hmeasure')
library(pROC)
library(ROCR)
library(Epi)

set.seed(561)

cost0 = 1   # Cost of mis-classifying a normal as having cancer in million $
cost1 = 10   # Cost of mis-classifying a cancer patient as normal (death?)

b = cost0 + cost1
c = cost0/(b)

n = 7000    # Total cases
pi0 =.8     # Percentage of normal
pi1 =.2     # Percentage of disease

# Actual values of the test for normals and disease (D higher test values)
testA_Normals = rnorm(n*pi0, mean=3, sd=1)
testA_Sick = rnorm(n*pi1, 6, 1)

# Determining a threshold based on cost 
# arg t min {Loss = cost0 * (1 - pnorm(t,3,1)) * pi0 + 
#            cost1 * pnorm(t,6,1) * pi1}

t = seq(0,10,0.0001)
loss <- cost0 * (1 - pnorm(t,3,1)) * pi0 + cost1 * pnorm(t,6,1) * pi1
Threshold = data.frame(t,loss)[which(loss==min(loss)),]$t

hist(testA_Normals,border=F, xlim=c(0,10))
hist(testA_Sick,col=2,border=F, add=T)

abline(v=Threshold)

enter image description here

hmeas = HMeasure(data$outcome, data$testA)
par(mfrow=c(2,2))
plotROC(hmeas,which=1)
plotROC(hmeas,which=2)
plotROC(hmeas,which=3)
plotROC(hmeas,which=4)

enter image description here

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197

1 Answers1

1

The discrete nature of the cost plot is the result of using the convex hull to calculate the implicit cost. This is done to circumvent concave regions of the ROC curve, where the cost function

$$Q(t:b,c)\overset{\Delta}{=} \left(c\pi_0(1-F_0(t)) + (1-c)\pi_1 F_1(t)\right)b$$

is not minimized for every choice of $c=\frac{c_0}{c_1}.$

Between two points at the lower ($L$) and upper ($U$) ends of of a concave segment of the the ROC curve there will be a line interval with a constant $c$ corresponding to optimal thresholds $T$ that minimize the cost function above with a

$$c=\frac{\pi_1(F_1(T_L) -F_1(T_U))}{\pi_0(F_0(T_L) - F_0(T_U)) + \pi_1(F_1(T_L)-F_1(T_U))}.$$

This is all found here.

Quick illustration with very simple code:

enter image description here

library(hmeasure)
n = 10
set.seed(1)
y =c(rep(1,n), rep(0,n))
scores = data.frame(Test_A=c(rnorm(2*n,0,1)))
out = HMeasure(true.class=y, scores=scores)

par(mfrow=c(1,2))
plotROC(out, which=1)
plotROC(out, which=3)

out
Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197