The way to calculate AUC-ROC is to plot out the TPR and FPR as the threshold, τ is changed and calculate the area under that curve. But, why is this area under the curve the same as this probability? Let's assume the following:
- A is the distribution of scores the model produces for data points that are actually in the positive class.
- B is the distribution of scores the model produces for data points that are actually in the negative class (we want this to be to the left of A).
- τ is the cutoff threshold. If a data point get's a score greater than this, it's predicted as belonging to the positive class. Otherwise, it's predicted to be in the negative class.
Note that the TPR (recall) is given by: P(A>τ) and the FPR (fallout) is given be: P(B>τ).
Now, we plot the TPR on the y-axis and FPR on the x-axis, draw the curve for various τ and calculate the area under this curve (AUC).
We get:
AUC=∫10TPR(x)dx=∫10P(A>τ(x))dx
where
x is the FPR.
Now, one way to calculate this integral is to consider
x as belonging to a uniform distribution. In that case, it simply becomes the expectation of the
TPR.
AUC=Ex[P(A>τ(x))](1)
if we consider
x∼U[0,1) .
Now, x here was just the FPR
x=FPR=P(B>τ(x))
Since we considered
x to be from a uniform distribution,
P(B>τ(x))∼U
=>P(B<τ(x))∼(1−U)∼U
=>FB(τ(x))∼U(2)
But we know from the inverse transform law that for any random variable X, if FX(Y)∼U then Y∼X. This follows since taking any random variable and applying its own CDF to it leads to the uniform.
FX(X)=P(FX(x)<X)=P(X<F−1X(X))=FXF−1X(X) = X.
und das gilt nur für uniform.
Die Verwendung dieser Tatsache in Gleichung (2) ergibt:
τ( x ) ∼ B.
Wenn wir dies in Gleichung (1) einsetzen, erhalten wir:
A U.C.= E.x( P.( A > B ) ) = P.( A > B )
Mit anderen Worten, der Bereich unter der Kurve ist die Wahrscheinlichkeit, dass eine zufällige positive Stichprobe eine höhere Punktzahl aufweist als eine zufällige negative Stichprobe.