Ich habe einen Testsatz von 100 Fällen und zwei Klassifikatoren.

Ich erstellte Vorhersagen und berechnete die ROC AUC, Sensitivität und Spezifität für beide Klassifikatoren.

Frage 1: Wie kann ich den p-Wert berechnen, um zu überprüfen, ob einer in Bezug auf alle Scores (ROC AUC, Sensitivität, Spezifität) signifikant besser als der andere ist?

Jetzt habe ich für den gleichen Testsatz von 100 Fällen unterschiedliche und unabhängige Funktionszuweisungen für jeden Fall. Dies liegt daran, dass meine Funktionen festgelegt, aber subjektiv sind und von mehreren (5) Personen bereitgestellt werden.

Also habe ich meine beiden Klassifikatoren erneut für 5 "Versionen" meines Testsatzes bewertet und 5 ROC AUCs, 5 Sensitivitäten und 5 Spezifitäten für beide Klassifikatoren erhalten. Dann berechnete ich den Mittelwert jeder Leistungsmessung für 5 Probanden (mittlere ROC AUC, mittlere Sensitivität und mittlere Spezifität) für beide Klassifikatoren.

Frage 2: Wie kann ich den p-Wert berechnen, um zu überprüfen, ob einer in Bezug auf die Durchschnittswerte (durchschnittliche ROC AUC, durchschnittliche Sensitivität, durchschnittliche Spezifität) signifikant besser als der andere ist?

Antworten mit einem Beispiel für Python (vorzugsweise) oder MatLab-Code sind mehr als willkommen.

— kostek
quelle

Vergleichen Sie Präzision, Genauigkeit und AuC direkt, um den besten Klassifikator unter den beiden zu erhalten. P-Wert macht hier keinen Sinn. Der p-Wert wird im Zusammenhang mit der Bewertung verwendet, ob das Modell besser

— abschneidet

Erstens stimme ich nicht zu, dass der Vergleich zweier Leistungskennzahlen mit p-Wert hier keinen Sinn ergibt. Ich sehe, dass ein Klassifikator AUC 0,80 und der andere 0,85 hat. Meine Nullhypothese wäre, dass es keinen Unterschied in der Leistung beider Kleinanzeigen gibt. Ich möchte wissen, ob der Unterschied statistisch signifikant ist.

— Kostek

Zweitens mache ich keine 5 Versionen meines Modells. Ich habe zwei Modelle in einem separaten Trainingssatz trainiert und bewerte sie nun anhand von 5 verschiedenen „Versionen“ meines Testsatzes. Ich habe eine mittlere Leistung für beide Klassifikatoren (z. B. 0,81 AUC und 0,84 AUC) und möchte prüfen, ob der Unterschied statistisch signifikant ist.

— Kostek

Ich würde nicht sagen, dass das, was ich tue, der Quervalidierung nahe kommt. In meinem Fall hängen die Werte von Features von dem Thema ab, das sie bereitstellt. Ich weiß, dass AUC zum Vergleichen von Modellen verwendet werden kann, aber ich möchte wissen, ob das Ergebnis meines Vergleichs in meiner Umgebung statistisch signifikant ist. Ich bin mir sicher, dass dies möglich und sinnvoll ist. Meine Frage ist, wie es geht.

— Kostek

Ich bin mir nicht sicher, worauf @Nishad abzielt. Sie können und sollten einen Hypothesentest verwenden, um festzustellen, ob sich Ihre Modelle erheblich voneinander unterscheiden. Die Standardabweichungen Ihrer Metriken sind vorhanden und werden mit zunehmender Stichprobengröße kleiner (alle anderen Faktoren sind gleich). Ein AUC-Unterschied zwischen 0,8 und 0,9 ist möglicherweise nicht signifikant, wenn Sie nur 10 Proben haben, aber möglicherweise sehr signifikant, wenn Sie 10 Millionen Proben haben. Ich sehe auch keine Beziehung zur Kreuzvalidierung. Würde die Kommentare runterstimmen, wenn ich könnte.

— Nuclear Wang

Wojtek J. Krzanowski und David J. Hand ROC-Kurven für kontinuierliche Daten (2009) sind eine großartige Referenz für alle Dinge, die mit ROC-Kurven zu tun haben. Es sammelt eine Reihe von Ergebnissen in einer frustrierend breiten Literaturbasis, in der häufig unterschiedliche Begriffe verwendet werden, um dasselbe Thema zu diskutieren.

Darüber hinaus bietet dieses Buch Kommentare und Vergleiche zu alternativen Methoden, die zur Schätzung derselben Größen hergeleitet wurden, und weist darauf hin, dass einige Methoden Annahmen treffen, die in bestimmten Zusammenhängen möglicherweise unhaltbar sind. Dies ist ein solcher Kontext; Andere Antworten geben die Hanley & McNeil-Methode an, bei der das Binormalmodell für die Verteilung von Punktzahlen zugrunde gelegt wird. Dies kann in Fällen ungeeignet sein, in denen die Verteilung von Klassenpunktzahlen nicht (nahezu) normal ist. Die Annahme normalverteilter Punktzahlen scheint in modernen maschinellen Lernzusammenhängen besonders ungeeignet zu sein. Typische gängige Modelle wie xgboost tendieren dazu, Punktzahlen mit einer "Badewanne" -Verteilung für Klassifizierungsaufgaben zu erzeugen ( dh Verteilungen mit hohen Dichten in den Extremen nahe 0 und 1) ).

Frage 1 - AUC

In Abschnitt 6.3 werden Vergleiche der ROC-AUC für zwei ROC-Kurven erörtert (S. 113-114). Insbesondere ist mein Verständnis , dass diese beiden Modelle sind korreliert, so dass die Informationen darüber , wie zu berechnen von entscheidender Bedeutung ist hier; Andernfalls wird Ihre Teststatistik verzerrt, da der Korrelationsbeitrag nicht berücksichtigt wird. $r$

Für den Fall von nicht korrelierten ROC-Kurven, die nicht auf parametrischen Verteilungsannahmen basieren, können Statistiken für Tets und Konfidenzintervalle, die AUCs vergleichen, direkt auf Schätzungen und der AUC-Werte und Schätzungen ihrer Standardabweichungen und basieren gemäß Nummer 3.5.1: $\widehat{\text{AUC}}_1$ $\widehat{\text{AUC}}_2$ $S_1$ $S_2$

$Z = \frac{{\hat{AUC}}_{1} - {\hat{AUC}}_{2}}{\sqrt{S_{1}^{2} + S_{2}^{2}}}$ $Z = \frac{\widehat{\text{AUC}}_1 - \widehat{\text{AUC}}_2}{\sqrt{S_1^2 + S_2^2}}$
Um solche Tests auf den Fall auszudehnen, dass für beide Klassifikatoren dieselben Daten verwendet werden, müssen wir die Korrelation zwischen den AUC-Schätzungen berücksichtigen:
$z = \frac{{\hat{AUC}}_{1} - {\hat{AUC}}_{2}}{\sqrt{S_{1}^{2} + S_{2}^{2} - r S_{1} S_{2}}}$ $z=\frac{\widehat{\text{AUC}}_1 - \widehat{\text{AUC}}_2}{\sqrt{S_1^2 + S_2^2 - rS_1S_2}}$
wobei die Schätzung dieser Korrelation ist. Hanley und McNeil (1983) haben eine solche Erweiterung vorgenommen, wobei sie ihre Analyse auf den binormalen Fall gestützt haben. Sie haben jedoch nur eine Tabelle angegeben, aus der hervorgeht, wie der geschätzte Korrelationskoeffizient aus der Korrelation der beiden Klassifikatoren innerhalb der Klasse P und der Korrelation von berechnet wird der beiden Klassifikatoren innerhalb der Klasse N, wobei angegeben wird, dass die mathematische Ableitung auf Anfrage verfügbar war. Verschiedene andere Autoren (z. B. Zou, 2001) haben Tests basierend auf dem binormalen Modell entwickelt, unter der Annahme, dass eine geeignete Transformation gefunden werden kann, die gleichzeitig die Punkteverteilungen der Klassen P und N in normal transformiert. $r$ $r$ $r_P$ $r_n$

DeLong et al. (1988) nutzten die Identität zwischen AUC und der Mann-Whitney-Teststatistik zusammen mit Ergebnissen aus der Theorie der verallgemeinerten Statistik von Sen (1960), um einen Schätzwert für die Korrelation zwischen den AUCs abzuleiten verlässt sich nicht auf die binormale Annahme. DeLong et al. (1988) präsentierten die folgenden Ergebnisse für Vergleiche zwischen Klassifikatoren. $U$ $k\ge 2$

In Abschnitt 3.5.1 haben wir gezeigt, dass die Fläche unter der empirischen ROC-Kurve der Mann-Whitney- Statistik entspricht und von gegeben wurde $U$

wobeidie Punktzahl für dieObjekteder Klasseunddie Punktzahl für sind die Klasse-Objekte in der Stichprobe. Angenommen, wir habenKlassifikatoren, die die Punkte
$\hat{A U C} = \frac{1}{n_{N} n_{P}} \sum_{i = 1}^{n_{N}} \sum_{j = 1}^{n_{P}} [I (s_{P_{j}} > s_{N_{i}}) + \frac{1}{2} I (s_{P_{j}} = s_{N_{i}})]$ $\widehat{AUC}=\frac{1}{n_N n_P} \sum_{i=1}^{n_N} \sum_{j=1}^{n_P} \left[ I(s_{P_j} > s_{N_i}) + \frac{1}{2}I(s_{P_j} = s_{N_i}) \right]$ $s_{P_i}, i = 1, \dots,n_P$ $P$ $s_{N_j}, j = 1, \dots,n_N$ $N$ $k$ und [Ich habe einen Indexierungsfehler in diesem Teil korrigiert - Sycorax] und . Definieren $s^r_{N_j}, j=1\dots n_N$ $s_{P_i}^r, j = 1, \dots,n_P$ $\widehat{AUC}_r, r = 1, \dots, k$
und
$V_{10}^{r} = \frac{1}{n_{N}} \sum_{j = 1}^{n_{N}} [I (s_{P_{i}}^{r} > s_{N_{j}}^{r}) + \frac{1}{2} I (s_{P_{i}}^{r} = s_{N_{j}}^{r})], i = 1, \dots, n_{P}$ $V^r_{10}=\frac{1}{n_N}\sum_{j=1}^{n_N} \left[ I(s_{P_i}^r > s_{N_j}^r) + \frac{1}{2}I(s_{P_i}^r = s_{N_j}^r) \right] , i=1,\dots,n_P$ $V_{01}^{r} = \frac{1}{n_{P}} \sum_{i = 1}^{n_{P}} [I (s_{P_{i}}^{r} > s_{N_{j}}^{r}) + \frac{1}{2} I (s_{P_{i}}^{r} = s_{N_{j}}^{r})], j = 1, \dots, n_{N}$ $V^r_{01} = \frac{1}{n_P}\sum_{i=1}^{n_P} \left[ I(s_{P_i}^r > s_{N_j}^r) + \frac{1}{2}I(s_{P_i}^r = s_{N_j}^r) \right] , j=1,\dots,n_N$
next, define the $k \times k$ matrix $\mathbf{W}_{10}$ with $(r,s)$ th element
$w_{10}^{r, s} = \frac{1}{n_{P} - 1} \sum_{i = 1}^{n_{P}} [V_{10}^{r} (s_{P_{i}}) - {\hat{A U C}}_{r}] [V_{10}^{s} (s_{P_{i}}) - {\hat{A U C}}_{s}]$ $w_{10}^{r,s} = \frac{1}{n_P - 1}\sum_{i=1}^{n_P} \left[ V_{10}^r(s_{P_i}) - \widehat{AUC}_r \right] \left[ V_{10}^s(s_{P_i}) - \widehat{AUC}_s \right]$ and the $k \times k$ matrix $\mathbf{W}_{01}$ with $(r,s)$ th element $w_{01}^{r, s} = \frac{1}{n_{N} - 1} \sum_{i = 1}^{n_{N}} [V_{01}^{r} (s_{N_{i}}) - {\hat{A U C}}_{r}] [V_{01}^{s} (s_{N_{i}}) - {\hat{A U C}}_{s}]$ $w_{01}^{r,s} = \frac{1}{n_N - 1}\sum_{i=1}^{n_N} \left[ V_{01}^r(s_{N_i}) - \widehat{AUC}_r \right] \left[ V_{01}^s(s_{N_i}) - \widehat{AUC}_s \right]$ Then the estiamted covariance matrix for the vector $(\widehat{AUC}_1, \dots, \widehat{AUC}_k)$ of the estimated areas under the curves is $W = \frac{1}{n_{P}} W_{10} + \frac{1}{n_{N}} W_{01}$ $\mathbf{W} = \frac{1}{n_P}\mathbf{W}_{10} + \frac{1}{n_N}\mathbf{W}_{01}$ with elements $w^{r,s}$ . This is a generalization of the result for the estimated variance of a single estiamted AUC, also given in section 3.5.1. In the case of two classifiers, the estiamted correlation $r$ between the estimated AUCs is thus given by $\frac{w^{1,2}}{\sqrt{w^{1,1}w^{2,2}}}$ which can be used in $z$ above.

Since another answers gives the Hanley and McNeil expressions for estimators of AUC variance, here I'll reproduce the DeLong estimator from p. 68:

The alternative approach due to DeLong et al (1988) and exemplified by Pepe (2003) gives perhaps a simpler estimate, and one that introduces the extra useful concept of a placement value. The placement value of a score $s$ with reference to a specified population is that population's survivor function at $s$ . This the placement value for $s$ in population N is $1 - F(s)$ and for $s$ in population P it is $1 - G(s)$ . Empirical estimates of placement values are given by the obvious proportions. Thus the placement value of observation $s_{N_i}$ in population P denoted $s^P_{N_i}$ , is the proportion of sample values from P that exceed $s_{N_i}$ , and $\text{var}(s_{P_i}^N)$ is the variance of the placement values of each observation from N with respect to population P...

The DeLong et al (1988) estimate of variance of $\widehat{AUC}$ is given in terms of these variances:
$s^{2} (\hat{A U C}) = \frac{1}{n_{P}} var (s_{P_{i}}^{N}) + \frac{1}{n_{N}} var (s_{N_{i}}^{P})$ $s^2(\widehat{AUC}) = \frac{1}{n_P} \text{var}\left(s_{P_i}^N\right) + \frac{1}{n_N}\text{var}\left(s_{N_i}^P\right)$

Note that $F$ is the cumulative distribution function of the scores in population N and $G$ is the cumulative distribution function of the scores in population P. A standard way to estimate $F$ and $G$ is to use the ecdf. The book also provides some alternative methods to the ecdf estimates, such as kernel density estimation, but that is outside the scope of this answer.

The statistics $Z$ and $z$ may be assumed to be standard normal deviates, and statistical tests of the null hypothesis proceed in the usual way. (See also: hypothesis-testing)

This is a simplified, high-level outline of how hypothesis testing works:

Testing, in your words, "whether one classifier is significantly better than the other" can be rephrased as testing the null hypothesis that the two models have statistically equal AUCs against the alternative hypothesis that the statistics are unequal.
This is a two-tailed test.
We reject the null hypothesis if the test statistic is in the critical region of the reference distribution, which is a standard normal distribution in this case.
The size of the critical region depends on the level $\alpha$ of the test. For a significance level of 95%, the test statistic falls in the critical region if $z > 1.96$ or $z < -1.96$ . (These are the $\alpha/2$ and $1 - \alpha/2$ quantiles of the standard normal distribution.) Otherwise, you fail to reject the null hypothesis and the two models are statistically tied.

Question 1 - Sensitivity and Specificity

The general strategy for comparing sensitivity and specificity is to observe that both of these statistics amount to performing statistical inference on proportions, and this is a standard, well-studied problem. Specifically, sensitivity is the proportion of population P that has a score greater than some threshold $t$ , and likewise for specificity wrt population N:

\begin{aligned} sensitivity = t p & = P (s_{P} > t) \\ 1 - specificity = f p & = P (s_{N} > t) \end{aligned}

$\begin{align} \text{sensitivity} = tp &= \mathbb{P}(s_P > t) \\ 1 - \text{specificity} = fp &= \mathbb{P}(s_N > t) \end{align}$

The main sticking point is developing the appropriate test given that the two sample proportions will be correlated (as you've applied two models to the same test data). This is addressed on p. 111.

Turning to particular tests, several summary statistics reduce to proportions for each curve, so that standard methods for comparing proportions can be used. For example, the value of $tp$ for fixed $fp$ is a proportion, as is the misclassification rate for fixed threshold $t$ . We can thus compare curves, using these measures, by means of standard tests to compare proportions. For example, in the unpaired case, we can use the test statistic $(tp_1 - tp_2) / s_{12}$ , where $tp_i$ is the true positive rate for curve $i$ as the point in question, and $s_{12}^2$ is the sum of the variances of $tp_1$ and $tp_2$ ...

For the paired case, however, one can derive an adjustment that allows for the covariance between $tp_1$ and $tp_2$ , but an alternative is to use McNemar's test for correlated proportions (Marascuilo and McSweeney, 1977).

The mcnemar-test is appropriate when you have $N$ subjects, and each subject is tested twice, once for each of two dichotomous outcomes. Given the definitions of sensitivity and specificity, it should be obvious that this is exactly the test that we seek, since you've applied two models to the same test data and computed sensitivity and specificity at some threshold.

The McNemar test uses a different statistic, but a similar null and alternative hypothesis. For example, considering sensitivity, the null hypothesis is that the proportion $tp_1 = tp_2$ , and the alternative is $tp_1 \neq tp_2$ . Re-arranging the proportions to instead be raw counts, we can write a contingency table

\begin{array}{ccc} Model 1 Positive at t & Model 1 Negative at t \\ Model 2 Positive at t & a & b \\ Model 2 Negative at t & c & d \end{array}

$\begin{array}{c|c|c|} & \text{Model 1 Positive at } t & \text{Model 1 Negative at } t \\ \hline \text{Model 2 Positive at } t & a & b \\ \hline \text{Model 2 Negative at } t & c & d \\ \hline \end{array}$ where cell counts are given by counting the true positives and false negatives according to each model

\begin{aligned} a & = \sum_{i = 1}^{n_{P}} I (s_{P_{i}}^{1} > t) \cdot I (s_{P_{i}}^{2} > t) \\ b & = \sum_{i = 1}^{n_{P}} I (s_{P_{i}}^{1} \leq t) \cdot I (s_{P_{i}}^{2} > t) \\ c & = \sum_{i = 1}^{n_{P}} I (s_{P_{i}}^{1} > t) \cdot I (s_{P_{i}}^{2} \leq t) \\ d & = \sum_{i = 1}^{n_{P}} I (s_{P_{i}}^{1} \leq t) \cdot I (s_{P_{i}}^{2} \leq t) \end{aligned}

$\begin{align} a &= \sum_{i=1}^{n_P} I(s_{P_i}^1 > t) \cdot I(s_{P_i}^2 > t) \\ b &= \sum_{i=1}^{n_P} I(s_{P_i}^1 \le t) \cdot I(s_{P_i}^2 > t) \\ c &= \sum_{i=1}^{n_P} I(s_{P_i}^1 > t) \cdot I(s_{P_i}^2 \le t) \\ d &= \sum_{i=1}^{n_P} I(s_{P_i}^1 \le t) \cdot I(s_{P_i}^2 \le t) \\ \end{align}$

and we have the test statistic

M = \frac{(b - c)^{2}}{b + c}

$M = \frac{(b-c)^2}{b + c}$ which is distributed as

χ_{1}^{2}

$\chi^2_1$ a chi-squared distribution with 1 degree of freedom. With a level

α = 95 %

$\alpha=95\%$ , the null hypothesis is rejected for

M > 3.841459

$M > 3.841459$ .

For the specificity, you can use the same procedure, except that you replace the $s^r_{P_i}$ with the $s^r_{N_j}$ .

Question 2

It seems that it is sufficient to merge the results by averaging the prediction values for each respondent, so that for each model you have 1 vector of 100 averaged predicted values. Then compute the ROC AUC, sensitivty and specificity statistics as usual, as if the original models didn't exist. This reflects a modeling strategy that treats each of the 5 respondents' models as one of a "committee" of models, sort of like an ensemble.

— Sycorax says Reinstate Monica
quelle

Thanks for your answer and provided references. What about p-values for sensitivity and specificity?

— kostek

For Q1, does it mean that there is no difference between computing p-value for sensitivity and specificity and that they both always have the same p-value and I simply make a contingency table and run McNemar test on it?

— kostek

No, you’d do one test for each.

— Sycorax says Reinstate Monica

That is a very detailed answer, thank you. About McNemar-test; what are exactly

a, b, c, d

$a,b,c,d$ ? What proportions are these?

— Drey

@Drey They're not proportions; they're counts. I make this explicit in a revision.

— Sycorax says Reinstate Monica

Let me keep the answer short, because this guide does explain a lot more and better.

Basically, you have your number of True Postives ( $nTP$ ) and number of True Negatives ( $nTN$ ). Also you have your AUC, A. The standard error of this A is:

$\texttt{SE}_A = \sqrt{\frac{A(1-A) + (nTP-1)(Q_1 - A^2)+(nTN-1)(Q_2 - A^2)}{nTP \cdot nTN}}$

with $Q_1 = A / (2 - A)$ and $Q_2 = 2A^2 / (1 + A)$ .

To compare two AUCs you need to compute the SE of them both using:

$\texttt{SE}_{A_1 - A_2} = \sqrt{(SE_{A_1})^2 + (SE_{A_2})^2 - 2r\cdot (SE_{A_1})(SE_{A_2})}$

where $r$ is a quantity that represents the correlation induced between the two areas by the study of the same set of cases. If your cases are different, then $r=0$ ; otherwise you need to look it up (Table 1, page 3 in freely available article).

Given that you compute the $z$ -Score by

$z = (A_1 - A_2) / SE_{A_1-A_2}$

From there you can compute p-value using probability density of a standard normal distribution. Or simply use this calculator.

This hopefully answers Question 1. - at least the part comparing AUCs. Sens/Spec is already covered by the ROC/AUC in some way. Otherwise, the answer I think lies in the Question 2.

As for Question 2, Central Limit Theorem tells us that your summary statistic would follow a normal distribution. Hence, I would think a simple t-test would suffice (5 measures of one classifier against 5 measures of the second classifier where measures could be AUC, sens, spec)

Edit: corrected formula for $\texttt{SE}$ ( $\ldots - 2r \ldots$ )

— Drey
quelle

Thanks for provided links. For Question 1, If I set A to be sensitivity or specificity, would the equations for SE and z-Score hold?

— kostek

No, because sens only handles TPs and spec handles TNs. It is possible to compute confidence intervals for sens/spec with Binomial proportion CI, but be vigilant (small sample size?). Your

\hat{p}

$\hat{p}$ would be sens or spec. If CIs overlap in your comparison, then the difference would be not statistically significant under the alpha-level.

— Drey

For Question 1, @Sycorax provided a comprehensive answer.

For Question 2, to the best of my knowledge, averaging predictions from subjects is incorrect. I decided to use bootstrapping to compute p-values and compare models.

In this case, the procedure is as follows:

For N iterations:
  sample 5 subjects with replacement
  sample 100 test cases with replacement
  compute mean performance of sampled subjects on sampled cases for model M1
  compute mean performance of sampled subjects on sampled cases for model M2
  take the difference of mean performance between M1 and M2
p-value equals to the proportion of differences smaller or equal than 0

This procedure performs one-tailed test and assumes that M1 mean performance > M2 mean performance.

A Python implementation of bootstrapping for computing p-values comparing multiple readers can be found in this GitHub repo: https://github.com/mateuszbuda/ml-stat-util

— kostek
quelle

Statistische Signifikanz (p-Wert) für den Vergleich zweier Klassifikatoren hinsichtlich (mittlerer) ROC AUC, Sensitivität und Spezifität

Frage 1 - AUC

Question 1 - Sensitivity and Specificity

Question 2