Beispiel, wie der Log-Sum-Exp-Trick in Naive Bayes funktioniert

Ich habe an vielen Stellen (z. B. hier und hier ) über den Log-Sum-Exp-Trick gelesen , aber noch nie ein Beispiel dafür gesehen, wie er speziell auf den Naive Bayes-Klassifikator angewendet wird (z. B. mit diskreten Merkmalen und zwei Klassen).

Wie genau würde man mit diesem Trick das Problem des numerischen Unterlaufs vermeiden?

naive-bayes underflow

— Josh
quelle

Es gibt hier einige Beispiele für seine Verwendung, jedoch nicht unbedingt explizit für naive Bayes. Dies spielt jedoch kaum eine Rolle, da die Idee des Tricks recht einfach und leicht anpassbar ist.

— Glen_b -State Monica

Das Problem ist eher ein Unterlauf als ein Überlauf.

— Henry

Ich würde vorschlagen, dass Sie eine Suche nach Unterlauf versuchen und dann Ihre Frage aktualisieren, um genauer zu behandeln, was noch nicht behandelt wurde.

— Glen_b -Reinstate Monica

Könnten Sie auch klarstellen - das ist Bernoulli-Modell naive Bayes? vielleicht noch etwas?

— Glen_b -Reinstate Monica

Sehen Sie sich das Beispiel hier unten an (kurz vor 'Siehe auch', wo sie Protokolle aufnehmen; beide Seiten zu potenzieren, aber die RHS "wie sie ist" zu belassen (als Exp einer Summe von Protokollen) wäre ein Beispiel für das Protokoll -sum-exp Trick. Gibt Ihnen das genügend Informationen bezüglich seiner Verwendung in Naive Bayes, um eine spezifischere Frage zu stellen?

— Glen_b -Reinstate Monica

p (Y = C | x) = \frac{p (x | Y = C) p (Y = C)}{\sum_{k = 1}^{| C |} p (x | Y = C_{k}) p (Y = C_{k})}

$p(Y=C|\mathbf{x}) = \frac{p(\mathbf{x}|Y=C)p(Y=C)}{~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k)}$

Sowohl der Nenner als auch der Zähler können sehr klein werden, typischerweise weil nahe 0 sein kann und wir viele von ihnen miteinander multiplizieren. Um Unterläufe zu vermeiden, kann man einfach das Protokoll des Zählers nehmen, aber man muss den log-sum-exp-Trick für den Nenner verwenden. $p(x_i \vert C_k)$

Genauer gesagt, um Unterläufe zu verhindern:

Wenn wir nur zu wissen , welche Klasse Pflege der Eingang höchstwahrscheinlich gehört mit dem Maximum a posteriori (MAP) Entscheidungsregel, müssen wir nicht die Log- gelten Summen-Exp-Trick, da wir in diesem Fall den Nenner nicht berechnen müssen . Für den Zähler kann man einfach das Protokoll nehmen, um Unterläufe zu vermeiden: $(\hat{y})$ $(\mathbf{x}=x_1, \dots, x_n)$ $log \left( p(\mathbf{x}|Y=C)p(Y=C) \right)$ . Genauer:

$\hat{y} = \underset{k \in {1, \dots, | C |}}{argmax} p (C_{k} | x_{1}, \dots, x_{n}) = \underset{k \in {1, \dots, | C |}}{argmax} p (C_{k}) \prod_{i = 1}^{n} p (x_{i} | C_{k})$ $\hat{y} = \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}}p(C_k \vert x_1, \dots, x_n) = \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \vert C_k)$
was wird, nachdem das Protokoll genommen wurde:

\begin{aligned} \hat{y} & = \underset{k \in {1, \dots, | C |}}{argmax} \log (p (C_{k} | x_{1}, \dots, x_{n})) \\ = \underset{k \in {1, \dots, | C |}}{argmax} \log (p (C_{k}) \prod_{i = 1}^{n} p (x_{i} | C_{k})) \\ = \underset{k \in {1, \dots, | C |}}{argmax} (\log (p (C_{k})) + \sum_{i = 1}^{n} \log (p (x_{i} | C_{k}))) \end{aligned}

$\begin{align} \hat{y} &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \log \left( p(C_k \vert x_1, \dots, x_n) \right)\\ &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \log \left( \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \vert C_k) \right) \\ &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \left( \log \left( p(C_k) \right) + \ \displaystyle\sum_{i=1}^n \log \left(p(x_i \vert C_k) \right) \right) \end{align}$

If we want to compute the class probability $p(Y=C|\mathbf{x})$ , we will need to compute the denominator:

$\begin{aligned} \log (p (Y = C | x)) & = \log (\frac{p (x | Y = C) p (Y = C)}{\sum_{k = 1}^{| C |} p (x | Y = C_{k}) p (Y = C_{k})}) \\ = \log (\underset{numerator}{\underset{⏟}{p (x | Y = C) p (Y = C)}}) - \log (\underset{denominator}{\underset{⏟}{\sum_{k = 1}^{| C |} p (x | Y = C_{k}) p (Y = C_{k})}}) \end{aligned}$

The element $\log \left( ~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)\\$ may underflow because $p(x_i \vert C_k)$ can be very small: it is the same issue as in the numerator, but this time we have a summation inside the logarithm, which prevents us from transforming the $p(x_i \vert C_k)$ (can be close to 0) into $\log \left(p(x_i \vert C_k) \right)$ (negative and not close to 0 anymore, since $0 \leq p(x_i \vert C_k) \leq 1$ ). To circumvent this issue, we can use the fact that $p(x_i \vert C_k) = \exp \left( {\log \left(p(x_i \vert C_k) \right)} \right)$ to obtain:

$\log (\sum_{k = 1}^{| C |} p (x | Y = C_{k}) p (Y = C_{k})) = \log (\sum_{k = 1}^{| C |} \exp (\log (p (x | Y = C_{k}) p (Y = C_{k}))))$

At that point, a new issue arises: $\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$ may be quite negative, which implies that $\exp \left( \log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right) \right)$ may become very close to 0, i.e. underflow. This is where we use the log-sum-exp trick:

$\log \sum_{k} e^{a_{k}} = \log \sum_{k} e^{a_{k}} e^{A - A} = A + \log \sum_{k} e^{a_{k} - A}$

with:
- $a_k=\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$ ,
- $A = \underset{k \in \{1, \dots, |C|\}} \max a_k.$
We can see that introducing the variable $A$ avoids underflows. E.g. with $k=2, a_1 = - 245, a_2 = - 255$ , we have:
- $\exp \left(a_1\right) = \exp \left(- 245\right) =3.96143\times 10^{- 107}$
- $\exp \left(a_2\right) = \exp \left(- 255\right) =1.798486 \times 10^{-111}$
Using the log-sum-exp trick we avoid the underflow, with $A=\max ( -245, -255 )=-245$ : $\begin{align}\log \sum_k e^{a_k} &= \log \sum_k e^{a_k}e^{A-A} \\&= A+ \log\sum_k e^{a_k -A}\\ &= -245+ \log\sum_k e^{a_k +245}\\&= -245+ \log \left(e^{-245 +245}+e^{-255 +245}\right) \\&=-245+ \log \left(e^{0}+e^{-10}\right) \end{align}$

We avoided the underflow since $e^{-10}$ is much farther away from 0 than $3.96143\times 10^{- 107}$ or $1.798486 \times 10^{-111}$ .

— Franck Dernoncourt
quelle

Suppose we want to identify which of two databases is more likely to have generated a phrase (for example, which novel is this phrase more likely to have come from). We could assume independence of the words conditional on the database (Naive Bayes assumption).

Now look up the second link you have posted. There $a$ would represent the joint probability of observing the sentence given a database and the $e^{b_{t}}$ s would represent the probability of observing each of the words in the sentence.

— Sid
quelle