If we want to compute the class probability p(Y=C|x), we will need to compute the denominator:
log(p(Y=C|x))=log(p(x|Y=C)p(Y=C) ∑|C|k=1p(x|Y=Ck)p(Y=Ck))=log⎛⎝⎜p(x|Y=C)p(Y=C)numerator⎞⎠⎟−log⎛⎝⎜⎜⎜⎜⎜ ∑k=1|C|p(x|Y=Ck)p(Y=Ck)denominator⎞⎠⎟⎟⎟⎟⎟
The element log( ∑|C|k=1p(x|Y=Ck)p(Y=Ck)) may underflow because p(xi|Ck) can be very small: it is the same issue as in the numerator, but this time we have a summation inside the logarithm, which prevents us from transforming the p(xi|Ck) (can be close to 0) into log(p(xi|Ck)) (negative and not close to 0 anymore, since 0≤p(xi|Ck)≤1). To circumvent this issue, we can use the fact that p(xi|Ck)=exp(log(p(xi|Ck))) to obtain:
log( ∑k=1|C|p(x|Y=Ck)p(Y=Ck))=log( ∑k=1|C|exp(log(p(x|Y=Ck)p(Y=Ck))))
At that point, a new issue arises: log(p(x|Y=Ck)p(Y=Ck)) may be quite negative, which implies that exp(log(p(x|Y=Ck)p(Y=Ck))) may become very close to 0, i.e. underflow. This is where we use the log-sum-exp trick:
log∑keak=log∑keakeA−A=A+log∑keak−A
with:
- ak=log(p(x|Y=Ck)p(Y=Ck)),
- A=maxk∈{1,…,|C|}ak.
We can see that introducing the variable A avoids underflows. E.g. with k=2,a1=−245,a2=−255, we have:
- exp(a1)=exp(−245)=3.96143×10−107
- exp(a2)=exp(−255)=1.798486×10−111
Using the log-sum-exp trick we avoid the underflow, with A=max(−245,−255)=−245:
log∑keak=log∑keakeA−A=A+log∑keak−A=−245+log∑keak+245=−245+log(e−245+245+e−255+245)=−245+log(e0+e−10)
We avoided the underflow since e−10 is much farther away from 0 than 3.96143×10−107 or 1.798486×10−111.