Warum sollte man sich in Naive Bayes mit Laplace-Glättung beschäftigen, wenn das Testset unbekannte Wörter enthält?

27

Ich habe heute über die Naive Bayes-Klassifikation gelesen. Ich las unter der Überschrift Parameterschätzung mit add 1 Glättung :

Verweisen Sie mit $c$ auf eine Klasse (z. B. Positiv oder Negativ) und mit $w$ auf ein Token oder Wort.

Der Maximum - Likelihood - Schätzer für $P(w|c)$ ist
$\frac{c o u n t (w, c)}{c o u n t (c)} = \frac{counts w in class c}{counts of words in class c} .$ $\frac{count(w,c)}{count(c)} = \frac{\text{counts w in class c}}{\text{counts of words in class c}}.$

Diese Schätzung von $P(w|c)$ könnte problematisch sein, da sie uns die Wahrscheinlichkeit geben würde $0$ für Dokumente mit unbekannten Wörtern eine . Eine gängige Methode zur Lösung dieses Problems ist die Verwendung der Laplace-Glättung.

Sei V die Wortmenge in der Trainingsmenge, füge der Wortmenge ein neues Element $UNK$ (für Unbekannt) hinzu.

Definiere
$P (w | c) = \frac{count (w, c) + 1}{count (c) + | V | + 1},$ $P(w|c)=\frac{\text{count}(w,c) +1}{\text{count}(c) + |V| + 1},$

wo $V$ auf das Vokabular bezieht (die Wörter im Trainingssatz).

Insbesondere hat jedes unbekannte Wort die Wahrscheinlichkeit
$\frac{1}{count (c) + | V | + 1} .$ $\frac{1}{\text{count}(c) + |V| + 1}.$

Meine Frage lautet: Warum beschäftigen wir uns überhaupt mit dieser Laplace-Glättung? Wenn diese unbekannten Wörter, auf die wir in der Testmenge stoßen, eine Wahrscheinlichkeit haben, die offensichtlich fast Null ist, dh $\frac{1}{\text{count}(c) + |V| + 1}$ , was bringt es, sie in das Modell aufzunehmen? Warum nicht einfach ignorieren und löschen?

— Matt O'Brien
quelle

3

Wenn Sie dies nicht tun, hat eine Aussage, die ein zuvor nicht gesehenes Wort enthält, den Wert

. Dies bedeutet, dass ein unmögliches Ereignis eingetreten ist. Was bedeutet, dass Ihr Modell eine unglaublich schlechte Passform hatte. Auch in einem richtigen Bayes'schen Modell könnte dies niemals passieren, da die unbekannte Wortwahrscheinlichkeit einen vom Prior vorgegebenen Zähler hätte (möglicherweise nicht 1). Ich weiß also nicht, warum dies den ausgefallenen Namen "Laplace-Glättung" erfordert.

p = 0

$p=0$

— Vermutungen

1

Aus welchem Text stammte die Lesung?

— wordsforthewise

17

Sie benötigen immer diese "ausfallsichere" Wahrscheinlichkeit.

Um zu sehen, warum der schlimmste Fall in Betracht gezogen wird, in dem keines der Wörter im Trainingsbeispiel im Testsatz vorkommt. In diesem Fall würden wir unter Ihrem Modell den Schluss ziehen, dass der Satz unmöglich ist, aber eindeutig existiert und einen Widerspruch erzeugt.

Ein weiteres extremes Beispiel ist der Testsatz "Alex hat Steve getroffen". wo "met" mehrmals im Trainingsbeispiel vorkommt, "Alex" und "Steve" jedoch nicht. Ihr Modell würde zu dem Schluss kommen, dass diese Aussage sehr wahrscheinlich nicht wahr ist.

— Sid
quelle

Ich hasse es, wie ein kompletter Idiot zu klingen, aber würde es Ihnen etwas ausmachen, dies näher zu erläutern? Wie ändert das Entfernen von "Alex" und "Steve" die Wahrscheinlichkeit des Auftretens der Anweisung?

— Matt O'Brien

2

Wenn wir die Unabhängigkeit der Wörter P (Alex) P (Steve) P (erfüllt) << P (erfüllt)

— Sid

1

Wir könnten ein Vokabular aufbauen, wenn wir das Modell auf dem Trainingsdatensatz trainieren. Warum also nicht einfach alle neuen Wörter entfernen, die nicht im Vokabular vorkommen, wenn Vorhersagen auf dem Testdatensatz gemacht werden?

— Avocado

15

Nehmen wir an, Sie haben Ihren Naive Bayes-Klassifikator auf zwei Klassen, "Ham" und "Spam", trainiert (dh er klassifiziert E-Mails). Der Einfachheit halber nehmen wir frühere Wahrscheinlichkeiten mit 50/50 an.

$(w_1, w_2,...,w_n)$

P (H a m | w_{1}, w_{2}, . . . w_{n}) = .90

$P(Ham|w_1,w_2,...w_n) = .90$

P (S p a m | w_{1}, w_{2}, . . w_{n}) = .10

$P(Spam|w_1,w_2,..w_n) = .10$

So weit, ist es gut.

$(w_1, w_2, ...,w_n,w_{n+1})$

P (H a m | w_{n + 1}) = P (S p a m | w_{n + 1}) = 0

$P(Ham|w_{n+1}) = P(Spam|w_{n+1}) = 0$

Plötzlich,

P (H a m | w_{1}, w_{2}, . . . w_{n}, w_{n + 1}) = P (H a m | w_{1}, w_{2}, . . . w_{n}) * P (H a m | w_{n + 1}) = 0

$P(Ham|w_1,w_2,...w_n,w_{n+1}) = P(Ham|w_1,w_2,...w_n) * P(Ham|w_{n+1}) = 0$ and

P (S p a m | w_{1}, w_{2}, . . w_{n}, w_{n + 1}) = P (S p a m | w_{1}, w_{2}, . . . w_{n}) * P (S p a m | w_{n + 1}) = 0

$P(Spam|w_1,w_2,..w_n,w_{n+1}) = P(Spam|w_1,w_2,...w_n) * P(Spam|w_{n+1}) = 0$

Despite the 1st email being strongly classified in one class, this 2nd email may be classified differently because of that last word having a probability of zero.

Laplace smoothing solves this by giving the last word a small non-zero probability for both classes, so that the posterior probabilities don't suddenly drop to zero.

— RVC
quelle

why would we keep a word which doesn't exists in the vocabulary at all? why not just remove it?

— avocado

4

if your classifier rates an email as likely to be ham, then p(ham| w1,...,wn) is 0.9, not p(w1,...,wn|ham)

— braaterAfrikaaner

5

This question is rather simple if you are familiar with Bayes estimators, since it is the directly conclusion of Bayes estimator.

In the Bayesian approach, parameters are considered to be a quantity whose variation can be described by a probability distribution(or prior distribution).

So, if we view the procedure of picking up as multinomial distribution, then we can solve the question in few steps.

First, define

m = | V |, n = \sum n_{i}

$m = |V|, n = \sum n_i$

If we assume the prior distribution of $p_i$ is uniform distribution, we can calculate it's conditional probability distribution as

p (p_{1}, p_{2}, . . ., p_{m} | n_{1}, n_{2}, . . ., n_{m}) = \frac{Γ (n + m)}{\prod_{i = 1}^{m} Γ (n_{i} + 1)} \prod_{i = 1}^{m} p_{i}^{n_{i}}

$p(p_1,p_2,...,p_m|n_1,n_2,...,n_m) = \frac{\Gamma(n+m)}{\prod\limits_{i=1}^{m}\Gamma(n_i+1)}\prod\limits_{i=1}^{m}p_i^{n_i}$

we can find it's in fact Dirichlet distribution, and expectation of $p_i$ is

E [p_{i}] = \frac{n_{i} + 1}{n + m}

$E[p_i] = \frac{n_i+1}{n+m}$

A natural estimate for $p_i$ is the mean of the posterior distribution. So we can give the Bayes estimator of $p_i$ :

{\hat{p}}_{i} = E [p_{i}]

$\hat p_i = E[p_i]$

You can see we just draw the same conclusion as Laplace Smoothing.

— Response777
quelle

4

Disregarding those words is another way to handle it. It corresponds to averaging (integrate out) over all missing variables. So the result is different. How?

Assuming the notation used here:

P (C^{*} | d) = \arg max_{C} \frac{\prod_{i} p (t_{i} | C) P (C)}{P (d)} \propto \arg max_{C} \prod_{i} p (t_{i} | C) P (C)

$P(C^{*}|d) = \arg\max_{C} \frac{\prod_{i}p(t_{i}|C)P(C)}{P(d)} \propto \arg\max_{C} \prod_{i}p(t_{i}|C)P(C)$ where

t_{i}

$t_{i}$ are the tokens in the vocabulary and

d

$d$ is a document.

Let say token $t_{k}$ does not appear. Instead of using a Laplace smoothing (which comes from imposing a Dirichlet prior on the multinomial Bayes), you sum out $t_{k}$ which corresponds to saying: I take a weighted voting over all possibilities for the unknown tokens (having them or not).

P (C^{*} | d) \propto \arg max_{C} \sum_{t_{k}} \prod_{i} p (t_{i} | C) P (C) = \arg max_{C} P (C) \prod_{i \neq k} p (t_{i} | C) \sum_{t_{k}} p (t_{k} | C) = \arg max_{C} P (C) \prod_{i \neq k} p (t_{i} | C)

$P(C^{*}|d) \propto \arg\max_{C} \sum_{t_{k}} \prod_{i}p(t_{i}|C)P(C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C) \sum_{t_{k}} p(t_{k}|C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C)$

But in practice one prefers the smoothing approach. Instead of ignoring those tokens, you assign them a low probability which is like thinking: if I have unknown tokens, it is more unlikely that is the kind of document I'd otherwise think it is.

— jpmuc
quelle

2

You want to know why we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).

The answer to your question is: not all words have to be unknown in all classes.

Say there are two classes M and N with features A, B and C, as follows:

M: A=3, B=1, C=0

(In the class M, A appears 3 times and B only once)

N: A=0, B=1, C=3

(In the class N, C appears 3 times and B only once)

Let's see what happens when you throw away features that appear zero times.

A) Throw Away Features That Appear Zero Times In Any Class

If you throw away features A and C because they appear zero times in any of the classes, then you are only left with feature B to classify documents with.

And losing that information is a bad thing as you will see below!

If you're presented with a test document as follows:

B=1, C=3

(It contains B once and C three times)

Now, since you've discarded the features A and B, you won't be able to tell whether the above document belongs to class M or class N.

So, losing any feature information is a bad thing!

B) Throw Away Features That Appear Zero Times In All Classes

Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?

No, because that would create its own problems!

The following test document illustrates what would happen if we did that:

A=3, B=1, C=1

The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).

C) Don't Throw Anything Away - Use Smoothing Instead

Smoothing allows you to classify both the above documents correctly because:

You do not lose count information in classes where such information is available and
You do not have to contend with zero counts.

Naive Bayes Classifiers In Practice

The Naive Bayes classifier in NLTK used to throw away features that had zero counts in any of the classes.

This used to make it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data).

— Aiaioo Labs
quelle

2

@ Aiaioo Labs You failed to realize that he was referring to words that did not appear in the training set at all, for your example, he was referring to say if D appeared, the issue isn't with laplace smoothing on the calculations from the training set rather the test set. Using laplace smoothing on unknown words from the TEST set causes probability to be skewed towards whichever class had the least amount of tokens due to 0 + 1 / 2 + 3 being bigger that 0 + 1 / 3 + 3 (if one of the classes had 3 tokens and the other had 2). ...

2

This can actually turn a correct classification into an incorrect classification if enough unknown words are smoothed into the equation. Laplace smoothing is ok for Training set calculations, but detrimental to test set analysis. Also imagine you have a test set with all unkown words, it should be classified immediately to the class with highest probability, but in fact it can and will usually, not be classified as such, and is usually classified as the class with the lowest amount of tokens.

@DrakeThatcher, highly agree with you, yes if we don't remove words not in vocabulary, then predicted proba will be skewed to class with least amount of words.

— avocado

1

I also came across the same problem while studying Naive Bayes.

According to me, whenever we encounter a test example which we hadn't come across during training, then out Posterior probability will become 0.

So adding the 1 , even if we never train on a particular feature/class, the Posterior probability will never be 0.

— Sarthak Khanna
quelle

1

Matt you are correct you raise a very good point - yes Laplace Smoothing is quite frankly nonsense! Just simply throwing away those features can be a valid approach, particularly when the denominator is also a small number - there is simply not enough evidence to support the probability estimation.

I have a strong aversion to solving any problem via use of some arbitrary adjustment. The problem here is zeros, the "solution" is to just "add some small value to zero so it's not zero anymore - MAGIC the problem is no more". Of course that's totally arbitrary.

Your suggestion of better feature selection to begin with is a less arbitrary approach and IME increases performance. Furthermore Laplace Smoothing in conjunction with naive Bayes as the model has in my experience worsens the granularity problem - i.e. the problem where scores output tend to be close to 1.0 or 0.0 (if the number of features is infinite then every score will be 1.0 or 0.0 - this is a consequence of the independence assumption).

Now alternative techniques for probability estimation exist (other than max likelihood + Laplace smoothing), but are massively under documented. In fact there is a whole field called Inductive Logic and Inference Processes that use a lot of tools from Information Theory.

What we use in practice is of Minimum Cross Entropy Updating which is an extension of Jeffrey's Updating where we define the convex region of probability space consistent with the evidence to be the region such that a point in it would mean the Maximum Likelihood estimation is within the Expected Absolute Deviation from the point.

This has a nice property that as the number of data points decreases the estimations peace-wise smoothly approach the prior - and therefore their effect in the Bayesian calculation is null. Laplace smoothing on the other hand makes each estimation approach the point of Maximum Entropy that may not be the prior and therefore the effect in the calculation is not null and will just add noise.

— samthebest
quelle