Wie berechnet man die Ratlosigkeit eines Holdouts mit Latent Dirichlet Allocation?

Ich bin verwirrt darüber, wie die Verwirrung einer Holdout-Stichprobe bei der Latent Dirichlet Allocation (LDA) berechnet wird. Die Zeitungen über das Thema rauschen darüber hinweg und lassen mich denken, ich vermisse etwas Offensichtliches ...

Ratlosigkeit wird als ein gutes Maß für die Leistung von LDA angesehen. Die Idee ist, dass Sie eine Holdout-Stichprobe aufbewahren, Ihre LDA auf den Rest der Daten trainieren und dann die Ratlosigkeit des Holdouts berechnen.

Die Ratlosigkeit könnte durch die Formel gegeben sein:

$per(D_{test})=exp\{-\frac{\sum_{d=1}^{M}\log p(\mathbb{w}_d)}{\sum_{d=1}^{M}N_d}\}$

(Taken from Image retrieval on large-scale image databases, Horster et al.)

Here $M$ is the number of documents (in the test sample, presumably), $\mathbb{w}_d$ represents the words in document $d$ , $N_d$ the number of words in document $d$ .

It is not clear to me how to sensibly calcluate $p(\mathbb{w}_d)$ , since we don't have topic mixtures for the held out documents. Ideally, we would integrate over the Dirichlet prior for all possible topic mixtures and use the topic multinomials we learned. Calculating this integral doesn't seem an easy task however.

Alternatively, we could attempt to learn an optimal topic mixture for each held out document (given our learned topics) and use this to calculate the perplexity. This would be doable, however it's not as trivial as papers such as Horter et al and Blei et al seem to suggest, and it's not immediately clear to me that the result will be equivalent to the ideal case above.

text-mining topic-models

— drevicko
quelle

Antworten:

This is indeed something often glossed over.

Some people are doing something a bit cheeky: holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.

To do it properly with held-out documents, as suggested, you do need to "integrate over the Dirichlet prior for all possible topic mixtures". http://people.cs.umass.edu/~wallach/talks/evaluation.pdf reviews a few methods for tackling this slightly unpleasant integral. I'm just about to try and implement this myself in fact, so good luck!

— Matt
quelle

Thanks for dredging up this question! Wallach et al also have a paper on topic model evaluations: Evaluation methods for topic models

— drevicko

No worries. I've found there's some code for Wallach's left-to-right method in the MALLET topic modelling toolbox, if you're happy to use their LDA implementation it's an easy win although it doesn't seem super easy to run it on a set of topics learned elsewhere from a different variant of LDA, which is what I'm looking to do. I ended up implementing the Chib-style estimator from their paper using the matlab code they supply as a guide although had to fix a couple of issues in doing that, let me know if you want the code.

— Matt

Hi @Matt is it possible to hand me the matlab code for perplexity evaluation on LDA? Thanks

— princess of persia

@princessofpersia I think the author fixed the problem I alluded to with the matlab code, see here: homepages.inf.ed.ac.uk/imurray2/pub/09etm

— Matt

We know that parameters of LDA are estimated through Variational Inference. So

$\log p(w|\alpha, \beta) = E[\log p(\theta,z,w|\alpha,\beta)]-E[\log q(\theta,z)] + D(q(\theta,z)||p(\theta,z))$ .

If your variational distribution is enough equal to the original distribution, then $D(q(\theta,z)||p(\theta,z)) = 0$ . So, $\log p(w|\alpha, \beta) = E[\log p(\theta,z,w|\alpha,\beta)]-E[\log q(\theta,z)]$ , which is the likelihood.

$\log p(w|\alpha, \beta)$ approximates to the likelihood you got from the Variational Inference.

— user32509
quelle

I think it is possible to improve the answer to be more specific about how to actually calculate the perplexity on the test set.

— Momo