In Bezug auf Ressourcen:
Hier sind einige zentrale Zitate aus ADADELTA: Eine adaptive Lernratenmethode , zusammen mit einigen Beispielen und kurzen Erklärungen:
ADAGRAD
Die Aktualisierungsregel für ADAGRAD lautet wie folgt:
Hier berechnet der Nenner diel2-Norm aller vorherigen Gradienten pro Dimension und η ist eine globale Lernrate, die von allen Dimensionen geteilt wird. Während es die handabgestimmte globale Lernrate gibt, hat jede Dimension ihre eigene dynamische Rate.
Δxt=−η∑tτ=1g2τ√gt(5)
l2
Dh wenn die Gradienten in den ersten drei Schritten , dann: Δ x 3 = - ηg1=⎛⎝a1b1c1⎞⎠,g2=⎛⎝a2b2c2⎞⎠,g3=⎛⎝a3b3c3⎞⎠
Hier ist leichter zu erkennen, dass jede Dimension wie versprochen ihre eigene dynamische Lernrate hat.
Δx3=−η∑3τ=1g2τ−−−−−−−√g3=−η⎛⎝⎜a21+a22+a23b21+b22+b23c21+c22+c23⎞⎠⎟−−−−−−−−−−−−−−⎷⎛⎝a3b3c3⎞⎠↓Δx3=−⎛⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜ηa21+a22+a23−−−−−−−−−−√a3ηb21+b22+b23−−−−−−−−−√b3ηc21+c22+c23−−−−−−−−−√c3⎞⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟
Probleme von ADAGRAD, denen ADADELTA entgegenzuwirken versucht
Die in diesem Artikel vorgestellte Idee wurde von ADAGRAD abgeleitet, um die beiden Hauptnachteile der Methode zu verbessern: 1) den kontinuierlichen Rückgang der Lernraten während des Trainings und 2) die Notwendigkeit einer manuell ausgewählten globalen Lernrate.
Der zweite Nachteil ist ziemlich selbsterklärend.
g2
t>2∑tτ=1g2τ−−−−−−−√g2g2gtΔxt
Δxt
ADADELTA
w
wtE[g2]t
E[g2]t=ρE[g2]t−1+(1−ρ)g2t(8)
ρRMStRMS[g]t=E[g2]t+ϵ−−−−−−−−√(9)
ϵ
RMS
E[Δx2]t−1=ρE[Δx2]t−2+(1−ρ)Δx2t−1
RMS[Δx]t−1=E[Δx2]t−1+ϵ−−−−−−−−−−−−√
ΔxtRMSwΔx
Δxt=−RMS[Δx]t−1RMS[g]tgt(14)
ϵRMSΔx0=0
rgr=⎛⎝arbrcr⎞⎠Δxr=⎛⎝irjrkr⎞⎠
Δxt=−RMS[Δx]t−1RMS[g]tgt=−E[Δx2]t−1+ϵ−−−−−−−−−−−√E[g2]t+ϵ−−−−−−−−√gt=−ρE[Δx2]t−2+(1−ρ)Δx2t−1+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−√ρE[g2]t−1+(1−ρ)g2t+ϵ−−−−−−−−−−−−−−−−−−−−√gt=−ρ(ρE[Δx2]t−3+(1−ρ)Δx2t−2)+(1−ρ)Δx2t−1+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√ρ(ρE[g2]t−2+(1−ρ)g2t−1)+(1−ρ)g2t+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√gt=−ρ2E[Δx2]t−3+p1(1−ρ)Δx2t−2+p0(1−ρ)Δx2t−1+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√ρ2E[g2]t−2+p1(1−ρ)g2t−1+p0(1−ρ)g2t+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√gt=−ρt−1E[Δx2]0+∑r=1t−1ρt−1−r(1−ρ)Δx2r+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√ρt−1E[g2]1+∑r=2tρt−r(1−ρ)g2r+ϵ−−−−−−−−−−−−−−−−−−−−−−−−−−−√gt
ρ is a decay constant, so we choose it such that ρ∈(0,1) (typically ρ≥0.9).
Therefore, multiplying by a high power of ρ results in a very small number.
Let w be the lowest exponent such that we deem the product of multiplying sane values by ρw negligible.
Now, we can approximate Δxt by dropping negligible terms:
Δxt≈−∑r=t−wt−1ρt−1−r(1−ρ)Δx2r+ϵ−−−−−−−−−−−−−−−−−−−−−√∑r=t+1−wtρt−r(1−ρ)g2r+ϵ−−−−−−−−−−−−−−−−−−−−√gt=−∑r=t−wt−1ρt−1−r(1−ρ)⎛⎝⎜i2rj2rk2r⎞⎠⎟+ϵ−−−−−−−−−−−−−−−−−−−−−−−−⎷∑r=t+1−wtρt−r(1−ρ)⎛⎝⎜a2rb2rc2r⎞⎠⎟+ϵ−−−−−−−−−−−−−−−−−−−−−−−−⎷⎛⎝atbtct⎞⎠↓Δxt≈−⎛⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜∑r=t−wt−1ρt−1−r(1−ρ)i2r+ϵ−−−−−−−−−−−−−−−−−−−−√∑r=t+1−wtρt−r(1−ρ)a2r+ϵ−−−−−−−−−−−−−−−−−−−−√at∑r=t−wt−1ρt−1−r(1−ρ)j2r+ϵ−−−−−−−−−−−−−−−−−−−−√∑r=t+1−wtρt−r(1−ρ)b2r+ϵ−−−−−−−−−−−−−−−−−−−−√bt∑r=t−wt−1ρt−1−r(1−ρ)k2r+ϵ−−−−−−−−−−−−−−−−−−−−√∑r=t+1−wtρt−r(1−ρ)c2r+ϵ−−−−−−−−−−−−−−−−−−−−√ct⎞⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟