Ich habe eine leichte Verwirrung über den Backpropagation- Algorithmus, der in Multilayer Perceptron (MLP) verwendet wird.

Der Fehler wird durch die Kostenfunktion korrigiert. Bei der Backpropagation versuchen wir, das Gewicht der ausgeblendeten Ebenen anzupassen. Den Ausgabefehler kann ich nachvollziehen, also e = d - y[ohne die Indizes].

Die Fragen sind:

Wie bekommt man den Fehler der versteckten Ebene? Wie berechnet man das?
Wenn ich es zurückpropagiere, sollte ich es als eine Kostenfunktion eines adaptiven Filters verwenden oder sollte ich einen Zeiger (in C / C ++) verwenden, um das Gewicht zu aktualisieren?

machine-learning neural-networks backpropagation

— HIGGINS
quelle

NN ist eher eine veraltete Technologie, daher werden Sie

@mbq: Ich zweifle nicht an deinen Worten, aber wie kommst du zu dem Schluss, dass NN "veraltete Technologie" sind?

— Steffen

@steffen Durch Beobachtung; Ich meine, es ist offensichtlich, dass niemand aus der NN-Community herauskommt und sagt: "Hey Leute, lasst uns unser Lebenswerk fallen lassen und mit etwas Besserem spielen!", Aber wir haben Werkzeuge, die ohne all diese Ambivalenz und niemals die gleiche oder eine bessere Genauigkeit erreichen ausbildung. Und die Leute lassen NN zugunsten von ihnen fallen.

Dies hatte etwas Wahres, als Sie es sagten, @mbq, aber nicht mehr.

— Jerad

@jerad Ziemlich einfach - ich habe einfach noch keinen fairen Vergleich mit anderen Methoden gesehen (Kaggle ist kein fairer Vergleich, da es an Konfidenzintervallen für Genauigkeiten mangelt - besonders wenn die Ergebnisse aller Teams mit hoher Punktzahl so nah beieinander liegen wie im Merck-Wettbewerb), auch keine Analyse der Robustheit der Parameteroptimierung - was viel schlimmer ist.

Ich dachte, ich würde hier einen in sich geschlossenen Beitrag für jeden beantworten, der interessiert ist. Hierfür wird die hier beschriebene Notation verwendet .

Einführung

Die Idee hinter Backpropagation ist es, eine Reihe von "Trainingsbeispielen" zu haben, mit denen wir unser Netzwerk trainieren. Jeder von diesen hat eine bekannte Antwort, so dass wir sie in das neuronale Netzwerk einstecken und herausfinden können, wie sehr es falsch war.

Bei der Handschrifterkennung würden Sie beispielsweise viele handgeschriebene Zeichen neben den tatsächlichen Zeichen haben. Dann kann das neuronale Netzwerk durch Backpropagation trainiert werden, um zu "lernen", wie jedes Symbol erkannt wird, und wenn es später mit einem unbekannten handgeschriebenen Zeichen versehen wird, kann es identifizieren, was es richtig ist.

Insbesondere geben wir einige Trainingsbeispiele in das neuronale Netzwerk ein, sehen, wie gut es war, und "rinnen" dann rückwärts, um herauszufinden, wie viel wir die Gewichte und die Vorspannung jedes Knotens ändern können, um ein besseres Ergebnis zu erzielen, und passen sie dann entsprechend an. Dabei "lernt" das Netzwerk.

Es gibt auch andere Schritte, die in den Trainingsprozess einbezogen werden können (z. B. Abbruch), aber ich werde mich hauptsächlich auf die Rückübertragung konzentrieren, da es darum ging, worum es in dieser Frage ging.

Teilweise Ableitungen

Eine partielle Ableitung ist eine Ableitung vonin Bezug auf eine Variable. $\frac{\partial f}{\partial x}$ $f$ $x$

Zum Beispiel, wenn , $f(x, y)=x^2 + y^2$ , weileinfach eine Konstante in Bezug auf. Ebenso ist $\frac{\partial f}{\partial x}=2x$ $y^2$ $x$ , weileinfach eine Konstante in Bezug auf. $\frac{\partial f}{\partial y}= 2y$ $x^2$ $y$

Ein mit bezeichneter Gradient einer Funktion ist eine Funktion, die die partielle Ableitung für jede Variable in f enthält. Speziell: $\nabla f$

\nabla f (v_{1}, v_{2}, . . ., v_{n}) = \frac{\partial f}{\partial v_{1}} e_{1} + \dots + \frac{\partial f}{\partial v_{n}} e_{n}

$\nabla f(v_1, v_2, ..., v_n) = \frac{\partial f}{\partial v_1 }\mathbf{e}_1 + \cdots + \frac{\partial f}{\partial v_n }\mathbf{e}_n$

wobei ein Einheitsvektor ist, der in die Richtung der Variablen . $e_i$ $v_1$

Nun, wenn wir die berechnete haben für eine Funktion , wenn wir in der Position sind , können wir "nach unten rutschen" durch in Richtung gehen . $\nabla f$ $f$ $(v_1, v_2, ..., v_n)$ $f$ $-\nabla f(v_1, v_2, ..., v_n)$

In unserem Beispiel von sind die Einheitsvektoren und , weil und , und diese Vektoren zeigen in die Richtung der und Achse. Also ist $f(x, y)=x^2 + y^2$ $e_1=(1, 0)$ $e_2=(0, 1)$ $v_1=x$ $v_2=y$ $x$ $y$ . $\nabla f(x, y) = 2x (1, 0) + 2y(0, 1)$

Um nun unsere Funktion "herunterzuschieben" , nehmen wir an, wir befinden uns an einem Punkt . Dann müssten wir in Richtung bewegen , $f$ $(-2, 4)$ . $-\nabla f(-2, -4)= -(2 \cdot -2 \cdot (1, 0) + 2 \cdot 4 \cdot (0, 1)) = -((-4, 0) + (0, 8))=(4, -8)$

Die Größe dieses Vektors gibt an, wie steil der Hügel ist (höhere Werte bedeuten, dass der Hügel steiler ist). In diesem Fall haben wir . $\sqrt{4^2+(-8)^2}\approx 8.944$

Gradient Descent

Hadamard-Produkt

Das Hadamard-Produkt zweier Matrizen , ist wie die Matrixaddition, außer dass wir die Matrizen nicht elementweise addieren, sondern sie elementweise multiplizieren. $A, B \in R^{n\times m}$

Formal ist dabei Matrixaddition , wobei so dass $A + B = C$ $C \in R^{n \times m}$

C_{j}^{i} = A_{j}^{i} + B_{j}^{i}

$C^i_j = A^i_j + B^i_j$

Das Hadamard-Produkt , wobei so dass $A \odot B = C$ $C \in R^{n \times m}$

C_{j}^{i} = A_{j}^{i} \cdot B_{j}^{i}

$C^i_j = A^i_j \cdot B^i_j$

Berechnung der Farbverläufe

(Der größte Teil dieses Abschnitts stammt aus Neilsens Buch ).

Wir haben eine Reihe von Trainingsmustern , wobei ein einzelnes Eingabe-Trainingsmuster ist und der erwartete Ausgabewert dieses Trainingsmusters ist. Wir haben auch unser neuronales Netzwerk, das aus den Verzerrungen und den Gewichten . wird verwendet, um Verwechslungen mit , und zu vermeiden , die bei der Definition eines Feedforward-Netzwerks verwendet werden. $(S, E)$ $S_r$ $E_r$ $W$ $B$ $r$ $i$ $j$ $k$

Als nächstes definieren wir eine Kostenfunktion , die in unserem neuronalen Netz und ein einziges Trainingsbeispiel nimmt und gibt , wie gut es tat. $C(W, B, S^r, E^r)$

Normalerweise werden quadratische Kosten verwendet, die durch definiert werden

C (W, B, S^{r}, E^{r}) = 0.5 \sum_{j} (a_{j}^{L} - E_{j}^{r})^{2}

$C(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$

wobei ist der Ausgang unseres neuronales Netz, gegebenen Eingangsabtastwert $a^L$ $S^r$

Dann wollen wir und $\frac{\partial C}{\partial w^i_j}$ für jeden Knoten in unserem neuronalen Feedforward-Netzwerk. $\frac{\partial C}{\partial b^i_j}$

Wir können diesen den Gradienten nennen an jedem Neuron , weil wir als und als Konstanten, da wir sie nicht ändern können , wenn wir versuchen zu lernen. Und das ist sinnvoll - wir möchten uns in eine Richtung relativ zu und bewegen , die die Kosten minimiert, und eine Bewegung in die negative Richtung des Gradienten in Bezug auf und wird dies tun. $C$ $S^r$ $E^r$ $W$ $B$ $W$ $B$

Dazu definieren wir als Fehler des Neuronsin Schicht. $\delta^i_j=\frac{\partial C}{\partial z^i_j}$ $j$ $i$

Wir beginnen mit der Berechnung durch Aufstecken in unser neuronales Netz. $a^L$ $S^r$

Dann berechnen wir den Fehler unserer Ausgangsschicht über $\delta^L$

δ_{j}^{L} = \frac{\partial C}{\partial a_{j}^{L}} σ^{'} (z_{j}^{L})

$\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma^{ \prime}(z^L_j)$

Welches kann auch als geschrieben werden

δ^{L} = \nabla_{a} C ⊙ σ^{'} (z^{L})

$\delta^L = \nabla_a C \odot \sigma^{ \prime}(z^L)$

Als nächstes finden wir den Fehler in Bezug auf den Fehler in der nächsten Schicht über $\delta^i$ $\delta^{i+1}$

δ^{i} = ((W^{i + 1})^{T} δ^{i + 1}) ⊙ σ^{'} (z^{i})

$\delta^i=((W^{i+1})^T \delta^{i+1}) \odot \sigma^{\prime}(z^i)$

Da wir nun den Fehler jedes Knotens in unserem neuronalen Netzwerk haben, ist die Berechnung des Gradienten in Bezug auf unsere Gewichte und Vorspannungen einfach:

\frac{\partial C}{\partial w_{j k}^{i}} = δ_{j}^{i} a_{k}^{i - 1} = δ^{i} (a^{i - 1})^{T}

$\frac{\partial C}{\partial w^i_{jk}}=\delta^i_j a^{i-1}_k=\delta^i(a^{i-1})^T$

\frac{\partial C}{\partial b_{j}^{i}} = δ_{j}^{i}

$\frac{\partial C}{\partial b^i_j} = \delta^i_j$

Beachten Sie, dass die Gleichung für den Fehler der Ausgabeebene die einzige Gleichung ist, die von der Kostenfunktion abhängt, sodass die letzten drei Gleichungen unabhängig von der Kostenfunktion identisch sind.

Als Beispiel erhalten wir mit quadratischen Kosten

δ^{L} = (a^{L} - E^{r}) ⊙ σ^{'} (z^{L})

$\delta ^L = (a^L - E^r) \odot \sigma ^ {\prime}(z^L)$

$L-1^{\text{th}}$

δ^{L - 1} = ((W^{L})^{T} δ^{L}) ⊙ σ^{'} (z^{L - 1})

$\delta^{L-1}=((W^{L})^T \delta^{L}) \odot \sigma^{\prime}(z^{L-1})$

= ((W^{L})^{T} ((a^{L} - E^{r}) ⊙ σ^{'} (z^{L}))) ⊙ σ^{'} (z^{L - 1})

$=((W^{L})^T ((a^L - E^r) \odot \sigma ^ {\prime}(z^L))) \odot \sigma^{\prime}(z^{L-1})$

which we can repeat this process to find the error of any layer with respect to $C$ , which then allows us to compute the gradient of any node's weights and bias with respect to $C$ .

I could write up an explanation and proof of these equations if desired, though one can also find proofs of them here. I'd encourage anyone that is reading this to prove these themselves though, beginning with the definition $\delta^i_j=\frac{\partial C}{\partial z^i_j}$ and applying the chain rule liberally.

For some more examples, I made a list of some cost functions alongside their gradients here.

Gradient Descent

Now that we have these gradients, we need to use them learn. In the previous section, we found how to move to "slide down" the curve with respect to some point. In this case, because it's a gradient of some node with respect to weights and a bias of that node, our "coordinate" is the current weights and bias of that node. Since we've already found the gradients with respect to those coordinates, those values are already how much we need to change.

We don't want to slide down the slope at a very fast speed, otherwise we risk sliding past the minimum. To prevent this, we want some "step size" $\eta$ .

Then, find the how much we should modify each weight and bias by, because we have already computed the gradient with respect to the current we have

Δ w_{j k}^{i} = - η \frac{\partial C}{\partial w_{j k}^{i}}

$\Delta w^i_{jk}= -\eta \frac{\partial C}{\partial w^i_{jk}}$

Δ b_{j}^{i} = - η \frac{\partial C}{\partial b_{j}^{i}}

$\Delta b^i_j = -\eta \frac{\partial C}{\partial b^i_j}$

Thus, our new weights and biases are

w_{j k}^{i} = w_{j k}^{i} + Δ w_{j k}^{i}

$w^i_{jk} = w^i_{jk} + \Delta w^i_{jk}$

b_{j}^{i} = b_{j}^{i} + Δ b_{j}^{i}

$b^i_j = b^i_j + \Delta b^i_j$

Using this process on a neural network with only an input layer and an output layer is called the Delta Rule.

Stochastic Gradient Descent

Now that we know how to perform backpropagation for a single sample, we need some way of using this process to "learn" our entire training set.

One option is simply performing backpropagation for each sample in our training data, one at a time. This is pretty inefficient though.

A better approach is Stochastic Gradient Descent. Instead of performing backpropagation for each sample, we pick a small random sample (called a batch) of our training set, then perform backpropagation for each sample in that batch. The hope is that by doing this, we capture the "intent" of the data set, without having to compute the gradient of every sample.

For example, if we had 1000 samples, we could pick a batch of size 50, then run backpropagation for each sample in this batch. The hope is that we were given a large enough training set that it represents the distribution of the actual data we are trying to learn well enough that picking a small random sample is sufficient to capture this information.

However, doing backpropagation for each training example in our mini-batch isn't ideal, because we can end up "wiggling around" where training samples modify weights and biases in such a way that they cancel each other out and prevent them from getting to the minimum we are trying to get to.

To prevent this, we want to go to the "average minimum," because the hope is that, on average, the samples' gradients are pointing down the slope. So, after choosing our batch randomly, we create a mini-batch which is a small random sample of our batch. Then, given a mini-batch with $n$ training samples, and only update the weights and biases after averaging the gradients of each sample in the mini-batch.

Formally, we do

Δ w_{j k}^{i} = \frac{1}{n} \sum_{r} Δ w_{j k}^{r i}

$\Delta w^{i}_{jk} = \frac{1}{n}\sum\limits_r \Delta w^{ri}_{jk}$

and

Δ b_{j}^{i} = \frac{1}{n} \sum_{r} Δ b_{j}^{r i}

$\Delta b^{i}_{j} = \frac{1}{n}\sum\limits_r \Delta b^{ri}_{j}$

where $\Delta w^{ri}_{jk}$ is the computed change in weight for sample $r$ , and $\Delta b^{ri}_{j}$ is the computed change in bias for sample $r$ .

Then, like before, we can update the weights and biases via:

w_{j k}^{i} = w_{j k}^{i} + Δ w_{j k}^{i}

$w^i_{jk} = w^i_{jk} + \Delta w^{i}_{jk}$

b_{j}^{i} = b_{j}^{i} + Δ b_{j}^{i}

$b^i_j = b^i_j + \Delta b^{i}_{j}$

This gives us some flexibility in how we want to perform gradient descent. If we have a function we are trying to learn with lots of local minima, this "wiggling around" behavior is actually desirable, because it means that we're much less likely to get "stuck" in one local minima, and more likely to "jump out" of one local minima and hopefully fall in another that is closer to the global minima. Thus we want small mini-batches.

On the other hand, if we know that there are very few local minima, and generally gradient descent goes towards the global minima, we want larger mini-batches, because this "wiggling around" behavior will prevent us from going down the slope as fast as we would like. See here.

One option is to pick the largest mini-batch possible, considering the entire batch as one mini-batch. This is called Batch Gradient Descent, since we are simply averaging the gradients of the batch. This is almost never used in practice, however, because it is very inefficient.

— Phylliida
quelle

I haven't dealt with Neural Networks for some years now, but I think you will find everything you need here:

Neural Networks - A Systematic Introduction, Chapter 7: The backpropagation algorithm

I apologize for not writing the direct answer here, but since I have to look up the details to remember (like you) and given that the answer without some backup may be even useless, I hope this is ok. However, if any questions remain, drop a comment and I'll see what I can do.

— steffen
quelle

Backpropagation-Algorithmus

Einführung

Teilweise Ableitungen

Hadamard-Produkt

Berechnung der Farbverläufe

Gradient Descent

Stochastic Gradient Descent