Was ist die zeitliche Komplexität für das Training eines neuronalen Netzwerks mit Backpropagation?

16

Angenommen, ein NN enthält $n$ ausgeblendete Ebenen, $m$ Trainingsbeispiele, Features und Knoten in jeder Ebene. Was ist die zeitliche Komplexität, um dieses NN mithilfe von Backpropagation zu trainieren? $x$ $n_i$

Ich habe eine grundlegende Vorstellung davon, wie sie die zeitliche Komplexität von Algorithmen finden, aber hier sind 4 verschiedene Faktoren zu berücksichtigen, z. B. Iterationen, Ebenen, Knoten in jeder Ebene, Trainingsbeispiele und möglicherweise weitere Faktoren. Ich habe hier eine Antwort gefunden, aber es war nicht klar genug.

Gibt es andere Faktoren als die oben genannten, die die zeitliche Komplexität des Trainingsalgorithmus eines NN beeinflussen?

— DuttaA
quelle

Siehe auch https://qr.ae/TWttzq .

— Nr.

9

Ich habe keine Antwort von einer vertrauenswürdigen Quelle gesehen, aber ich werde versuchen, dies selbst mit einem einfachen Beispiel zu beantworten (mit meinem aktuellen Wissen).

Im Allgemeinen ist zu beachten, dass das Trainieren eines MLP unter Verwendung von Backpropagation normalerweise mit Matrizen implementiert wird.

Zeitliche Komplexität der Matrixmultiplikation

Die zeitliche Komplexität der Matrixmultiplikation für $M_{ij} * M_{jk}$ ist einfach $\mathcal{O}(i*j*k)$ .

Beachten Sie, dass wir hier von einem einfachsten Multiplikationsalgorithmus ausgehen: Es gibt einige andere Algorithmen mit etwas besserer Zeitkomplexität.

Feedforward-Pass-Algorithmus

Der Feedforward-Ausbreitungsalgorithmus ist wie folgt.

Zuerst müssen Sie von Schicht $i$ nach $j$ wechseln

S_{j} = W_{j i} * Z_{i}

$S_j = W_{ji}*Z_i$

Dann wenden Sie die Aktivierungsfunktion an

Z_{j} = f (S_{j})

$Z_j = f(S_j)$

Wenn wir $N$ Ebenen (einschließlich Eingabe- und Ausgabeebene) haben, wird dies $N-1$ Mal ausgeführt.

Beispiel

Als Beispiel berechnen wir die Zeitkomplexität für den Vorwärtsdurchlaufalgorithmus für einen MLP mit $4$ Schichten, wobei $i$ die Anzahl der Knoten der Eingangsschicht, $j$ die Anzahl der Knoten in der zweiten Schicht und $k$ die Anzahl der Knoten in der Eingangsschicht bezeichnet dritte Schicht und $l$ die Anzahl der Knoten in der Ausgabeschicht.

Da es $4$ Ebenen gibt, benötigen Sie $3$ Matrizen, um die Gewichte zwischen diesen Ebenen darzustellen. Bezeichnen wir sie mit $W_{ji}$ , $W_{kj}$ und $W_{lk}$ , wobei $W_{ji}$ eine Matrix mit $j$ Zeilen und $i$ Spalten ist ( $W_{ji}$ enthält also die von Schicht $i$ zu Schicht $j$ gehenden Gewichte ).

Angenommen , Sie haben $t$ Trainingsbeispiele. Für die Ausbreitung von Schicht $i$ nach $j$ haben wir zuerst

S_{j t} = W_{j i} * Z_{i t}

$S_{jt} = W_{ji} * Z_{it}$

und dieser Vorgang (dh Matrix multiplcation) aufweist $\mathcal{O}(j*i*t)$ Zeitkomplexität. Dann wenden wir die Aktivierungsfunktion an

Z_{j t} = f (S_{j t})

$Z_{jt} = f(S_{jt})$

und dies hat $\mathcal{O}(j*t)$ Zeitkomplexität, da es sich um ein Element weise Betrieb ist.

Insgesamt haben wir also

O (j * i * t + j * t) = O (j * t * (t + 1)) = O (j * i * t)

$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$

Mit derselben Logik für das Gehen $j \to k$ , haben wir $\mathcal{O}(k*j*t)$ , und für $k \to l$ , haben wir $\mathcal{O}(l*k*t)$ .

Insgesamt wird die Zeitkomplexität für die Vorwärtskopplung sein

O (j * i * t + k * j * t + l * k * t) = O (t * (i j + j k + k l))

$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$

Ich bin nicht sicher, ob dies weiter vereinfacht werden kann oder nicht. Vielleicht ist es nur $\mathcal{O}(t*i*j*k*l)$ , aber ich bin nicht sicher.

Backpropagation-Algorithmus

Der Backpropagation-Algorithmus geht wie folgt vor. Ausgehend von der Ausgangsschicht $l \to k$ berechnen wir das Fehlersignal $E_{lt}$ , eine Matrix, die die Fehlersignale für Knoten auf Schicht $l$

E_{l t} = f^{'} (S_{l t}) ⊙ (Z_{l t} - O_{l t})

$E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})}$

wo $\odot$ Mittel elementweise Multiplikation. Beachten Sie, dass $E_{lt}$ hat $l$ Zeilen und $t$ Spalten: Es bedeutet einfach , jede Spalte ist das Fehlersignal beispielsweise die Ausbildung $t$ .

We then compute the "delta weights", $D_{lk} \in \mathbb{R}^{l \times k}$ (between layer $l$ and layer $k$ )

D_{l k} = E_{l t} * Z_{t k}

$D_{lk} = E_{lt} * Z_{tk}$

where $Z_{tk}$ is the transpose of $Z_{kt}$ .

We then adjust the weights

W_{l k} = W_{l k} - D_{l k}

$W_{lk} = W_{lk} - D_{lk}$

For $l \to k$ , we thus have the time complexity $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$ .

Now, going back from $k \to j$ . We first have

E_{k t} = f^{'} (S_{k t}) ⊙ (W_{k l} * E_{l t})

$E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt})$

Then

D_{k j} = E_{k t} * Z_{t j}

$D_{kj} = E_{kt} * Z_{tj}$

And then

W_{k j} = W_{k j} - D_{k j}

$W_{kj} = W_{kj} - D_{kj}$

where $W_{kl}$ is the transpose of $W_{lk}$ . For $k \to j$ , we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$ .

And finally, for $j \to i$ , we have $\mathcal{O}(j*t(k+i))$ . In total, we have

O (l t k + t k (l + j) + t j (k + i)) = O (t * (l k + k j + j i))

$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O (t * (i j + j k + k l)) .

$O(t*(ij + jk + kl)).$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O (n * t * (i j + j k + k l)),

$O(n*t*(ij + jk + kl)),$ where

n

$n$ is number of iterations.

Notes

Note that these matrix operations can greatly be paralelized by GPUs.

Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$ , $j$ , $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$ .

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

— M.kazem Akhgary
quelle

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer

— DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.

— M.kazem Akhgary

3

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$ , assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$ .

The bad news is that there's no formula telling you what number of epochs $e$ you need.

— maaartinus
quelle

From the above answer don't you think itdepends on more factors?

— DuttaA

1

@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.

— maaartinus

I think the answers are same. in my answer I can assume number of weights w = ij + jk + kl. basically sum of n * n_i between layers as you noted.

— M.kazem Akhgary