Was ist der Zweck einer Aktivierungsfunktion in neuronalen Netzen?

16

Es wird gesagt, dass Aktivierungsfunktionen in neuronalen Netzen dazu beitragen, Nichtlinearität einzuführen .

Was bedeutet das?
Was bedeutet in diesem Zusammenhang Nichtlinearität ?
Wie hilft die Einführung dieser Nichtlinearität ?
Gibt es andere Zwecke für Aktivierungsfunktionen ?

neural-networks deep-learning

— Mohsin Bukhari
quelle

13

Fast alle Funktionen, die durch die nichtlinearen Aktivierungsfunktionen bereitgestellt werden, sind durch andere Antworten gegeben. Lassen Sie mich sie zusammenfassen:

Erstens, was bedeutet Nichtlinearität? Es bedeutet etwas (in diesem Fall eine Funktion), das in Bezug auf eine gegebene Variable / Variablen nicht linear ist, dh $f(c1.x1 + c2.x2...cn.xn + b) != c1.f(x1) + c2.f(x2) ... cn.f(xn) + b.$ `
Was bedeutet in diesem Zusammenhang Nichtlinearität? Es bedeutet , dass das neuronale Netz erfolgreich ungefähre Funktionen (bis zu einem gewissen Fehler durch den Benutzer entschieden) , die nicht Linearität folgt oder es kann die Klasse einer Funktion erfolgreich vorhersagen , die durch eine Entscheidungsgrenze geteilt ist , die nicht linear ist. $e$
Warum hilft es? Ich glaube kaum, dass Sie ein Phänomen der physischen Welt finden können, das der Linearität direkt folgt. Sie benötigen also eine nichtlineare Funktion, die sich dem nichtlinearen Phänomen annähern kann. Eine gute Intuition wäre auch jede Entscheidungsgrenze oder eine Funktion ist eine lineare Kombination von Polynomkombinationen der Eingabemerkmale (also letztendlich nicht linear).
Zwecke der Aktivierungsfunktion? Zusätzlich zur Einführung der Nichtlinearität hat jede Aktivierungsfunktion ihre eigenen Merkmale.

Sigmoid $\frac{1} {(1 + e ^ {-(w1*x1...wn*xn + b)})}$

Dies ist eine der häufigsten Aktivierungsfunktionen und nimmt überall monoton zu. Dies ist im allgemeinen bei der endgültigen Ausgangsknoten verwendet , da sie Werte zwischen 0 und 1 quetscht (wenn der Ausgang erforderlich sein 0oder 1) .So über 0,5 gilt , 1während unter 0,5 als 0, obwohl eine andere Schwelle (nicht 0.5) vielleicht eingestellt. Sein Hauptvorteil ist, dass seine Unterscheidung einfach ist und bereits berechnete Werte verwendet und angeblich Hufeisenkrebsneuronen diese Aktivierungsfunktion in ihren Neuronen haben.

Tanh $\frac{e ^ {(w1*x1...wn*xn + b)} - e ^ {-(w1*x1...wn*xn + b)})}{(e ^ { (w1*x1...wn*xn + b)} + e ^ {-(w1*x1...wn*xn + b)}}$

Dies hat einen Vorteil gegenüber der Sigmoid-Aktivierungsfunktion, da sie dazu neigt, die Ausgabe auf 0 zu zentrieren, was zu einem besseren Lernen auf den nachfolgenden Ebenen führt (wirkt als Merkmalsnormalisierung). Eine schöne Erklärung hier . Negative und positive Ausgangswerte vielleicht als betrachtet 0und 1jeweils. Wird hauptsächlich in RNNs verwendet.

Re-Lu-Aktivierungsfunktion - Dies ist eine weitere sehr häufige einfache nichtlineare Aktivierungsfunktion (linear im positiven Bereich und im negativen Bereich, die sich gegenseitig ausschließt), die den Vorteil hat, dass das Problem des Verschwindens des Gradienten beseitigt wird, dem die obigen beiden gegenüberstehen, dh der Gradient tendiert dazu0as x tendiert zu + unendlich oder -unendlich. Hier ist eine Antwort auf die Näherungsleistung von Re-Lu trotz seiner scheinbaren Linearität. ReLus haben den Nachteil, dass sie tote Neuronen haben, die zu größeren NNs führen.

Sie können auch Ihre eigenen Aktivierungsfunktionen entwerfen, abhängig von Ihrem speziellen Problem. Möglicherweise haben Sie eine quadratische Aktivierungsfunktion, die quadratische Funktionen viel besser approximiert. Dann müssen Sie jedoch eine Kostenfunktion entwerfen, die etwas konvexer Natur sein sollte, damit Sie sie mithilfe von Differentialen erster Ordnung optimieren können und die NN tatsächlich zu einem anständigen Ergebnis konvergiert. Dies ist der Hauptgrund, warum Standardaktivierungsfunktionen verwendet werden. Ich glaube jedoch, dass mit geeigneten mathematischen Werkzeugen ein enormes Potenzial für neue und exzentrische Aktivierungsfunktionen besteht.

Angenommen, Sie versuchen, eine einzelne variable quadratische Funktion zu approximieren, sagen Sie . Dies lässt sich am besten durch eine quadratische Aktivierung von approximieren, wobei und die trainierbaren Parameter sind. Das Entwerfen einer Verlustfunktion, die der herkömmlichen Ableitungsmethode erster Ordnung (Gradientenabfall) folgt, kann für eine nicht monotisch ansteigende Funktion jedoch recht schwierig sein. $a.x^2 + c$ $w1.x^2 + b$ $w1$ $b$

Für Mathematicians: In der Sigmoid - Aktivierungsfunktion sehen wir , dass ist immer < . Durch binomiale Expansion oder durch umgekehrte Berechnung der unendlichen GP-Reihe erhalten wir $(1 / (1 + e ^ {-(w1*x1...wn*xn + b)})$ $e ^ {-(w1*x1...wn*xn + b)}$ 1 =. Nun ist in einem NN . Somit erhalten wir alle Potenzen von die gleich sind mit $sigmoid(y)$ $1 + y + y^2.....$ $y = e ^ {-(w1*x1...wn*xn + b)}$ $y$ $e ^ {-(w1*x1...wn*xn + b)}$ Somit kann jede Potenz von als Multiplikation mehrerer abklingender Exponentiale auf der Basis eines Merkmals , zum Beispiel $y$ $x$ $y^2 = e^ {-2(w1x1)} * e^ {-2(w2x2)} * e^ {-2(w3x3)} *...... e^ {-2(b)}$ . Somit hat jedes Merkmal ein Mitspracherecht bei der Skalierung des Graphen von . $y^2$

Eine andere Denkweise wäre, die Exponentiale nach Taylor Series zu erweitern:

So erhalten wir eine sehr komplexe Kombination mit allen möglichen Polynomkombinationen von Eingangsvariablen. Ich glaube, wenn ein neuronales Netzwerk richtig strukturiert ist, kann der NN diese Polynomkombinationen feinabstimmen, indem er nur die Verbindungsgewichte ändert und die maximal nützlichen Polynomterme auswählt und Terme durch Subtrahieren der Ausgabe von 2 richtig gewichteten Knoten ablehnt.

Die Aktivierung von kann auf die gleiche Weise funktionieren, da ausgegeben wird . Ich bin mir nicht sicher, wie Re-Lus Arbeit funktioniert, aber aufgrund seiner starken Struktur und des Problems toter Neuronen erforderte ReLu größere Netzwerke für eine gute Annäherung. $tanh$ $|tanh| < 1$

Für einen formalen mathematischen Beweis muss man sich jedoch den Satz der universellen Approximation ansehen.

Für Nicht-Mathematiker besuchen einige bessere Einblicke diese Links:

Aktivierungsfunktionen von Andrew Ng - für eine formellere und wissenschaftlichere Antwort

Wie klassifiziert ein Klassifizierer für ein neuronales Netzwerk, wenn nur eine Entscheidungsebene gezeichnet wird?

Differenzierbare Aktivierungsfunktion Ein visueller Beweis, dass neuronale Netze jede Funktion berechnen können

— DuttaA
quelle

3

Ich würde argumentieren, dass ReLU in NNs heute tatsächlich häufiger ist als Sigmoid :)

— Andreas Storvik Strauman

@ AndreasStorvikStrauman und Sie sind ganz richtig ... Aber Sigmoid hat ein Kind namens Softmax :)

— DuttaA

7

Wenn Sie nur lineare Schichten in einem neuronalen Netzwerk hätten, würden alle Schichten im Wesentlichen zu einer linearen Schicht zusammenfallen, und daher wäre eine "tiefe" Architektur des neuronalen Netzwerks effektiv nicht mehr tief, sondern nur noch ein linearer Klassifikator.

y = f (W_{1} W_{2} W_{3} x) = f (W x)

$y = f(W_1 W_2 W_3x) = f(Wx)$

Dabei entspricht $W$ der Matrix, die die Netzwerkgewichte und -vorspannungen für eine Schicht darstellt, und $f()$ der Aktivierungsfunktion.

Mit der Einführung einer nichtlinearen Aktivierungseinheit nach jeder linearen Transformation wird dies nicht mehr passieren.

y = f_{1} (W_{1} f_{2} (W_{2} f_{3} (W_{3} x)))

$y = f_1( W_1 f_2( W_2f_3( W_3x)))$

Jede Schicht kann nun auf den Ergebnissen der vorhergehenden nichtlinearen Schicht aufbauen, was im Wesentlichen zu einer komplexen nichtlinearen Funktion führt, die in der Lage ist, jede mögliche Funktion mit der richtigen Gewichtung und genügend Tiefe / Breite anzunähern.

— Marcel_marcel1991
quelle

5

Let's first talk about linearity. Linearity means the map (a function), $f: V \rightarrow W$ , used is a linear map, that is, it satisfies the following two conditions

$f(x + y) = f(x) + f(y), \; x, y \in V$
$f(c x) = cf(x), \; c \in \mathbb{R}$

You should be familiar with this definition if you have studied linear algebra in the past.

However, it's more important to think of linearity in terms of linear separability of data, which means the data can be separated into different classes by drawing a line (or hyperplane, if more than two dimensions), which represents a linear decision boundary, through the data. If we cannot do that, then the data is not linearly separable. Often times, data from a more complex (and thus more relevant) problem setting is not linearly separable, so it is in our interest to model these.

To model nonlinear decision boundaries of data, we can utilize a neural network that introduces non-linearity. Neural networks classify data that is not linearly separable by transforming data using some nonlinear function (or our activation function), so the resulting transformed points become linearly separable.

Different activation functions are used for different problem setting contexts. You can read more about that in the book Deep Learning (Adaptive Computation and Machine Learning series).

For an example of non linearly separable data, see the XOR data set.

Can you draw a single line to separate the two classes?

— skim
quelle

4

Consider a very simple neural network, with just 2 layers, where the first has 2 neurons and the last 1 neuron, and the input size is 2. The inputs are $x_1$ and $x_1$ .

The weights of the first layer are $w_{11}, w_{12}, w_{21}$ and $w_{22}$ . We do not have activations, so the outputs of the neurons in the first layer are

\begin{aligned} o_{1} = w_{11} x_{1} + w_{12} x_{2} \\ o_{2} = w_{21} x_{1} + w_{22} x_{2} \end{aligned}

$\begin{align} o_1 = w_{11}x_1 + w_{12}x_2 \\ o_2 = w_{21}x_1 + w_{22}x_2 \end{align}$

Let's calculate the output of the last layer with weights $z_1$ and $z_2$

o u t = z_{1} o_{1} + z_{2} o_{2}

$out = z_1o_1 + z_2o_2$

Just substitute $o_1$ and $o_2$ and you will get:

o u t = z_{1} (w_{11} x_{1} + w_{12} x_{2}) + z_{2} (w_{21} x_{1} + w_{22} x_{2})

$out = z_1(w_{11}x_1 + w_{12}x_2) + z_2(w_{21}x_1 + w_{22}x_2)$

or

o u t = (z_{1} w_{11} + z_{2} w_{21}) x_{1} + (z_{2} w_{22} + z_{1} w_{12}) x_{2}

$out = (z_1w_{11} + z_2 w_{21})x_1 + (z_2w_{22} + z_1w_{12})x_2$

And look at this! If we create NN just with one layer with weights $z_1w_{11} + z_2 w_{21}$ and $z_2w_{22} + z_1w_{12}$ it will be equivalent to our 2 layers NN.

The conclusion: without nonlinearity, the computational power of a multilayer NN is equal to 1-layer NN.

Also, you can think of the sigmoid function as differentiable IF the statement that gives a probability. And adding new layers can create new, more complex combinations of IF statements. For example, the first layer combines features and gives probabilities that there are eyes, tail, and ears on the picture, the second combines new, more complex features from the last layer and gives probability that there is a cat.

For more information: Hacker's guide to Neural Networks.

— user2674414
quelle

3

First Degree Linear Polynomials

Non-linearity is not the correct mathematical term. Those that use it probably intend to refer to a first degree polynomial relationship between input and output, the kind of relationship that would be graphed as a straight line, a flat plane, or a higher degree surface with no curvature.

To model relations more complex than y = a₁x₁ + a₂x₂ + ... + b, more than just those two terms of a Taylor series approximation is needed.

Tune-able Functions with Non-zero Curvature

Artificial networks such as the multi-layer perceptron and its variants are matrices of functions with non-zero curvature that, when taken collectively as a circuit, can be tuned with attenuation grids to approximate more complex functions of non-zero curvature. These more complex functions generally have multiple inputs (independent variables).

The attenuation grids are simply matrix-vector products, the matrix being the parameters that are tuned to create a circuit that approximates the more complex curved, multivariate function with simpler curved functions.

Oriented with the multi-dimensional signal entering at the left and the result appearing on the right (left-to-right causality), as in the electrical engineering convention, the vertical columns are called layers of activations, mostly for historical reasons. They are actually arrays of simple curved functions. The most commonly used activations today are these.

ReLU
Leaky ReLU
ELU
Threshold (binary step)
Logistic

The identity function is sometimes used to pass through signals untouched for various structural convenience reasons.

These are less used but were in vogue at one point or another. They are still used but have lost popularity because they place additional overhead on back propagation computations and tend to lose in contests for speed and accuracy.

Softmax
Sigmoid
TanH
ArcTan

The more complex of these can be parametrized and all of them can be perturbed with pseudo-random noise to improve reliability.

Why Bother With All of That?

Artificial networks are not necessary for tuning well developed classes of relationships between input and desired output. For instance, these are easily optimized using well developed optimization techniques.

Higher degree polynomials — Often directly solvable using techniques derived directly from linear algebra
Periodic functions — Can be treated with Fourier methods
Curve fitting — converges well using the Levenberg–Marquardt algorithm, a damped least-squares approach

For these, approaches developed long before the advent of artificial networks can often arrive at an optimal solution with less computational overhead and more precision and reliability.

Where artificial networks excel is in the acquisition of functions about which the practitioner is largely ignorant or the tuning of the parameters of known functions for which specific convergence methods have not yet been devised.

Multi-layer perceptrons (ANNs) tune the parameters (attenuation matrix) during training. Tuning is directed by gradient descent or one of its variants to produce a digital approximation of an analog circuit that models the unknown functions. The gradient descent is driven by some criteria toward which circuit behavior is driven by comparing outputs with that criteria. The criteria can be any of these.

Matching labels (the desired output values corresponding to the training example inputs)
The need to pass information through narrow signal paths and reconstruct from that limited information
Another criteria inherent in the network
Another criteria arising from a signal source from outside the network

In Summary

In summary, activation functions provide the building blocks that can be used repeatedly in two dimensions of the network structure so that, combined with an attenuation matrix to vary the weight of signaling from layer to layer, is known to be able to approximate an arbitrary and complex function.

Deeper Network Excitement

The post-millenial excitement about deeper networks is because the patterns in two distinct classes of complex inputs have been successfully identified and put into use within larger business, consumer, and scientific markets.

Heterogeneous and semantically complex structures
Media files and streams (images, video, audio)

— FauChristian
quelle

But the question was about the purpose of activation and not the use of ANN

— DuttaA

@DuttaA, you're comment was accurate. Thank you. The answer had only one sentence that directly answered the question, and the tie into the rest of that answer wasn't very well communicated. I substantially edited it.

— FauChristian

This actually is the best answer, should have more upvotes and should be the accepted answer.

— DuttaA

2

There is no purpose to an activation function in an artificial network, just like there is no purpose to 3 in the factors of the number of 21. Multi-layer perceptrons and recurrent neural networks were defined as a matrix of cells each of which contains one. Remove the activation functions and all that is left is a series of useless matrix multiplications. Remove the 3 from 21 and the result is not a less effective 21 but a completely different number 7.

Activation functions do not help introduce non-linearity, they are the sole components in network forward propagation that do not fit a first degree polynomial form. If a thousand layers had an activation function $ax$ , where $a$ is a constant, the parameters and activations of the thousand layers could be reduced to a single dot product and no function could be simulated by the deep network other than those that reduce to $ax$ .

— han_nah_han_
quelle