Welche Verlustfunktion ist für die logistische Regression richtig?

31

Ich habe zwei Versionen der Verlustfunktion für die logistische Regression gelesen. Welche davon ist richtig und warum?

Aus dem maschinellen Lernen , Zhou ZH (auf Chinesisch), mit $\beta = (w, b)\text{ and }\beta^Tx=w^Tx +b$ :

$\begin{matrix} (1) & l (β) = \sum_{i = 1}^{m} (- y_{i} β^{T} x_{i} + \ln (1 + e^{β^{T} x_{i}})) \end{matrix}$ $l(\beta) = \sum\limits_{i=1}^{m}\Big(-y_i\beta^Tx_i+\ln(1+e^{\beta^Tx_i})\Big) \tag 1$
Aus meinem College-Kurs mit $z_i = y_if(x_i)=y_i(w^Tx_i + b)$ :

$\begin{matrix} (2) & L (z_{i}) = \log (1 + e^{- z_{i}}) \end{matrix}$ $L(z_i)=\log(1+e^{-z_i}) \tag 2$

Ich weiß, dass die erste eine Ansammlung aller Proben ist und die zweite für eine einzelne Probe, aber ich bin neugieriger auf den Unterschied in Form von zwei Verlustfunktionen. Irgendwie habe ich das Gefühl, dass sie gleichwertig sind.

logistic loss-functions

— xtt
quelle

31

Die Beziehung ist wie folgt: $l(\beta) = \sum_i L(z_i)$ .

Definieren Sie eine logistische Funktion als . Sie besitzen die Eigenschaft, dass. Oder mit anderen Worten: $f(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1+e^{-z}}$ $f(-z) = 1-f(z)$

\frac{1}{1 + e^{z}} = \frac{e^{- z}}{1 + e^{- z}} .

$\frac{1}{1+e^{z}} = \frac{e^{-z}}{1+e^{-z}}.$

Wenn Sie den Kehrwert beider Seiten nehmen, dann nehmen Sie das Protokoll, das Sie erhalten:

\ln (1 + e^{z}) = \ln (1 + e^{- z}) + z .

$\ln(1+e^{z}) = \ln(1+e^{-z}) + z.$

Subtrahiere von beiden Seiten und du solltest folgendes sehen: $z$

- y_{i} β^{T} x_{i} + l n (1 + e^{y_{i} β^{T} x_{i}}) = L (z_{i}) .

$-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i}) = L(z_i).$

Bearbeiten:

Im Moment bin ich wieder lesen Sie diese Antwort und bin verwirrt darüber , wie ich bekam , um gleich . Vielleicht liegt ein Tippfehler in der ursprünglichen Frage vor. $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$

Bearbeiten 2:

Für den Fall, dass die ursprüngliche Frage keinen Tippfehler enthielt, scheint @ManelMorales richtig zu sein, um die Aufmerksamkeit auf die Tatsache zu lenken, dass, wenn , die Wahrscheinlichkeitsmassenfunktion als , aufgrund der Eigenschaft, dass $y \in \{-1,1\}$ $P(Y_i=y_i) = f(y_i\beta^Tx_i)$ $f(-z) = 1 - f(z)$ . Ich schreibe es hier anders, weil er eine neue Zweideutigkeit in die Notation einführt . Der Rest folgt aus der negativen log-Wahrscheinlichkeit für jede Codierung. Siehe seine Antwort unten für weitere Details. $z_i$ $y$

— Taylor
quelle

42

OP glaubt fälschlicherweise, dass die Beziehung zwischen diesen beiden Funktionen auf die Anzahl der Stichproben zurückzuführen ist (dh Single vs All). Der eigentliche Unterschied besteht jedoch einfach darin, wie wir unsere Trainingsetiketten auswählen.

Bei der binären Klassifikation können wir die Bezeichnungen $y=\pm1$ oder $y=0,1$ .

Wie bereits ausgeführt, ist die logistische Funktion $\sigma(z)$ eine gute Wahl, da sie die Form einer Wahrscheinlichkeit hat, dh $\sigma(-z)=1-\sigma(z)$ und $\sigma(z)\in (0,1)$ als $z\rightarrow \pm \infty$ . Wenn wir die Bezeichnungen $y=0,1$ auswählen wir sie zuweisen

\begin{aligned} P (y = 1 | z) & = σ (z) = \frac{1}{1 + e^{- z}} \\ P (y = 0 | z) & = 1 - σ (z) = \frac{1}{1 + e^{z}} \end{aligned}

$\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}$

which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$ .

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$ , after taking the natural logarithm and some simplification, we will find out:

\begin{aligned} l (z) = - \log (\prod_{i}^{m} P (y_{i} | z_{i})) = - \sum_{i}^{m} \log (P (y_{i} | z_{i})) = \sum_{i}^{m} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \end{aligned}

$\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}$

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$ . It is pretty obvious then that we can assign

P (y | z) = σ (y z) .

$\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}$

It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$ . Following the same steps as before we minimize in this case the loss function

\begin{aligned} L (z) = - \log (\prod_{j}^{m} P (y_{j} | z_{j})) = - \sum_{j}^{m} \log (P (y_{j} | z_{j})) = \sum_{j}^{m} \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-yz_j}) \end{aligned} \end{equation}$

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent:

\begin{aligned} - y_{i} z_{i} + \log (1 + e^{z_{i}}) \equiv \log (1 + e^{- y z_{j}}) \end{aligned}

$\begin{equation} \begin{aligned} -y_iz_i+\log(1+e^{z_i})\equiv \log(1+e^{-yz_j}) \end{aligned} \end{equation}$

The case $y_i=1$ is trivial to show. If $y_i \neq 1$ , then $y_i=0$ on the left hand side and $y_i=-1$ on the right hand side.

$\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$ , both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).

— Manuel Morales
quelle

Is logistic loss function convex?

— user85361

2

Log reg

l (z)

$l(z)$ IS convex, but not

α

$\alpha$ -convex. Thus we can't place a bound on how long gradient descent takes to converge. We can adjust the form of

l

$l$ to make it strongly convex by adding a regularization term: with positive constant

λ

$\lambda$ define our new function to be

l^{'} (z) = l (z) + λ ‖ z ‖^{2}

$l'(z)=l(z)+\lambda\|z\|^2$ s.t

l^{'} (z)

$l'(z)$ is

λ

$\lambda$ -strongly convex and we can now prove the convergence bound of

l^{'}

$l'$ . Unfortunately, we are now minimizing a different function! Luckily, we can show that the value of the optimum of the regularized function is close to the value of the optimum of the original.

— Manuel Morales

The notebook you referred has gone, I got another proof: statlect.com/fundamentals-of-statistics/…

— Domi.Zhang

2

I found this to be the most helpful answer.

— mohit6up

@ManuelMorales Do you have a link to the regularized function's optimum value being close to the original?

— Mark

19

I learned the loss function for logistic regression as follows.

Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. Let $P(y=1|x)$ be the probability that the binary output $y$ is 1 given the input feature vector $x$ . The coefficients $w$ are the weights that the algorithm is trying to learn.

P (y = 1 | x) = \frac{1}{1 + e^{- w^{T} x}}

$P(y=1|x) = \frac{1}{1 + e^{-w^{T}x}}$

Because logistic regression is binary, the probability $P(y=0|x)$ is simply 1 minus the term above.

P (y = 0 | x) = 1 - \frac{1}{1 + e^{- w^{T} x}}

$P(y=0|x) = 1- \frac{1}{1 + e^{-w^{T}x}}$

The loss function $J(w)$ is the sum of (A) the output $y=1$ multiplied by $P(y=1)$ and (B) the output $y=0$ multiplied by $P(y=0)$ for one training example, summed over $m$ training examples.

J (w) = \sum_{i = 1}^{m} y^{(i)} \log P (y = 1) + (1 - y^{(i)}) \log P (y = 0)

$J(w) = \sum_{i=1}^{m} y^{(i)} \log P(y=1) + (1 - y^{(i)}) \log P(y=0)$

where $y^{(i)}$ indicates the $i^{th}$ label in your training data. If a training instance has a label of $1$ , then $y^{(i)}=1$ , leaving the left summand in place but making the right summand with $1-y^{(i)}$ become $0$ . On the other hand, if a training instance has $y=0$ , then the right summand with the term $1-y^{(i)}$ remains in place, but the left summand becomes $0$ . Log probability is used for ease of calculation.

If we then replace $P(y=1)$ and $P(y=0)$ with the earlier expressions, then we get:

J (w) = \sum_{i = 1}^{m} y^{(i)} \log (\frac{1}{1 + e^{- w^{T} x}}) + (1 - y^{(i)}) \log (1 - \frac{1}{1 + e^{- w^{T} x}})

$J(w) = \sum_{i=1}^{m} y^{(i)} \log \left(\frac{1}{1 + e^{-w^{T}x}}\right) + (1 - y^{(i)}) \log \left(1- \frac{1}{1 + e^{-w^{T}x}}\right)$

You can read more about this form in these Stanford lecture notes.

— stackoverflowuser2010
quelle

This answer also provides some relevant perspective here.

— GeoMatt22

6

The expression you have is not a loss (to be minimized), but rather a log-likelihood (to be maximized).

— xenocyon

2

@xenocyon true - this same formulation is typically written with a negative sign applied to the full summation.

— Alex Klibisz

1

Anstelle von Mean Squared Error verwenden wir eine Kostenfunktion namens Cross-Entropy, die auch als Log Loss bezeichnet wird. Der Cross-Entropy-Verlust kann in zwei separate Kostenfunktionen unterteilt werden: eine für y = 1 und eine für y = 0.

\begin{aligned} j (θ) & = \frac{1}{m} \sum_{i = 1}^{m} C o s t (h_{θ} (x^{(i)}), y^{(i)}) \\ C o s t (h_{θ} (x), y) & = - \log (h_{θ} (x)) & i f y & = 1 \\ C o s t (h_{θ} (x), y) & = - \log (1 - h_{θ} (x)) & i f y & = 0 \end{aligned}

$\begin{align}\newcommand{\Cost}{{\rm Cost}}\newcommand{\if}{{\rm if}} j(\theta) &= \frac 1 m \sum_{i=1}^m \Cost(h_\theta(x^{(i)}), y^{(i)}) & & \\ \Cost(h_\theta(x), y) &= -\log(h_\theta(x)) & \if\ y &= 1 \\ \Cost(h_\theta(x), y) &= -\log(1-h_\theta(x)) & \if\ y &= 0 \end{align}$

When we put them together we have:

j (θ) = \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x)^{(i)})]

$j(\theta) = \frac 1 m \sum_{i=1}^m \big[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x)^{(i)}) \big]$

Multiplying by $y$ and $(1−y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $y=1$ and $y=0$ cases. If $y=0$ , the first side cancels out. If $y=1$ , the second side cancels out. In both cases we only perform the operation we need to perform.

If you don't want to use a for loop, you can try a vectorized form of the equation above

\begin{aligned} h & = g (X θ) \\ J (θ) & = \frac{1}{m} \cdot (- y^{T} \log (h) - (1 - y)^{T} \log (1 - h)) \end{aligned}

$\begin{align} h &= g(X\theta) \\ J(\theta) &= \frac 1 m \cdot \big(-y^T\log(h)-(1-y)^T\log(1-h)\big) \end{align}$

The entire explanation can be view on Machine Learning Cheatsheet.

— Emanuel Fontelles
quelle