Der Beweis äquivalenter Formeln der Gratregression

15

Ich habe die beliebtesten Bücher zum statistischen Lernen gelesen

1- Die Elemente des statistischen Lernens.

2- Eine Einführung in das statistische Lernen .

Beide erwähnen, dass die Gratregression zwei äquivalente Formeln hat. Gibt es einen nachvollziehbaren mathematischen Beweis für dieses Ergebnis?

Ich habe auch Cross Validated durchlaufen , kann dort aber keinen eindeutigen Beweis finden.

Wird LASSO darüber hinaus die gleiche Art von Beweis erhalten?

— jeza
quelle

2

en.wikipedia.org/wiki/…

— Taylor

1

Lasso ist keine Form der Gratregression.

— Xi'an

@jeza, könntest du erklären, was in meiner Antwort fehlt? Es lässt sich wirklich alles über die Verbindung ableiten.

— Royi

@jeza, könntest du konkret sein? Wenn Sie das Lagrange-Konzept für ein eingeschränktes Problem nicht kennen, ist es schwierig, eine prägnante Antwort zu geben.

— Royi

1

@jeza, ein eingeschränktes Optimierungsproblem kann in eine Optimierung der Lagrange-Funktion / KKT-Bedingungen umgewandelt werden (wie in den aktuellen Antworten erläutert). Dieses Prinzip hat bereits viele verschiedene einfache Erklärungen im Internet. In welche Richtung ist eine nähere Erläuterung des Beweises notwendig? Erklärung / Beweis des Lagrange-Multiplikators / der Lagrange-Funktion, Erklärung / Beweis, wie es sich bei diesem Problem um einen Optimierungsfall handelt, der sich auf die Methode von Lagrange, die Differenz KKT / Lagrange, die Erklärung des Regularisierungsprinzips usw. bezieht.

— Sextus Empiricus

19

Die klassische Ridge Regression ( Tikhonov Regularization ) ist gegeben durch:

arg min x 1 2 ∥ x - y ∥ 22 + λ ∥ x ∥ 22

$\arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2}$

Die Behauptung oben ist, dass das folgende Problem äquivalent ist:

arg min x subject to 1 2 ∥ x - y ∥ 22 ∥ x ∥ 22 \leq t

$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$

Lassen Sie uns definieren als die optimale Lösung des ersten Problems und $\hat{x}$ $\tilde{x}$ als die optimale Lösung des zweiten Problems.

Der Äquivalenzanspruch bedeutet das $\forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x}$ .
Sie können nämlich immer ein Paar von $t$ und haben $\lambda \geq 0$ so dass die Lösung des Problems dieselbe ist.

Wie könnten wir ein Paar finden?
Nun, indem Sie die Probleme lösen und die Eigenschaften der Lösung betrachten.
Beide Probleme sind konvex und glatt, so dass es die Dinge einfacher machen sollte.

Die Lösung für das erste Problem ist an dem Punkt gegeben, an dem der Gradient verschwindet, was bedeutet:

x^- y + 2 λ x^= 0

$\hat{x} - y + 2 \lambda \hat{x} = 0$

Die KKT-Bedingungen des zweiten Problems besagen:

x ~ - y + 2 μ x ~ = 0

$\tilde{x} - y + 2 \mu \tilde{x} = 0$

und

μ (∥ x ~ ∥ 22 - t) = 0

$\mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0$

Die letzte Gleichung legt nahe, dass entweder $\mu = 0$ oder ${\left\| \tilde{x} \right\|}_{2}^{2} = t$ .

Achten Sie darauf, dass die 2 Basisgleichungen äquivalent sind.
Das heißt , wenn und $\hat{x} = \tilde{x}$ $\mu = \lambda$ beide Gleichungen gelten.

Das bedeutet, dass für den Fall ${\left\| y \right\|}_{2}^{2} \leq t$ $\mu = 0$ gesetzt werden muss, was bedeutet, dass für $t$ groß genug ist, damit beide äquivalent sind, gesetzt werden muss $\lambda = 0$ .

Im anderen Fall sollte man finden $\mu$ wo:

y t (I + 2 μ I) - 1 (I + 2 μ I) - 1 y = t

${y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t$

Dies ist im Grunde , wenn ${\left\| \tilde{x} \right\|}_{2}^{2} = t$

Sobald Sie feststellen, dass $\mu$ die Lösungen kollidieren.

In Bezug auf den Fall ${L}_{1}$ (LASSO) funktioniert es mit der gleichen Idee.
Der einzige Unterschied ist, dass wir keine Lösung gefunden haben, weshalb es schwieriger ist, die Verbindung herzustellen.

Werfen Sie einen Blick auf meine Antwort unter StackExchange Cross Validated Q291962 und StackExchange Signal Processing Q21730 - Bedeutung von $\lambda$ in Basis Pursuit .

Bemerkung
Was ist eigentlich los?
Bei beiden Problemen versucht $x$ so nah wie möglich an $y$ .
Im ersten Fall verschwindet $x = y$ im ersten Term ( ${L}_{2}$ -Distanz) und im zweiten Fall verschwindet die Zielfunktion.
Der Unterschied besteht darin, dass man im ersten Fall die ${L}_{2}$ -Norm von $x$ ausgleichen muss . Wenn $\lambda$ höher wird, bedeutet das Gleichgewicht, dass Sie $x$ kleiner machen sollten .
Im zweiten Fall gibt es eine Mauer, die Sie $x$ immer näher bringen $y$ bis Sie an die Wand stoßen, die der Norm unterliegt (By $t$ ).
Wenn die Wand weit genug ist (hoher Wert von $t$ ) und genug von der Norm von $y$ abhängt, dann hat i keine Bedeutung, genau wie $\lambda$ relevant ist, ist nur der Wert multipliziert mit der Norm von $y$ sinnvoll.
Der genaue Zusammenhang ergibt sich aus dem oben angegebenen Lagrange.

Ressourcen

Ich habe diesen Artikel heute (03/04/2019) gefunden:

Approximationshärte für eine Klasse von spärlichen Optimierungsproblemen .

— Royi
quelle

Bedeutet das Äquivalent, dass \ lambda und \ t gleich sein sollten. Weil ich das im Beweis nicht sehen kann. danke

— jeza

@jeza, wie ich schrieb oben, für jede

ist

(nicht notwendigerweise gleich

als eine Funktion von

und den Daten

t $t$

λ≥0 $\lambda \geq 0$

t $t$

y $y$ ) derart , dass die Lösungen der beiden Formen sind die gleichen.

— Royi

3

@jeza, beide

&

sind hier im Wesentlichen freie Parameter. Wenn Sie beispielsweise

einmal angegeben haben, erhalten Sie eine bestimmte optimale Lösung. Aber

bleibt ein freier Parameter. An diesem Punkt lautet die Behauptung, dass es einen Wert von

, der dieselbe optimale Lösung ergibt. Im Wesentlichen gibt es keine Einschränkungen auf , was die

sein muss; Es ist nicht so, dass es eine feste Funktion von

, wie

oder so. λ $\lambda$

t $t$

λ $\lambda$

t $t$

λ $\lambda$

t=λ/2 $t=\lambda/2$

— gung - Wiedereinsetzung von Monica

@ Royi, ich möchte 1- wissen, warum deine Formel (1/2) hat, während die fraglichen Formeln nicht? 2- Verwenden Sie KKT, um die Äquivalenz der beiden Formeln zu zeigen? 3- Wenn ja, kann ich diese Entsprechung immer noch nicht erkennen. Ich bin nicht sicher, aber ich erwarte, dass dieser Beweis zeigt, dass Formel eins = Formel zwei.

— Jeza

1. Einfacher, wenn Sie den LS-Begriff unterscheiden. Sie können von my

zum OP

um den Faktor zwei verschieben. 2. Ich habe KKT für den 2. Fall verwendet. Der erste Fall hat keine Einschränkungen, daher können Sie ihn einfach lösen. 3. Es gibt keine geschlossene Formgleichung zwischen ihnen. Ich habe die Logik gezeigt und gezeigt, wie Sie ein Diagramm erstellen können, das sie verbindet. Aber wie ich geschrieben habe, ändert es sich für jedes

(es ist datenabhängig).λ $\lambda$

λ $\lambda$

y $y$

— Royi

9

Ein weniger mathematisch strenger, aber möglicherweise intuitiverer Ansatz zum Verstehen der Vorgänge besteht darin, mit der Einschränkungsversion (Gleichung 3.42 in der Frage) zu beginnen und sie mit den Methoden des "Lagrange-Multiplikators" ( https: //en.wikipedia ) zu lösen .org / wiki / Lagrange_multiplier oder Ihren bevorzugten multivariablen Kalkültext). Denken Sie daran, dass im Kalkül der Vektor von Variablen ist, aber in unserem Fall ist konstant und ist der variable Vektor. Sobald Sie die Lagrange - Multiplikator - Technik anwenden Sie mit der ersten Gleichung am Ende (3,41) (nach dem zusätzlichen Wegwerfen , die zur Minimierung konstant ist relativ und kann ignoriert werden). $x$ $x$ $\beta$ $-\lambda t$

Dies zeigt auch, dass dies für Lasso und andere Einschränkungen funktioniert.

— Greg Snow
quelle

8

It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:

optimization subject to hard (i.e. inviolable) constraints
optimization with penalties for violating constraints.

Quick intro to weak duality and strong duality

Assume we have some function $f(x,y)$ of two variables. For any $\hat{x}$ and $\hat{y}$ , we have:

min x f (x, y^) \leq f (x^, y^) \leq max y f (x^, y)

$\min_x f(x, \hat{y}) \leq f(\hat{x}, \hat{y}) \leq \max_y f(\hat{x}, y)$

Since that holds for any $\hat{x}$ and $\hat{y}$ it also holds that:

max y min x f (x, y) \leq min x max y f (x, y)

$\max_y \min_x f(x, y) \leq \min_x \max_y f(x, y)$

This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):

max y min x f (x, y) = min x max y f (x, y)

$\max_y \min_x f(x, y) = \min_x \max_y f(x, y)$

When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!

Lagrangian for constrained Ridge Regression

Let me define the function $\mathcal{L}$ as:

L (b, λ) = \sum i = 1 n (y - x i \cdot b) 2 + λ (\sum j = 1 p b 2 j - t)

$\mathcal{L}(\mathbf{b}, \lambda) = \sum_{i=1}^n (y - \mathbf{x}_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right)$

The min-max interpretation of the Lagrangian

The Ridge regression problem subject to hard constraints is:

min b max λ \geq 0 L (b, λ)

$\min_\mathbf{b} \max_{\lambda \geq 0} \mathcal{L}(\mathbf{b}, \lambda)$

You pick $\mathbf{b}$ to minimize the objective, cognizant that after $\mathbf{b}$ is picked, your opponent will set $\lambda$ to infinity if you chose $\mathbf{b}$ such that $\sum_{j=1}^p b_j^2 > t$ .

If strong duality holds (which it does here because Slater's condition is satisfied for $t>0$ ), you then achieve the same result by reversing the order:

max λ \geq 0 min b L (b, λ)

$\max_{\lambda \geq 0} \min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$

Here, your opponent chooses $\lambda$ first! You then choose $\mathbf{b}$ to minimize the objective, already knowing their choice of $\lambda$ . The $\min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$ part (taken $\lambda$ as given) is equivalent to the 2nd form of your Ridge Regression problem.

As you can see, this isn't a result particular to Ridge regression. It is a broader concept.

References

(I started this post following an exposition I read from Rockafellar.)

Rockafellar, R.T., Convex Analysis

You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.

— Matthew Gunn
quelle

note that your answer can be extended to any convex function.

— 81235

6

They are not equivalent.

For a constrained minimization problem

min b \sum i = 1 n (y - x' i \cdot b) 2 s . t . \sum j = 1 p b 2 j \leq t, b = (b 1, . . ., b p) (1)

$\min_{\mathbf b} \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2\\ s.t. \sum_{j=1}^p b_j^2 \leq t,\;\;\; \mathbf b = (b_1,...,b_p) \tag{1}$

we solve by minimize over $\mathbf b$ the corresponding Lagrangean

Λ = \sum i = 1 n (y - x' i \cdot b) 2 + λ (\sum j = 1 p b 2 j - t) (2)

$\Lambda = \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right) \tag{2}$

Here, $t$ is a bound given exogenously, $\lambda \geq 0$ is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and $\lambda$ are to be determined optimally through the minimization procedure given $t$ .

Comparing $(2)$ and eq $(3.41)$ in the OP's post, it appears that the Ridge estimator can be obtained as the solution to

min b {Λ + λ t} (3)

$\min_{\mathbf b}\{\Lambda + \lambda t\} \tag{3}$

Since in $(3)$ the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve $\mathbf b$ , it would appear that indeed the two approaches are equivalent...

But this is not correct because in the Ridge regression we minimize over $\mathbf b$ given $\lambda >0$ . But, in the lens of the constrained minimization problem, assuming $\lambda >0$ imposes the condition that the constraint is binding, i.e that

\sum j = 1 p (b * j, r i d g e) 2 = t

$\sum_{j=1}^p (b^*_{j,ridge})^2 = t$

The general constrained minimization problem allows for $\lambda = 0$ also, and essentially it is a formulation that includes as special cases the basic least-squares estimator ( $\lambda ^*=0$ ) and the Ridge estimator ( $\lambda^* >0$ ).

So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.

— Alecos Papadopoulos
quelle

@MartijnWeterings Thanks for the comment, I have reworked my answer.

— Alecos Papadopoulos

@MartijnWeterings I do not see what is confusing since the expression written in your comment is exactly the expression I wrote in my reworked post.

— Alecos Papadopoulos

1

This was the duplicate question I had in mind were the equivalence is explained very intuitively to me math.stackexchange.com/a/336618/466748 the argument that you give for the two not being equivalent seems only secondary to me, and a matter of definition (the OP uses

λ≥0 $\lambda \geq 0$ instead of

λ>0 $\lambda > 0$ and we could just as well add the constrain

t<∥βOLS∥22 $t < \Vert \beta^{OLS} \Vert^2_2$ to exclude the cases where

λ=0 $\lambda=0$ ) .

— Sextus Empiricus

@MartijnWeterings When A is a special case of B, A cannot be equivalent to B. And ridge regression is a special case of the general constrained minimization problem, Namely a situation to which we arrive if we constrain further the general problem (like you do in your last comment).

— Alecos Papadopoulos

Certainly you could define some constrained minimization problem that is more general then ridge regression (like you can also define some regularization problem that is more general than ridge regression, e.g. negative ridge regression), but then the non-equivalence is due to the way that you define the problem and not due to the transformation from the constrained representation to the Lagrangian representation. The two forms can be seen as equivalent within the constrained formulation/definition (non-general) that are useful for ridge regression.

— Sextus Empiricus