Ist die PCA-Optimierung konvex?

Die objektive Funktion der Hauptkomponentenanalyse (PCA) ist die Minimierung des Rekonstruktionsfehlers in der L2-Norm (siehe Abschnitt 2.12 hier) . Eine andere Ansicht versucht, die Varianz bei der Projektion zu maximieren. Wir haben auch hier einen ausgezeichneten Beitrag: Was ist die objektive Funktion der PCA? ? ).

Meine Frage ist, dass die PCA-Optimierung konvex ist? (Ich habe hier einige Diskussionen gefunden , aber ich wünschte, jemand könnte hier im Lebenslauf einen netten Beweis erbringen).

— Haitao Du
quelle

Nein. Sie maximieren eine konvexe Funktion (unter Bedingungen).

— user603

Ich denke, Sie müssen genau definieren, was Sie unter "PCA-Optimierung" verstehen. Eine Standardformulierung besteht darin,

zu maximieren

x^{'} A x

$x^\prime\mathbb{A}x$ , wobei

x^{'} x = 1

$x^\prime x=1$ . Das Problem ist, dass Konvexität nicht einmal Sinn macht: Die Domäne

x^{'} x = 1

$x^\prime x=1$ ist eine Kugel, kein euklidischer Raum.

— Whuber

@whuber danke für deinen Kommentar, ich kann die Frage aufgrund begrenzter Kenntnisse möglicherweise nicht klären. Ich kann auf einige Antworten warten, die mir dabei helfen, die Frage gleichzeitig zu klären.

— Haitao Du

Ich würde Sie auf jede Definition von "konvex" verweisen, mit der Sie vertraut sind. Enthalten sie nicht alle ein Konzept von Punkten im Bereich einer Funktion, die "zwischen" anderen Punkten liegt? Dies ist insofern zu beachten, als Sie daran erinnert werden, die Geometrie der Domäne einer Funktion sowie etwaige algebraische oder analytische Eigenschaften der Funktionswerte zu berücksichtigen. In diesem Licht fällt mir ein, dass die varianzmaximierende Formulierung leicht modifiziert werden kann, um die Domäne konvex zu machen: einfach

x^{'} x \leq 1

$x^\prime x\le1$ anstelle von

x^{'} x = 1

$x^\prime x=1$ . Die Lösung ist die gleiche - und die Antwort wird ganz klar.

— Whuber

Nein, die üblichen Formulierungen von PCA sind keine konvexen Probleme. Sie können jedoch in ein konvexes Optimierungsproblem umgewandelt werden.

Die Einsicht und der Spaß daran besteht darin, die Abfolge der Transformationen zu verfolgen und zu visualisieren, anstatt nur die Antwort zu erhalten: Sie liegt in der Reise, nicht im Ziel. Die Hauptschritte auf diesem Weg sind

Erhalten Sie einen einfachen Ausdruck für die Zielfunktion.
Vergrößern Sie seine nicht konvexe Domäne in eine solche.
Ändern Sie das nicht konvexe Ziel in eines, das offensichtlich die Punkte, an denen es seine optimalen Werte erreicht, nicht verändert.

Wenn Sie genau hinschauen, können Sie die SVD- und Lagrange-Multiplikatoren lauern sehen - aber sie sind nur eine Nebenschau, da für landschaftliches Interesse, und ich werde sie nicht weiter kommentieren.

Die standardmäßige varianzmaximierende Formulierung von PCA (oder zumindest dessen Schlüsselschritt) ist

\begin{matrix} (*) & Maximize f (x) = x^{'} A x subject to x^{'} x = 1 \end{matrix}

$\text{Maximize }f(x)=\ x^\prime \mathbb{A} x\ \text{ subject to }\ x^\prime x=1\tag{*}$

wobei die Matrix eine symmetrische, positiv-semidefinite Matrix ist, die aus den Daten aufgebaut ist (üblicherweise ihre Summe aus Quadraten und Produktmatrix, ihre Kovarianzmatrix oder ihre Korrelationsmatrix). $n\times n$ $\mathbb A$

(Entsprechend können wir versuchen, das nicht beschränkte Ziel zu maximieren . Dies ist nicht nur ein unangenehmerer Ausdruck - es ist keine quadratische Funktion mehr -, sondern die grafische Darstellung von Sonderfällen zeigt schnell, dass es keine konvexe Funktion ist Normalerweise beobachtet man, dass diese Funktion bei Neuskalierungen invariant ist und reduziert sie dann auf die eingeschränkte Formulierung .) $x^\prime \mathbb{A} x / x^\prime x$ $x\to \lambda x$ $(*)$

Jedes Optimierungsproblem kann abstrakt formuliert werden als

Finden Sie mindestens ein , das die Funktion so groß wie möglich macht. $x\in\mathcal{X}$ $f:\mathcal{X}\to\mathbb{R}$

Denken Sie daran, dass ein Optimierungsproblem konvex ist, wenn es zwei separate Eigenschaften aufweist:

Die Domäne ist konvex. $\mathcal{X}\subset\mathbb{R}^n$ Dies kann auf viele Arten formuliert werden. Eine ist, dass, wann immer und und , $x\in\mathcal{X}$ $y\in\mathcal{X}$ $0 \le \lambda \le 1$ auch. Geometrisch: wenn zwei Endpunkte eines Liniensegments liegen in , das gesamte Segment liegt in . $\lambda x + (1-\lambda)y\in\mathcal{X}$ $\mathcal X$ $\mathcal X$
Die Funktion ist konvex. $f$ Dies kann auch auf viele Arten formuliert werden. Einer ist , dass , wenn und und , (Wir brauchten $x\in\mathcal{X}$ $y\in\mathcal{X}$ $0 \le \lambda \le 1$
$f (λ x + (1 - λ) y) \geq λ f (x) + (1 - λ) f (y) .$ $f(\lambda x + (1-\lambda)y) \ge \lambda f(x) + (1-\lambda) f(y).$ $\mathcal X$ für diesen Zustand sein , um konvex zu machen beliebigen Sinn.) Geometrisch: Wann immer ein Liniensegment in $\bar{xy}$ $\mathcal X$ , liegt der Graph von (wie auf dieses Segment beschränkt) über oder auf dem Verbindungssegment und in . $f$ $(x,f(x))$ $(y,f(y))$ $\mathbb{R}^{n+1}$
Der Archetyp einer konvexen Funktion ist lokal überall parabolisch mit nicht positivem Leitkoeffizienten: Auf jedem Liniensegment kann er in der Form ausgedrückt werden mit $y\to a y^2 + b y + c$ $a \le 0.$

Eine Schwierigkeit bei ist, dass die Einheitskugel , die entschieden nicht konvex ist. $(*)$ $\mathcal X$ $S^{n-1}\subset\mathbb{R}^n$ Wir können dieses Problem jedoch ändern, indem wir kleinere Vektoren einbeziehen. Das liegt daran, dass wenn wir mit einem Faktor skalieren , mit multipliziert wird . Wenn , können wir skalieren $x$ $\lambda$ $f$ $\lambda^2$ $0 \lt x^\prime x \lt 1$ durch Multiplizieren mit auf Einheitslänge $x$ , wodurchErhöhungaber innerhalb der Einheitskugel bleibt. Lasst uns alsoumformulierenals $\lambda=1/\sqrt{x^\prime x} \gt 1$ $f$ $D^n = \{x\in\mathbb{R}^n\mid x^\prime x \le 1\}$ $(*)$

\begin{matrix} (**) & Maximize f (x) = x^{'} A x subject to x^{'} x \leq 1 \end{matrix}

$\text{Maximize }f(x)=\ x^\prime \mathbb{A} x\ \text{ subject to }\ x^\prime x\le1\tag{**}$

Seine Domäne ist was eindeutig konvex ist, also sind wir auf halber Strecke. Es bleibt die Konvexität des Graphen von zu berücksichtigen . $\mathcal{X}=D^n$ $f$

Ein guter Weg, um über das Problem nachzudenken - auch wenn Sie nicht vorhaben, die entsprechenden Berechnungen durchzuführen - ist der Spektralsatz. $(**)$ Es heißt, dass man mittels einer orthogonalen Transformation mindestens eine Basis von in der diagonal ist: $\mathbb P$ $\mathbb{R}^n$ $\mathbb A$

A = P^{'} Σ P

$\mathbb {A = P^\prime \Sigma P}$

wobei alle nicht diagonalen Einträge von Null sind. Eine solche Wahl von kann so verstanden werden, dass sie überhaupt nichts an ändert, sondern nur die Art und Weise, wie Sie es beschreiben : Wenn Sie Ihren Standpunkt drehen, werden die Achsen der Ebenenhyperseiten der Funktion (welche waren immer Ellipsoide) mit den Koordinatenachsen ausgerichtet. $\Sigma$ $\mathbb{P}$ $\mathbb A$ $x\to x^\prime \mathbb{A} x$

Since $\mathbb A$ is positive-semidefinite, all the diagonal entries of $\Sigma$ must be non-negative. We may further permute the axes (which is just another orthogonal transformation, and therefore can be absorbed into $\mathbb P$ ) to assure that

σ_{1} \geq σ_{2} \geq \dots \geq σ_{n} \geq 0.

$\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0.$

If we let $x=\mathbb{P}^\prime y$ be the new coordinates $x$ (entailing $y=\mathbb{P}x$ ), the function $f$ is

f (y) = y^{'} A y = x^{'} P^{'} A P x = x^{'} Σ x = σ_{1} x_{1}^{2} + σ_{2} x_{2}^{2} + \dots + σ_{n} x_{n}^{2} .

$f(y) = y^\prime \mathbb{A} y = x^\prime \mathbb{P^\prime A P} x = x^\prime \Sigma x = \sigma_1 x_1^2 + \sigma_2 x_2^2 + \cdots + \sigma_n x_n^2.$

Diese Funktion ist definitiv nicht konvex! Sein Graph sieht aus wie ein Teil eines Hyperparaboloids: An jedem Punkt im Inneren von führt die Tatsache, dass alle nicht negativ sind, dazu, dass es sich eher nach oben als nach unten kräuselt . $\mathcal X$ $\sigma_i$

Allerdings können wir uns wenden in ein konvexes Problem mit einer sehr nützlicher Technik. $(**)$ Das Wissen , dass die maximal auftreten wird , wo , lassen Sie sich die Konstante subtrahiert aus , zumindest für die Punkte an der Grenze von . Dadurch werden die Positionen von Punkten auf der Grenze, an denen optimiert wird, nicht geändert , da alle Werte von auf der Grenze um denselben Wert . Dies schlägt vor, die Funktion zu untersuchen $x^\prime x = 1$ $\sigma_1$ $f$ $\mathcal{X}$ $f$ $f$ $\sigma_1$

g (y) = f (y) - σ_{1} y^{'} y .

$g(y) = f(y) - \sigma_1 y^\prime y.$

This indeed subtracts the constant $\sigma_1$ from $f$ at boundary points, and subtracts smaller values at interior points. This will assure that $g$ , compared to $f$ , has no new global maxima on the interior of $\mathcal X$ .

Let's examine what has happened with this sleight-of-hand of replacing $-\sigma_1$ by $-\sigma_1 y^\prime y$ . Because $\mathbb P$ is orthogonal, $y^\prime y = x^\prime x$ . (That's practically the definition of an orthogonal transformation.) Therefore, in terms of the $x$ coordinates, $g$ can be written

g (y) = σ_{1} x_{1}^{2} + \dots + σ_{n} x_{n}^{2} - σ_{1} (x_{1}^{2} + \dots + x_{n}^{2}) = (σ_{2} - σ_{1}) x_{2}^{2} + \dots + (σ_{n} - σ_{1}) x_{n}^{2} .

$g(y) = \sigma_1 x_1 ^2 + \cdots + \sigma_n x_n^2 - \sigma_1(x_1^2 + \cdots + x_n^2) = (\sigma_2-\sigma_1)x_2^2 + \cdots + (\sigma_n - \sigma_1)x_n^2.$

Because $\sigma_1 \ge \sigma_i$ for all $i$ , each of the coefficients is zero or negative. Consequently, (a) $g$ is convex and (b) $g$ is optimized when $x_2=x_3=\cdots=x_n=0$ . ( $x^\prime x=1$ then implies $x_1=\pm 1$ and the optimum is attained when $y = \mathbb{P} (\pm 1,0,\ldots, 0)^\prime$ , which is--up to sign--the first column of $\mathbb P$ .)

Let's recapitulate the logic. Because $g$ is optimized on the boundary $\partial D^n=S^{n-1}$ where $y^\prime y = 1$ , because $f$ differs from $g$ merely by the constant $\sigma_1$ on that boundary, and because the values of $g$ are even closer to the values of $f$ on the interior of $D^n$ , the maxima of $f$ must coincide with the maxima of $g$ .

— whuber
quelle

+1 Very nice. I edited to fix one formula to what I think you intended (but please check). Apart from that, I found the sentence "That won't change any boundary values at which f is optimized" to be confusing at first, because the boundary values do change: you are subtracting

σ_{1}

$\sigma_1$ . Maybe it makes sense to reformulate a bit?

— amoeba says Reinstate Monica

@amoeba Right on all counts; thank you. I have amplified the discussion of that point.

— whuber

(+1) In your answer, you seem to define a convex function to be what most people would consider to be a concave function (perhaps since a convex optimization problem has a convex domain and a concave function over which a maximum is computed (or a convex function over which a minimum is computed))

— user795305

@amoeba It's a subtle argument. Note, however, that the new maxima--those of

g

$g$ --are found to occur only on the boundary. That rules out your counterexamples. Another point worth noting is that in the end we don't really care whether new local (or even global) maxima happen to show up in the interior of

X

$\mathcal X$ , because we are originally concerned only about local maxima on its boundary. We are therefore free to alter

f

$f$ in any way that will not make any of those local boundary maxima move or disappear.

— whuber

Yes, I agree. It does not matter how

f

$f$ is modified on the inside, if the resulting

g

$g$ is "convex" and happens to have maxima on the boundary. Your

g

$g$ does happen to have maxima on the boundary, and this makes the whole argument work. Makes sense.

— amoeba says Reinstate Monica

No.

Rank $k$ PCA of matrix $M$ can be formulated as

$\hat{X} = \underset{rank(X) \leq k}{argmin} \| M - X\|_F^2$

( $\|\cdot\|_F$ is Frobenius norm). For derivation see Eckart-Young theorem.

Though the norm is convex, the set over which it is optimized is nonconvex.

A convex relaxation of PCA's problem is called Convex Low Rank Approximation

$\hat{X} = \underset{\|X\|_* \leq c}{argmin} \| M - X\|_F^2$

( $\|\cdot\|_*$ is nuclear norm. it's convex relaxation of rank - just like $\|\cdot\|_1$ is convex relaxation of number of nonzero elements for vectors)

You can see Statistical Learning with Sparsity, ch 6 (matrix decompositions) for details.

If you're interested in more general problems and how they relate to convexity, see Generalized Low Rank Models.

— Jakub Bartczuk
quelle

Disclaimer: The previous answers do a pretty good job of explaining how PCA in its original formulation is non-convex but can be converted to a convex optimization problem. My answer is only meant for those poor souls (such as me) who are not so familiar with the jargon of Unit Spheres and SVDs - which is, btw, good to know.

My source is this lecture notes by Prof. Tibshirani

For an optimization problem to be solved with convex optimization techniques, there are two prerequisites.

The objective function has to be convex.
The constraint functions should also be convex.

Most formulations of PCA involve a constraint on the rank of a matrix.

In these type of PCA formulations, condition 2 is violated. Because, the constraint that $rank(X) = k,$ is not convex. For example, let $J_{11}$ , $J_{22}$ be 2 × 2 zero matrices with a single 1 in the upper left corner and lower right corner respectively. Then, each of these have rank 1, but their average has rank 2.

— kasa
quelle

Could you please explain what "

X

$X$ " refers to and why there is any constraint on its rank? This doesn't correspond with my understanding of PCA, but perhaps you are thinking of a more specialized version in which only

k

$k$ principal components are sought.

— whuber

Yeah,

X

$X$ is the transformed (rotated) data matrix. In this formulation, we seek matrices that are at least of rank

k

$k$ . You can refer to the link in my answer for a more accurate description.

— kasa