Ist Bleaching immer gut?

27

Ein üblicher Vorverarbeitungsschritt für maschinelle Lernalgorithmen ist das Aufhellen von Daten.

Es scheint, dass es immer gut ist, das Weißmachen durchzuführen, da die Daten dekorreliert werden, was die Modellierung vereinfacht.

Wann wird Bleaching nicht empfohlen?

Hinweis: Ich beziehe mich auf die Dekorrelation der Daten.

data-transformation

— Ran
quelle

1

Kannst du eine Referenz für das Bleaching geben?

— Atilla Ozgur

2

Ich denke, dieser Thread ist ein Stummel. Es sollte wirklich erweitert werden. - - Die aktuell akzeptierte Antwort enthält so wenig Informationen. - - Ich würde es nicht akzeptieren und hier ein Kopfgeld eröffnen.

— Léo Léopold Hertz 준영

Ihre Frage ist auch voreingenommen, weil Sie "immer" da sind. Natürlich ist das Aufhellen nicht immer gut. Definieren Sie auch die Arten der Aufhellung. Ich denke, es führt hier zu nicht so konstruktiven Antworten. - - Definieren Sie die zu verwendenden Datentypen. - - Ich denke, eine bessere Frage kann sein: Wie können Sie die Anwendung dieses Aufhellens auf diese ausreichend schönen Daten verbessern? . - - @AtillaOzgur Eine Quelle en.wikipedia.org/wiki/Whitening_transformation, wenn die grundlegende Transformation von Whitening berücksichtigt wird.

— Léo Léopold Hertz 준영

13

Pre-Whitening ist eine Verallgemeinerung der Merkmalsnormalisierung, die die Eingabe unabhängig macht, indem sie gegen eine transformierte Eingabekovarianzmatrix transformiert wird. Ich kann nicht verstehen, warum das eine schlechte Sache sein kann.

Eine schnelle Suche ergab jedoch: "Die Durchführbarkeit der Datenaufhellung zur Verbesserung der Leistung von Wetterradar" ( pdf ).

Insbesondere beim exponentiellen ACF (was mit Monakovs Ergebnissen übereinstimmt) funktionierte das Aufhellen gut, beim Gaußschen weniger gut. Nach numerischen Experimenten haben wir festgestellt, dass der Gaußsche Fall in dem Sinne numerisch schlecht konditioniert ist, dass die Bedingungszahl (Verhältnis von maximalem zu minimalem Eigenwert) für die Gaußsche Kovarianzmatrix extrem groß ist.

Ich bin nicht gut genug ausgebildet, um dies zu kommentieren. Vielleicht ist die Antwort auf Ihre Frage, dass das Aufhellen immer gut ist, aber es gibt bestimmte Fallstricke (z. B. bei Zufallsdaten funktioniert es nicht gut, wenn dies über die Gaußsche Autokorrelationsfunktion erfolgt).

— andreister
quelle

2

Nach meinem Verständnis funktioniert es gut, wenn die Kovarianzmatrix gut geschätzt wird. Kann sich jemand dazu äußern? Vielen Dank.

— Ran

3

Das obige Zitat bezieht sich nicht auf eine schlecht geschätzte Kovarianzmatrix (obwohl dies ebenfalls problematisch wäre). Es heißt, dass es für eine perfekt spezifizierte Kovarianzmatrix immer noch schwierig sein kann, die erforderliche Faktorisierung (und die damit verbundenen Datentransformationen) genau durchzuführen. Dies ist auf eine schlechte numerische Konditionierung zurückzuführen, was bedeutet, dass Rundungsfehler mit endlicher Genauigkeit die Berechnungen verschmutzen.

— GeoMatt22

2

Dies ist keine ausreichende Antwort. Es hat meist nicht so verwandtes Material kopiert. - - Diese Antwort sollte wirklich erweitert werden. Es ist ein Stummel.

— Léo Léopold Hertz 준영

20

Erstens denke ich, dass Dekorrelation und Aufhellung zwei getrennte Verfahren sind.

Um die Daten zu dekorrelieren, müssen wir sie transformieren, damit die transformierten Daten eine diagonale Kovarianzmatrix haben. Diese Transformation kann durch Lösen des Eigenwertproblems gefunden werden. Wir finden die Eigenvektoren und zugehörigen Eigenwerte der Kovarianzmatrix durch Lösen ${\bf \Sigma} = {\bf X}{\bf X}'$

Σ Φ = Φ Λ

${\bf \Sigma}{\bf \Phi} = {\bf \Phi} {\bf \Lambda}$

Dabei ist eine Diagonalmatrix mit den Eigenwerten als diagonalen Elementen. ${\bf \Lambda}$

Die Matrix somit diagonalisiert die Kovarianzmatrix . Die Spalten von sind die Eigenvektoren der Kovarianzmatrix. ${\bf \Phi}$ ${\bf X}$ ${\bf \Phi}$

Wir können die diagonalisierte Kovarianz auch schreiben als:

\begin{matrix} (1) & Φ^{'} Σ Φ = Λ \end{matrix}

${\bf \Phi}' {\bf \Sigma} {\bf \Phi} = {\bf \Lambda} \tag{1}$

${\bf x}_i$

\begin{matrix} (2) & x_{i}^{*} = Φ^{'} x_{i} \end{matrix}

${\bf x}_i^* = {\bf \Phi}' {\bf x}_i \tag{2}$

${\bf \Lambda}$

Λ^{- 1 / 2} Λ Λ^{- 1 / 2} = I

${\bf \Lambda}^{-1/2} {\bf \Lambda} {\bf \Lambda}^{-1/2} = {\bf I}$

$(1)$

Λ^{- 1 / 2} Φ^{'} Σ Φ Λ^{- 1 / 2} = I

${\bf \Lambda}^{-1/2} {\bf \Phi}' {\bf \Sigma} {\bf \Phi} {\bf \Lambda}^{-1/2} = {\bf I}$

${\bf x}_i^*$ ${\bf x}_i^\dagger$

\begin{matrix} (3) & x_{i}^{†} = Λ^{- 1 / 2} x_{i}^{*} = Λ^{- 1 / 2} Φ^{'} x_{i} \end{matrix}

${\bf x}_i^{\dagger} = {\bf \Lambda}^{-1/2} {\bf x}_i^* = {\bf \Lambda}^{-1/2}{\bf \Phi}'{\bf x}_i \tag 3$

Now the covariance of ${\bf x}_i^\dagger$ is not only diagonal, but also uniform (white), since the covariance of ${\bf x}_i^\dagger$ , ${\bf E}({\bf x}_i^\dagger {{\bf x}_i^\dagger}') = {\bf I}$ .

Following on from this, I can see two cases where this might not be useful. The first is rather trivial, it could happen that the scaling of data examples is somehow important in the inference problem you are looking at. Of course you could the eigenvalues as an additional set of features to get around this. The second is a computational issue: firstly you have to compute the covariance matrix ${\bf \Sigma}$ , which may be too large to fit in memory (if you have thousands of features) or take too long to compute; secondly the eigenvalue decomposition is O(n^3) in practice, which again is pretty horrible with a large number of features.

And finally, there is a common "gotcha" that people should be careful of. One must be careful that you calculate the scaling factors on the training data, and then you use equations (2) and (3) to apply the same scaling factors to the test data, otherwise you are at risk of overfitting (you would be using information from the test set in the training process).

Source: http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

— tdc
quelle

2

Thanks for the clarification, you are right. I was referring to de-correlating. btw: at the end you write that whitening is only performed to the training data. as far as I know, you compute the matrix from the training data, but you perform it on both training & test data.

— Ran

@Ran yes that's what I meant ... I'll update the answer

— tdc

It would be nice if you could also offer sections in your answer. Have a intro, a summary and the math things. - - I think you do not go deep enough in your answer. - - Your answer covers mostly trivial propositions but does not go deep enough in the topic. You have just basic copy-pasted material from lecture notes but very little own work for the topic.

— Léo Léopold Hertz 준영

so in simple terms, do pca to get de-correlated features, and then foreach new feature, divide by the variance to get whitened features.

— avocado

1

From http://cs231n.github.io/neural-networks-2/

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing...

Unfortunately I'm not educated enough to comment further on this.

— DharmaTurtle
quelle

Please, state which forms of noises are exaggerated. Your reference is rigorous. It is just basic computer science about the topic i.e. white noise with an ancient neural network approach. - - The work exaggerate should also be defined.

— Léo Léopold Hertz 준영

Seems to me that this is just related to the scaling of all features to have the same variance, right? So if there were a feature whose variance in the training set were noise, we might expect the overall variance of this feature to be much smaller than another feature; this transformation would make both the "noise" feature and the other feature have the same variance, and could be seen as "amplifying noise".

— ijoseph