Maximum Likelihood Estimators - Multivariates Gaußsches

Kontext

Der multivariate Gauß-Faktor wird beim maschinellen Lernen häufig verwendet. Die folgenden Ergebnisse werden in vielen ML-Büchern und -Kursen ohne die Ableitungen verwendet.

Gegebene Daten in Form einer Matrix der Dimensionen , wenn wir annehmen, dass die Daten einer variaten Gaußschen Verteilung mit Parametern mean ( ) und covarianz matrix ( ) Die Maximum Likelihood Estimators sind gegeben durch: $\mathbf{X}$ $m \times p$ $p$ $\mu$ $p \times 1$ $\Sigma$ $p \times p$

$\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$

$\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T$

Ich verstehe, dass die Kenntnis des multivariaten Gaußschen eine Voraussetzung für viele ML-Kurse ist, aber es wäre hilfreich, die vollständige Ableitung in einer in sich geschlossenen Antwort ein für alle Mal zu haben, da ich das Gefühl habe, dass viele Selbstlerner in den Statistiken herumhüpfen. Die Websites stackexchange und math.stackexchange suchen nach Antworten.

Frage

Wie lautet die vollständige Ableitung der Maximum-Likelihood-Schätzer für den multivariaten Gaußschen

Beispiele:

Diese Vorlesungsunterlagen (Seite 11) zur linearen Diskriminanzanalyse oder diese verwenden die Ergebnisse und setzen Vorkenntnisse voraus.

Es gibt auch einige Posts, die teilweise beantwortet oder geschlossen sind:

— Xavier Bourret Sicotte
quelle

Antworten:

Ableiten der Maximum Likelihood Estimators

Es sei angenommen , daß wir Zufallsvektoren, die jeweils eine Größe von : wobei jeder Zufallsvektor als Beobachtung (Datenpunkt) über Variablen interpretiert werden kann . Wenn jedes als multivariate Gaußsche Vektoren bezeichnet wird: $m$ $p$ $\mathbf{X^{(1)}, X^{(2)},...,X^{(m)}}$ $p$ $\mathbf{X}^{(i)}$

X^{(i)} \sim N_{p} (μ, Σ)

$\mathbf{X^{(i)}} \sim \mathcal{N}_p(\mu, \Sigma)$

Wo die Parameter unbekannt sind. Um ihre Schätzung zu erhalten, können wir die Methode der maximalen Wahrscheinlichkeit verwenden und die Log-Wahrscheinlichkeitsfunktion maximieren. $\mu, \Sigma$

Man beachte , dass durch die Unabhängigkeit der Zufallsvektoren, die gemeinsame Dichte der Daten ist das Produkt der einzelnen Dichten, das heißt . Wenn Sie den Logarithmus verwenden, erhalten Sie die Log-Likelihood-Funktion $\mathbf{ \{X^{(i)}}, i = 1,2,...,m\}$ $\prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma })$

\begin{aligned} l (μ, Σ | x^{(i)}) & = \log \prod_{i = 1}^{m} f_{X^{(i)}} (x^{(i)} | μ, Σ) \\ = \log \prod_{i = 1}^{m} \frac{1}{(2 π)^{p / 2} | Σ |^{1 / 2}} \exp (- \frac{1}{2} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ)) \\ = \sum_{i = 1}^{m} (- \frac{p}{2} \log (2 π) - \frac{1}{2} \log | Σ | - \frac{1}{2} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ)) \end{aligned}

$\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned}$

\begin{aligned} l (μ, Σ;) & = - \frac{m p}{2} \log (2 π) - \frac{m}{2} \log | Σ | - \frac{1}{2} \sum_{i = 1}^{m} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ) \end{aligned}

$\begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned}$

Herleiten $\hat \mu$

Um die Ableitung in Bezug auf und gleich Null zu sein, verwenden wir die folgende Matrixkalkülidentität: $\mu$

wenn nicht vonabhängtundsymmetrisch ist. $\mathbf{ \frac{\partial w^T A w}{\partial w} = 2Aw}$ $\mathbf{w}$ $\mathbf{A}$ $\mathbf{A}$

\begin{aligned} \frac{\partial}{\partial μ} l (μ, Σ | x^{(i)}) & = \sum_{i = 1}^{m} Σ^{- 1} (μ - x^{(i)}) = 0 \\ Since Σ is positive definite \\ 0 & = m μ - \sum_{i = 1}^{m} x^{(i)} \\ \hat{μ} & = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} = \bar{x} \end{aligned}

$\begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( \mu - x^{(i)} ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned}$

Welches wird oft der mittlere Vektor der Stichprobe genannt.

Deriving $\hat \Sigma$

Das Ableiten des MLE für die Kovarianzmatrix erfordert mehr Arbeit und die Verwendung der folgenden Eigenschaften der linearen Algebra und des Kalküls:

Die Spur ist bei zyklischen Permutationen von Matrixprodukten invariant: $tr[ACB] = tr[CAB] = tr[BCA]$

Da skalar ist, können wir seine Spur nehmen und denselben Wert erhalten: $x^TAx$ $x^tAx = tr[x^TAx] = tr[x^txA]$

$\frac{\partial}{\partial A} tr[AB] = B^T$

$\frac{\partial}{\partial A} \log |A| = A^{-T}$

Durch die Kombination dieser Eigenschaften können wir berechnen

\frac{\partial}{\partial A} x^{t} A x = \frac{\partial}{\partial A} t r [x^{T} x A] = [x x^{t}]^{T} = x^{T T} x^{T} = x x^{T}

$\frac{\partial}{\partial A} x^tAx =\frac{\partial}{\partial A} tr[x^TxA] = [xx^t]^T = x^{TT}x^T = xx^T$

Which is the outer product of the vector $x$ with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. $\Sigma^{-1}$ (note $C$ is constant)

\begin{aligned} l (μ, Σ | x^{(i)}) & = C - \frac{m}{2} \log | Σ | - \frac{1}{2} \sum_{i = 1}^{m} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ) \\ = C + \frac{m}{2} \log | Σ^{- 1} | - \frac{1}{2} \sum_{i = 1}^{m} t r [(x^{(i)} - μ) (x^{(i)} - μ)^{T} Σ^{- 1}] \\ \frac{\partial}{\partial Σ^{- 1}} l (μ, Σ | x^{(i)}) & = \frac{m}{2} Σ - \frac{1}{2} \sum_{i = 1}^{m} {(x^{(i)} - μ) (x^{(i)} - μ)}^{T} Since Σ^{T} = Σ \end{aligned}

$\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m tr[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } ] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned}$

Equating to zero and solving for $\Sigma$

\begin{aligned} 0 & = m Σ - \sum_{i = 1}^{m} {(x^{(i)} - μ) (x^{(i)} - μ)}^{T} \\ \hat{Σ} & = \frac{1}{m} \sum_{i = 1}^{m} {(x^{(i)} - \hat{μ}) (x^{(i)} - \hat{μ})}^{T} \end{aligned}

$\begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned}$

Sources

— Xavier Bourret Sicotte
quelle

Alternative proofs, more compact forms, or intuitive interpretation are welcome !

— Xavier Bourret Sicotte

In the derivation for

μ

$\mu$ , why does

Σ

$\Sigma$ need to be positive definite? Does it seem enough that

Σ

$\Sigma$ is invertible? For an invertible matrix

A

$A$ ,

A x = 0

$Ax=0$ only when

x = 0

$x=0$ ?

— Tom Bennett

To clarify,

Σ

$\Sigma$ is an

m \times m

$m \times m$ matrix that may have finite diagonal and non-diagonal components indicating correlation between vectors, correct? If that is the case, in what sense are these vectors independent? Also, why is the joint probability function equal to the likelihood? Shouldn't the joint density,

f (x, y)

$f(x,y)$ , be equal to the likelihood multiplied by the prior, i.e.

f (x | y) f (y)

$f(x|y)f(y)$ ?

— Mathews24

@TomBennett the sigma matrix is positive definite by definition - see stats.stackexchange.com/questions/52976/… for the proof. The matrix calculus identity requires the matrix to be symmetric, not positive definite. But since positive definite matrices are always symmetric that works

— Xavier Bourret Sicotte

Yes indeed - independence between observations allow to get the likelihood - the wording may be unclear faie enough - this is the multivariate version of the likelihood. The prior is still irrelevant regardless

— Xavier Bourret Sicotte

Ein alternativer Beweis für $\widehat{\Sigma}$ das nimmt die Ableitung in Bezug auf $\Sigma$ direkt:

Aufnehmen mit der Log-Wahrscheinlichkeit wie oben:

\begin{array}{rcl} ℓ (μ, Σ) & = & C - \frac{m}{2} \log | Σ | - \frac{1}{2} \sum_{i = 1}^{m} tr [(x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ)] \\ = & C - \frac{1}{2} (m \log | Σ | + \sum_{i = 1}^{m} tr [(x^{(i)} - μ) (x^{(i)} - μ)^{T} Σ^{- 1}]) \\ = & C - \frac{1}{2} (m \log | Σ | + tr [S_{μ} Σ^{- 1}]) \end{array}

$\begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray}$ where

S_{μ} = \sum_{i = 1}^{m} (x^{(i)} - μ) (x^{(i)} - μ)^{T}

$S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$ and we have used the cyclic and linear properties of

tr

$\text{tr}$ . To compute

\partial ℓ / \partial Σ

$\partial \ell /\partial \Sigma$ we first observe that

\frac{\partial}{\partial Σ} \log | Σ | = Σ^{- T} = Σ^{- 1}

$\frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1}$ by the fourth property above. To take the derivative of the second term we will need the property that

\frac{\partial}{\partial X} tr (A X^{- 1} B) = - (X^{- 1} B A X^{- 1})^{T} .

$\frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T.$ (from The Matrix Cookbook, equation 63). Applying this with

B = I

$B=I$ we obtain that

\frac{\partial}{\partial Σ} tr [S_{μ} Σ^{- 1}] = - {(Σ^{- 1} S_{μ} Σ^{- 1})}^{T} = - Σ^{- 1} S_{μ} Σ^{- 1}

$\frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1}$ because both

Σ

$\Sigma$ and

S_{μ}

$S_\mu$ are symmetric. Then

\frac{\partial}{\partial Σ} ℓ (μ, Σ) \propto m Σ^{- 1} - Σ^{- 1} S_{μ} Σ^{- 1} .

$\frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}.$ Setting this to 0 and rearranging gives

\hat{Σ} = \frac{1}{m} S_{μ} .

$\widehat{\Sigma} = \frac{1}{m}S_\mu.$

This approach is more work than the standard one using derivatives with respect to $\Lambda = \Sigma^{-1}$ , and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $\partial/{\partial \Sigma^{-1}}$ than $\partial/\partial \Sigma$ .

— Eric Kightley
quelle

Maximum Likelihood Estimators - Multivariates Gaußsches

Kontext

Frage

Beispiele:

Ableiten der Maximum Likelihood Estimators

Herleiten μμ^μ^\hat \mu

Deriving ΣΣ^Σ^\hat \Sigma

Sources

Herleiten $\hat \mu$

Deriving $\hat \Sigma$