Woher kommt in der einfachen linearen Regression die Formel für die Varianz der Residuen?

Nach einem von mir verwendeten Text lautet die Formel für die Varianz des $i^{th}$ -Rests wie folgt:

$\sigma^2\left ( 1-\frac{1}{n}-\frac{(x_{i}-\overline{x})^2}{S_{xx}} \right )$

Das finde ich schwer zu glauben , da die $i^{th}$ Rest die Differenz zwischen dem ist $i^{th}$ beobachteten Wert und dem $i^{th}$ ausgestattet Wert; Wenn man die Varianz der Differenz berechnen würde, würde ich zumindest einige "Pluspunkte" in dem resultierenden Ausdruck erwarten. Jede Hilfe zum Verständnis der Herleitung wäre dankbar.

regression variance residuals

— Eric
quelle

Ist es möglich, dass einige "

+

$+$ " - Zeichen im Text als "

-

$-$ "

Zeichen falsch wiedergegeben (oder falsch gelesen) werden?

— Whuber

Ich hatte das gedacht, aber es passierte zweimal im Text (2 verschiedene Kapitel), daher hielt ich es für unwahrscheinlich. Natürlich würde eine Ableitung der Formel helfen! :)

— Eric

Die Negative sind das Ergebnis der positiven Korrelation zwischen einer Beobachtung und ihrem angepassten Wert, wodurch die Varianz der Differenz verringert wird.

— Glen_b -Reinstate Monica

@ Glen Vielen Dank für die Erklärung, warum sich herausstellt, dass die Formel zusammen mit Ihrer unten stehenden Matrixableitung sinnvoll ist.

— Eric

Die Intuition über die "Plus" -Zeichen in Bezug auf die Varianz (aus der Tatsache, dass selbst wenn wir die Varianz einer Differenz unabhängiger Zufallsvariablen berechnen, addieren wir deren Varianzen) ist korrekt, aber fatal unvollständig: Wenn die beteiligten Zufallsvariablen nicht unabhängig sind sind dann auch Kovarianzen beteiligt - und Kovarianzen können negativ sein. Es gibt einen Ausdruck, der ist fast wie der Ausdruck in der Frage wurde angenommen , dass es „sollte“ durch das OP (und ich), und es ist die Varianz der Vorhersage Fehlers bezeichnet es mit $e^0 = y^0 - \hat y^0$ : $y^0 = \beta_0+\beta_1x^0+u^0$

Var (e^{0}) = σ^{2} \cdot (1 + \frac{1}{n} + \frac{(x^{0} - \bar{x})^{2}}{S_{x x}})

$\text{Var}(e^0) = \sigma^2\cdot \left(1 + \frac 1n + \frac {(x^0-\bar x)^2}{S_{xx}}\right)$

Der entscheidende Unterschied zwischen der Varianz des Vorhersagefehlers und der Varianz des Schätzfehlers (dh des Restes), ist , dass der Fehlerausdruck der prädizierten Beobachtung ist nicht mit dem Schätzer korreliert , da der Wert wurde nicht bei der Konstruktion verwendet , der Schätzer und das Berechnen der Schätzer ist ein Wert außerhalb der Stichprobe. $y^0$

Die Algebra für beide geht bis zu einem Punkt (mit anstelle von ) genauso vor, läuft dann aber auseinander. Speziell: $^0$ $_i$

In der einfachen linearen Regression , die Varianz des Schätzers ist noch $y_i = \beta_0 + \beta_1x_i + u_i$ $\text{Var}(u_i)=\sigma^2$ $\hat \beta = (\hat \beta_0, \hat \beta_1)'$

Var (\hat{β}) = σ^{2} {(X^{'} X)}^{- 1}

$\text{Var}(\hat \beta) = \sigma^2 \left(\mathbf X' \mathbf X\right)^{-1}$

We have

X^{'} X = [\begin{matrix} n & \sum x_{i} \\ \sum x_{i} & \sum x_{i}^{2} \end{matrix}]

$\mathbf X' \mathbf X= \left[ \begin{matrix} n & \sum x_i\\ \sum x_i & \sum x_i^2 \end{matrix}\right]$

and so

{(X^{'} X)}^{- 1} = [\begin{matrix} \sum x_{i}^{2} & - \sum x_{i} \\ - \sum x_{i} & n \end{matrix}] \cdot {[n \sum x_{i}^{2} - {(\sum x_{i})}^{2}]}^{- 1}

$\left(\mathbf X' \mathbf X\right)^{-1}= \left[ \begin{matrix} \sum x_i^2 & -\sum x_i\\ -\sum x_i & n \end{matrix}\right]\cdot \left[n\sum x_i^2-\left(\sum x_i\right)^2\right]^{-1}$

We have

[n \sum x_{i}^{2} - {(\sum x_{i})}^{2}] = [n \sum x_{i}^{2} - n^{2} {\bar{x}}^{2}] = n [\sum x_{i}^{2} - n {\bar{x}}^{2}] = n \sum (x_{i}^{2} - {\bar{x}}^{2}) \equiv n S_{x x}

$\left[n\sum x_i^2-\left(\sum x_i\right)^2\right] = \left[n\sum x_i^2-n^2\bar x^2\right] = n\left[\sum x_i^2-n\bar x^2\right] \\= n\sum (x_i^2-\bar x^2) \equiv nS_{xx}$

{(X^{'} X)}^{- 1} = [\begin{matrix} (1 / n) \sum x_{i}^{2} & - \bar{x} \\ - \bar{x} & 1 \end{matrix}] \cdot (1 / S_{x x})

$\left(\mathbf X' \mathbf X\right)^{-1}= \left[ \begin{matrix} (1/n)\sum x_i^2 & -\bar x\\ -\bar x & 1 \end{matrix}\right]\cdot (1/S_{xx})$

which means that

Var ({\hat{β}}_{0}) = σ^{2} (\frac{1}{n} \sum x_{i}^{2}) \cdot (1 / S_{x x}) = \frac{σ^{2}}{n} \frac{S_{x x} + n {\bar{x}}^{2}}{S_{x x}} = σ^{2} (\frac{1}{n} + \frac{{\bar{x}}^{2}}{S_{x x}})

$\text{Var}(\hat \beta_0) = \sigma^2\left(\frac 1n\sum x_i^2\right)\cdot \ (1/S_{xx}) = \frac {\sigma^2}{n}\frac{S_{xx}+n\bar x^2} {S_{xx}} = \sigma^2\left(\frac 1n + \frac{\bar x^2} {S_{xx}}\right)$

Var ({\hat{β}}_{1}) = σ^{2} (1 / S_{x x})

$\text{Var}(\hat \beta_1) = \sigma^2(1/S_{xx})$

Cov ({\hat{β}}_{0}, {\hat{β}}_{1}) = - σ^{2} (\bar{x} / S_{x x})

$\text{Cov}(\hat \beta_0,\hat \beta_1) = -\sigma^2(\bar x/S_{xx})$

The $i$ -th residual is defined as

{\hat{u}}_{i} = y_{i} - {\hat{y}}_{i} = (β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i} + u_{i}

$\hat u_i = y_i - \hat y_i = (\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i +u_i$

The actual coefficients are treated as constants, the regressor is fixed (or conditional on it), and has zero covariance with the error term, but the estimators are correlated with the error term, because the estimators contain the dependent variable, and the dependent variable contains the error term. So we have

Var ({\hat{u}}_{i}) = [Var (u_{i}) + Var ({\hat{β}}_{0}) + x_{i}^{2} Var ({\hat{β}}_{1}) + 2 x_{i} Cov ({\hat{β}}_{0}, {\hat{β}}_{1})] + 2 Cov ([(β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i}], u_{i})

$\text{Var}(\hat u_i) = \Big[\text{Var}(u_i)+\text{Var}(\hat \beta_0)+x_i^2\text{Var}(\hat \beta_1)+2x_i\text{Cov}(\hat \beta_0,\hat \beta_1)\Big] + 2\text{Cov}([(\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i],u_i)$

= [σ^{2} + σ^{2} (\frac{1}{n} + \frac{{\bar{x}}^{2}}{S_{x x}}) + x_{i}^{2} σ^{2} (1 / S_{x x}) + 2 Cov ([(β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i}], u_{i})

$=\Big[\sigma^2 + \sigma^2\left(\frac 1n + \frac{\bar x^2} {S_{xx}}\right) + x_i^2\sigma^2(1/S_{xx}) +2\text{Cov}([(\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i],u_i)$

Pack it up a bit to obtain

Var ({\hat{u}}_{i}) = [σ^{2} \cdot (1 + \frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{S_{x x}})] + 2 Cov ([(β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i}], u_{i})

$\text{Var}(\hat u_i)=\left[\sigma^2\cdot \left(1 + \frac 1n + \frac {(x_i-\bar x)^2}{S_{xx}}\right)\right]+ 2\text{Cov}([(\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i],u_i)$

The term in the big parenthesis has exactly the same structure with the variance of the prediction error, with the only change being that instead of $x_i$ we will have $x^0$ (and the variance will be that of $e^0$ and not of $\hat u_i$ ). The last covariance term is zero for the prediction error because $y^0$ and hence $u^0$ is not included in the estimators, but not zero for the estimation error because $y_i$ and hence $u_i$ is part of the sample and so it is included in the estimator. We have

2 Cov ([(β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i}], u_{i}) = 2 E ([(β_{0} - {\hat{β}}_{0}) + (β_{1} - {\hat{β}}_{1}) x_{i}] u_{i})

$2\text{Cov}([(\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i],u_i) = 2E\left([(\beta_0 - \hat \beta_0) + (\beta_1 - \hat \beta_1)x_i]u_i\right)$

= - 2 E ({\hat{β}}_{0} u_{i}) - 2 x_{i} E ({\hat{β}}_{1} u_{i}) = - 2 E ([\bar{y} - {\hat{β}}_{1} \bar{x}] u_{i}) - 2 x_{i} E ({\hat{β}}_{1} u_{i})

$=-2E\left(\hat \beta_0u_i\right)-2x_iE\left(\hat \beta_1u_i\right) = -2E\left([\bar y -\hat \beta_1 \bar x]u_i\right)-2x_iE\left(\hat \beta_1u_i\right)$

the last substitution from how $\hat \beta_0$ is calculated. Continuing,

. . . = - 2 E (\bar{y} u_{i}) - 2 (x_{i} - \bar{x}) E ({\hat{β}}_{1} u_{i}) = - 2 \frac{σ^{2}}{n} - 2 (x_{i} - \bar{x}) E [\frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{S_{x x}} u_{i}]

$...=-2E(\bar yu_i) -2(x_i-\bar x)E\left(\hat \beta_1u_i\right) = -2\frac {\sigma^2}{n} -2(x_i-\bar x)E\left[\frac {\sum(x_i-\bar x)(y_i-\bar y)}{S_{xx}}u_i\right]$

= - 2 \frac{σ^{2}}{n} - 2 \frac{(x_{i} - \bar{x})}{S_{x x}} [\sum (x_{i} - \bar{x}) E (y_{i} u_{i} - \bar{y} u_{i})]

$=-2\frac {\sigma^2}{n} -2\frac {(x_i-\bar x)}{S_{xx}}\left[ \sum(x_i-\bar x)E(y_iu_i-\bar yu_i)\right]$

= - 2 \frac{σ^{2}}{n} - 2 \frac{(x_{i} - \bar{x})}{S_{x x}} [- \frac{σ^{2}}{n} \sum_{j \neq i} (x_{j} - \bar{x}) + (x_{i} - \bar{x}) σ^{2} (1 - \frac{1}{n})]

$=-2\frac {\sigma^2}{n} -2\frac {(x_i-\bar x)}{S_{xx}}\left[ -\frac {\sigma^2}{n}\sum_{j\neq i}(x_j-\bar x) + (x_i-\bar x)\sigma^2(1-\frac 1n)\right]$

= - 2 \frac{σ^{2}}{n} - 2 \frac{(x_{i} - \bar{x})}{S_{x x}} [- \frac{σ^{2}}{n} \sum (x_{i} - \bar{x}) + (x_{i} - \bar{x}) σ^{2}]

$=-2\frac {\sigma^2}{n}-2\frac {(x_i-\bar x)}{S_{xx}}\left[ -\frac {\sigma^2}{n}\sum(x_i-\bar x) + (x_i-\bar x)\sigma^2\right]$

= - 2 \frac{σ^{2}}{n} - 2 \frac{(x_{i} - \bar{x})}{S_{x x}} [0 + (x_{i} - \bar{x}) σ^{2}] = - 2 \frac{σ^{2}}{n} - 2 σ^{2} \frac{(x_{i} - \bar{x})^{2}}{S_{x x}}

$=-2\frac {\sigma^2}{n}-2\frac {(x_i-\bar x)}{S_{xx}}\left[ 0 + (x_i-\bar x)\sigma^2\right] = -2\frac {\sigma^2}{n}-2\sigma^2\frac {(x_i-\bar x)^2}{S_{xx}}$

Inserting this into the expression for the variance of the residual, we obtain

Var ({\hat{u}}_{i}) = σ^{2} \cdot (1 - \frac{1}{n} - \frac{(x_{i} - \bar{x})^{2}}{S_{x x}})

$\text{Var}(\hat u_i)=\sigma^2\cdot \left(1 - \frac 1n - \frac {(x_i-\bar x)^2}{S_{xx}}\right)$

So hats off to the text the OP is using.

(I have skipped some algebraic manipulations, no wonder OLS algebra is taught less and less these days...)

SOME INTUITION

So it appears that what works "against" us (larger variance) when predicting, works "for us" (lower variance) when estimating. This is a good starting point for one to ponder why an excellent fit may be a bad sign for the prediction abilities of the model (however counter-intuitive this may sound...).
The fact that we are estimating the expected value of the regressor, decreases the variance by $1/n$ . Why? because by estimating, we "close our eyes" to some error-variability existing in the sample,since we essentially estimating an expected value. Moreover, the larger the deviation of an observation of a regressor from the regressor's sample mean, the smaller the variance of the residual associated with this observation will be... the more deviant the observation, the less deviant its residual... It is variability of the regressors that works for us, by "taking the place" of the unknown error-variability.

But that's good for estimation. For prediction, the same things turn against us: now, by not taking into account, however imperfectly, the variability in $y^0$ (since we want to predict it), our imperfect estimators obtained from the sample show their weaknesses: we estimated the sample mean, we don't know the true expected value -the variance increases. We have an $x^0$ that is far away from the sample mean as calculated from the other observations -too bad, our prediction error variance gets another boost, because the predicted $\hat y^0$ will tend to go astray... in more scientific language "optimal predictors in the sense of reduced prediction error variance, represent a shrinkage towards the mean of the variable under prediction". We do not try to replicate the dependent variable's variability -we just try to stay "close to the average".

— Alecos Papadopoulos
quelle

Thank you for a very clear answer! I'm glad that my "intuition" was correct.

— Eric

Alecos, I really don't think this is right.

— Glen_b -Reinstate Monica

@Alecos the mistake is in taking the parameter estimates to be uncorrelated with the error term. This part:

Var ({\hat{u}}_{i}) = Var (u_{i}) + Var ({\hat{β}}_{0}) + x_{i}^{2} Var ({\hat{β}}_{1}) + 2 x_{i} Cov ({\hat{β}}_{0}, {\hat{β}}_{1})

$\text{Var}(\hat u_i) = \text{Var}(u_i)+\text{Var}(\hat \beta_0)+x_i^2\text{Var}(\hat \beta_1)+2x_i\text{Cov}(\hat \beta_0,\hat \beta_1)$ isn't right.

— Glen_b -Reinstate Monica

@Eric I apologize for misleading you earlier. I have tried to provide some intuition for both formulas.

— Alecos Papadopoulos

+1 You can see why I did the multiple regression case for this... thanks for going to the extra effort of doing the simple-regression case.

— Glen_b -Reinstate Monica

Sorry for the somewhat terse answer, perhaps overly-abstract and lacking a desirable amount of intuitive exposition, but I'll try to come back and add a few more details later. At least it's short.

Given $H=X(X^TX)^{-1}X^T$ ,

\begin{array}{rcl} Var (y - \hat{y}) & = & Var ((I - H) y) \\ = & (I - H) Var (y) (I - H)^{T} \\ = & σ^{2} (I - H)^{2} \\ = & σ^{2} (I - H) \end{array}

$\begin{eqnarray} \text{Var}(y-\hat{y})&=&\text{Var}((I-H)y)\\ &=&(I-H)\text{Var}(y)(I-H)^T\\ &=&\sigma^2(I-H)^2\\ &=&\sigma^2(I-H) \end{eqnarray}$

Hence

Var (y_{i} - {\hat{y}}_{i}) = σ^{2} (1 - h_{i i})

$\text{Var}(y_i-\hat{y}_i)=\sigma^2(1-h_{ii})$

In the case of simple linear regression ... this gives the answer in your question.

This answer also makes sense: since $\hat{y}_i$ is positively correlated with $y_i$ , the variance of the difference should be smaller than the sum of the variances.

Edit: Explanation of why $(I-H)$ is idempotent.

(i) $H$ is idempotent:

$H^2=X(X^TX)^{-1}X^TX(X^TX)^{-1}X^T$ $= X\ [(X^TX)^{-1}X^TX]\ (X^TX)^{-1}X^T=X(X^TX)^{-1}X^T=H$

(ii) $(I-H)^2= I^2-IH-HI+H^2=I-2H+H=I-H$

— Glen_b -Reinstate Monica
quelle

This is a very nice derivation for its simplicity, although one step that is not clear to me is why

(I - H)^{2} = (I - H)

$(I-H)^2=(I-H)$ . Maybe when you expand on your answer a little, as you're planning to do anyway, you could say a little something about that?

— Jake Westfall

@Jake Am Ende ein paar Zeilen hinzugefügt

— Glen_b -Reinstate Monica