Wie ist die Verteilung von in linearer Regression unter der Nullhypothese? Warum ist sein Modus nicht bei Null, wenn ?

Wie ist die Verteilung des Bestimmtheitsmaßes oder des quadratischen bei linearer univariater multipler Regression unter der Nullhypothese ? $R^2$ $H_0:\beta=0$

Wie hängt es von der Anzahl der Prädiktoren $k$ und der Anzahl der Stichproben $n>k$ ? Gibt es einen Ausdruck in geschlossener Form für den Modus dieser Distribution?

Insbesondere habe ich das Gefühl, dass für eine einfache Regression (mit einem Prädiktor $x$ ) diese Verteilung den Modus Null hat, aber für eine multiple Regression ist der Modus ein positiver Wert ungleich Null. Wenn dies tatsächlich zutrifft, gibt es eine intuitive Erklärung für diesen "Phasenübergang"?

Aktualisieren

Wie @Alecos unten zeigte, ist die Verteilung tatsächlich bei Null, wenn $k=2$ und $k=3$ und nicht bei Null, wenn $k>3$ . Ich denke, dass es eine geometrische Sicht auf diesen Phasenübergang geben sollte. Betrachten Sie die geometrische Ansicht von OLS: $\mathbf y$ ist ein Vektor in $\mathbb R^n$ , $\mathbf X$ definiert dort einen $k$ dimensionalen Unterraum. OLS entspricht der Projektion von $\mathbf y$ auf diesen Unterraum, und $R^2$ ist der Kosinusquadrat des Winkels zwischen $\mathbf y$ und seiner Projektion $\hat{\mathbf y}$ .

Aus der Antwort von @ Alecos folgt nun, dass, wenn alle Vektoren zufällig sind, die Wahrscheinlichkeitsverteilung dieses Winkels für und einen Spitzenwert von , jedoch einen Modus mit einem anderen Wert für . Warum?! $90^\circ$ $k=2$ $k=3$ $<90^\circ$ $k>3$

Update 2: Ich akzeptiere die Antwort von @ Alecos, habe aber immer noch das Gefühl, dass mir hier einige wichtige Erkenntnisse fehlen. Wenn irgendjemand jemals eine andere (geometrische oder nicht) Sichtweise auf dieses Phänomen vorschlägt, die es "offensichtlich" machen würde, würde ich gerne ein Kopfgeld anbieten.

— Amöbe sagt Reinstate Monica
quelle

Sind Sie bereit, die Normalität von Fehlern anzunehmen?

— Dimitriy V. Masterov

Ja, ich denke, man muss davon ausgehen, dass diese Frage beantwortbar ist (?).

— Amöbe sagt Reinstate Monica

Haben Sie diese davegiles.blogspot.jp/2013/05/good-old-r-squared.html überprüft ?

— Khashaa

@Khashaa: In der Tat muss ich zugeben, dass ich diese Blogspot-Seite gefunden habe, bevor ich meine Frage hier poste. Ehrlich gesagt wollte ich immer noch eine Diskussion über dieses Phänomen in unserem Forum haben, also tat ich so, als hätte ich das nicht gesehen.

— Amöbe sagt Reinstate Monica

Stark verwandte CV Frage stats.stackexchange.com/questions/123651/…

— Alecos Papadopoulos

Antworten:

Für die spezifische Hypothese (dass alle Regressorkoeffizienten Null sind, ohne den konstanten Term, der in diesem Test nicht untersucht wird) und unter Normalbedingungen wissen wir (siehe z. B. Maddala 2001, S. 155, aber beachten Sie, dass dort die Regressoren ohne den konstanten Term, daher sieht der Ausdruck ein bisschen anders aus als die Statistik $k$

istals ein zentraler verteiltZufallsvariable.

F = n - k k - 1 R 2 1 - R 2

$F = \frac {n-k}{k-1}\frac {R^2}{1-R^2}$

F(k−1,n−k) $F(k-1, n-k)$

Beachten Sie, dass wir den konstanten Term zwar nicht testen, ihn aber auch zählt. $k$

Dinge bewegen,

(k - 1) F - (k - 1) F R 2 = (n - k) R 2 \Rightarrow (k - 1) F = R 2 [(n - k) + (k - 1) F]

$(k-1)F - (k-1)FR^2 = (n-k)R^2 \Rightarrow (k-1)F = R^2\big[(n-k) + (k-1)F\big]$

\Rightarrow R 2 = ( k - 1 ) F ( n - k ) + ( k - 1 ) F

$\Rightarrow R^2 = \frac {(k-1)F}{(n-k) + (k-1)F}$

Die rechte Seite wird jedoch speziell als Beta-Distribution vertrieben

R 2 \sim B e t a (k - 1 2, n - k 2)

$R^2 \sim Beta\left (\frac {k-1}{2}, \frac {n-k}{2}\right)$

Der Modus dieser Distribution ist

mode R 2 = k - 1 2 - 1 k - 1 2 + n - k 2 - 2 = k - 3 n - 5

$\text{mode}R^2 = \frac {\frac {k-1}{2}-1}{\frac {k-1}{2}+ \frac {n-k}{2}-2} =\frac {k-3}{n-5}$

ENDLICHER UND EINZIGARTIGER MODUS
Aus der obigen Beziehung können wir schließen, dass die Distribution einen eindeutigen und endlichen Modus haben muss

k \geq 3, n > 5

$k\geq 3, n >5$

Dies steht im Einklang mit den allgemeinen Anforderungen für eine Beta-Distribution

{α > 1, β \geq 1}, OR {α \geq 1, β > 1}

$\{\alpha >1 , \beta \geq 1\},\;\; \text {OR}\;\; \{\alpha \geq1 , \beta > 1\}$

wie man aus diesem CV-Thread entnehmen oder hier lesen kann .
Beachten Sie, dass wenn , wir die gleichmäßige Verteilung erhalten, so dass alle Dichtepunkte Moden sind (endlich, aber nicht eindeutig). Was die Frage aufwirft : Warum ist , wenn , als ? $\{\alpha =1 , \beta = 1\}$ $k=3, n=5$ $R^2$ $U(0,1)$

AUSWIRKUNGEN
Angenommen, Sie haben Regressoren (einschließlich der Konstanten) und Beobachtungen. Ziemlich schöne Regression, keine Überanpassung. Dann $k=5$ $n=99$

R 2 ∣ ∣ β = 0 \sim B e t a (2, 47), mode R 2 = 1 47 \approx 0.021

$R^2\Big|_{\beta=0} \sim Beta\left (2, 47\right), \text{mode}R^2 = \frac 1{47} \approx 0.021$

und Dichtediagramm

enter image description here

Intuition bitte: Dies ist die Verteilung von unter der Hypothese, dass kein Regressor tatsächlich zur Regression gehört. A) Die Verteilung ist also unabhängig von den Regressoren. B) Mit zunehmender Stichprobengröße konzentriert sich die Verteilung auf Null, da die erhöhte Information die Variabilität kleiner Stichproben überschwemmt, die zu einer gewissen "Anpassung" führen kann. C) Die Anzahl irrelevanter Regressoren Wenn sich die Stichprobengröße erhöht, konzentriert sich die Verteilung auf , und wir haben das Phänomen der "falschen Anpassung". $R^2$ $1$

Beachten Sie aber auch, wie "einfach" es ist, die Nullhypothese abzulehnen: In dem speziellen Beispiel hat die kumulative Wahrscheinlichkeit für bereits erreicht , so dass ein erhaltenes die Null von "unbedeutender Regression" bei abzulehnen Signifikanzniveau %. $R^2=0.13$ $0.99$ $R^2>0.13$ $1$

ADDENDUM
Um auf das neue Problem in Bezug auf den Modus der -Verteilung zu antworten , kann ich die folgende (nicht geometrische) Überlegung anbieten, die es mit dem Phänomen der "falschen Anpassung" verknüpft: Wenn wir kleinste Quadrate für einen Datensatz ausführen Wir lösen im Wesentlichen ein System von linearen Gleichungen mit Unbekannten (der einzige Unterschied zur High-School-Mathematik besteht darin, dass wir damals "bekannte Koeffizienten" nannten, was wir in der linearen Regression "Variablen / Regressoren", "unbekannt x" nennen, was wir Nennen wir nun "unbekannte Koeffizienten" und "konstante Terme", was wir als "abhängige Variable" bezeichnen. Solange $R^2$ $n$ $k$ $k<n$ das System ist überidentifiziert und es gibt keine exakte Lösung, nur eine ungefähre, und der Unterschied ergibt sich als "ungeklärte Varianz der abhängigen Variablen", die von erfasst wird . Wenn , hat das System eine exakte Lösung (unter der Annahme einer linearen Unabhängigkeit). In der Zwischenzeit verringern wir mit zunehmender Anzahl von den "Grad der Überidentifizierung" des Systems und "nähern" uns der einzelnen exakten Lösung. Unter dieser Sichtweise ist es sinnvoll, warum mit der Addition irrelevanter Regressionen störend ansteigt und sich folglich seine Mode allmählich in Richtung bewegt , wenn für gegebene ansteigt $1-R^2$ $k=n$ $k$ $R^2$ $1$ $k$ . $n$

— Alecos Papadopoulos
quelle

Es ist mathematisch. Für

der erste Parameter der Beta-Verteilung (das "

" in Standardnotation) kleiner als Eins. In diesem Fall hat die Beta-Distribution keinen endlichen Modus. Spielen Sie mit keisan.casio.com/exec/system/1180573226 , um zu sehen, wie sich die Formen ändern. k=2 $k=2$

α $\alpha$

— Alecos Papadopoulos

@ Alecos Ausgezeichnete Antwort! (+1) Kann ich Ihnen dringend empfehlen, Ihrer Antwort die Voraussetzung für die Existenz des Modus hinzuzufügen? Dies wird normalerweise als

und

aber subtiler ist es in Ordnung, wenn die Gleichheit in einem der beiden gilt ... Ich denke für unsere Zwecke wird dies zu

und

und mindestens einem von Diese Ungleichungen sind streng . α>1 $\alpha>1$

β>1 $\beta>1$

k≥3 $k \geq 3$

n≥k+2 $n \geq k + 2$

— Silverfish

@Khashaa Außer wenn die Theorie es verlangt, schließe ich den Achsenabschnitt niemals von der Regression aus - es ist das durchschnittliche Niveau der abhängigen Variablen, Regressoren oder keine Regressoren (und dieses Niveau ist normalerweise positiv, also wäre es eine dumm selbst erstellte Fehlspezifikation) weglassen). Aber ich schließe es immer aus dem F-Test der Regression aus, da es mir nicht darum geht, ob die abhängige Variable einen bedingungslosen Mittelwert ungleich Null hat, sondern ob die Regressoren irgendeine Aussagekraft in Bezug auf Abweichungen von diesem Mittelwert haben.

— Alecos Papadopoulos

+1! Are there results for the distribution of

R2 $R^2$ for nonzero

βj $\beta_j$ ?

— Christoph Hanck

@ChristophHanck See also davegiles.blogspot.jp/2013/05/good-old-r-squared.html

— Alecos Papadopoulos

I won't rederive the $\mathrm{Beta}(\frac{k-1}{2}, \, \frac{n-k}{2})$ distribution in @Alecos's excellent answer (it's a standard result, see here for another nice discussion) but I want to fill in more details about the consequences! Firstly, what does the null distribution of $R^2$ look like for a range of values of $n$ and $k$ ? The graph in @Alecos's answer is quite representative of what occurs in practical multiple regressions, but sometimes insight is gleaned more easily from smaller cases. I've included the mean, mode (where it exists) and standard deviation. The graph/table deserves a good eyeball: best viewed at full-size. I could have included less facets but the pattern would have been less clear; I have appended R code so that readers can experiment with different subsets of $n$ and $k$ .

Distribution of R2 for small sample sizes

Values of shape parameters

The graph's colour scheme indicates whether each shape parameter is less than one (red), equal to one (blue), or more than one (green). The left-hand side shows the value of $\alpha$ while $\beta$ is on the right. Since $\alpha = \frac{k-1}{2}$ , its value increases in arithmetic progression by a common difference of $\frac{1}{2}$ as we move right from column to column (add a regressor to our model) whereas, for fixed $n$ , $\beta = \frac{n-k}{2}$ decreases by $\frac{1}{2}$ . The total $\alpha + \beta = \frac{n-1}{2}$ is fixed for each row (for a given sample size). If instead we fix $k$ and move down the column (increase sample size by 1), then $\alpha$ stays constant and $\beta$ increases by $\frac{1}{2}$ . In regression terms, $\alpha$ is half the number of regressors included in the model, and $\beta$ is half the residual degrees of freedom. To determine the shape of the distribution we are particularly interested in where $\alpha$ or $\beta$ equal one.

The algebra is straightforward for $\alpha$ : we have $\frac{k-1}{2}=1$ so $k=3$ . This is indeed the only column of the facet plot that's filled blue on the left. Similarly $\alpha < 1$ for $k<3$ (the $k=2$ column is red on the left) and $\alpha > 1$ for $k>3$ (from the $k=4$ column onwards, the left side is green).

For $\beta=1$ we have $\frac{n-k}{2}=1$ hence $k=n-2$ . Note how these cases (marked with a blue right-hand side) cut a diagonal line across the facet plot. For $\beta > 1$ we obtain $k < n - 2$ (the graphs with a green left side lie to the left of the diagonal line). For $\beta < 1$ we need $k > n - 2$ , which involves only the right-most cases on my graph: at $n=k$ we have $\beta=0$ and the distribution is degenerate, but $n=k-1$ where $\beta = \frac{1}{2}$ is plotted (right side in red).

Since the PDF is $f(x;\,\alpha,\,\beta) \propto x^{\alpha-1} (1-x)^{\beta-1}$ , it is clear that if (and only if) $\alpha<1$ then $f(x) \to \infty$ as $x \to 0$ . We can see this in the graph: when the left side is shaded red, observe the behaviour at 0. Similarly when $\beta<1$ then $f(x) \to \infty$ as $x \to 1$ . Look where the right side is red!

Symmetries

One of the most eye-catching features of the graph is the level of symmetry, but when the Beta distribution is involved, this shouldn't be surprising!

The Beta distribution itself is symmetric if $\alpha = \beta$ . For us this occurs if $n = 2k-1$ which correctly identifies the panels $(k=2, n=3)$ , $(k=3, n=5)$ , $(k=4, n=7)$ and $(k=5, n=9)$ . The extent to which the distribution is symmetric across $R^2 = 0.5$ depends on how many regressor variables we include in the model for that sample size. If $k = \frac{n+1}{2}$ the distribution of $R^2$ is perfectly symmetric about 0.5; if we include fewer variables than that it becomes increasingly asymmetric and the bulk of the probability mass shifts closer to $R^2 = 0$ ; if we include more variables then it shifts closer to $R^2 = 1$ . Remember that $k$ includes the intercept in its count, and that we are working under the null, so the regressor variables should have coefficient zero in the correctly specified model.

There is also an obviously symmetry between distributions for any given $n$ , i.e. any row in the facet grid. For example, compare $(k=3, n=9)$ with $(k=7, n=9)$ . What's causing this? Recall that the distribution of $\mathrm{Beta}(\alpha, \beta)$ is the mirror image of $\mathrm{Beta}(\beta, \alpha)$ across $x=0.5$ . Now we had $\alpha_{k,n} = \frac{k-1}{2}$ and $\beta_{k,n} = \frac{n-k}{2}$ . Consider $k'=n-k+1$ and we find:

α k', n = ( n - k + 1 ) - 1 2 = n - k 2 = β k, n

$\alpha_{k',n} = \frac{(n-k+1)-1}{2} = \frac{n-k}{2} = \beta_{k,n}$

β k', n = n - ( n - k + 1 ) 2 = k - 1 2 = α k, n

$\beta_{k',n} = \frac{n-(n-k+1)}{2} = \frac{k-1}{2} = \alpha_{k,n}$

So this explains the symmetry as we vary the number of regressors in the model for a fixed sample size. It also explains the distributions that are themselves symmetric as a special case: for them, $k' = k$ so they are obliged to be symmetric with themselves!

This tells us something we might not have guessed about multiple regression: for a given sample size $n$ , and assuming no regressors have a genuine relationship with $Y$ , the $R^2$ for a model using $k-1$ regressors plus an intercept has the same distribution as $1 - R^2$ does for a model with $k-1$ residual degrees of freedom remaining.

Special distributions

When $k=n$ we have $\beta=0$ , which isn't a valid parameter. However, as $\beta \to 0$ the distribution becomes degenerate with a spike such that $\mathsf{P}(R^2 = 1)=1$ . This is consistent with what we know about a model with as many parameters as data points - it achieves perfect fit. I haven't drawn the degenerate distribution on my graph but did include the mean, mode and standard deviation.

When $k=2$ and $n=3$ we obtain $\mathrm{Beta}(\frac{1}{2}, \, \frac{1}{2})$ which is the arcsine distribution. This is symmetric (since $\alpha = \beta$ ) and bimodal (0 and 1). Since this is the only case where both $\alpha < 1$ and $\beta < 1$ (marked red on both sides), it is our only distribution which goes to infinity at both ends of the support.

The $\mathrm{Beta}(1, \, 1)$ distribution is the only Beta distribution to be rectangular (uniform). All values of $R^2$ from 0 to 1 are equally likely. The only combination of $k$ and $n$ for which $\alpha = \beta =1$ occurs is $k=3$ and $n=5$ (marked blue on both sides).

The previous special cases are of limited applicability but the case $\alpha > 1$ and $\beta=1$ (green on left, blue on right) is important. Now $f(x;\,\alpha,\,\beta) \propto x^{\alpha-1} (1-x)^{\beta-1} = x^{\alpha-1}$ so we have a power-law distribution on [0, 1]. Of course it's unlikely we'd perform a regression with $k=n-2$ and $k>3$ , which is when this situation occurs. But by the previous symmetry argument, or some trivial algebra on the PDF, when $k=3$ and $n > 5$ , which is the frequent procedure of multiple regression with two regressors and an intercept on a non-trivial sample size, $R^2$ will follow a reflected power law distribution on [0, 1] under $H_0$ . This corresponds to $\alpha=1$ and $\beta>1$ so is marked blue on left, green on right.

You may also have noticed the triangular distributions at $(k=5,n=7)$ and its reflection $(k=3,n=7)$ . We can recognise from their $\alpha$ and $\beta$ that these are just special cases of the power-law and reflected power-law distributions where the power is $2-1=1$ .

Mode

If $\alpha>1$ and $\beta>1$ , all green in the plot, $f(x; \, \alpha, \, \beta)$ is concave with $f(0)=f(1)=0$ , and the Beta distribution has a unique mode $\frac{\alpha-1}{\alpha+\beta-2}$ . Putting these in terms of $k$ and $n$ , the condition becomes $k>3$ and $n>k+2$ while the mode is $\frac{k-3}{n-5}$ .

All other cases have been dealt with above. If we relax the inequality to allow $\beta=1$ , then we include the (green-blue) power-law distributions with $k=n-2$ and $k>3$ (equivalently, $n>5$ ). These cases clearly have mode 1, which actually agrees with the previous formula since $\frac{(n-2)-3}{n-5}=1$ . If instead we allowed $\alpha=1$ but still demanded $\beta>1$ , we'd find the (blue-green) reflected power-law distributions with $k=3$ and $n>5$ . Their mode is 0, which agrees with $\frac{3-3}{n-5}=0$ . However, if we relaxed both inequalities simultaneously to allow $\alpha=\beta=1$ , we'd find the (all blue) uniform distribution with $k=3$ and $n=5$ , which does not have a unique mode. Moreover the previous formula can't be applied in this case, since it would return the indeterminate form $\frac{3-3}{5-5}=\frac{0}{0}$ .

When $n=k$ we get a degenerate distribution with mode 1. When $\beta < 1$ (in regression terms, $n=k-1$ so there is only one residual degree of freedom) then $f(x) \to \infty$ as $x \to 1$ , and when $\alpha < 1$ (in regression terms, $k=2$ so a simple linear model with intercept and one regressor) then $f(x) \to \infty$ as $x \to 0$ . These would be unique modes except in the unusual case where $k=2$ and $n=3$ (fitting a simple linear model to three points) which is bimodal at 0 and 1.

Mean

The question asked about the mode, but the mean of $R^2$ under the null is also interesting - it has the remarkably simple form $\frac{k-1}{n-1}$ . For a fixed sample size it increases in arithmetic progression as more regressors are added to the model, until the mean value is 1 when $k=n$ . The mean of a Beta distribution is $\frac{\alpha}{\alpha+\beta}$ so such an arithmetic progression was inevitable from our earlier observation that, for fixed $n$ , the sum $\alpha+\beta$ is constant but $\alpha$ increases by 0.5 for each regressor added to the model.

α α + β = ( k - 1 ) / 2 ( k - 1 ) / 2 + ( n - k ) / 2 = k - 1 n - 1

$\frac{\alpha}{\alpha+\beta} = \frac{(k-1)/2}{(k-1)/2 + (n-k)/2} = \frac{k-1}{n-1}$

Code for plots

require(grid)
require(dplyr)

nlist <- 3:9 #change here which n to plot
klist <- 2:8 #change here which k to plot

totaln <- length(nlist)
totalk <- length(klist)

df <- data.frame(
    x = rep(seq(0, 1, length.out = 100), times = totaln * totalk),
    k = rep(klist, times = totaln, each = 100),
    n = rep(nlist, each = totalk * 100)
)

df <- mutate(df,
    kname = paste("k =", k),
    nname = paste("n =", n),
    a = (k-1)/2,
    b = (n-k)/2,
    density = dbeta(x, (k-1)/2, (n-k)/2),
    groupcol = ifelse(x < 0.5, 
        ifelse(a < 1, "below 1", ifelse(a ==1, "equals 1", "more than 1")),
        ifelse(b < 1, "below 1", ifelse(b ==1, "equals 1", "more than 1")))
)

g <- ggplot(df, aes(x, density)) +
    geom_line(size=0.8) + geom_area(aes(group=groupcol, fill=groupcol)) +
    scale_fill_brewer(palette="Set1") +
    facet_grid(nname ~ kname)  + 
    ylab("probability density") + theme_bw() + 
    labs(x = expression(R^{2}), fill = expression(alpha~(left)~beta~(right))) +
    theme(panel.margin = unit(0.6, "lines"), 
        legend.title=element_text(size=20),
        legend.text=element_text(size=20), 
        legend.background = element_rect(colour = "black"),
        legend.position = c(1, 1), legend.justification = c(1, 1))


df2 <- data.frame(
    k = rep(klist, times = totaln),
    n = rep(nlist, each = totalk),
    x = 0.5,
    ymean = 7.5,
    ymode = 5,
    ysd = 2.5
)

df2 <- mutate(df2,
    kname = paste("k =", k),
    nname = paste("n =", n),
    a = (k-1)/2,
    b = (n-k)/2,
    meanR2 = ifelse(k > n, NaN, a/(a+b)),
    modeR2 = ifelse((a>1 & b>=1) | (a>=1 & b>1), (a-1)/(a+b-2), 
        ifelse(a<1 & b>=1 & n>=k, 0, ifelse(a>=1 & b<1 & n>=k, 1, NaN))),
    sdR2 = ifelse(k > n, NaN, sqrt(a*b/((a+b)^2 * (a+b+1)))),
    meantext = ifelse(is.nan(meanR2), "", paste("Mean =", round(meanR2,3))),
    modetext = ifelse(is.nan(modeR2), "", paste("Mode =", round(modeR2,3))),
    sdtext = ifelse(is.nan(sdR2), "", paste("SD =", round(sdR2,3)))
)

g <- g + geom_text(data=df2, aes(x, ymean, label=meantext)) +
    geom_text(data=df2, aes(x, ymode, label=modetext)) +
    geom_text(data=df2, aes(x, ysd, label=sdtext))
print(g)

— Silverfish
quelle

Really illuminating visualization. +1

— Khashaa

Great addition, +1, thanks. I noticed that you call

0 $0$ a mode when the distribution goes to

+∞ $+\infty$ when

x→0 $x\to 0$ (and nowhere else) -- something @Alecos above (in the comments) did not want to do. I agree with you: it is convenient.

— amoeba says Reinstate Monica

@amoeba from the graphs we'd like to say "values around 0 are most likely" (or 1). But the answer of Alecos is also both self-consistent and consistent with many authorities (people differ on what to do about the 0 and 1 full stop, let alone whether they can count as a mode!). My approach to the mode differs from Alecos mostly because I use conditions on alpha and beta to determine where the formula is applicable, rather than taking my starting point as the formula and seeing which k and n give sensible answers.

— Silverfish

(+1), this is a very meaty answer. By keeping

k $k$ too close to

n $n$ and both small, the question studies in detail, and so decisively, the case of really small samples with relatively too many and irrelevant regressors.

— Alecos Papadopoulos

@amoeba You probably noticed that this answer furnishes an algebraic answer for why, for sufficiently large

n $n$ , the mode of the distribution is 0 for

k=3 $k=3$ but positive for

k>3 $k>3$ . Since

f(x)∝x(k−3)/2(1−x)(n−k−2)/2 $f(x) \propto x^{(k-3)/2}(1-x)^{(n-k-2)/2}$ then for

k=3 $k=3$ we have

f(x)∝(1−x)(n−5)/2 $f(x) \propto (1-x)^{(n-5)/2}$ which will clearly have mode at 0 for

n>5 $n>5$ , whereas for

k=4 $k=4$ we have

f(x)∝x1/2(1−x)(n−6)/2 $f(x) \propto x^{1/2}(1-x)^{(n-6)/2}$ whose maximum can be found by calculus to be the quoted mode formula. As

k $k$ increases, the power of

x $x$ rises by 0.5 each time. It's this

xα−1 $x^{\alpha-1}$ factor which makes

f(0)=0 $f(0)=0$ so kills the mode at 0

— Silverfish