Kann eine Metaanalyse von Studien, die alle „nicht statistisch signifikant“ sind, zu einer „signifikanten“ Schlussfolgerung führen?


29

Eine Metaanalyse umfasst eine Reihe von Studien, von denen alle einen P-Wert von mehr als 0,05 berichteten. Kann die gesamte Metaanalyse einen P-Wert von weniger als 0,05 ausweisen? Unter welchen Umständen?

(Ich bin mir ziemlich sicher, dass die Antwort ja lautet, aber ich hätte gerne eine Referenz oder Erklärung.)


1
Ich weiß nicht viel über Metaanalysen, aber ich hatte den Eindruck, dass es sich nicht um Hypothesentests handelt, sondern nur um eine Schätzung des Bevölkerungseffekts. In diesem Fall gibt es keine Bedeutung.
Kodiologist

1
Nun, eine Metaanalyse - am Ende des Tages - ist nur ein gewichteter Mittelwert. Und für diesen gewichteten Mittelwert können Sie durchaus einen Hypothesentest erstellen. Siehe zum Beispiel Borenstein, Michael et al. "Eine grundlegende Einführung in Modelle mit festen und zufälligen Effekten für die Metaanalyse." Research Synthesis Methods 1.2 (2010): 97 & ndash; 111.
Boscovich

1
Die anderen Antworten sind ebenfalls gut, aber ein einfacher Fall: Zwei Studien sind bei p = 0,9 signifikant, aber nicht bei p = 0,95. Die Wahrscheinlichkeit, dass zwei unabhängige Studien beide p> = 0,9 zeigen, beträgt nur 0,01, sodass Ihre Metaanalyse eine Signifikanz bei p = 0,99
barrycarter

2
Nehmen Sie die Grenze: Keine Messung kann genügend Beweise für / gegen eine (nichttriviale) Hypothese liefern, um einen kleinen p Wert zu haben, aber eine ausreichend große Sammlung von Messungen kann dies.
Eric Towers

p-Werte zeigen weder einen "statistisch signifikanten" noch einen nicht signifikanten Effekt an. Was könnten wir aus einer bedeutenden Schlussfolgerung verstehen? Ist es eine metaanalytische Schlussfolgerung?
Subhash C. Davar

Antworten:


31

In theory, yes...

The results of individual studies may be insignificant but viewed together, the results may be significant.

In theory you can proceed by treating the results yi of study i like any other random variable.

Let yi be some random variable (eg. the estimate from study i). Then if yi are independent and E[yi]=μ, you can consistently estimate the mean with:

μ^=1niyi

Adding more assumptions, let σi2 be the variance of estimate yi. Then you can efficiently estimate μ with inverse variance weighting:

μ^=iwiyiwi=1/σi2j1/σj2

In either of these cases, μ^ may be statistically significant at some confidence level even if the individual estimates are not.

BUT there may be big problems, issues to be cognizant of...

  1. If E[yi]μ then the meta-analysis may not converge to μ (i.e. the mean of the meta-analysis is an inconsistent estimator).

    For example, if there's a bias against publishing negative results, this simple meta-analysis may be horribly inconsistent and biased! It would be like estimating the probability that a coin flip lands heads by only observing the flips where it didn't land tails!

  2. yiyji and j were based upon the same data, then treating yi and yj as independent in the meta-analysis may vastly underestimate the standard errors and overstate statistical significance. Your estimates would still be consistent, but the standard-errors need to reasonably account for cross-correlation in the studies.

  3. Combining (1) and (2) can be especially bad.

    For example, the meta-analysis of averaging polls together tends to be more accurate than any individual poll. But averaging polls together is still vulnerable to correlated error. Something that has come up in past elections is that young exit poll workers may tend to interview other young people rather than old people. If all the exit polls make the same error, then you have a bad estimate which you may think is a good estimate (the exit polls are correlated because they use the same approach to conduct exit polls and this approach generates the same error).

Undoubtedly people more familiar with meta-analysis may come up with better examples, more nuanced issues, more sophisticated estimation techniques, etc..., but this gets at some of the most basic theory and some of the bigger problems. If the different studies make independent, random error, then meta-analysis may be incredibly powerful. If the error is systematic across studies (eg. everyone undercounts older voters etc...), then the average of the studies will also be off. If you underestimate how correlated studies are or how correlated errors are, you effectively over estimate your aggregate sample size and underestimate your standard errors.

There are also all kinds of practical issues of consistent definitions etc...


1
I'm criticizing a meta-analysis for ignoring dependencies between effect sizes (i.e., many effect sizes were based on the same participants, but treated as independent). The authors say no biggie, we are just interested in moderators anyways. I'm making the point you made here: treating them "as independent in the meta-analysis may vastly underestimate the standard errors and overstate statistical significance." Is there a proof/simulation study showing why this is the case? I have lots of references saying that correlated errors means underestimated SE... but I don't know why?
Mark White

1
@MarkWhite The basic idea isn't more complicated than Var(1niXi)=1n2(iVar(Xi)+ijCov(Xi,Xj)). If for all i we have Var(Xi)=σ2 and Cov(Xi,Xj)=0 for ij then Var(1niXi)=σ2n and your standard error is σn. On the other hand, if the covariance terms are positive and big, the standard error is going to be larger.
Matthew Gunn

@MarkWhite I'm not a meta-analysis expert, and I honestly don't know what's a great source for how one should do modern, meta-analysis. Conceptually, replicating analysis on the same data is certainly useful (as is intensively studying some subjects), but it's not the same as reproducing a finding on new, independent subjects.
Matthew Gunn

1
Ah, so in words: The total variance of an effect size comes from (a) its variance and (b) it's covariance with other effect sizes. If the covariance is 0, then standard error estimate is fine; but if it covaries with other effect sizes, we need to account for that variance, and ignoring it means we are underestimating the variance. It's like variance is made up of two parts A and B, and ignoring dependencies assumes the B part is 0 when it is not?
Mark White

1
Also, this looks to be a good source (see especially Box 2): nature.com/neuro/journal/v17/n4/pdf/nn.3648.pdf
Mark White

29

Yes. Suppose you have N p-values from N independent studies.

Fisher's test

(EDIT - in response to @mdewey's useful comment below, it is relevant to distinguish between different meta tests. I spell out the case of another meta test mentioned by mdewey below)

The classical Fisher meta test (see Fisher (1932), "Statistical Methods for Research Workers" ) statistic

F=2i=1Nln(pi)
has a χ2N2 null distribution, as 2ln(U)χ22 for a uniform r.v. U.

Let χ2N2(1α) denote the (1α)-quantile of the null distribution.

Suppose all p-values are equal to c, where, possibly, c>α. Then, F=2Nln(c) and F>χ2N2(1α) when

c<exp(χ2N2(1α)2N)
For example, for α=0.05 and N=20, the individual p-values only need to be less than
> exp(-qchisq(0.95, df = 40)/40)
[1] 0.2480904

Of course, what the meta statistic tests is "only" the "aggregate" null that all individual nulls are true, which is to be rejected as soon as only one of the N nulls is false.

EDIT:

Here is a plot of the "admissible" p-values against N, which confirms that c grows in N, although it seems to level off at c0.36.

enter image description here

I found an upper bound for the quantiles of the χ2 distribution

χ2N2(1α)2N+2log(1/α)+22Nlog(1/α),
here, suggesting that χ2N2(1α)=O(N) so that exp(χ2N2(1α)2N) is bounded from above by exp(1) as N. As exp(1)0.3679, this bound seems reasonably sharp.

Inverse Normal test (Stouffer et al., 1949)

The test statistic is given by

Z=1Ni=1NΦ1(pi)
with Φ1 the standard normal quantile function. The test rejects for large negative values, viz., if Z<1.645 at α=0.05. Hence, for pi=c, Z=NΦ1(c). When c<0.5, Φ1(c)<0 and hence Zp as N. If c0.5, Z will take values in the acceptance region for any N. Hence, a common p-value less than 0.5 is sufficient to produce a rejection of the meta test as N.

More specifically, Z<1.645 if c<Φ(1.645/N), which tends to Φ(0)=0.5 from below as N.


2
+1 and wow! did not expect there to be an upper bound at all, let alone 1/e.
amoeba says Reinstate Monica

Thanks :-). I had not expected one either before I saw the plot...
Christoph Hanck

5
Interestingly the method due to Fisher is the only one of the commonly used methods which has this property. For most of the others what you call F increases with N if $c>0.5) and decreases otherwise. That applies to Stouffer's method and Edgington's method as well as methods based on logits and on mean of p. The various methods which are special cases of Wilkinson's method (minimum p, maximum p, etc) have different properties again.
mdewey

1
@mdewey, that is interesting indeed, I just picked Fisher's test purely because it came to my mind first. That said, by "only one", do you mean the specific bound 1/e? Your comments, that I try to spell out in my edit, suggest to me that Stouffer's method also has an upper bound, that turns out to be 0.5?
Christoph Hanck

I am not going to have time to go into this for another week but I think if you have ten studies with p=0.9 you get an overall p as close to unity as makes no difference. There may be a one- versus two-sided issue here. If you want to look at more material I have a draft of extra stuff to go into my R package <code>metap</code> here which you are free to use to expand your answer if you wish.
mdewey

4

The answer to this depends on what method you use for combining p-values. Other answers have considered some of these but here I focus on one method for which the answer to the original question is no.

The minimum p method, also known as Tippett's method, is usually described in terms of a rejection at the α level of the null hypothesis. Define

p[1]p[2]p[k]
for the k studies. Tippett's method then evaluates whether
p[1]<1(1α)1k

It is easy to see the since the kth root of a number less than unity is closer to unity the last term is greater than α and hence the overall result will be non-significant unless p[1] is already less than α.

It is possible to work out the critical value and for example if we have ten primary studies each with a p-values of 00.05 so as close to significant as can be then the overall critical value is 0.40. The method can be seen as a special case of Wilkinson's method which uses p[r] for 1rk and in fact for the particular set of primary studies even r=2 is not significant (p=0.09)

L H C Tippett's method is described in a book The methods of statistics. 1931 (1st ed) and Wilkinson's method is here in an article "A statistical consideration in psychological research"


1
Thanks. But note that most meta-analysis methods combine effect sizes (accounting for any difference in sample size), and do not combine P values.
Harvey Motulsky

@HarveyMotulsky agreed, combining p-values is a last resort but the OP did tag his question with the combining-p-values tag so I responded in that spirit
mdewey

I think that your answer is correct.
Subhash C. Davar
Durch die Nutzung unserer Website bestätigen Sie, dass Sie unsere Cookie-Richtlinie und Datenschutzrichtlinie gelesen und verstanden haben.
Licensed under cc by-sa 3.0 with attribution required.