4

I have a sample of about 4,000 $r$ (that is, Pearson correlation), $\chi^2$, $t-$, or $F-$ tests reported in psychology journals. These tests have been drawn randomly from a larger dataset with about 500,000 statistical tests extracted from ~32,000 articles from 126 psychology journals.

For each statistical test I have the following data:

  • Test statistic value
  • Category of test statistic ($t$, $F$, $\chi^2$ or $r$)
  • Degrees of freedom (both $df$ in the case of $F$-test)
  • Reported $p$-value
  • Whether the reported $p$-value is consistent with the reported test statistic value and $df$ (with inconsistency likely indicating a reporting error)
  • Year of publication (ranging from 1980-2019, though with relatively few articles from the early part of that period)
  • Journal name (126 different journal names)
  • Classification of the statistical test as either “central” or “peripheral”

That last point relates to a classification of whether the statistical test was central to the main aims of the article, or whether it was peripheral (e.g. a statistical test done in the course of assumption-checking). These judgments were made by human raters, who have been shown to have decent reliability/validity in relation to this task (Cohen's $κ$ of 0.73)

All test statistics were converted to Fisher $Z$-transformed correlation coefficients using the "correlation coefficient per $df$" method, in order that they may be compared.

There are two main research questions:

  1. Are reported effect sizes declining over time? A prior analysis (in which no distinction was made between central and peripheral tests) suggests that overall reported effect sizes are slightly declining over time. But we are seeking to confirm this, and also to clarify whether this decline is being driven by tests of "central" hypotheses, or tests of "peripheral" hypotheses, or both.

  2. Are statistical reporting errors more common in central tests, or peripheral tests?

I’d originally planned to address these questions using two multilevel models,

  1. A multilevel regression in which tests are nested inside journals, and the outcome variable is test effect size (the Fisher $Z$-transformed correlation coefficient mentioned earlier). Predictors would be statistic type ($F$, $t$, $\chi^2$, $r$), focal/peripheral status, year of publication, and the interaction between focal/peripheral status and year of publication.

  2. A multilevel logistic regression in which tests are nested inside journals, and the outcome variable is the probability the test contains a reporting error. Predictors would be statistic type ($F$, $t$, $\chi^2$, $r$) and focal/peripheral status.

It’s been suggested to me that I should instead be doing “a multilevel meta-regression”. This is not a concept I was previously familiar with, but looking at the Cochrane handbook I read that

Meta-regressions usually differ from simple regressions in two ways. First, larger studies have more influence on the relationship than smaller studies, since studies are weighted by the precision of their respective effect estimate. Second, it is wise to allow for the residual heterogeneity among intervention effects not modelled by the explanatory variables. This gives rise to the term ‘random-effects meta-regression’, since the extra variability is incorporated in the same way as in a random-effects meta-analysis.

It wasn’t clear to me that either of those differences should be relevant, given my research questions.

Regarding the first research question (effect sizes over time), I understand that weighting large $N$ studies higher makes sense if I'm interested in the size of the underlying effects being studied by psychologists. However, if I'm only interested in assessing the effect sizes psychologists report over time I don’t see why large $N$ studies should be weighted higher.

Regarding the second research question (statistical reporting errors), I don’t see why large $N$ studies should be weighted higher.

Given my research questions, what analysis should I be doing?

  • I would recommend mvmeta in Stata or R – Giuseppe Biondi-Zoccai Apr 17 '20 at 06:35
  • Would you really want a study of ten people to have the same influence as one of ten thousand? – mdewey Apr 17 '20 at 14:18
  • @mdewey Your question gets to the heart of the matter; I'm wondering exactly that. Re: the first research question (effect sizes over time), I get that weighting large N studies higher makes sense if the meta-analysis is interested in the size of the underlying effects being studied by psychologists. However, if the meta-analysis is only interested in assessing the effect sizes psychologists _report_ over time I don’t see why large N studies should be weighted higher. Re: the second research question (statistical reporting errors), I don’t see why large N studies should be weighted higher. – user1205901 - Reinstate Monica Apr 17 '20 at 22:19
  • If I were attacking this, I would look at using a random forest. I would also look for geospatial migration over time, so affiliated institution of authors would be a column. – EngrStudent Jun 01 '20 at 11:44
  • (+1 for the post) Keeping N as a predictor, you will learn its relationship to your outcomes. Weighting by N, you will more or less assume that it matters for your estimates of interest but you will not learn about the above relationships. – rolando2 Jun 01 '20 at 12:02
  • Meta-regression is a specialization of meta-analysis. Meta-analyses are concerned with summarizing research results on a particular topic across different articles. I'm not sure your research question falls under this domain (of traditional meta-analyses). Given your sample size and research aims, it seems it is more in line with data-mining (albeit with pre-specified hypotheses), and as such I think you have a lot of freedom to choose whatever is the best method for the analyses. If both weighting and not weighting the effect sizes make sense, why not do both? – Tim Mak Jun 02 '20 at 03:41
  • You are studying effect size change over time, but seem to mostly have data on test statistics. These are not really effect sizes. They are values on a distribution used for tests to see if there is an unlikely effect. So do you have the actual effect sizes and their standard errors or variance? – Deathkill14 Sep 13 '20 at 12:26
  • You are studying effect size change over time, but seem to mostly have data on test statistics. These are not really effect sizes. They are values on a distribution used for tests to see if there is an unlikely effect. So do you have the actual effect sizes and their standard errors or variance? – Deathkill14 Sep 13 '20 at 12:26
  • The test statistics have been converted to effect sizes (Fisher-transformed correlation coefficients) as per the "correlation coefficient per $df$" method described [here](https://stats.stackexchange.com/questions/484418/the-correlation-coefficient-per-df-effect-size-measure). I can use $\sqrt{\frac{1}{(n-3)}}$ to derive estimated standard errors for all the test statistics besides the $\chi^2$ statistics and the $F$-statistics with denominator degrees of freedom $> 1$. – user1205901 - Reinstate Monica Sep 13 '20 at 12:37

0 Answers0