9

I'm interested in ways to detect mistakes in published papers without analyzing the raw data. For example the GRIM test [1]. Here's another similarish one from one of the GRIM authors' blog. I don't know of any others.

Looking for inconsistencies in reported stats seems attractive, because digging through raw data is difficult and sometimes the data isn't available. It's probably also easier to automate.

Edit: Benford's law, credit to DJohnson. Any others?


[1] Brown et al., A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology, Social Psychological and Personality Science (2016)

R Greg Stacey
  • 2,202
  • 2
  • 15
  • 30
  • 4
    There are a few blogs that much more systematically track *inconsistencies* in reported data, e.g., Data Colada (http://datacolada.org/) or Andrew Gelman's blog (andrewgelman.com). Benford's Law is used in accounting to detect nonrandom, i.e., fraudulent, reported numbers. – Mike Hunter Oct 23 '17 at 18:56
  • @R.Greg All you need to reproduce the carrot data at the blog link are two sets of kids -- one group that don't take any carrots, and another group that take (not eat) about double the mean. Sure, that's still a lot of carrots, but nobody needs to be taking more than about 39 carrots, rather than 60 or more. There's a bunch of semi- plausible ways for that to happen. Note that in the third group (the one where the mean number taken was 19) the actual mean number eaten was only about *7*. If they were small carrots, eating say 18 carrots or so may be feasible – Glen_b Oct 25 '17 at 08:52
  • @Glen_b What do you think about these kinds of tests, in general, not just GRIM and SPRITE? Do people run tests like this in any systematic way (look for inconsistencies in reported stats)? Going into the raw data is really time consuming, and there often just isn't any published data to look at. This seems much simpler - you might even be able to automate it. Is anyone doing that? Have people tried? – R Greg Stacey Oct 25 '17 at 17:25
  • 2
    Calling them 'tests' is a bit much, we're basically just looking at consistency checks -- which we should all be in the habit of carrying out as we read. They might help you find places to focus attention, but I would be very wary of putting much faith in these things as a way of identifying people doing something wrong (it's trivial to avoid such naive detection methods, so you can only catch the really incompetent). In the example [here](https://medium.com/@jamesheathers/the-grim-test-a-method-for-evaluating-published-research-9a4e5f05e870) about the ages -- ...ctd – Glen_b Oct 25 '17 at 21:26
  • 2
    ctd... I spotted where the problem was as I was reading the setup. Such errors occur for all manner of innocent reasons and some may not actually be errors (e.g. *one* person gives their age as 17 years 5 months which was recorded as 17.42 -- there's no error in the reporting at all, though perhaps a slight issue in data-handling). I have many times spotted more egregious things that must have been errors (such as seeing -in a published paper- data grouped into age ranges where the standard deviation reported for one group was considerably larger than half the range of the age-bin). – Glen_b Oct 25 '17 at 21:31
  • 2
    There are dozens of things that could easily be checked for. There's a time when I think a set of more comprehensive consistency checks would be useful -- before publication. Where the original data is not provided in the paper, it would be a simple matter for an editor to get someone to run the information in the paper through a collection of consistency checks and flag anything sufficiently weird to query the author about ("With this summary, how do you have a mean equal to the lowest reported value in the group when the standard deviation isn't 0?"). ...ctd – Glen_b Oct 25 '17 at 21:41
  • 2
    ctd... *That* would be handy because if it does turn out to have an innocent explanation that should go in the paper ("Sorry, we forgot to state that the reported means were rounded to the nearest integer." .... "please say so under the table!"). It's fine to do consistency checks and raise queries (if we don't too readily jump to conclusions), but the author there seems to be making rather a big deal out of a fairly obvious thing to check for. – Glen_b Oct 25 '17 at 21:41
  • 2
    I don't know of any wholesale checks being done, though some certainly could be, and in any case we should always be skeptical readers (I often ask myself when looking at reported information ... "does this make sense?"; often people ignore basic numerical facts when they read, though for me thinking about it is often part of comprehending what I read) – Glen_b Oct 25 '17 at 21:52
  • @Glen_b Thanks for the discussion. You more than answered my question, so feel free to copy and paste the comments as an answer and I'll accept it. – R Greg Stacey Oct 25 '17 at 22:05
  • Oh, I don't feel like I answered the question at all. But I may be able to come back later (a bit short on time today, but might find a few minutes) and add a little discussion and make a more even-handed answer. – Glen_b Oct 25 '17 at 22:12
  • Oh, when do we have all journals adopting a reproducible research policy? – StatsStudent Mar 11 '19 at 22:34

2 Answers2

0

You could simply ask for the data if you think there is an error. If they say no that would be a concern to me, although I find it hard to believe people would falsify results deliberately (well some will, but that will be rare I think).

user54285
  • 751
  • 4
  • 9
0

Below are heuristics that are not necessarily direct mistakes, but are frequent ways of using statistics sub-optimally.

Use of underpowered statistical tests, do they mention sample size calculations for example?

Data dredging especially on “big data”. If all variables are highly significant this could be a sign of stepwise regression or similar being used, rather than more logical reasoning having been used.

Over reliance on hypothesis tests, as we as consumers of the paper do not know how many other tests were tried, and compensation schemes for multiple comparisons have side effects such as depending on how many (acknowledged) tests were performed. Further, some fields have much prior knowledge which if not encoded into a test through say a Bayesian approach, are too prone to randomness of “significant” results of hypothesis tests.

Not using multiple imputation or similar but instead dropping observations with missing values from the analysis may bias the remaining data, and also reduces the power of subsequent tests.

Something that is more difficult to discern is if the techniques used are not well understood by the authors. This can be more apparent if they give a presentation of the paper. If the technique is not understood sufficiently, it may have been misused.

Single Malt
  • 504
  • 1
  • 5
  • 15