Questions tagged [reproducible-research]

Research practice of making full experimental description, entire collected data, and all the data analysis scripts publicly available, so that the published results can be reproduced elsewhere.

Reproducible research is any scientific finding or result that can be independently replicated based on the methods detailed by the original investigator. It is a cornerstone of the scientific method. Reproducible research for statistical methods involves clearly describing the assumptions, approaches, and tests used for any data analysis. Statistical methods can also be used to assess how reproducible an original set of findings was given the similarity to its independent replications.

81 questions
95
votes
2 answers

How much do we know about p-hacking "in the wild"?

The phrase p-hacking (also: "data dredging", "snooping" or "fishing") refers to various kinds of statistical malpractice in which results become artificially statistically significant. There are many ways to procure a "more significant" result,…
73
votes
15 answers

Complete substantive examples of reproducible research using R

The Question: Are there any good examples of reproducible research using R that are freely available online? Ideal Example: Specifically, ideal examples would provide: The raw data (and ideally meta data explaining the data), All R code including…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
51
votes
3 answers

How are we defining 'reproducible research'?

This has come up in a few questions now, and I've been wondering about something. Has the field as a whole moved toward "reproducibility" focusing on the availability of the original data, and the code in question? I was always taught that the core…
Fomite
  • 21,264
  • 10
  • 78
  • 137
43
votes
8 answers

How do I get people to take better care of data?

My workplace has employees from a very wide range of disciplines, so we generate data in lots of different forms. Consequently, each team has developed its own system for storing data. Some use Access or SQL databases; some teams (to my horror)…
Richie Cotton
  • 644
  • 9
  • 15
37
votes
5 answers

Is p-value essentially useless and dangerous to use?

This article "The Odds, Continually Updated" from NY Times happened to catch my attention. To be short, it states that [Bayesian statistics] is proving especially useful in approaching complex problems, including searches like the one the Coast…
31
votes
6 answers

How to increase longer term reproducibility of research (particularly using R and Sweave)

Context: In response to an earlier question about reproducible research Jake wrote One problem we discovered when creating our JASA archive was that versions and defaults of CRAN packages changed. So, in that archive, we also include the…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
28
votes
3 answers

Who to follow on github to learn about best practice in data analysis?

It is helpful to study the data analysis code of experts. I've recently been perusing github and there are a number of people sharing data analysis code there. This includes a few R Packages (which of course are available directly from CRAN), but…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
28
votes
2 answers

What are some standard practices for creating synthetic data sets?

As context: When working with a very large data set, I am sometimes asked if we can create a synthetic data set where we "know" the relationship between predictors and the response variable, or relationships among predictors. Over the years, I…
Iterator
  • 2,294
  • 1
  • 15
  • 22
23
votes
4 answers

As a reviewer, can I justify requesting data and code be made available even if the journal does not?

As science must be reproducible, by definition, there is increasing recognition that data and code are an essential component of the reproduciblity, as discussed by the Yale Roundtable for data and code sharing. In reviewing a manuscript for a…
David LeBauer
  • 7,060
  • 6
  • 44
  • 89
23
votes
1 answer

Has the reported state-of-the-art performance of using paragraph vectors for sentiment analysis been replicated?

I was impressed by the results in the ICML 2014 paper "Distributed Representations of Sentences and Documents" by Le and Mikolov. The technique they describe, called "paragraph vectors", learns unsupervised representations of arbitrarily-long…
17
votes
1 answer

How to create coloured tables with Sweave and xtable?

I am using Sweave and xtable to generate a report. I would like to add some coloring on a table. But I have not managed to find any way to generate colored tables with xtable. Is there any other option?
RockScience
  • 2,731
  • 4
  • 27
  • 46
15
votes
1 answer

What if high validation accuracy but low test accuracy in research?

I have a specific question about validation in machine learning research. As we know, the machine learning regime asks researchers to train their models on the training data, choose from candidate models by validation set, and report accuracy on the…
Mou
  • 638
  • 2
  • 5
  • 14
12
votes
3 answers

Hosting options for publicly available data

So you've decided to support the idea of reproducible research and want to make your data available online for people to see and use. The question is, where do you host it? My first inclination is of course the private webspace I have on a…
Fomite
  • 21,264
  • 10
  • 78
  • 137
10
votes
4 answers

Implications of current debate on statistical significance

In the past few years, various scholars have raised a detrimental problem of scientific hypothesis testing, dubbed "researcher degree of freedom," meaning that scientists have numerous choices to make during their analysis that bias towards finding…
8
votes
1 answer

Why do people use PCA when it has so many issues?

(This is a soft question) Recently I'm learning Principal Component Analysis, and it appears to have a lot of issues: You have to transform the data to roughly the same scale before applying PCA, but how the feature scaling should be performed is…
1
2 3 4 5 6