How to efficiently manage a statistical analysis project?

Question

We often hear of project management and design patterns in computer science, but less frequently in statistical analysis. However, it seems that a decisive step toward designing an effective and durable statistical project is to keep things organized.

I often advocate the use of R and a consistent organization of files in separate folders (raw data file, transformed data file, R scripts, figures, notes, etc.). The main reason for this approach is that it may be easier to run your analysis later (when you forgot how you happened to produce a given plot, for instance).

What are the best practices for statistical project management, or the recommendations you would like to give from your own experience? Of course, this applies to any statistical software. (one answer per post, please)

I'm voting to close this question as off-topic because it's about project management — Aksakal, Aug 09 '16 at 21:23
@Aksakal: I think you are a bit harsh. :) It is relevant to "*people interested in statistics*". Also the 70+ votes strongly suggest standard users found this question of interest and useful. — usεr11852, Aug 09 '16 at 22:25
@gung Would you perhaps like to add an answer to that Meta thread so that we could discuss it? — amoeba, Aug 10 '16 at 15:24

score 81 · Accepted Answer · edited May 23 '17 at 12:39

81

I am compiling a quick series of guidelines I found on SO (as suggested by @Shane), Biostar (hereafter, BS), and this SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.

Data management

Create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni /BS)
Never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
Check data consistency (whuber /SE)
Manage script dependencies and data flow with a build automation tool, like GNU make (Karl Broman/Zachary Jones)

Coding

organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
separate source code from editing stuff, especially for large project -- partly overlapping with previous item and reporting
Document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file -- a good discussion on Medstats, Documenting analyses and data edits Options
[R] Custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)

Analysis

Don't forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
For Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatra may be a good candidate, giovanni /BS)
Don't limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)

Versioning

Use some kind of revision control for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) -- this follows from nice questions asked by @Jeromy and @Tal
Backup everything, on a regular basis (Sharpie/JD Long /SO)
Keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) -- partly redundant with the previous item since it is available in Git

Editing/Reporting

[R] Sweave (Matt Parker /SO) or the more up-to-date knitr
[R] Brew (Shane /SO)
[R] R2HTML or ascii

As a side note, Hadley Wickham offers a comprehensive overview of R project management, including reproducible exemplification and an unified philosophy of data.

Finally, in his R-oriented Workflow of statistical data analysis Oliver Kirchkamp offers a very detailed overview of why adopting and obeying a specific workflow will help statisticians collaborate with each other, while ensuring data integrity and reproducibility of results. It further includes some discussion of using a weaving and version control system. Stata users might find J. Scott Long's The Workflow of Data Analysis Using Stata useful too.

edited May 23 '17 at 12:39

Community

1

answered Sep 30 '10 at 10:44

chl

50,972
18
205
364

Great job chl! Would it be o.k. by you if I where to publish this on my blog? (I mean, this text is cc, so I could, but I wanted you permission any way :) ) Cheers, Tal – Tal Galili Sep 30 '10 at 14:49
@Tal No problem. It's far from being an exhaustive list, but maybe you can aggregate other useful links at a later time. Also, feel free to adapt or reorganize in a better way. – chl Sep 30 '10 at 15:07
+1 This is a nice list. You might consider "accepting this" so that it's always on top; given that it's CW, anyone can keep it updated. – Shane Sep 30 '10 at 15:34
@Shane Well, I am indebted to you for providing a first answer with so useful links. Feel free to add/modify the way you want. – chl Sep 30 '10 at 15:45
I republished it here. Great list! http://www.r-statistics.com/2010/09/managing-a-statistical-analysis-project-guidelines-and-best-practices/ – Tal Galili Sep 30 '10 at 16:03
I'd vote for Mercurial rather than Git as a versioning tool. I've found it to be easier to use and the user community isn't so harsh. (On the Mac, MacHG is a great GUI front end for Mercurial.) Whatever versioning tool you use, a GUI front end is very useful and powerful for tracking and managing things. – Wayne May 01 '12 at 20:49
@Wayne Thanks for that. Oliver Kirchkamp discussed the use of SVN; I found myself often using [RCS](http://www.gnu.org/software/rcs/) for various stuff, but I've heard good things from Hg. I agree that a GUI can be a plus although I mainly work from the command-line and Emacs. ([GitHub for Mac](http://mac.github.com/) isn't that bad, btw.) – chl May 01 '12 at 20:55
@CHL: At first glance, it's easy to think that a GUI simply makes it easier for a novice to use, but I've found that the power of the GUI (at least MacHG) is that it's dynamic. I keep MacHG open all the time, and I can see at a glance what files are in a project, and which have been updated. Click on a file to see what's been changed. It especially helps if I change projects, to help remind me where I was. – Wayne May 01 '12 at 21:03
This could really do with the addition of a description of how to use a Makefile for managing data munging and caching. If anyone knows of one, please add it. If not, I will try to write one up soon, once I get my head around it. – naught101 Apr 07 '14 at 12:38
@naught101 Karl Broman has some tutorials on using GNU tools and R in his "Tools for Reproducible Research", http://kbroman.github.io/Tools4RR/. – chl Apr 08 '14 at 15:03
interesting - your answer is a very comprehensive guide to code/file management. not too much in terms of checking if you have actually answered the key research questions or output requirements – probabilityislogic May 07 '19 at 07:43
@probabilityislogic That was the way I originally envisioned the question: in terms of tools and project structure, rather than statistical expertise or general research/scientific methodology. – chl Sep 14 '20 at 18:24

score 21 · Answer 2 · edited May 23 '17 at 12:39

21

This doesn't specifically provide an answer, but you may want to look at these related stackoverflow questions:

You may also be interested in John Myles White's recent project to create a statistical project template.

edited May 23 '17 at 12:39

Community

1

answered Sep 20 '10 at 20:42

Shane

11,961
17
71
89

Thanks for the links! The question is open to any statistical software -- I use Python and Stata from time to time, so I wonder if confirmed users may bring interesting recommendations there. – chl Sep 20 '10 at 20:51
Absolutely; although I would add that the recommendations in the above links could really apply to any statistical project (regardless of the language). – Shane Sep 20 '10 at 20:58
Definitely, yes! I updated my question at the same time. – chl Sep 20 '10 at 21:03

score 8 · Answer 3 · answered Sep 25 '10 at 15:45

8

This overlaps with Shane's answer, but in my view there are two main piers:

Reproducibility; not only because you won't end with results that are made "somehow" but also be able to rerun the analysis faster (on other data or with slightly changed parameters) and have more time to think about the results. For a huge data, you can first test your ideas on some small "playset" and then easily extend on the whole data.
Good documentation; commented scripts under version control, some research journal, even ticket system for more complex projects. Improves reproducibility, makes error tracking easier and writing final reports trivial.

answered Sep 25 '10 at 15:45

+1 I like the second point (I use roxygen + git). The first point makes me think also of the possibility to give your code to another statistician that will be able to reproduce your results at a later stage of the project, without any help. – chl Sep 26 '10 at 09:52
Reproducibility? Data has random error anyway, so who cares. Documentation? Two possible answers: 1) We're too busy, we don't have time for documentation or 2) We only had budget to either do the analysis or document it, so we chose to do the analysis. You think I'm joking? I've seen/heard these attitudes on many occasions - on projects on which lives were riding on the line. – Mark L. Stone Aug 10 '16 at 00:37

score 4 · Answer 4 · answered Sep 21 '10 at 03:00

4

van Belle is the source for the rules of successful statistical projects.

answered Sep 21 '10 at 03:00

Carlos Accioly

4,715
4
25
28

score 1 · Answer 5 · answered Oct 01 '10 at 00:58

1

Just my 2 cents. I've found Notepad++ useful for this. I can maintain separate scripts (program control, data formatting, etc.) and a .pad file for each project. The .pad file call's all the scripts associated with that project.

answered Oct 01 '10 at 00:58

3

You mean, notepad++ with the use of npptor :) – Tal Galili Oct 02 '10 at 08:33

score 1 · Answer 6 · answered Apr 09 '14 at 09:58

While the other answers are great, I would add another sentiment: Avoid using SPSS. I used SPSS for my master's thesis and now on my regular job in market research.

While working with SPSS, it was incredibly hard to develop organized statistical code, due to the fact that SPSS is bad at handling multiple files (sure, you can handle multiple files , but it's not as painless as R), because you cannot store datasets to a variable - you have to use "dataset activate x"- code, which can be a total pain. Also, the syntax is clunky and encourages shorthands, which make code even more unreadable.

score 0 · Answer 7 · answered Jun 09 '18 at 04:04

Jupyter Notebooks, which work with R/Python/Matlab/etc, remove the hassle of remembering which script generates a certain figure. This post describes a tidy way of keeping the code and the figure right beside each other. Keeping all figures for a paper or thesis chapter in a single notebook makes the asccoiated code very easy to find.

Even better, in fact, because you can scroll through, say, a dozen figures to find the one you want. The code is kept hidden until it is needed.

How to efficiently manage a statistical analysis project?

7 Answers7

Linked