Do some of you use Google Docs spreadsheet to conduct and share your statistical work with others?

Question

I know most of you probably feel that Google Docs is still a primitive tool. It is no Matlab or R and not even Excel. Yet, I am baffled at the power of this web based software that just uses the operating capability of a browser (and is compatible with many browsers that work very differently).

Mike Lawrence, active in this forum, has shared a spreadsheet with us using Google Docs doing some pretty fancy stuff with it. I personally have replicated a pretty thorough hypothesis testing framework (including numerous parametric and nonparametric tests) originally done in Excel in Google Docs.

I am interested if any of you have given Google Docs a try and have pushed it to its limits in interesting applications. I am also interested to hear about the bugs or flaws you have encountered with Google Docs

I am designating this question "for community wiki" denoting that there are no best answers for this. It is more of a survey than anything.

Here is the URL https://spreadsheets.google.com/ccc?key=0Ap2N_aeyRMGHdHJxUnVNeEl5VGtvY1RVLVc5UjU4Vmc&hl=en#gid=0 It was related to his question http://stats.stackexchange.com/questions/2956/have-i-computed-these-likelihood-ratios-correctly — Sympa, Oct 01 '10 at 19:13
Google docs, when formally tested, performed miserably on most statistical calculations (when it could do them at all). See [Kellie B. Keeling and Robert J. Pavur (2011): Statistical Accuracy of Spreadsheet Software, The American Statistician, 65:4, 265-273](http://amstat.tandfonline.com/doi/pdf/10.1198/tas.2011.09076) — whuber, Sep 13 '12 at 19:24

score 19 · Answer 1 · edited Jul 14 '18 at 17:48

As an enthusiast user of R, bash, Python, asciidoc, (La)TeX, open source sofwtare or any un*x tools, I cannot provide an objective answer. Moreover, as I often argue against the use of MS Excel or spreadsheet of any kind (well, you see your data, or part of it, but what else?), I would not contribute positively to the debate. I'm not the only one, e.g.

Spreadsheet Addiction, from P. Burns.
MS Excel’s precision and accuracy, a post on the 2004 R mailing-list
L. Knusel, On the accuracy of statistical distributions in Microsoft Excel 97, Computational Statistics & Data Analysis, 26: 375–377, 1998. (pdf)
B.D. McCullough & B. Wilson, On the accuracy of statistical procedures in Microsoft Excel 2000 and Excel XP, Computational Statistics & Data Analysis, 40: 713–721, 2002.
M. Altman, J. Gill & M.P. McDonald, Numerical Issues in Statistical Computing for the Social Scientist, Wiley, 2004. [e.g., pp. 12–14]

A colleague of mine loose all his macros because of the lack of backward compatibility, etc. Another colleague tried to import genetics data (around 700 subjects genotyped on 800,000 markers, 120 Mo), just to "look at them". Excel failed, Notepad gave up too... I am able to "look at them" with vi, and quickly reformat the data with some sed/awk or perl script. So I think there are different levels to consider when discussing about the usefulness of spreadsheets. Either you work on small data sets, and only want to apply elementary statistical stuff and maybe it's fine. Then, it's up to you to trust the results, or you can always ask for the source code, but maybe it would be simpler to do a quick test of all inline procedures with the NIST benchmark. I don't think it corresponds to a good way of doing statistics simply because this is not a true statistical software (IMHO), although as an update of the aforementioned list, newer versions of MS Excel seems to have demonstrated improvements in its accuracy for statistical analyses, see Keeling and Pavur, A comparative study of the reliability of nine statistical software packages (CSDA 2007 51: 3811).

Still, about one paper out of 10 or 20 (in biomedicine, psychology, psychiatry) includes graphics made with Excel, sometimes without removing the gray background, the horizontal black line or the automatic legend (Andrew Gelman and Hadley Wickham are certainly as happy as me when seeing it). But more generally, it tend to be the most used "software" according to a recent poll on FlowingData, which remind me of an old talk of Brian Ripley (who co-authored the MASS R package, and write an excellent book on pattern recognition, among others):

Let's not kid ourselves: the most widely used piece of software for statistics is Excel (B. Ripley via Jan De Leeuw), http://www.stats.ox.ac.uk/~ripley/RSS2002.pdf

Now, if you feel it provides you with a quick and easier way to get your statistics done, why not? The problem is that there are still things that cannot be done (or at least, it's rather tricky) in such an environment. I think of bootstrap, permutation, multivariate exploratory data analysis, to name a few. Unless you are very proficient in VBA (which is neither a scripting nor a programming language), I am inclined to think that even minor operations on data are better handled under R (or Matlab, or Python, providing you get the right tool for dealing with e.g. so-called data.frame). Above all, I think Excel does not promote very good practices for the data analyst (but it also applies to any "cliquodrome", see the discussion on Medstats about the need to maintain a record of data processing, Documenting analyses and data edits), and I found this post on Practical Stats relatively illustrative of some of Excel pitfalls. Still, it applies to Excel, I don't know how it translates to GDocs.

About sharing your work, I tend to think that Github (or Gist for source code) or Dropbox (although EULA might discourage some people) are very good options (revision history, grant management if needed, etc.). I cannot encourage the use of a software which basically store your data in a binary format. I know it can be imported in R, Matlab, Stata, SPSS, but to my opinion:

data should definitively be in a text format, that can be read by another statistical software;
analysis should be reproducible, meaning you should provide a complete script for your analysis and it should run (we approach the ideal case near here...) on another operating system at any time;
your own statistical software should implement acknowledged algorithms and there should be an easy way to update it to reflect current best practices in statistical modeling;
the sharing system you choose should include versioning and collaborative facilities.

That's it.

@Gaetan Aside from my response, I gave my +1 to the question because I think it is very relevant for debating about statistical practice and project management. — chl, Oct 02 '10 at 09:35
@chl: although I didn't downvote this answer, I think I understand why one would downvote it. The information you have provided is correct, very very important and think-provoking. HOWEVER, most of it (except for the last two paragraphs) do not answer the question. Ideally, one would write this large disclaimer elsewhere and give a link to it. — Boris Gorelik, Oct 04 '10 at 13:23
@chl: despite what I said in my comment, I love your answer and up-vote it — Boris Gorelik, Oct 04 '10 at 13:24
@bgbg Thanks for your comment. Maybe I didn't answer the CW question. However, I never intended to give a purely provocative answer. The OP asked about potential "bugs and flaws" in GDocs: I provide illustrations about what I know from Excel, acknowledging the fact I don't know how it would translate to GDocs. I also understand part of the question as "what are the benefits of using GDocs for data analysis", and I just gave some arguments against the use of spreadsheet for large scale projects, or analysis at the bleeding edge (still, I acknowledged at the beginning that this would be biased). — chl, Oct 04 '10 at 14:21
@chi: I can see why someone would downvote this, but I think your answer is a valid response. A list of circumstances when using spreadsheets is not a good idea is a handy addition to a discussion of the advantages of using them. — Freya Harrison, Oct 07 '10 at 08:27

score 12 · Accepted Answer · answered Oct 01 '10 at 19:23

My main use for google spreadsheets have been with google forms, for collecting data, and then easily importing it into R. Here is a post I wrote about it half a year ago:

Google spreadsheets + google forms + R = Easily collecting and importing data for analysis

Also, If you are into collaboration, my tool of choice is DropBox. I wrote a post regarding it a few months ago:

Syncing files across computers using DropBox

I have now been using it for about half a year on a project with 5 co-authors, and it has been invaluable (syncing data files from 3 contributers, everyone can see the latest version of the output I am producing, and everyone are looking at the same .docx file for the article).

Both posts offer video tutorials and verbal instructions.

thanks for your feedback. This is the exact type of comments I was interested in. You have really leveraged the sharing and importing component of Google docs. Good for you. I'll read your material to learn more about it. — Sympa, Oct 01 '10 at 20:43
Dear Gaetan, I am delighted by your response - thank you for the kind words. Best, Tal. — Tal Galili, Oct 02 '10 at 08:17

score 10 · Answer 3 · answered Jul 01 '12 at 18:46

"I am also interested to hear about the bugs or flaws you have encountered with Google Docs."

I will respond to that part of the original question only. My explorations with Google Docs Spreadsheets (GSheets) have been concerned with the mathematical and statistical functions. In the end my assessment is that Google Spreadsheets is in that respect much inferior in 2012 to the maligned Excel of 1997.

Witness: Google Sheets apparently evaluates erfc(x) using erfc(x)=1-erf(x) for arguments for which erf(x) is close to 1. They evaluate a standard deviation or a variance via average of the squares minus square of the average; it is bad numerical practice. Combinatorial functions and discrete probabilities such as poisson(n,x) = pow(x,n)*exp(-x)/n! are evaluated factor-by-factor, causing needless overflow. The factorial is evaluated using Stirling's approximation factor-by-factor, causing further needless overflow. The cumulative Poisson distribution is evaluated by simply doing the finite sum, so the normalization property is lost in the round-off; the same is true for the cumulative binomial distribution. The cumulative normal distribution is completely messed up; it goes outside the [0,1] range. There is a general loss of accuracy relative to the implementations of the same functions in other packages. The descriptions of elementary functions such as rounding are often garbled and unintelligible; the interpretation is a guessing game.

I have documented these issues in two sets of postings on the Google Docs product forums:

(2011-11-13 and later) normdist throws negative value still https://productforums.google.com/d/topic/docs/XfBPtoKJ1Ws/

(2012-05-06 and later) Errors and other issues with statistical and mathematical functions in GSheets https://productforums.google.com/d/topic/docs/rxFCHYeMhrU/

(+1) In other words, it seems apparent that the (*many!*) statisticians at Google are in no way involved in this project. — cardinal, Jul 02 '12 at 12:31
The only part of Google Docs I used are the editor, that is very usefull when collaboratively editing in _realtime_. I don't think git and friends solves that problem! — kjetil b halvorsen, Sep 14 '12 at 00:12

Do some of you use Google Docs spreadsheet to conduct and share your statistical work with others?

3 Answers3

Linked