43

My workplace has employees from a very wide range of disciplines, so we generate data in lots of different forms. Consequently, each team has developed its own system for storing data. Some use Access or SQL databases; some teams (to my horror) are reliant almost entirely on Excel spreadsheets. Often, data formats change from project to project. In some cases calling it a 'system' is too kind.

The problems this entails are that I have to write new code to clean the data for every project, which is expensive; people manually editing spreadsheets make reproducibility and auditing of data near impossible; and even worse, there is a chance that data gets lost or made incorrect.

I've been given the opportunity to discuss these problems with a board member of the company and I need to work out what to tell him. I think I've already persuaded him that we have a problem and that getting this right will enable better science and saving money. The question is: what should we be aiming for, and how do we get there?

More specifically:

How should we store data, in way that lets us track it from creation to publishing in a paper? (Databases stored on a central server?)

How do you go about standardising database formats?

Are there any good resources for educating people on how to care for data? (As a general rule, occupational hygienists and explosives engineers aren't data nerds; so non-technical content preferred.)

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
Richie Cotton
  • 644
  • 9
  • 15
  • 1
    The title question is different to the substantive question in the post. The former asks about how to convince people to take care of data, and the latter asks about the best away to store data. Which is the question you want answered? – Ben Nov 29 '18 at 01:18

8 Answers8

16

It's worth considering ideas from the software world. In particular you might think of setting up: a version control repository and a central database server.

Version control probably helps you out with otherwise free floating files, such as Excel and text files, etc. But this could also include files associated with data, such as R, SAS, etc. The idea is that there's a system which tracks changes to your files allowing you to know what happened when and rollback to a point in the past if needed.

Where you already have SQL databases, the best thing you can do is set up a central server and hire a capable DBA. The DBA is the person tasked with ensuring and mantaining the integrity of the data. Part of the job description involves things like backups and tuning. But another part is more relevant here -- controlling how data enters the system, ensuring that constraints are met, access policies are in place to prevent harm to the data, setting up views to expose custom or simplified data formats, etc. In short, implementing a methodology around the data process. Even if you don't hire an actual DBA (the good ones are very hard to recruit), having a central server still enables you to start thinking about instituting some kind of methodology around data.

ars
  • 12,160
  • 1
  • 36
  • 54
  • 3
    Absolutely agreed on the version control. I use it; as do a substantial proportion of the developers and statisticians. (I'd like to see 100% adoption but that's another pipe-dream for now.) The hard bit is getting non-techies to use it. Any ideas appreciated. – Richie Cotton Oct 21 '10 at 20:18
  • 2
    @Richie Cotton: I don't know why, but version control seems to be a difficult concept for non-techies to grasp. People continue to just make some changes to a file, rename it and send it via email. How I hate those "PaperDraftCorrectedByJohnRevision3RewroteByLeslie-NewVersion3.doc" files... – nico Oct 21 '10 at 21:14
12

One free online resource is the set of Statistical Good Practice Guidelines from the Statistical Services Centre at the University of Reading.

In particular:

MrGomez
  • 143
  • 9
onestop
  • 16,816
  • 2
  • 53
  • 83
  • 1
    Excellent links. I think two important messages for me to pass on are: we need more automated data checking and I need to start explaining about separating data entry and data presentation. – Richie Cotton Oct 21 '10 at 21:06
6

I think first of all you have to ask yourself: why do people use Excel to do tasks Excel was not made for?

1) They already know how to use it 2) It works. Maybe in a clumsy way but it works and that's what they want

I copy a series of numbers in, press a button and I have a plot. As easy as that.

So, make them understand what advantages they can have by using centralized datasets, proper databases (note that Access is NOT one of those) and so on. But remember the two points above: you need to set up a system that works and it's easy to use.

I've seen too many times badly made systems that made me want to go back not to Excel but to pen and paper!

Just as an example, we have an horrible ordering system where I work.

We used to have to fill in an order form which was an Excel spreadsheet where you would input the name of the product, the quantity, the cost etc. It would add everything up, add TVA etc etc, you printed it, gave it to the secretary who would make the order and that was it. Unefficient, but it worked.

Now we have an online ordering system, with a centralized DB and everything. It's an horror. It should not take me 10 minutes to fill in a damn form because of the unituitive keyboard shortcuts and the various oddities of the software. And note that I am quite informatics-savvy, so imagine what happens to people who don't like computers...

nico
  • 4,246
  • 3
  • 28
  • 42
  • Agreed that things need to be user friendly. Since people get very protective over their working practices, any changes must make people's lives easier or they'll fail. – Richie Cotton Oct 21 '10 at 20:13
5

I underline all answers given already, but let's call a cat a cat: in many workspaces it is hardly impossible to convince management that investment in "exotic" softwaretools (exotic to them, that is) is necessary, let alone hiring somebody that could set it up and maintain it. I have told quite some clients that they would benefit greatly from hiring a statistician with a thorough background on software and databases, but "no can do" is the general response.

So as long as that ain't going to happen, there are some simple things you can do with Excel that will make life easier. And the first of this is without doubt version control. More info on version control with Excel can be found here.

Some things about using excel

People using EXCEL very often like the formula features of EXCEL. Yet, this is the most important source of errors within EXCEL sheets, and of problems when trying to read in EXCEL files as far as my experience goes. I refuse to work with sheets containing formulas.

I also force everybody I work with to deliver the EXCEL sheets in a plain format, meaning that:

  • The first row contains the names of the different variables
  • The spreadsheet starts in cell A1
  • All data is put in columns, without interruptions and without formatting.
  • If possible, the data is saved in .csv format as well. It's not difficult to write a VBA script that will extract the data, reformat it and put it in a .csv file. This also allows for better version control, as you can make a .csv dump of the data every day.

If there is a general structure the data always has, then it might be good to develop a template with underlying VB macros to add data and generate the dataset for analysis. This in general will avoid that every employee comes up with his own "genius" system of data storage, and it allows you to write your code in function of this.

This said, if you can convince everybody to use SQL (and a front end for entering data), you can link R directly to that one. This will greatly increase performance.

Data structure and management

As a general rule, the data stored in databases (or EXCEL sheets if they insist) should be the absolute minimum, meaning that any variable that can be calculated from some other variables should not be contained in the database. Mind you, sometimes it can be beneficial to store those derived or transformed variables as well, if the calculations are tedious and take a long time. But these should be stored in a seperate database, if necessary linked to the original one.

Thought should be given as well to what is considered as one case (and hence one row). As as an example, people tend to produce time series by making a new variable for each time point. While this makes sense in an EXCEL, reading in these data demands quite some flipping around of the data matrix. Same for comparing groups: There should be one group indicator and one response variable, not a response variable for each group. This way data structures can be standardized as well.

A last thing I run into frequently, is the use of different metrics. Lengths are given in meters or centimeters, temperatures in Celcius, Kelvin or Farenheit, ... One should indicate in any front end or any template what the unit is in which the variable is measured.

And even after all these things, you still want to have a data control step before you actually start with the analysis. Again, this can be any script that runs daily (e.g. overnight) on new entries, and that flags problems immediately (out of range, wrong type, missing fields, ...) so they can be corrected as fast as possible. If you have to return to an entry that was made 2 months ago to find out what is wrong and why, you better get some good "Sherlock-skills" to correct it.

my 2 cents

Joris Meys
  • 5,475
  • 2
  • 32
  • 43
  • Some very interesting points here. Convincing people to simplify and standardise their spreadsheets is likely to be more successful than getting them to abandon them. Also I had no idea that version control could integrate with Excel. Nice to know. – Richie Cotton Oct 22 '10 at 14:46
  • 2
    Concerning the advice not to store redundant variables: this is appropriate for RDBMSes but I would like to suggest that the opposite should be encouraged for spreadsheets. The latter are so error-prone that mechanisms to detect and correct errors are invaluable. One of the best consists of redundant information, such as computed fields and statistical summaries. For example, if column C is the ratio of columns A and B, then an error in a single column in any given row can be detected and usually fixed. – whuber Oct 25 '10 at 15:55
  • 1
    @whuber : that is what we check in the data control step. You can use that extra column to check fast, but you shouldn't keep it in the final sheet. Formulas in spreadsheets are horror, and the larger the spreadsheet, the more difficult to get the data out of it. Plus, in the case of Excel you'll be fighting the differences between .xls and .xlsx anyway. Be sure that a decision of a manager to update Microsoft Office can break tons of code if you rely heavily on excel files. So: save as csv, and keep these csv files as small as possible. – Joris Meys Oct 25 '10 at 16:33
  • After spending a significant part of the last 24 years of my career coping with data transmitted in spreadsheets and managing substantial databases, I must respectfully disagree. There is no such thing as "control" over spreadsheets (whether .xls, .xlsx, .wks, .wb*, etc) or even csv files. The presence of redundant information in such files--even when they're available only in printed form--has many times resurrected some fairly large databases (100k+ records). Every time this happens I (and my clients) have been grateful for the redundancies. – whuber Oct 25 '10 at 17:14
  • @whuber : We do data control with extra scripts, looking for the impossible values/outliers/odd cases. That's what I mean with the data control step. This is industry standard btw in companies like SGS and others doing analyses of clinical trials etc. The redundant information that is needed is kept in seperate databases. If one of them fails, the other is needed for the resurrection. In case you don't have a decent backup system, that is... – Joris Meys Oct 25 '10 at 20:56
3

VisTrails: A Python-Based Scientific Workflow and Provenance System. This talk given at PyCon 2010 has some good ideas. Worth listening to even if you are not interested in using VisTrails or python. In the end I think if you would be able to require that there be a clear document way to reproduce the data. And require some validation that they can.

Quoting:

"In this talk, we will give an overview of VisTrails (http://www.vistrails.org), a python-based open-source scientific workflow that transparently captures provenance (i.e., lineage) of both data products and the processes used to derive these products. We will show how VisTrails can be used to streamline data exploration and visualization. Using real examples, we will demonstrate key features of the system, including the ability to visually create information processing pipelines that combine multiple tools and Iibraries such as VTK, pylab, and matplotlib. We will also show how VisTrails leverages provenance information not only to support result reproducibility, but also to simplify the creation and refinement of pipelines."

naught101
  • 4,973
  • 1
  • 51
  • 85
Vincent
  • 113
  • 1
  • 9
  • Python isn't widespread in our organisation but it looks like an interesting project. I'll see if I can pinch some ideas on how things should be done from their documentation. – Richie Cotton Oct 22 '10 at 17:02
2

I just came across this webpage hosted by ICPSR on data management plans. Although I think the goals of ICPSR will be somewhat different than your business (e.g. they are heavily interested in making the data readily able to be disseminated without violating confidentiality), I imagine they have useful information to businesses. Particularly advice on creating metadata seems to me to be universal.

Andy W
  • 15,245
  • 8
  • 69
  • 191
2

In the case of a much smaller scales, I experienced using dropbox fora sharing/syncing a copy of the data files (and scripts and results) with other researchers/collaborators (I wrote about it here).

The other tool I have used is google docs for collecting and sharing data (about which I wrote here)

Tal Galili
  • 19,935
  • 32
  • 133
  • 195
0

Dropbox + packrat is nice for sharing files with backup/versioning.

Then you load those files (after automated canonicalization/massage) into a database and do the analyses off of the cleaned-up data. Put the scripts to automate the Extract-Transform-Load cycle under version control (or at least a separate dropbox folder with the packrat option...).

When your database server eventually crashes (or needs to be sharded or whatever) you have a pipeline for moving data from people-friendly (Excel, web forms, etc) to analysis-friendly (typically normalized and constrained, always cleaned up).

That "E-T-L" phase is from data warehousing. And if you're not building an online transaction processing system, you're probably building a data warehouse. So embrace it and take advantage of what people have learned from building those for the past 30 years.

Have fun.

naught101
  • 4,973
  • 1
  • 51
  • 85
Tim
  • 121
  • 3