I underline all answers given already, but let's call a cat a cat: in many workspaces it is hardly impossible to convince management that investment in "exotic" softwaretools (exotic to them, that is) is necessary, let alone hiring somebody that could set it up and maintain it. I have told quite some clients that they would benefit greatly from hiring a statistician with a thorough background on software and databases, but "no can do" is the general response.
So as long as that ain't going to happen, there are some simple things you can do with Excel that will make life easier. And the first of this is without doubt version control. More info on version control with Excel can be found here.
Some things about using excel
People using EXCEL very often like the formula features of EXCEL. Yet, this is the most important source of errors within EXCEL sheets, and of problems when trying to read in EXCEL files as far as my experience goes. I refuse to work with sheets containing formulas.
I also force everybody I work with to deliver the EXCEL sheets in a plain format, meaning that:
- The first row contains the names of the different variables
- The spreadsheet starts in cell A1
- All data is put in columns, without interruptions and without formatting.
- If possible, the data is saved in .csv format as well. It's not difficult to write a VBA script that will extract the data, reformat it and put it in a .csv file. This also allows for better version control, as you can make a .csv dump of the data every day.
If there is a general structure the data always has, then it might be good to develop a template with underlying VB macros to add data and generate the dataset for analysis. This in general will avoid that every employee comes up with his own "genius" system of data storage, and it allows you to write your code in function of this.
This said, if you can convince everybody to use SQL (and a front end for entering data), you can link R directly to that one. This will greatly increase performance.
Data structure and management
As a general rule, the data stored in databases (or EXCEL sheets if they insist) should be the absolute minimum, meaning that any variable that can be calculated from some other variables should not be contained in the database. Mind you, sometimes it can be beneficial to store those derived or transformed variables as well, if the calculations are tedious and take a long time. But these should be stored in a seperate database, if necessary linked to the original one.
Thought should be given as well to what is considered as one case (and hence one row). As as an example, people tend to produce time series by making a new variable for each time point. While this makes sense in an EXCEL, reading in these data demands quite some flipping around of the data matrix. Same for comparing groups: There should be one group indicator and one response variable, not a response variable for each group. This way data structures can be standardized as well.
A last thing I run into frequently, is the use of different metrics. Lengths are given in meters or centimeters, temperatures in Celcius, Kelvin or Farenheit, ... One should indicate in any front end or any template what the unit is in which the variable is measured.
And even after all these things, you still want to have a data control step before you actually start with the analysis. Again, this can be any script that runs daily (e.g. overnight) on new entries, and that flags problems immediately (out of range, wrong type, missing fields, ...) so they can be corrected as fast as possible. If you have to return to an entry that was made 2 months ago to find out what is wrong and why, you better get some good "Sherlock-skills" to correct it.
my 2 cents