5

I am a medical student working on a project which requires massive data that are collected manually through data entry operators. Although I am using traditional tool used by epidemiologists, Epi Info, I just wanted to know if anyone could recommend some good alternatives--not necessarily free/open source though preferred.

The data has been already collected, but its all in papers, spanning over five years. It contains ~22 fields and is for many individuals, by that I mean >100,000. Im from India, although Im in a medical college, this project is our sole endeavor (i.e, self funded). Double data entry proposal in on table but a final call on check mechanisms will be taken only after the pilot study.

The whole data is hand written (literally scribbled) and is indeed inconsistent at many places. Although I have almost designed the whole form in Epi Info which involves drop down boxes, radio buttons, check boxes and takes advantage of Check Code wherever required. My concerns with Epi Info are- the present version ie., EI7 though promises to incorporate Geocode, unfortunately I could not get it to work, even after getting separate keys from Bing. But, as a potential solution I would be using Google API, which would though limit queries ~2500/day/IP (yes there are workarounds).

Secondly, Epi Info is painfully slow and its giving me some hard time whether to place confidence in it for a such a huge data especially when merging will be required and when reimporting of the data will be required after editing it in excel (geocoding and other cleaning discrepancies). Personally latest version has a very good Dashboard for data analysis, but many among us are in favour of SPSS.

Lastly, we are not in favour of any web based application (ofcourse because we dont have internet everywhere here, and that has to be fast as well), an offline solution is required.

Ankush
  • 153
  • 1
  • 7
  • Are you looking for data entry software (database) or statistical analysis software? Be more specific and perhaps we can help. – pmgjones May 01 '12 at 11:11
  • Yes, Im looking for a data entry software, which would enable me to create data entry operator friendly forms which after filling would in turn generate a database in an excel sheet. – Ankush May 01 '12 at 11:34
  • Can you give us a better idea of what kind of data you're collecting? Is this cross-sectional? Longitudinal? Are you collecting many fields for a few individuals, a few fields for many individuals, many for many? How massive is massive? What country are you in, and which institutions are you affiliated with? Are you doing double data entry? We need so much more info! – Matt Parker May 01 '12 at 15:25
  • 1
    The data has been already collected, but its all in papers, spanning over five years. It contains ~22 fields and is for many individuals, by that I mean >100,000. Im from India, although Im in a medical college, this project is our sole endeavor (i.e, self funded). Double data entry proposal in on table but a final call on check mechanisms will be taken only after the pilot study. – Ankush May 01 '12 at 15:47
  • Can you tell us more about the data format, and edit the original question to include the information in there? The more details the better. For example, if the data is typed/printed on paper and present on consistent styles of forms, you will probably want to use one of the many form-scanning-OCR solutions out there. If it's handwritten or inconsistent then you will need more manual methods. – Jonathan May 01 '12 at 17:08
  • Also it would help to know why you are looking for an alternative to Epi Info. There are many potential tools and knowing more about your desires will help with making a recommendation. Is it too slow/costly? Not enough data validation? etc. – Jonathan May 01 '12 at 17:10
  • 1
    The whole data is hand written (literally scribbled) and is indeed inconsistent at many places. Although I have almost designed the whole form in Epi Info which involves drop down boxes, radio buttons, check boxes and takes advantage of Check Code wherever required. My concerns with Epi Info are- the present version ie., EI7 though promises to incorporate Geocode, unfortunately I could not get it to work, even after getting separate keys from Bing. But, as a potential solution I would be using Google API, which would though limit queries ~2500/day/IP (yes there are workarounds). – Ankush May 01 '12 at 17:39
  • 1
    Secondly, Epi Info is painfully slow and its giving me some hard time whether to place confidence in it for a such a huge data especially when merging will be required and when reimporting of the data will be required after editing it in excel (geocoding and other cleaning discrepancies). Personally latest version has a very good Dashboard for data analysis, but many among us are in favour of SPSS. @Jonathan – Ankush May 01 '12 at 17:49

2 Answers2

5

Very interesting question. The three tasks you have at hand (data entry, geocoding, and data analysis) are all things that can be done by one program or by three (or more) completely separate programs. This isn't an "answer", exactly, but I've outlined my experiences below.

Data Entry:

  • MS Access: The old standby. Build forms and enter data. Potentially hazardous if you have multiple users entering data simultaneously. I've used this for some small projects and would prefer to avoid it in the future - I had weird problems with records linked across tables, but your data sounds like you have just one table. You'd have to have sufficient licenses and computers with Windows. I find it slow, but most of my Access DBs are on a network drive across campus, so that's part of the problem.
  • SurveyMonkey: SurveyMonkey can make a reasonably good data entry tool for simple, form-based input - there are a few configuration tweaks necessary, but I've used this for entering a few thousand surveys. It's web-based, so you'd need a sufficiently reliable Internet connection, but otherwise you're using SurveyMonkey's hardware. Shouldn't be any problem with multiple simultaneous data entry, and it has several options for data export. You'd need at least the Select plan (US$17 per month) to get unlimited questions and respondents.
  • RedCAP: Vanderbilt University runs a consortium around its RedCAP software, which is purpose-built for research (including medical research). You have to be part of the consortium to use it, though, and I think most consortium members host their own servers - but you might be able to piggyback on someone else's.
  • A homebrew solution built on a web framework (e.g., Django or Rails): Provides maximum control, but also has the highest technical capacity requirements. I've played around with Django a bit and I think you could get a 22-field form up pretty quickly. I don't think 22 x 100k counts as "big" where Django is concerned.

Data Analysis:

Does Epi Info support writing code for analysis, or is it limited to the widgets and other menu-driven choices I saw in the tutorial video? Coding up an analysis is key for being able to reproduce results and find errors, so if Epi Info doesn't have that, SPSS would be an improvement. Better still, R - it's free, has a really robust community, and has packages that will let you do just about any kind of analysis you'd like. All of three of these should be able to import data from whatever data entry option you choose, so don't worry about that.


Geocoding (and, I'm assuming, some mapping):

The maps in Epi Info looked nice! That was the segment of the intro video that really drew my attention. But there are many ways to geocode your addresses, and it may be easier to do data entry in your system of choice and then bulk-geocode the addresses afterward (that'll save you 100,000 clicks of that 'Get Coordinates' button, anyway). There are several options for that - Bing and Google, of course, and many others (check out the geocoding questions on our sister site, GIS.StackExchange). I think it would be especially worthwhile to check out bulk geocoders that explicitly refer to their ability to code in India - many geocoders (e.g., SmartyStreets) are nation-specific, and many others are just going to return crappy results. A number of geocoding APIs are available through packages in R, and there are a variety of packages available for mapping.

So - I would definitely consider using Epi Info for only the things that it's doing well for you (the form creation and data entry looks nice, but if it's slow, what can you do?), and reaching out to other tools for the things it's not doing well. My ideal version of this would probably be double-data-entry in a Django database, automatic geocoding through an API or by sending a file to a service (whichever gets the most accurate results, and analysis and mapping in R.

Matt Parker
  • 5,597
  • 5
  • 26
  • 37
  • 1
    I've been using the `sp`, `spatstat` and `ggmap` R packages for a much smaller project and they've worked well. My concern would be how well Google covers the areas of India you want to geocode. – Wayne May 01 '12 at 20:24
  • Also, a quick search for open source forms found Orbeon (http://www.orbeon.com/forms/orbeon-forms), whose Community Edition might be useful. – Wayne May 01 '12 at 20:29
  • MS Access is not a solution, the data is indeed linked at some places, and most of us are not in favour of it. SurveyMonkey and RedCap are web based solutions. Actually SurveyMonkey is a very good tool if you are planning to collect data from the applicants directly and yes ofcourse I will look into it. RedCap is also a very good solution, but the first view over the map on its homepage shows no institution in India are using it. And even if I plan to become a partner the page clearly states... – Ankush May 02 '12 at 00:21
  • "The technical core staff will need to be affiliated with your institution, not a third-party software vendor or service provider." The latter would have been my case, so this cant be a solution. Yes for futute, I will look into it. Django has a coding curve to it, plus its again webbased. Now for Geocoding I have found a solution. I will be using [GeodesiX](http://www.calvert.ch/geodesix/) with excel and because we will be doing geocoding simultaneosly on the amount of data being entered everyday, we can use two or 3 IPs to overide the requests barrier. For analysis SPSS is a good. – Ankush May 02 '12 at 00:35
  • So finally the question is if there is any offline form builder, which supports linking within and is capable of handling large amount of data and finally export it in excel! @Matt Parker – Ankush May 02 '12 at 00:38
  • @Ankush That sounds like Access, I'm afraid - the 2007 and 2010 versions should be able to handle the amount of data you've got. There's really not much else in that particular niche, though you might browse the answers to [this question](http://stackoverflow.com/questions/221826/alternatives-to-access). A couple of years ago I was setting up a multi-site data entry system that couldn't use the Internet (too many organizational policies to overcome), and I wound up using Access. It was okay, and would probably have been better if I were more skilled at Access. – Matt Parker May 02 '12 at 15:47
  • @Ankush Django does come with a light, built-in server, so you could distribute the application to run on disconnected computers and then merge the results together. You'd still have to code it up, of course... – Matt Parker May 02 '12 at 15:48
  • @MattParker Thanks a lot for all your help.Finally Im pushing all my hopes with Epi Info (28 fields),because the only other option otherwise would be MS Access, and even Im not as good at it vis-a-vis Epi Info.And will try contacting CDC through EpiInfo forums, in case of trouble.But it was a good learning experience about various options which I may consider in future,for other projects. Now we have to look at other logistic problems ie.,getting data entry operators,etc and rest after pilot study. Thanks again for the personal research you did.Thanks everybody else and best wishes from India. – Ankush May 04 '12 at 10:10
-3

In my view you must opt for MS Excel for this work it will automatically generated the forms to fill up the data based on your database structure.

cardinal
  • 24,973
  • 8
  • 94
  • 128
  • 1
    (-1) Please don't put email addresses in answers. Please consider reading the FAQ on how to formulate good answers. It would help for you to provide some firm justification for this recommendation. – cardinal May 01 '12 at 13:37
  • 2
    Excel is an awful recommendation for this (Access is defensible). I've never seen a form for excel that was more than a few items, and I have no idea what you are talking about with "excel will automatically generate forms", perhaps you should elaborate and point to some examples. – Andy W May 01 '12 at 13:45
  • I totally agree with Andy W, excel insnt the recommended one, plus it doesnt automatically generate forms, you have to use Visual Basic Editor, and it can be a learning curve. I know EpiInfo pretty well, but if there could be a possible good alternative, even if reasonably priced. – Ankush May 01 '12 at 14:47