Which algorithm should I use?

Question

I was reading many machine learning questions like this one but I am not sure how to apply them to my scenario. I come from a biology/medicine background, and my math knowledge is limited (last thing I learned was calculus long ago), and some of the explanations on this site go over my head.

The problem

Suppose someone is filling out forms everyday. The form contains various fields (patient information, medical history, job, etc.) Now suppose there are n patients coming to the hospital each day (so that's a minimum of n forms filled out each day, each time they arrive). Some parts of the form may never change (e.g. gender, race, etc. as an example) but some parts may (e.g. the medical history, checkup time, etc). Now I need to write a program that "learns" from this data so I can anticipate when the patient is likely to arrive again, then I need to determine whether the form is likely to change based on how often it changed in the past, and if the "important" parts (like the medical history) is unlikely to change, then I need to generate a form in anticipation of the event.

Things I've considered

With whatever limited knowledge I've got, I figured I'd need to calculate some kind of variance to determine how different each form is from other forms of each patient. Then I figured I'd be pushing the data through some unsupervised learning 'thing' to detect if some kind of pattern exists to the patients' arrivals, and if it does then I can use the past data on the variance to determine if a form should be generated in advance.

So, I assume I'll need multiple algorithms (different for each part of the question), and I've looked at Naive Bayes, Logistic Regression, Decision Trees, and SVMs, but I hit my head on complex math and nearly suffered an aneurysm (joking)... To begin this quest, which model/algorithm should I start with, and why does it work well for this scenario over the others? I would appreciate a more wordy (less math) answer.

"*some of the explanations on this site go over my head*" -- you're in good company -- quite a few of them go over mine, too. For each part of the form, for a given patient you essentially have a history of observations where each specific part either changed from the previous time or it didn't. It sounds like you're trying to develop some model that tries to determine the probability of change in each part at some given time $t$ based on the various characteristics, as well as the history of changes to date. — Glen_b, Dec 02 '14 at 04:40
@Glen_b I have a very similar question and instead of a creating a new question for it, I thought I'd ask here given the similarity. Like the OP, I need to determine the probability of change at some time t. What would be the best way to go about it? — arao6, Dec 02 '14 at 23:09

score 4 · Accepted Answer · answered Dec 02 '14 at 06:38

This is a good question but not very focused, so it will be impossible to go into each aspect in depth. As general advice: start with simple models, e.g. linear ones like glm, logistic regression, linear SVM. These are fairly straightforward to understand, fast to train and typically come with fewer bells and whistles (= headaches) than their nonlinear counterparts.

An important question to answer before deciding on a method is whether you only care about predictions or whether you want to understand them. This is essentially a choice between white box models that are interpretable (such as logistic regression) or black box models that are designed to offer better predictive performance but without being easily interpretable. Most people with a medical background want white box models, but their usefulness depends entirely on the task at hand. Based on your question, you seem to focus on making accurate predictions.

An important aspect of building any model is having enough data of sufficient quality. Some of the things you want to do require a lot of data. For instance, if you want to predict how forms change for a given patient, this is inevitably learned based on previously seen similar patients. If you want to predict detailed changes this will be difficult and would require thousands of patients (since you need to have enough similar ones to learn underlying patterns). Another question is whether or not you can reasonably make such predictions, e.g. does medical history provide enough insight to project what will happen in the future? This would probably be possible in some highly specific cases (for instance a diabetic that develops cardiovascular issues), but not in general.

As a quick primer, some keywords that may be useful:

If you want to predict some discrete outcome with a finite number of values, you are dealing with classification. Typically, methods deal with binary classification (2 classes) but are extendable to more than two.
If you want to predict some continuous outcome, you are doing regression. For instance, predicting the time before a patient returns.
If you only want to learn some structure within the data, you are doing clustering. For instance, do groups of similar patients exist? (not to one specific patients, but in all of the data)

Great answer. Like the OP, I'm also not very good at math. Do you know of any applied examples on the net of those three types? I seem to find a lot of research papers and math-heavy jargon but nothing with simple descriptions for beginners. — arao6, Dec 03 '14 at 01:17

Which algorithm should I use?

The problem

Things I've considered

1 Answers1

Linked