What are the most common machine learning algorithms applied to binary categorical data?

Question

I am developing a set of medical diagnostic procedures that will be assessed using binary categorical variables. I want to assess the relative importance of these criteria. So that we can focus our treatments on those that have the highest level of impact on the patients overall health.

Mock Example of Our Data

This is an example of the kind of data we would collect. On a given day, we would evaluate the patient based on a set of binary evaluation criteria (aka "metrics").

Our Interest in Machine Learning

What we want to do is to start to understand correlations and relationships between the metrics so we can prioritize our treatments. The work we do is an advanced form of physical therapy. We tailor our exercise program to the improvements we see in the patient. We experiment with different exercises to find combinations that maximize the total number of metrics the patient has. But I don't think this is the most efficient way to improve patient health because the quantity of metrics they test positive for is not the most important factor. Some of the metrics are clearly more important than others just based on our theoretical understanding and training. But actually finding this in the data has proved hard to do just by looking by eye at tables of 1s and 0s. Computing Pearson correlations is easy but insufficient for identifying patterns systematically. From what I have read, I think a machine learning, algorithmic approach would be substantially more effective at identifying effective treatments.

In what sense is our problem binary

Although we use binary features, this isn't a binary classification problem I think. Health is not a binary category for us. The patient isn't considered healthy unless they test positive on all the metrics we use. So simply saying they are healthy or unhealthy isn't a useful problem to solve because we already have a way to diagnose this.

I think our goal is to use machine learning to help better identify degrees of "health" by clustering criteria that seem to influence each other. At the moment, we are using just binary features. In our work, generally numerical features (i.e. defined over $\Bbb R$) don't work well because its hard to quantify attributes about the patient in numerical terms that are actually useful for predicting treatments. Graded/ordinal metrics also aren't great because it is hard to know how to define the magnitudes of the scale. So binary metrics are often the most useful.

What I'm Looking For

I was thinking of testing out code for machine learning algorithms applicable to binary features. I figure if I find some examples to start with, I can experiment a bit and test out which ones might be most useful for our purposes. But I'm having troubling narrowing down what my options are. Many times when I search binary machine learning I get "binary classification" which I don't think is what I want. Decision trees look plausible, but I'm not sure what kind I should be looking for given how many kinds there are.

Key Properties to Keep in Mind

Binary Features
Unsupervised learning
Features are not independent (in probability sense) and there will be correlations between them.
I may be looking for something related to "feature selection"

My Question

What are the most common machine learning algorithms applied to binary categorical data?

This maybe be too subjective, in which case I'll delete it if asked.

Not sure if this is a duplicate of the following question: http://stats.stackexchange.com/questions/2234/alternatives-to-logistic-regression-in-r/235144#235144 — Ferdi, Oct 13 '16 at 07:53
Could u explain how they are related? I don't see binary mentioned there. Mine could be a duplicate, but I don't see how they're related at first glance. — Stan Shunpike, Oct 13 '16 at 07:55
A logistic regression is basically the same as a binary classification. E.g. take 0= unemployed and 1 = having a job. Then a logistic regression is calculating the probability under which a certain person has a job and is not unemployed using several predictors, e.g. age, sex, education. The dependet variable is binary categorical data. — Ferdi, Oct 13 '16 at 08:00
What is the exact structure of the output variable you're trying to predict? Is it binary (eg. $y_i \in \{0, 1\}$? It is a categorical variable? eg. $y_i \in \{0, 1, 2, 3, 4\}$? Is a "binary categorical variable" simply a binary variable? Is this what you're trying to forecast or what you're using to forecast? — Matthew Gunn, Oct 13 '16 at 08:03
I'd like a ranking of which evaluation criteria is most important. That's kind of what I'm envisioning. A way to prioritize which evaluation criteria are best at predicting the behaviors of the others. — Stan Shunpike, Oct 13 '16 at 08:08
Each criteria is binary. Example "can the patient lift their arms over their head?" "Does the patient have complete range of motion of their shoulder socket?" — Stan Shunpike, Oct 13 '16 at 08:09
@Ferdi but all of my criteria are binary. Does that mean the inputs to the logistic regression can be binary? — Stan Shunpike, Oct 13 '16 at 08:11
What exactly is binary? Your dependent variable or your independent variable? If your dependent variable is binary you can use the algorithms in the link I already sent you no matter if your independent variables are binary or not binary — Ferdi, Oct 13 '16 at 08:15
So for each individual $i$ you have $k$ binary variables $x_1, \ldots, x_k$? What are you trying to predict? Are you trying to predict $x_j$ given all $x_{i \neq j}$? I don't really understand the objective. — Matthew Gunn, Oct 13 '16 at 08:15
Stan - please move all your clarifications into the main question. In addition just lay out exactly what your medical problem is (and what data has been collected) and samples of your data (what are the inputs, what are the desired outputs). Almost any ML algorithm is applicable to categorical data (logistic regression/trees/random forests/neural networks SVMS etc etc.). But if you explain your problem more one can narrow it down. eg how much data do you have. have you collected it through experimental procedure or is it more 'observational'... — seanv507, Oct 13 '16 at 08:20
Hmmm ok I will have to do that tomorrow. I will provide more complete details after i think through how to answer all these points that have been raised and update the questions. — Stan Shunpike, Oct 13 '16 at 08:36
I don't think this is too broad. I think the "broadness" was answers to specific requests in this comment section asked by folks with O(1k) to O(10k) reputation. I think that good links to logistic regression, and generalised linear models with multivariate binomial and continuous independent variables might be really helpful here. — EngrStudent, Jun 14 '18 at 12:11

score 0 · Answer 1 · answered Oct 14 '16 at 12:10

Considering it as a 2-class problem (healthy/unhealthy). To check for the importance of the variables, a χ2-Test for 2 × 2 table or McNemar’s test would be very good. A p-value of less than 0.05 would make the variable significant. https://www.researchgate.net/publication/5883796_Which_is_the_correct_statistical_test_to_use

Here a usual decision tree is a good choice. But make sure to prune it to avoid over fitting of the model. A more appropriate choice would be to use Random forests which is an ensemble (collection) of many decision trees.

score 0 · Answer 2 · answered May 08 '18 at 05:13

I'm not sure I entirely get what you want to do, but if you want to cluster the patients (and using a clustering algorithm), your biggest problem will be dimensionality curse. To avoid that you should use a dimension reduction algorithm. I particularly like autoencoders especially when I have binary features. You will then start with a large number of binary inputs and ends up with a lower number of continuous input which will allow you to use traditional clustering methods (k means or other)

What are the most common machine learning algorithms applied to binary categorical data?

Mock Example of Our Data

Our Interest in Machine Learning

In what sense is our problem binary

What I'm Looking For

Key Properties to Keep in Mind

My Question

2 Answers2