Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

Question

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

1 is "income"
2 is "year"
3 is "use/COMMERCIAL EXEMPT"
4 is "use/CONDOMINIUM"
5 is "use/SINGLE FAMILY"
6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

You should spell out "SVM", at least once. – Peter Flom Feb 21 '13 at 01:28 — Peter Flom, Feb 21 '13 at 01:28
Make sure you scale that data! – Patrick Caldon Feb 21 '13 at 02:16 — Patrick Caldon, Feb 21 '13 at 02:16

Kyle. · Accepted Answer · 2013-02-21T03:01:00.917

7

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!

edited Feb 21 '13 at 03:01

answered Feb 21 '13 at 02:01

Kyle.

1,550
1
11
22

1

thanks Kyle, can you be a little more specific? What do you mean "assign your partitioned categorical features to a specific place"? – Seamus Abshere Feb 21 '13 at 02:42
@SeamusAbshere No problem! I edited my answer to address this! – Kyle. Feb 21 '13 at 03:01
I feel like I've heard that libsvm does what you're talking about automatically - any thoughts? – Seamus Abshere Feb 21 '13 at 15:31
@SeamusAbshere I imagine you're right, but I don't know for sure. Now that I think about it, I'm not sure how it could work any other way. – Kyle. Feb 21 '13 at 15:51
Emboldened by @Kyle's answer, I wrote a Ruby library ([VectorEmbed](https://github.com/seamusabshere/vector_embed)) that does this conversion (embedding) automatically, both for categorical (using Murmur32 hashes) and continuous data. It outputs libsvm-formatted files. – Seamus Abshere Mar 29 '13 at 00:38

Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

1 Answers1

Linked