libsvm data format

Question

I'm using the libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) tool for support vector classification. However, I'm confused about the format of the input data.

From the README:

The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
.
.
.
Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label (multi-class is supported). For regression, <label> is the target value which can be any real number. For one-class SVM, it's not used so can be any number. The pair <index>:<value> gives a feature (attribute) value: <index> is an integer starting from 1 and <value> is a real number. The only exception is the precomputed kernel, where <index> starts from 0; see the section of precomputed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.

I have the following questions:

What is the use of the <index>? What purpose does it serve?
Is there a correspondence between the same index values of different data instances?
What if I miss/skip an index in between ?

I ask because the datafile *heart_scale* which is included in the package for libsvm, on line 12, the index starts from 2. Is the <value> for index 1 taken as unknown/missing? Note: the tools/checkdata.py tool provided with the package says that the *heart_scale* file is correct.

score 25 · Answer 1 · answered Jul 27 '13 at 20:45

This link should help: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#/Q3:_Data_preparation

It's mentioned that the data is stored in a sparse array/matrix form. Essentially, it means only the non-zero data are stored, and any missing data is taken as holding value zero. For your questions:

a) Index merely serves as a way to distinguish between the features/parameters. In terms of a hyperspace, it's merely designating each component: Eg: 3-D ( 3 features) indices 1,2,3 would correspond to the x,y,z coordinates.

b) The correspondence is merely mathematical, when constructing the hyper-plane, these serve as coordinates.

c) If you skip one in between, it should be assigned a default value of zero.

In short, +1 1:0.7 2:1 3:1 translates to:

Assign to class +1, the point (0.7,1,1).

score 5 · Answer 2 · answered Dec 30 '14 at 16:28

Just small and fast guide:

LibSVM format means that your document needs to be pre-processed already. You need to know how many classification classes will be used (most likely 2) and feature space.

Classification class is something like true/false; 0,1,... Here you need to transform it into integers (e.g. 0,1).

Feature space is a space for your multidimensional data. Each feauture (vector) should have its own ID (index) and its value. E.g. 1:23.2 means that feature/dimension 1 has value 23.2

<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>
...

In the case of regression, you can replace by (the target value to model) — daruma, Nov 09 '21 at 07:21

libsvm data format

2 Answers2