I'm using the libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) tool for support vector classification. However, I'm confused about the format of the input data.
From the README:
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ... . . .
Each line contains an instance and is ended by a '\n' character. For classification,
<label>
is an integer indicating the class label (multi-class is supported). For regression,<label>
is the target value which can be any real number. For one-class SVM, it's not used so can be any number. The pair<index>:<value>
gives a feature (attribute) value:<index>
is an integer starting from 1 and<value>
is a real number. The only exception is the precomputed kernel, where<index>
starts from 0; see the section of precomputed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.
I have the following questions:
- What is the use of the
<index>
? What purpose does it serve? - Is there a correspondence between the same index values of different data instances?
- What if I miss/skip an index in between ?
I ask because the datafile *heart_scale* which is included in the package for libsvm, on line 12, the index starts from 2. Is the <value>
for index 1 taken as unknown/missing?
Note: the tools/checkdata.py tool provided with the package says that the *heart_scale* file is correct.