Suitable number of classes for SVM in text categorization

Question

I'm doing text categorization with R and SVM in the package e1071. I have around 30000 text files for training, and 10000 for test. The goal is to hierarchically categorize these files. For example, I have 13 categories in level 1, such as sports, literature, politics, etc, and in the second level, there are more than 300 categories. For instance, below sports category, there are sub-categories, like football, basketball, rugby, etc.

There are two strategies to reach the categorization in level 2. First is to classify the files in first level (13 categories), and then recursively, classify the files among its own subcategories. Second strategy is more direct, i.e. we assign different labels to all categories (more than 300) in level 2, then we train the model with SVM.

For the second strategy, although I have used SVD to doc-term matrix, reducing its dimension to 30,000 * 10. The svm function in package e1071 still breaks down, giving the error cannot allocate vector of size 12.4 Gb.

So I'd like to ask you gurus, whether the large number of categories is a real problem for SVM? Specifically, in my case, which strategy will produce better results and is more feasible in practical ?

user974514 · Answer 1 · 2012-06-13T13:05:54.517

4

Following answer is based on my own personal insights of doing text analysis.

Of course, an increase in the number of categories will increase the time significantly since you have bigger matrix dimensions and so on. But it's not necessary a bad approach. Moreover first strategy looks somehow strange to me since the result of guessing subgroup maybe interfered with bad result of guessing the group (some subcategories can be significally different from other categories or subcategories, but the whole can not). So I would probably go with the second strategy.

In second approach you will need quite much computational power. The error you're getting is that your RAM memory is full (also swap if you have one). There are couple basic suggestions concerning this problem.

Try to reduce your doc-term matrixes. That includes removing stopwords, punctuation, removing words that have no meaning whatsoever. This is very common procedure but sometime one can consider creating his own bigger filter.
Don't use whole amount of articles, instead use only the sample. Well, sampling is one of the most simplest procedures to reduce the amount of computing.
The most lazy solution, either get a pc with more operative memory or increase your swap memory and let computer do the rest.

These are more common approaches in your second case. I might would skip through the package called RTextTools that makes all of this work easier. Another insight would be using another approach rather than SVM. I'm not sure, but I think there are already implemented classification algorithms that implies that some of your categories have subcategories.

And as always don't forget to protect your progress when R crashes by saving workspace, .Rdata and then loading it. Also try to use R's garbage collector gc().

edited Jun 13 '12 at 13:05

answered Jun 13 '12 at 11:12

user974514

343
2
11

Thanks for your comments. In fact, I also foresee the drawbacks of first method, but I made the compromise because of the error stated above. In effect, my Linux server owns 24GB RAM and 8 CPUs, in my opinion, it's quite computationally powerful. As to your 3 suggestions, 1) I've used **tm** package for all these preprocessing, and the construction of doc-term matrix, I did add so bigger filter, but still, the size of vocabulary is around 150,000. 2) The training corpus is relatively small, between 1KB and 20KB, so I think we can regard them as samples. – Ensom Hodder Jun 13 '12 at 11:38
1

3) It's an interesting suggestion, but I don't know how to increase the swap memory in R. Could you give me some references? Finally, I had never heard of **RTextTools** before, but I found it's quite attractive to explore it. Thank you very very much again for your detailed comments. :-) – Ensom Hodder Jun 13 '12 at 11:39
In Linux your swap memory is a special partition of your hard disk dedicated to storing data from your RAM once your RAM is filled up. You have 24GB RAM, so if your programs needs more than 24GB of RAM then they will extend your memory with the space of your swap partition. The main drawback of this is that the read/write speed of your hard disk is significantly lower then your RAM speed, so your computer slows down quite a lot once you start using RAM. The advantage is that you are not limited to your RAM size if you need a few GBs more. Check here for more info: http://is.gd/tKF91r – Sicco Jun 13 '12 at 11:46
By the way, I have 320 categories in total on the second level, in order to obtain a reasonable good model, I sampled 100 observations for every category, hence 32,000 docs in all. Therefore, it's impossible to reduce the size of training corpus. – Ensom Hodder Jun 13 '12 at 11:47
The swap is a partition in *nix more common used for storage of excessive RAM memory. [Here](http://askubuntu.com/questions/103043/how-to-create-swap-partition-on-already-installed-ubuntu) is the guide for debian. But consider this drawback. Since hard drives are quite slow you will be experiencing significant drop in computional speed. In my case, when machine used to start using swap memory the usage of CPU dropped down till like 5-10%. Concerning the size of sample try to choose the smallest sample size of all data in which probability to encounter each subcategory is like 90%. – user974514 Jun 13 '12 at 11:54
@user974514 I understand the basic concept of SWAP, but it seems it's rather the OS to manage it, not ourselves. What's more, due to its disadvantage, I'll try to survive without using it. – Ensom Hodder Jun 13 '12 at 12:51

score 3 · Answer 2 · answered Jun 13 '12 at 11:51

3

You might want to consider using LibLINEAR instead of (Lib)SVM. It is said to run faster than SVM for cases like document classification, though I'm not sure how it effects memory usage (see section 'When to use LIBLINEAR but not LIBSVM'). Here is the package for R.

answered Jun 13 '12 at 11:51

Sicco

236
4
11

Hum, your suggestion really deserves attention. :-p Thank you, and I will look into it ! – Ensom Hodder Jun 13 '12 at 12:48
2

LibLINEAR is designed for large datasets which can't fit in main memory - but its limited to the linear kernel. Generally linear kernels work well for text classification and often datasets become very large and are very sparse, so LibLINEAR works well in that case. – karenu Jun 14 '12 at 20:02
@karenu Yes I read the introduction about **LibLINEAR**, and I think I will try it. But according to your experiences, is it necessary to reduce the dimension of doc-term matrix with some techniques, like SVD, or PCA. or we could run directly **LibLINEAR** with a doc-term matrix 30,000 rows by 150,000 columns. :-p – Ensom Hodder Jun 15 '12 at 09:30
@karenu What's more, do you know the difference between libsvm with linear kernels, and LibLINEAR? Or they are actually the same thing ? Thanks for you help ! – Ensom Hodder Jun 15 '12 at 09:32
I'd have to go back and read the paper to remember precisely, but the difference is going to be in the optimization, not the outcome. libsvm with the linear kernel is probably more efficient than liblinear when the dataset does fit in main memory, since liblinear isn't optimizing for that scenario. As for whether you could run it without some kind of feature selection - I should think so, but it will depend on how sparse the dataset is. In the original LibLINEAR paper they tested on a dataset with approx 20K instances and 1.3 Million features. – karenu Jun 18 '12 at 02:42

score 2 · Answer 3 · answered Sep 03 '12 at 08:31

I work as a project manager & software engineer at a small, research oriented software company. We have recently completed a text categorization project and made many experiments on the way.

We work with a Lineer SVM implementation I coded in C# based on Platt's "Sequential Minimal Optimization" Algorithm and later improvements for the linear case. The items are product definitions being sold on the internet. Our categories is a two level tree with about 15 first level nodes and 200 leaves. Here is a two sentence summary of what I have learned:

Success depends on the quality of the training set much more than the exact methodology. Although it is true that SVM performs somewhat better than Decision Trees, Nearest Neighboors and any other "simple" approach, selecting the kernel, optimizing the system parameters make little difference in terms of success rate.

Our experiments included the comparison of the two approaches you wrote about. Although my common sense still tells me that first approach should perform better, there was no statistically significant difference in terms of success rate in our experiments: very likely, when a product is classified, either both approaches classify it correctly, of neither.

For the RAM usage, I don't know about R to make concrete suggestions. However I can give some general advice, in case you have not already implemented these steps:

Remove stop words
Remove uncommon words (i.e. words that occur in less than say 10 documents)
Format and Stem all the words so that COME, come, came, coming, etc. are mapped to the same term. (There are free libraries for this)

SVM needs much memory during training. Still, 24 Gb seems more than enough to me. (Again, I don't know anything about the R implementation) However here is an experiment you may perform, which if succeeds will make training much faster, and will consume much less resources:

If you have 300 categories, construct 300 different training sets and Support Vector Machines. For each category let your training set consist of 100 positive samples and say 400 randomly selected negative samples (find the optimum number by experimentation). You will have 300 different term-document matrices but each will be of much less size. When a document is to be classified, ask each SVM and return the category with the maximum value.

Hum, I really appreciate your help for sharing your own experiences. In fact, I've almost finished this project by using the "LibLinear" library, my final system is implemented in Java. R was only used to create some prototypes and evaluate the performance of different kernels. Thanks a lot anyway. — Ensom Hodder, Sep 03 '12 at 17:38

Suitable number of classes for SVM in text categorization

3 Answers3