I'm doing text categorization with R and SVM in the package e1071. I have around 30000 text files for training, and 10000 for test. The goal is to hierarchically categorize these files. For example, I have 13 categories in level 1, such as sports, literature, politics, etc, and in the second level, there are more than 300 categories. For instance, below sports category, there are sub-categories, like football, basketball, rugby, etc.
There are two strategies to reach the categorization in level 2. First is to classify the files in first level (13 categories), and then recursively, classify the files among its own subcategories. Second strategy is more direct, i.e. we assign different labels to all categories (more than 300) in level 2, then we train the model with SVM.
For the second strategy, although I have used SVD to doc-term matrix, reducing its dimension to 30,000 * 10. The svm function in package e1071 still breaks down, giving the error cannot allocate vector of size 12.4 Gb.
So I'd like to ask you gurus, whether the large number of categories is a real problem for SVM? Specifically, in my case, which strategy will produce better results and is more feasible in practical ?