30

I am trying to get upto speed with R. I eventually want to use R libraries for doing text classification. I was just wondering what people's experiences are with regard to R's scalability when it comes to doing text classification.

I am likely to run into high dimensional data (~300k dimensions). I am looking at using SVM and Random Forest in particular as classification algorithms.

Would R libraries scale to my problem size?

Thanks.

EDIT 1: Just to clarify, my data set is likely to have 1000-3000 rows (perhaps a bit more) and 10 classes.

EDIT 2: Since I am very new to R, I will request the posters to be more specific where possible. For example, if you are suggesting a workflow/pipeline, please be sure to mention the R libraries involved in each step if possible. Some additional pointers (to examples, sample code etc.) would be icing on the cake.

EDIT 3: First off, thanks everyone for your comments. And secondly, I apologize, perhaps I should have given more context for the problem. I am new to R but not so much to text classification. I have already done pre-processing (stemming, stopword removal, tf-idf conversion etc.) on my some part of my data using tm package, just to get a feel for things. tm was so slow even on about 200docs that I got concerned about scalability. Then I started playing with FSelector and even that was really slow. And that's the point at which I made my OP.

EDIT 4: It just occurred to me that I have 10 classes and about ~300 training documents per class, and I am in fact building the termXdoc matrix out of the entire training set resulting in very high dimensionality. But how about reducing every 1-out-of-k classification problem to a series of binary classification problems? That would drastically reduce the number of training documents (and hence dimensionality) at each of the k-1 steps considerably, wouldn't it? So is this approach a good one? How does it compare in terms of accuracy to the usual multi-class implementation?

Andy
  • 1,583
  • 3
  • 21
  • 19
  • 1
    300k dimensions with how many rows? Unfortunately, R objects must be in memory (at least unless you consider major tweaking, basically requiring you to rewrite these algorithms yourself). That means that with, say, 8 gigs of ram, I don't think you can store more than a few hundred rows with 300k columns. – crayola Aug 14 '11 at 08:06
  • @crayola : Number of rows can vary from 1000-3000. – Andy Aug 14 '11 at 13:37
  • 2
    R objects *do not* need to be in memory. Memory mapping is very easy. 300k dimensions is not a problem. I also presume your data is sparse, as that is the case with nearly all text problems. – Iterator Aug 14 '11 at 16:24
  • I just noticed the comment above: only 1000-3000 rows? That is very small. Can you explain what your corpus is? A batch of emails? Descriptions of packages in CRAN? You may have more statistical issues with P >> N than with any storage issues. – Iterator Aug 14 '11 at 16:36
  • @Iterator To be precise R objects must be in memory. Mapping may be at most emulated, yet such setting will not be fully compatible. –  Aug 14 '11 at 17:30
  • 1
    @Iterator : We have some educational resources (term papers, essays etc.) that we want to classify. – Andy Aug 14 '11 at 17:33
  • @mbq: Fair point. If the object is a pointer to memory mapped files, then the object will be just a few bytes. Andy: that's a good use and it's well within undergrad capabilities - I've taught undergrads to do document clustering and classification with R, using a 20K document corpus (each represented as a bag of words vector). Just use sparse matrices. I can't speak for whether particular packages play nicely with sparse matrices, but I suspect they should. – Iterator Aug 14 '11 at 18:22
  • @Iterator : Can you give me the list of packages you used for classification for that particular task? And perhaps a link to some sample code if you have it still? – Andy Aug 14 '11 at 18:29
  • @Andy: Sorry, I don't believe I have the sample code: I taught the students the methods (statistical & computational), they wrote the code. :) From the steps outlined below, it should be easy to proceed. I did preprocessing of tokens fully outside of R, but packages like `openNLP` should do the trick now. After tokenization, processing should be less than 100-200 lines of R code, if not much less. – Iterator Aug 14 '11 at 19:52
  • Regarding the 4th edit: it's up to you. You may discover a nesting of topics, and the 1 versus all will not give much insight on that. Moreover, if you allow for a non-binary cost matrix, you will find it more interesting to not do 1 versus all. – Iterator Aug 15 '11 at 00:05
  • Regarding 3rd edit: I'm not clear on why this processing should be slow. Honestly, I've only used non-R code for text preprocessing. In any case, preprocessing and modeling are two different steps; in the past, it never occurred to me to do it from within R. The `tm` and `openNLP` packages should be fine, but I haven't any first-hand experience with them. You might swap preprocessors and then go back to R with the data. – Iterator Aug 15 '11 at 14:54
  • @Iterator : I think what I am going to do first to is to sample from my data and see what I get. If the accuracy is low I will have develop a solution along the lines discussed here. Thank you for your responses, and I will post back very soon with my experiences (and more questions :-) ). – Andy Aug 15 '11 at 17:33
  • Good luck and have fun with this! Btw, if you want advice on preprocessing, esp. in Python, Perl or Java, SO is a good place to start. Once you have numerical token vectors, R offers the most sophisticated playground for modeling and EDA. – Iterator Aug 15 '11 at 18:20

3 Answers3

17

As requested in a comment, here are some pointers for processing steps. A number of tools may be found at the CRAN Task View for Natural Language Processing. You may also want to look at this paper on the tm (text mining) package for R.

  1. Prior to processing, consider normalization of the word tokens. openNLP (for which there is an R package) is one route.
  2. For text processing, a common pre-processing step is to normalize the data via tf.idf -- term frequency * inverse document frequency - see the Wikipedia entry for more details. There are other more recent normalizations, but this is a bread and butter method, so it's important to know it. You can easily implement it in R: just store (docID, wordID, freq1, freq2) where freq1 is the count of times the word indexed by wordID has appeared in the given document and freq2 is the # of documents in which it appears. No need to store this vector for words that don't appear in a given document. Then, just take freq1 / freq2 and you have your tf.idf value.
  3. After calculating the tf.idf values, you can work with the full dimensionality of your data or filter out those words that are essentially uninformative. For instance, any word that appears in only 1 document is not going to give much insight. This may reduce your dimensionality substantially. Given the small # of documents being examined, you may find that reducing to just 1K dimensions is appropriate.
  4. I wouldn't both recentering the data (e.g. for PCA), but you can store the data now in a term matrix (where entries are now tf.idf values) with ease, using the sparse matrices, as supported by the Matrix package.

At this point, you have a nicely pre-processed dataset. I would recommend proceeding with the tools cited in the CRAN task view or the text mining package. Clustering the data, for instance by projecting onto the first 4 or 6 principal components, could be very interesting to your group when the data is plotted.

One other thing: you may find that dimensionality reduction along the lines of PCA (*) can be helpful when using various classification methods, as you are essentially aggregating the related words. The first 10-50 principal components may be all that you need for document classification, given your sample size.

(*) Note: PCA is just a first step. It can be very interesting for someone just starting out with text mining and PCA, but you may eventually find that it is a bit of a nuisance for sparse data sets. As a first step, though, take a look at it, especially via the prcomp and princomp functions.

Update: I didn't state a preference in this answer - I recommend prcomp rather than princomp.

Iterator
  • 2,294
  • 1
  • 15
  • 22
  • +1 Nice answer; I'm only curious why do you say that small number of docks imply lower number of important variables -- doesn't it seem a bit overfitty? –  Aug 14 '11 at 20:21
  • I'm not sure I understand what you mean. To keep them is overfitting, certainly, so these variables would be eliminated in any reasonable regularization. Moreover, vocabulary (P) grows with # of documents or samples (N), so the first time a term appears is not indicative of much. Keep adding docs and then recurrence of a term across docs will become informative. – Iterator Aug 14 '11 at 20:56
  • @Iterator : Thanks for your answer. So `prcomp` and/or `princomp` will scale to this kind of data you reckon? Also I just edited my question and added some additional information. – Andy Aug 14 '11 at 22:40
  • No, these likely won't scale when you hit 300K columns. :) (Just to point out: X'X in that case will have 90B entries - a storage problem.) Instead, filter by tf.idf first. If there are only, say, 10 distinct classes, then a small multiple of 10 should be adequate for a larger dimensionality for separating the classes. So, 1000 dimensions should be more than enough. Both PCA methods (btw, I recommend `prcomp`) will be fine. – Iterator Aug 15 '11 at 00:00
  • Once you limit to 1000 dimensions or perhaps a few more (e.g. 2K), and do PCA, you may take the projections onto say 100 dimensions (which may be overkill, but there's little harm in this) and then do classification. At this point, there's nothing too exciting going on. – Iterator Aug 15 '11 at 00:00
5

First, welcome! Text processing is lots of fun, and doing it in R is getting easier all the time.

The short answer: yes - the tools in R are now quite good for dealing with this kind of data. In fact, there's nothing special about R, C++, Groovy, Scala, or any other language when it comes to data storage in RAM: every language stores an 8 byte double float in...wait for it...wait for it... 8 bytes!

The algorithms and their implementation do matter, especially if implemented very poorly with regard to data structures and computational complexity. If you are implementing your own algorithms, just take care. If using other code, caveat emptor applies, as it does in any environment.

For R, you will need to consider:

  1. Your data representation (look at sparse matrices, especially in the Matrix package)
  2. Data storage (perhaps memory mapped, using bigmemory or ff; or distributed, using Hadoop)
  3. Your partitioning of data (how much can you fit in RAM is dependent on how much RAM you have)

The last point is really under your control.

When it comes to this dimensionality, it's not particularly big anymore. The # of observations will be more of an impact, but you can partition your data to adjust for RAM usage, so there's not really much to get too worried about.

Iterator
  • 2,294
  • 1
  • 15
  • 22
3

I agree with crayola that the number of rows is crucial here. For RF you will need at least 3x more RAM than your dataset weights and probably a lot of time (such number of attributes usually requires a lot of trees in the forest -- and note that there is no parallel implementation of RF in R).

About SVM, I doubt it is a good idea to fight with 300k dimensions while you probably can develop a kernel function that will be equivalent to your descriptors of text.

EDIT: 3k x 30k (real) matrix would occupy something like 7Gb, so all you need to do RF (using randomForest) on this data is a computer with 16GB RAM, some luck and quite a bit of time or just a computer with 24GB RAM and quite a bit of time.

  • Well I was certainly going to do feature selection (chi squared, entropy based) but again I could not find any R library that would scale for this task either. Taking all this into account, is it correct then to say that perhaps I should start looking at non-R solutions? – Andy Aug 14 '11 at 13:40
  • 1
    "note that there is no parallel implementation of RF in R". That is only partially correct, as the `foreach` package plays nicely with the `randomForest` package. I think there is one such example in the vignette for `foreach`. (Or maybe `doMC`.) – crayola Aug 14 '11 at 14:13
  • @Andy The thing is, short of rewriting the algorithms in a low level programming language, I'm not sure what software will be able to apply these algorithms to your data. If I were in your situation, I guess I would stick to R and rewrite parts of `randomForest` such that it would query the randomly chosen columns from, for example, an SQL database at each iteration (such that the whole 300k dimensions wouldn't ever have to be in ram). But that's probably mainly because I know more about R than about the other possible options. – crayola Aug 14 '11 at 14:22
  • What do you precisely mean by claiming you couldn't found a library that would scale for that? Filters like this are basic algebra they should work without problems (provided you have enough RAM). –  Aug 14 '11 at 14:39
  • @crayola True, but the merging part is awful. Moreover it is not shared-mem parallelism, so it would be probably painful (if not impossible) in this setting. –  Aug 14 '11 at 14:43
  • @crayola ... or look for [alternative implementation](http://stats.stackexchange.com/questions/10001/implementations-of-the-random-forest-algorithm). –  Aug 14 '11 at 14:46
  • @mbq - Well as regards to scalability, I started working with FSelector (http://cran.r-project.org/web/packages/FSelector/index.html) - even for relatively small data sets (1k x 3k) it seemed like it was taking forever, and that's why I was very concerned. I have used WEKA before for feature selection (among other things) and it was much faster. – Andy Aug 14 '11 at 15:58
  • @Andy Yeah, FSelector is no too well suited for larger jobs :-( Summing up, R as R can handle that data, randomForest can model it, for other packages it depends on a package. And I think the best option is not to reinvent the wheel and go with already made tools for the job, as Iterator suggests. –  Aug 14 '11 at 20:18
  • @mbq : random forest can model my data as is, without parallelization you mean? And how about feature selection then? Which package should I use for that? And what if I insist on using chi-square or entropy based ones :-)? And are SVM based classifiers going to scale? – Andy Aug 14 '11 at 22:34
  • @mbq : Actually I just had another thought. Please check EDIT 4 of my question. – Andy Aug 14 '11 at 23:23