Memory consumption in NLP tasks

Question

I am trying to apply a simple Naive Bayes or SVM (libSVM) algorithms to a large data set, which I've constructed as an .arff file.

The number of features in my set is ~180k and there are ~6k examples. Also there are 8 classification classes. The data is of size ~3.2GB.

I am working with Weka's Java API and Eclipse, I am increasing JVM's memory to the maximum, but I am always getting a heap space error.

I am on a MacBook Pro, 2.3 GHz Intel Core i5, 4GB 1333 MHz DDR3.

Do I need to find another machine to work with or is it possible that I am having an memory leaks programmatically?

This question appears to be off-topic because it is about diagnosing the likely cause of a heap space error in software. — Glen_b, Jun 02 '14 at 00:56

score 1 · Accepted Answer · answered Jun 01 '14 at 22:48

Nothing good is going to come from this.

Your raw data just about fits into memory, assuming you can fit Eclipse, the entire OS, and anything else that's running into 800 Mb or so, which seems unlikely. However, you now need to operate on this data. Plain-vanilla quadratic programming needs approximately $O(n^2)$ space, which is going to be a lot in your case. SMO can dramatically improve on this, as can various SVM approximations (reduced SVM), but you're still going to need some memory, and things are going to get ugly very quickly if you start swapping. In theory, you could buy (a lot) more RAM, or switch to something with a lot less overhead ($k$-NN only really needs $k$ doubles and $k$ indicies into your data set).

However, I think the the utterly immense sparsity of your data set is likely to be an even bigger issue. Assuming your features are all binary, you're sampling from a very, very small fraction of the input space (~1/a fifty-four thousand digit number). You'd be better off with a more compact representation of your data, both pratically and theoretically.

I'd start by eliminating any features that have constant values across the whole data set. This should be pretty simple and I suspect it will remove many of your features. Next, you might consider something like PCA to re-represent the remaning features in a low-dimensional space.

Thanks Matt. I am trying to ditch some columns, so I am thinking to leave out these which sum lower than a certain number, let's say 20. I know this is very bad feature selection, but weka doesn't seem to be loading my file. I may later try to figure out IG of my features and leave out some accordingly. Does that sounds good? — Michael, Jun 02 '14 at 21:35
It's a start, at least. Unless you're absolutely locked into this feature representation, I'd think long and hard about other ways to represent it. You might also want to look into "online" or "streaming" methods that don't require keeping the whole data set in memory, like this: http://www.icml2010.org/papers/238.pdf Unfortunately, weka doesn't do much, if any, of this, so you'll have to revamp your toolchain. — Matt Krause, Jun 07 '14 at 01:43
By the way, welcome to Cross Validated! I think the hardware-centric nature of your question lead to it being placed on hold. That said, your problem might have a statistical-ish solution, so if you go down the dimensionality reduction or online classification route, feel free to ask more questions! Good luck! — Matt Krause, Jun 07 '14 at 01:57
Thanks, it's nice to be here! Yes, I found a 32GB RAM machine and I will apply some feature selection to solve the problem. — Michael, Jun 11 '14 at 22:43

Memory consumption in NLP tasks

1 Answers1