What is the fastest unsupervised feature learning algorithm?

Question

I took a look at several unsupervised feature learning algorithms. Most of them (restricted Boltzmann machines and sparse auto-encoders) have very long training times even on small datasets like MNIST. I wonder if there are similar algorithms that can be trained in less time?

Similar algorithms that seem to be promising might be sparse filtering and reconstruction cost ICA (RICA). Are there maybe more?

Another problem with some of these algorithms is: most of them require batch training with algorithms like L-BFGS. In an online setting it would most probably be hard to train them without caching a batch of instances for training. Are there alternatives? Is there anything in the literature? I could not find anything.

score 5 · Accepted Answer · answered Jun 30 '13 at 22:55

5

K-means is pretty fast, as is PCA. If you use a sparse SVD library (like the irlba package for R) you can approximate PCA pretty quickly on large datasets.

I think there's some pretty fast algorithms for online (also known as sequential) k-means.

answered Jun 30 '13 at 22:55

Zach

22,308
18
114
158

Is it also possible to stack k-means to build deep network architectures? – alfa Jul 01 '13 at 06:17
@Alfa: I supposed "stacking" k-means would give you a hierarchy of clusters, but I don't think it will do you a lot of good. At the end of the day k-means finds "clusters" of observations that are near each other in Euclidean space. I've heard of "evidence accumulation clustering," using a hierarchical clustering built on top of k-means with several different centers, but at that point you lose a lot of speed. – Zach Jul 01 '13 at 12:35
OK, I remember Andrew Ng's research group already published a paper on k-means and how it compares to sparse auto-encoders, GMMs and sparse RBMs on CIFAR-10 and NORB. I will reread that paper now. :) – alfa Jul 01 '13 at 13:37
@alfa if you've got a link to the paper, I'd love to read it too! – Zach Jul 01 '13 at 13:54
1

Here it is: http://ai.stanford.edu/~ang/papers/aistats11-AnalysisSingleLayerUnsupervisedFeatureLearning.pdf – alfa Jul 01 '13 at 14:18
1

@alfa You should also read http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Coates_485.pdf by the same authors. – lmjohns3 Aug 21 '13 at 15:15
Even faster than k-means, but not usually as performant (particularly for small dictionaries) is just to build your dictionary using samples from the training data. – lmjohns3 Aug 21 '13 at 15:16
That seems to be very similar to k-means. :) – alfa Aug 21 '13 at 18:46
I mean it is a mix of k-means and k-nn. I read that paper already, thanks @imjohns3 – alfa Aug 27 '13 at 14:16

score 1 · Answer 2 · answered Dec 31 '14 at 22:07

One thing to consider is the MDA, the Marginalized Denoising Autoencoder. It trains orders of magnitude faster than an SdA and may be the solution that you're looking for. [1][2] I personally am more interested in the latter paper as learning non-linear representations is important in domains which have highly non-linear structure. In my case, face images.

The "online" setting for an algorithm that uses stochastic gradient descent is simply setting the mini-batch size to 1, that is, computing the gradient and taking a step after every new sample. It should be obvious that, in the case of online learning, standard "batch" gradient descent will never update its weights. (Batch gradient descent only updates its weights after seeing all of the examples in a dataset.) The actual minibatch size is going to change based on the hardware that you're running. You may even find out that, since data bandwidth will most likely be the overriding concern, CPUs might perform better than GPUs--at least until nvlink comes out.

Yoshua Bengio has some interesting thoughts on what it means to train a network in an online setting (where each new training sample $x_t$ comes in at timestep $t$, is seen once and then never seen again). He suggests that, *"in the simplified case of independent and identically distributed (i.i.d.) data, that an online learner is performing stochastic gradient descent on its generalization error."[3] It's hopefully clear that most online datasets are not i.i.d, they usually exhibit temporal correlation. (I certainly would not want to watch an i.i.d. stream of images for too long. ^^)

What is the fastest unsupervised feature learning algorithm?

2 Answers2