One thing to consider is the MDA, the Marginalized Denoising Autoencoder. It trains orders of magnitude faster than an SdA and may be the solution that you're looking for. [1][2] I personally am more interested in the latter paper as learning non-linear representations is important in domains which have highly non-linear structure. In my case, face images.
The "online" setting for an algorithm that uses stochastic gradient descent is simply setting the mini-batch size to 1, that is, computing the gradient and taking a step after every new sample. It should be obvious that, in the case of online learning, standard "batch" gradient descent will never update its weights. (Batch gradient descent only updates its weights after seeing all of the examples in a dataset.) The actual minibatch size is going to change based on the hardware that you're running. You may even find out that, since data bandwidth will most likely be the overriding concern, CPUs might perform better than GPUs--at least until nvlink comes out.
Yoshua Bengio has some interesting thoughts on what it means to train a network in an online setting (where each new training sample $x_t$ comes in at timestep $t$, is seen once and then never seen again). He suggests that, *"in the simplified case of independent and identically distributed (i.i.d.) data, that an online learner is performing stochastic gradient descent on its generalization error."[3] It's hopefully clear that most online datasets are not i.i.d, they usually exhibit temporal correlation. (I certainly would not want to watch an i.i.d. stream of images for too long. ^^)
http://arxiv.org/abs/1206.4683
http://arxiv.org/abs/1206.4683
http://arxiv.org/abs/1206.5533