The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.
There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$, where
$$y_j = f\left( \sum_iw_{ij}x_i \right),$$
its gradient is
\begin{align}
\frac{\partial}{\partial w_{ij}} E
&= \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \\
&= \frac{\partial E}{\partial y_j} \cdot f'\left(\sum_i w_{ij} x_i\right) x_i.
\end{align}
If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,
\begin{align}
f(u) = \max\left(0, u\right),
\end{align}
the derivative is zero only for negative inputs and 1 for positive inputs. Another important contribution comes from properly initializing the weights. This paper looks like a good source for understanding the challenges in more details (although I haven't read it yet):
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf