An enlightening example is its use in Stochastic Neighborhood Embedding devised by Hinton and Roweis.
Essentially the authors are trying to represent data on a two or three dimensional manifold so that the data can be visually represented (similar in aim as PCA, for instance). The difference is that rather than preserve total variance in the data (as in PCA), SNE attempts to preserve local structure of the data -- if that is unclear, the KL divergence may help to illuminate what it means. To do this, they use a Gaussian kernel to estimate the probability that points $i$ and $j$ would be neighbours:
$$P_i = \sum_j p_{i,j}\qquad \text{ where }\qquad p_{i,j} = \frac{\exp(-\|x_i-x_j\|^2\;/\; 2\sigma_i^2)}{\sum_{k\neq l} \exp(-\| x_i - x_k \|^2\;/\;2\sigma_i^2)}$$
They then use a Gaussian kernel to find a probability density for the new points in the low dimensional space.
$$Q_i = \sum_j q_{i,j}\qquad \text{ where }\qquad q_{i,j} = \frac{\exp(-\|y_i-y_j\|^2)}{\sum_{k\neq l} \exp(-\| y_i - y_k \|^2)}\qquad\;\;$$
and they use a cost function $C=\sum_i D_{KL}(P_i||Q_i)$ to measure how well the low dimensional data represents the original data.
If we fix the index $i$ for a moment and just look at a single point $i$, expanding out the notation we get:
$$D_{KL}(P||Q) = \sum_j p_{i,j} log(\frac{p_{i,j}}{q_{i,j}})$$
Which is brilliant! For each point $i$, the KL divergence will be high if points which are close in high-dimensional space (large $p_{i,j}$) are far apart in low-dimensional space (small $q_{i,j}$). But it puts a much smaller penalty on points that are far apart in high-dimensional space which are close together in low-dimensions. In this way the asymmetry of the KL-divergence is actually beneficial!
If we were to find a minimum Cost, we would have a method that preserves well the local structure of the data, as the authors set out to do, and the KL divergence played a pivotal role.