Robert Tibshirani et al came up with a value called the Gap Statistic to estimate the number of clusters in a dataset. It involves calculating two quantities: The sum of the pairwise distances ($d$) (using some distance metric, e.g., squared euclidean is common) for all points in a cluster $C_r, r \in \{1,...,k\}$:, called $D_r$ (calculated for each cluster); and the pooled average pairwise difference $W_k$ over all clusters for the fit using $k$ clusters:
$$D_r = \sum_{i,i' \in C_r} d_{i,i'}$$
$$\hat W_k = \sum_{r=1}^k \frac{1}{2n_r} D_r $$
Where $n_r$ is the number of points in cluster $C_r$.
The gap statistic is:
$$\textrm{Gap}_n(k) = E_n[\log(W_k)] - \log(\hat W_k)$$
Where the expectation is taken with respect to samples of size $n=\sum_1^r n_r$ from some null reference distribution. Tibshirani, Hastie, and Friedman use a uniform distribution over the rectangle containing the data [see Elements of Statistical Learning, ed 2, p.519].
Note:
I added a hat to the observed $W_k$ to differentiate it from the theoretical value $E_n[\log(W_k)]$. You won't see this in the linked paper, as it is implicitly understood...but it can be confusing to someone new to this. Don't forget that $E_n[\log(W_k)]$ is the expected value of $W_k$ assuming that it is calculated using $n$ points drawn from the null reference distribution. It has nothing to do with your actual observed sample values, which is embodied in $\hat W_k$.
Even with this simple distribution, you'll likely need to simulate $W_k$ under the null reference distribution to calculate $E_n[\log(W_k)]$. Your goal is to calculate the gap statistic for various values of $k$ (the number of clusters) and see which value of $k$ which maximizes the gap statistic.