I'm working on a small (200M) corpus of text, which I want to explore with some cluster analysis. What books or articles on that subject would you recommend?
6 Answers
It may be worth looking at M.W. Berry's books:
- Survey of Text Mining I: Clustering, Classification, and Retrieval (2003)
- Survey of Text Mining II: Clustering, Classification, and Retrieval (2008)
They consist of series of applied and review papers. The latest seems to be available as PDF at the following address: http://bit.ly/deNeiy.
Here are few links related to CA as applied to text mining:
- Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK
- An Approach to Text Mining using Information Extraction
You can also look at Latent Semantic Analysis, but see my response there: Working through a clustering problem.
Finding Groups in Data. An Introduction to Cluster Analysis from professors Leonard Kaufman and Peter J. Rousseeuw.
I am reading the book and finding it very useful because:
- As stated by the authors in the preface:
Our purpose was to write an applied book for the general user. We wanted to make cluster analysis available to people who do not necessarily have a strong mathematical or statistical background.
It provides theoretical content to understand the functions available in the
R
package Cluster.Chapters can be read individually according to the cluster method of interest.
exception is chapter 3, which is built on chapter 2
The book's chapters are:
- Introduction
- Partitioning Around Medoids (Program PAM).
- Clustering Large Applications (Program CLARA).
- Fuzzy Analysis (Program FUNNY).
- Agglomerative Nesting (Program AGNES).
- Divisive Analysis (Program DIANA).
- Monothetic Analysis (Program MONA).
References:
Kaufman, L., & Rousseeuw, P. J. (2005). Finding Groups in Data. An Introduction to Cluster Analysis (p. 342). John Wiley & Sons Inc.
Maechler, M. (2013). Cluster Analysis Extended Rousseeuw et al. CRAN.

- 3,070
- 5
- 28
- 55
-
1This book indeed provides a nice overview of the field. It focuses on a few algorithms/methods (e.g. the well-known silhouette, which happens to have been designed by one of the book's authors) and covers them extensively. It also comes with some code, but 1990 style. FYI: [full table of contents](https://www.quora.com/Data-Analysis/What-are-the-best-books-on-cluster-analysis/answer/Franck-Dernoncourt). – Franck Dernoncourt Nov 26 '13 at 14:03
This chapter of Introduction to Data Mining is available online and gives a nice overview.

- 11,961
- 17
- 71
- 89
-
And [here](https://www-users.cs.umn.edu/~kumar001/dmbook/ch7_clustering.pdf) is the link to the 2nd edition (2018). – Richard Hardy Jan 10 '19 at 10:28
Cluster Analysis by Brian S. Everitt is a nice book length applied treatment of Cluster Analysis.

- 5,708
- 3
- 29
- 41
Not specifically about text-mining, but I quite liked "Exploratory Data Analysis with MATLAB" by Martinez and Martinez.

- 4,246
- 3
- 28
- 42
Another in-depth book worth looking at: Handbook of Cluster Analysis by Hennig et al. (2015)

- 1
- 2