I never used it directly, so I can only share some papers I had and general thoughts about that technique (which mainly address your questions 1 and 3).
My general understanding of biclustering mainly comes from genetic studies (2-6) where we seek to account for clusters of genes and grouping of individuals: in short, we are looking to groups samples sharing similar profile of gene expression together (this might be related to disease state, for instance) and genes that contribute to this pattern of gene profiling. A survey of the state of the art for biological "massive" datasets is available in Pardalos's slides, Biclustering. Note that there is an R package, biclust, with applications to microarray data.
In fact, my initial idea was to apply this methodology to clinical diagnosis, because it allows to put features or variables in more than one cluster, which is interesting from a semeiological perpective because symptoms that cluster together allow to define syndrome, but some symptoms can overlap in different diseases. A good discussion may be found in Cramer et al., Comorbidity: A network perspective (Behavioral and Brain Sciences 2010, 33, 137-193).
A somewhat related technique is collaborative filtering. A good review was made available by Su and Khoshgoftaar (Advances in Artificial Intelligence, 2009): A Survey of Collaborative Filtering Techniques. Other references are listed at the end. Maybe analysis of frequent itemset, as exemplified in the market-basket problem, is also linked to it, but I never investigated this. Another example of co-clustering is when we want to simultaneously cluster words and documents, as in text mining, e.g. Dhillon (2001). Co-clustering documents and words using bipartite spectral graph partitioning. Proc. KDD, pp. 269–274.
About some general references, here is a not very exhaustive list that I hope you may find useful:
- Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666
- Carmona-Saez et al. (2006). Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics, 7, 78.
- Prelic et al. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9), 1122-1129. www.tik.ee.ethz.ch/sop/bimax
- DiMaggio et al. (2008). Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics, 9, 458.
- Santamaria et al. (2008). BicOverlapper: A tool for bicluster visualization. Bioinformatics, 24(9), 1212-1213.
- Madeira, S.C. and Oliveira, A.L. (2004) Bicluster algorithms for biological data analysis: a survey. IEEE Trans. Comput. Biol. Bioinform., 1, 24–45.
- Badea, L. (2009). Generalized Clustergrams for Overlapping Biclusters. IJCAI
- Symeonidis, P. (2006). Nearest-Biclusters Collaborative Filtering. WEBKDD