10

For the task of churn modelling I was considering:

  1. Compute k clusters for the data
  2. Build k models for each cluster individually.

The rationale for that is, that there is nothing to prove, that the population of subscribers is homogenous, so its reasonable to assume that data-generating process may be different for different "groups".

My question is, is it an appropriate method? Does it violate anything, or is it considered bad for some reason? If so, why?

If not, would you share some best practices on that issue? And another question: is it generally better or worse to do pre-clustering than model tree (As defined in Witten, Frank - classification/regression tree with models at the leafs. Intuitively it seems that decision-tree stage is just another form of clustering, but I don't know whether it has any advantages over "normal" clustering.).

Tsundoku
  • 237
  • 1
  • 3
  • 12
Ziel
  • 101
  • 4

5 Answers5

3

There is a method called clusterwise regression that solves similar problem (first clusters data and then builts predictive models). See for example this.

sitems
  • 3,649
  • 1
  • 25
  • 52
  • 1
    I looked it up here: http://www.tandfonline.com/doi/abs/10.1080/00273170701836653 and found following in the abstract: "n some cases, most of the variation in the response variable is explained by clustering the objects, with little additional benefit provided by the within-cluster regression models. Accordingly, there is tremendous potential for overfitting with clusterwise regression". Doesn't really seem promising. – Ziel Oct 12 '12 at 13:26
  • Ok, but they do not say that it always fails. I have never used that method, I only know that it may be combination of supervised and unsupervised approach but there is a small number of papers that use this method. – sitems Oct 12 '12 at 13:30
  • In addition, most applications that I found are about marketing and finance so maybe it is suitable especially for this kind of data. – sitems Oct 12 '12 at 13:39
  • 1
    It does seem very intuitive for the field of marketing - churn,cross/upsell. – Ziel Oct 12 '12 at 13:50
3

Two points that are too long to be a comment:

  • pure clusters (i.e. containing cases of one class only) are no problem per se: so called one-class classifiers model each class independent of all others. They can perfectly deal with this.

  • However, if the data clusters in a way that the classes are quite separated, i.e. the clusters are rather pure, this means that a very strong structure exists, a structure that cluster analysis is able to find without guidance by the class labels. This means that certain types of classifiers such as nearest neighbour methods based on the same distance measure used by the cluster analysis are appropriate for the data.

  • The other possibility, situations where the clusters are not pure, but a combination of cluster and classification methods can do well is appropriate for trees. The tree will do the part of the clustering (and pure nodes are not considered a problem.) Here's an artificial example, a 2 cluster version of the XOR-problem:
    XOR cluster

  • another way to include the cluster information without running the risk of having pure clusters would be to use the clustering as a feature generation step: add the outcome of the cluster analysis as new variates to the data matrix.

  • You ask whether it is bad for some reason: one pitfall is that this appoach leads to models with many degrees of freedom. You'll have to be particularly careful not to overfit.

  • Have a look at model-based-trees, e.g. mbq's answer here I think they implement a concept that is very close to whar you look for. They can be implemented as forest as well: e.g. R package mobForest.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
1

I'm dealing with similar problem these days. I have hundreds of feature to build classifier. After trying different models (ex: random forests, gradient boost, etc...), I still got low precision/recall. So I'm trying to do some clustering then build classifiers in different groups. My concern is, just like Anony-Mousse says, how can I gain more information from the classifier if I use all the information in clustering? So here's what I gonna do next:

  1. Use some features (less, according to prior knowledge) to do clustering.
  2. Use other features (more) to train classifiers.

I think it may also helps to reduce complexity, wish it helps.

1

Building $k$ clusters and then $k$ corresponding models is absolutely feasible. The pathologic case noted in the comments wherein the clusters perfectly separate the outcome variables would pose difficulties for classifiers is a theoretical problem, but one which I think is unlikely (especially in a high dimensional case). Furthermore, if you could build such clusters, you could then just use those clusters for prediction!

In addition, if the process begins with $N$ samples, the classifiers can only use $N/k$ samples. Thus, a more powerful approach would be to use the clusters in building a single classifier that incorporates the heterogeneity in the clusters using a mixture of regressions. In model-based clustering, one assumes the data are generated from a mixture distribution $Y_i \sim N(\mu_i, \sigma_i^2)$ where $i=1$ with probability $\pi$ and $i=2$ with probability $1-\pi$ and $\mu_1 \neq \ \mu_2$ and $\sigma_1^2 \neq \sigma_2^2$. A mixture regression is an extension that allows one to model the data as being dependent on co-variates; $\mu_i$ is replaced with $\beta_i X_i$, where the $\beta_i$ have to be estimated. While this example is for a univariate, Gaussian case, the framework can accommodate many data (multinomial-logit would be appropriate for categorical variables). The flexmix package for R provides a more detailed description and of course a relatively easy and extensible way to implement this approach.

Alternatively, in a discriminative setting, one could try incorporating cluster assignments (hard or soft) as a feature for training the classification algorithm of choice (e.g. NB, ANN, SVM, RF, etc.)

Sameer
  • 1,004
  • 9
  • 12
0

Well, if your clusters are really good, your classifiers will be crap. Because they have not enough diversion in their training data.

Say your clusters are perfect i.e. pure. You can't even properly train a classifier there anymore. Classifiers need positive and negative examples!

Random Forest are very successful in doing the exact opposite. They take a random sample of the data, train a classifier on that, and then use all of the trained classifiers.

What might work is to use clustering, and then train a classifier on every pair of clusters, at least if they disagree enough (if a class is split into two clusters, you still cannot train a classifier there!)

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • The purpose of the clustering is not to find "pure" clusters, i.e. ones that are awesome in discriminating my target variable. The purpose of the clustering is finding groups homogenous in the "other" area. To give an example: I think that in churn there are "quality-only" customers and "cost-optimizining" customers. I don't think I should assume that relevant features for classifiation are same in both groups so i want to build separate model for each group. Of course I don't have explicit "quality" and "cost" groups, hence the idea for clustering to derive such groups first from data. – Ziel Oct 12 '12 at 14:37
  • Any kind of extra imbalancedness and correlation in the data can harm. See, a classifier may *want* to discern "quality only" and "cost optimizing". If he only gets one group, he cannot make use of this distinction. – Has QUIT--Anony-Mousse Oct 12 '12 at 15:12
  • Maybe I explained it poorly. "Quality only" and "cost optimizing" are latent, unobserveable. Maybe they dont even exist, its a hypothetis. For the sake of explanation let's say they do. Say that for QO group relevant variables for discrimination are X1-X5. For CO group relevant variables are X6-X10. It's no use throwing both groups in one classifier, because you dont have an observable dummy "QO vs. CO". You will get some average betas for X1-X10, not suitable for either group. So the idea is to do clustering and find relevant groups, that may be govern by diffrent data generating process. – Ziel Oct 12 '12 at 15:31
  • That doesn't change what I'm saying. The more information you use for splitting your data set, the less the classifiers can train on. If the split is good, the classifiers will likely become bad, if the split is *worthless* they work well. Again, let me reiterate that RandomForests work, because they *are* random. – Has QUIT--Anony-Mousse Oct 12 '12 at 16:12
  • "The more information you use for splitting your data set (..)" I have a hard time grasping it. Certainly if I wanted to check cluster validity using some anova etc. it would be wise to do clustering on a set of variable and validity testing on another disjoint set (to prevent overfitting). However here I'm concerned only with predictive power of a model and I dont get how exacly does clustering take anything away from it. Predictive model use variables discriminatively due to a label, clustering is unsupervised, so they should be complementary, not take away from one another. Correct? – Ziel Oct 12 '12 at 20:47
  • And the point really was that classifier wouldn't work well, because both groups are governed by diffrent explanatory variables. Imagine 100 obserwations Y=5X+rnorm and hundred Y=X+rnorm. Applying one classifier for the whole set is nonsense, average Y=3X doesn't say anything for the data. That was the rationale for using clustering first and then two classifiers. I'm just extending this reasoning for the high-dimensional case. – Ziel Oct 13 '12 at 08:27
  • Which classifier uses the average? Now you are talking regression. The point is: if your clusters are good, they *could* become next to pure, at which point your classifiers will be crap, because they have too little training data for the other class. In the worst case, none. Just assume your clusters are pure, how are you going to train your classifiers? But if you still don't trust my experience, *go ahead and try it*, and share your findings if clustering helps in your case more than say random forests. It *can* work *sometimes*. In particular if your clusters are bad. – Has QUIT--Anony-Mousse Oct 13 '12 at 11:23
  • @Anony-Mousse (+1) One can either cluster to get pure clusters to build the classifier, which is just the same as building a decision tree (whose splits are some sort of clustering with labels) or one can cluster the data to find general structures, but these may or may not support the classification task. In the first case, the classifer will perform as good as **a** decision tree, in the second case it gets worse. => go go random forests. – mlwida Oct 15 '12 at 09:53
  • @Anony-Mousse I guess you are right, based on my experience + intuition, but I also think it is hard to grasp without some sort of "intuitive illustration". – mlwida Oct 15 '12 at 09:59
  • I'd say random forests are very good not at doing the opposite but at doing exactly a kind of clustering that hopefully soon yields pure leaves... Also **good clustering does not imply problems for the classifier training** (sorry: -1 for now). There are lots of classification techniques that would have no whatsoever problem. E.g., you may set up the cluster classifier so that a constant label is returned if the training cluster was pure (compare to tree based classifiers)... Nearest neighours work as well. SIMCA-like approaches come to my mind, too. Please see my answer. – cbeleites unhappy with SX Oct 15 '12 at 16:35
  • 1
    But only if you do a two-level approach, first classify by the clusters, then evaluate the cluster classifier. Otherwise, the constant classifier is useless. Then you are putting all the burden to the clustering. – Has QUIT--Anony-Mousse Oct 15 '12 at 16:57
  • 1
    Well, that's how I understood the OP. – cbeleites unhappy with SX Oct 15 '12 at 22:32
  • 1
    You can of course do this, but chances are that your clusters aren't that good, and that you are better off with a proper ensemble, of "overlapping" classifiers. Just like RandomForests does. – Has QUIT--Anony-Mousse Oct 15 '12 at 23:35