If I want to build a predictive glm model, should I make cluster analysis on 100% of observations or on training sample (80%)? Thanks
-
2How does cluster analysis relate to GLM in your analysis..? – Tim Jan 30 '15 at 10:16
-
Tim, I have to build a glm model and one of my explanatory variable will be a clustered one (for example zone from zip-code). My problem is, since I split my dataset in training and validation one, have I to make cluster on the 100% of my data (before split) or on the training sample? – Smith Feb 02 '15 at 15:20
-
3Please register & merge your accounts. Then you will be able to comment on your own question. You can find out how in our [help]. – gung - Reinstate Monica Feb 02 '15 at 15:23
-
Eric, first of all thanks for your answer. I'm in the first situation (I'm certain on how I'm going to define zone to zip), so I should do cluster before splitting. Have you ever read some papers/books in which it's shown is theoretically correct do the cluster before split database? If so, could you write me the name or send me the link? Thanks again – Smith Feb 03 '15 at 13:07
1 Answers
The question is, "what are you trying to test with your holdout data?"
If you are certain on how you are going to define zip to zone then you should do the cluster before you split your training and validation set.
If you are testing if a particular clustering is effective in your modeling (which I believe is the case here) then you should fit your clusters on your training data and then test how well those clusters are fit using your holdout data.
You don't have to use the same training and hold out datasets for fitting the clusters and for fitting the model or do test both at the same time. It might be beneficial to fit your other model variables first then go back and fit your clusters or fit your clusters first then go back and fit the other model variables* (as oppose to trying to fit both at once and dealing with interpreting interactions between your model and cluster changes).
*If you are fitting the clusters first you will need to know in advance what is the characteristic(s) across the zips that you are trying to cluster and how you will define a "good" fit. If you don't have well defined characteristics to judge if the clusters are a good grouping then you are probably better fitting all of your other variables first and coming back to the clustering.
Obligatory warning about binning data, which could be what your clusters are doing. What is the benefit of breaking up a continuous predictor variable?