How can I or should I cluster the data before regression?

Question

I have data for thousands of basins (rain, model streamflow, other indicators). But obviously, these basins can be clustered as there is something called catchment similarity. Similar basins are not not dependent on geographical proximity, but a wide number of factors. So, to develop a post-processor (eg. http://www.adv-geosci.net/29/51/2011/), I can -

Develop a single multivariate model for the entire dataset. Is it a waste of information? Or
Cluster the data and develop individual regression models for each cluster.

What is the state of the art in this kind of clustering-regression kind of thing? Any ideas or other approaches. are welcome.

Please note that I will have to cluster/classify on my own first.

It would probably be best to try both approaches. You say that it "obviously" can be clustered, but it is not clear actually how well the data clusters, and besides for that, whether the information that it clusters on is related to what you are trying to predict. You should analyse these questions, but in the end, its best to try both approaches. — user3494047, Mar 01 '17 at 02:41
@user3494047 The differences are well known in literature. For example, basins in mountainous or arid regions behave differently. I want a way to take this information into account into the model... — maximusdooku, Mar 01 '17 at 02:43
Perhaps it is as you say and the data clusters perfectly. For example, if you have data about humans and you are able to perfectly cluster the data according to people from the east coast and people from the west coast. If you are trying to use regression to predict how many children a person will have, the clustering might not help you at all if being from the east or west coast provides no information on how many children you are likely to have. Although it might. — user3494047, Mar 01 '17 at 02:47
Thank you. So the clustering exists in a continuum. So, clear separation would be difficult. But I can definitely create 10 bins (for example) for the basins... — maximusdooku, Mar 01 '17 at 02:49
What are you trying to predict? I do not know the kind of data you're working with, but it is true that clustering as pre-processing can sometimes be useful. — user3494047, Mar 01 '17 at 02:52
I am trying to reduce bias in model prediction of basin streamflow. For each basin I have large number of variables such as soil type, rainfall amount, how much water in the soil etc. Now, these are based all over the US. — maximusdooku, Mar 01 '17 at 02:58
You can do both clustering and regression in a single model see http://stats.stackexchange.com/q/245902/35989 — Tim, Mar 03 '17 at 16:53

How can I or should I cluster the data before regression?

0 Answers0