2

We are conducting research on neighborhood mail response behavior, i.e. what percentage of people in a neighborhood reply to a piece of mail.

Based on regression analysis, we know which factors (% black, % poor, etc.) influence mail response rates. I’m toying with the idea of using the significant variables from the regression model to construct clusters that could inform outreach/advertising in these different neighborhoods. In other words, clusters would help us identify what combination of reasons leads to different response rates in different areas.

How can this be done? I want the clustering to be informed by the mail response rates. Should I just include response rates as one of the clustering variables? Or is there a way to include response rates as a dependent variable? The clustering techniques I am familiar with are unsupervised, without a dependent variable.

Mihai Chelaru
  • 269
  • 3
  • 11

1 Answers1

0

Trying to conclude "which factors influence mail response" based on the regression has the same problem as assuming causation from correlation: it totally ignores confounders.

You can find a much better approach to this exact kind of problem from a recent paper: The Blessings of Multiple Causes.

Neil G
  • 13,633
  • 3
  • 41
  • 84