1

I want to classify U.S. states into two groups, one for rich and another for poor and compare their socioeconomic factors by using hypothesis testing. In addition to median income, I'm hoping to include other factors, such as unemployment rate, etc to classify the states at first.

When I want to classify the states in the beginning before conducting any hypothesis testing, which of the following would you recommend?

  1. Manually classify them (From manual inspection, it is quite difficult to classify some states that fall in between the two groups)
  2. Use unsupervised machine learning technique to create two clusters

Thank you for your suggestions.

Adrian Keister
  • 3,664
  • 5
  • 18
  • 35
golden
  • 11
  • 2
  • 4
    How would you apply an unsupervised technique without first specifying, quantitatively, what "rich" and "poor" mean? This sounds circular. – whuber Nov 02 '20 at 19:06
  • 4
    [Don't bin your continuous data](https://stats.stackexchange.com/q/68834/1352). Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's *Regression Modeling Strategies*) to capture any nonlinearity. – Stephan Kolassa Nov 02 '20 at 19:28
  • let's say you will use algorithm to classify states into poor/rich, ok - then similar techniques will test hypothesis based on same data? labels should be given outside of algorithm... – quester Nov 02 '20 at 19:53

0 Answers0