0

I want to commence a twostep cluster analysis, since the database I am conducting analysis on contains important metric as well as nominal values.

=> Question #1: Should the binary and the metric variables used be about the same quantity? I use 3 binary variables, but way more metric ones. Will one binary (of only few) influence the cluster shaping more than one metric (of many)?

=> Question #2: Does it "confuse" the algorithm if some binary variables are encoded with 0,1, and some with 1,2 as possible values? Or does it merely assess the distance between cases and not care about this at all?

Also, I know that with "normal" cluster analysis, you can chose different coefficients for the comparison of cases. Some consider shared non-values as similarities (e.g. the Simple Matching coefficient), some only consider present values as similar (Tanimoto / Jaccard). To my knowledge, the latter is useful if dummies are used: Two people are both NOT a member of the Republicans, NOT a member of the Democrats, NOT a member of the Green Party but a member of The Libertarian Party. If only positive values are considered, that would mean they have one thing in common; if both negative and positive values are considered, they have four things in common (although it is really just one).

Since I was gonna use dummies to assess the employment state, I also have the following questions:

=> Question #3: Can I chose the coefficient used for binary variables when I do a two step cluster analysis? (I was gonna use SPSS, but Stata is also an option)

=> Question #4: If not, which coefficient does that analysis use? Are mutual non-values considered a similarity?

=> Question #5: If mutual non-values are considered a similarity: Is there a way to reduce autocorrelation akin to the example above? Transforming the binary variables to metric ones is not feasible, is there anything else?

I would be VERY happy if any of you could help me with these questions! I've already done literary research on them, sadly, I wound up with no answers yet.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Alex R.
  • 61
  • 2
  • What do you mean by "coefficient"? And what is the clustering algorithm? There are a lot of clustering algorithms out there, and I'm not personally familiar with any two-step ones but I'd appreciate a name I could look up so maybe I could help. – shadowtalker Oct 02 '14 at 01:35
  • By coefficient I mean either the Simple Matching coefficient, or R&R, or Tanimoto (from what I've gathered, those were the most typical ones). – Alex R. Oct 02 '14 at 02:26
  • (Sorry, pressed Enter instead of shift-enter) That is, whether shared non-values are considered to signify similarity. As for algorithm, I was talking about the one that compares two cases in terms of similarity of a variable. From my understanding, these are assessed with a difference matrix, so it should make little difference how the variables are encoded, as long as a distance of 1 between the two cases implies the same thing. So if one variable goes from 0 to 1, and another from 1 to 2 should not change the accuracy of the analysis. But I still felt it seemed off to have various ranges. – Alex R. Oct 02 '14 at 02:36
  • Check [this](http://stats.stackexchange.com/a/116859/3277) relevant answer with further links in it. – ttnphns Oct 02 '14 at 05:02

0 Answers0