how to find similarity based on only one column value

Question

I am not sure if it is possible and that is why I am asking the question here.

I have a data looks like below

Column A     Column B
  A           0.098
  B           0.076
  C           0.871
  D           0.837
  E           1.981
  F           0.736
  G           0.983
  H           0.019
  I           0.836
  J           0.936
  K           0.197
  L           0.986
  M           0.084
  N           0.048
  O           0.471

The values in column B are linked to the letters in column A. I want to know if it is possible to make a decision on which letters are similar and which letters are not similar based on the value in column B? Is this possible ? if so, how ?

Generally, this question sounds like you are looking for a one-dimensional [clustering solution.](http://stats.stackexchange.com/questions/tagged/clustering) The existence of good answers, such as at http://stats.stackexchange.com/questions/2717, indicates the answer is "yes," but *how* you formulate that answer depends on how you quantify "similar." — whuber, Jul 04 '16 at 16:01
@whuber thank you very much, would it be possible you give me an example answer ? — nik, Jul 04 '16 at 18:14
Can't really see the issue here. Most clustering algorithms would work I guess. — Firebug, Jul 04 '16 at 18:30
The thread I linked to includes two worked examples plus many other general answers. — whuber, Jul 04 '16 at 19:11
@whuber the problem is that i don't know how to calculate the matrix in that example you showed — nik, Jul 04 '16 at 20:01
Neither do we: **you** have to specify how the distance between any two rows of your data will be computed. The obvious method (use the absolute value of the differences in $B$) might or might not be of any use. — whuber, Jul 04 '16 at 20:52
@whuber can you just give me an example? then you can remove your answer if you want. I just want to know how to do that matrix then everything is OK — nik, Jul 04 '16 at 21:03
Sure: the distance between A and B could be taken as $|0.098 - 0.076| = 0.022$. — whuber, Jul 04 '16 at 21:06
@whuber Thanks, that I know ! I want to know if statistically it is correct , sorry for misunderstanding that is all what I want to know — nik, Jul 04 '16 at 21:12
"Statistically correct" has no meaning in this context. Your problem itself determines the "distances" between the observations. No amount of theory can do that. — whuber, Jul 04 '16 at 21:17
@whuber thank you for your clarification but imagine I draw a conclusion based on the distance matrix between all my elements then i perform one way clustering. How significantly statistically my result can be ? probably I need to have replicate is it? — nik, Jul 04 '16 at 21:22
If this is how your data set looks like then you don't need any algorithm. It's a simple mapping of B to A. — Aksakal, Jul 05 '16 at 17:02

EngrStudent · Accepted Answer · 2016-07-05T16:52:09.770

This tool is JMP, but there are many others too.

Here is how to do it in JMP:

This dialog shows several families of how to make a 1d clustering.

Choosing the defaults, and coloring by value of "Column B", then selecting only the constellation plot gives the following:

This is only about the 2-norm between all pairs of points. There are other norms. There are other transformations.

When the distribution is plotted, the "clusters" (vertical groupings) are also somewhat visible.

To my untrained eye there are at least 3 general groups, and one of those has what looks like 3 sub-groups.

HERE is how to do similar (not same) in R:

Here is the code:
#load data mydata <- read.csv("data.csv")

#compute distance matrix
d <- dist(as.matrix(mydata$Column.B),  #the data
          method = "euclidean")        #the distance measure

#compute cluster membership
hc_0 <- hclust(d,                      #the distance matrix
               method = "ward.D2"       #the cluster method
               ) 

#plot of dendrogram
plot(hc_0)

Here is the result:

It is not a constellation plot, but a dendrogram. Other functions to look at include "rect.hclust" and the "rpuHclust". The "rpuHclust" comes from the "rpuHclust" package.

There are at least 3 other things to consider.

The "method" in "dist" can be lots of things including absolute distance, squared distance, or some other measure.
The "method" in "hclust" can be lots of other things including "ward.D", "average", "centroid", "median" and others. Each has strengths and weaknesses.
The "hclust" is one type of clustering but there are others. K-means or Gaussian mixture models come to mind.

All of these approaches can be appropriate for some cases and not appropriate for others. What question are you trying to answer with the cluster membership? If you want robust classification then you need a lot more samples. If true physics says your data is multivariate and you are trying to defy it using statistics, then your correct binning might be a problem. If you are just looking to get a sense of how many buckets might be a good value to start with for the data give what you have then this might not be a bad way to start in on that.

thank you very much for your explanation. would it be possible to generate the same results using R or Matlab? i don't have this JMP — nik, Jul 05 '16 at 13:01
Absolutely it would. Would you like me to? I am surprised that you have MatLab but not JMP. If you are in academia they are both essentially free. If you are in industry, I think MatLab costs about 4x more than JMP. They have a trial version to download. — EngrStudent, Jul 05 '16 at 13:35
yes please do, so that I can accept your answer. actually I did not like JMP it is because I personally don't like black boxes. in my opinion JMP is another software which only worth to be in a bin. The reason is for the fact that people click and want a figure of metric but don't care about how it is generated etc etc. however, this is my opinion :-p — nik, Jul 05 '16 at 13:39
It may take me a day. I also agree with you about the black-box being a dangerous thing for real understanding. JMP is compiled vs. interpreted so it runs very fast, and for use in a company with quality control wanting to have consistent (canned) analysis and reporting it can have good value. — EngrStudent, Jul 05 '16 at 14:41

how to find similarity based on only one column value

1 Answers1