0

I'm creating a dataset of wine grape varieties and their associated flavors/aromas. Here's a schematic of the data:

            Flavor1       Flavor2       Flavor3       Flavor4    ...

   Grape1      1             1             1             0

   Grape2      0             0             0             1

   Grape3      0             0             1             0

   Grape4      1             1             1             1

   ...

1 = grape has the flavor

0 = grape doesn't have the flavor

I plan to plot histograms for each grape variety and do a visual check, but I imagine there's some similarity matrix I could construct for these data. I'm not the most advanced statistics user, so something readily implementable in a statistical package would be great, if at all possible.

Thank you!

ttnphns
  • 51,648
  • 40
  • 253
  • 462
mrt
  • 283
  • 1
  • 13
  • Best for what? There are dozens of measures for such data. – ttnphns May 21 '18 at 05:46
  • @ttnphns I don't have the knowledge to answer that question unfortunately. I was hoping that the structure of my dataset might suggest something, but maybe not. Basically, I'm looking for something easily implementable and easily interpretable. – mrt May 21 '18 at 06:02

1 Answers1

1

I suggest to try one of the following distance measures

If you are on python, the following package gives you a list of algorithms to experiment with out of the box

Just check out the "Metrics intended for boolean-valued vector spaces"

Here you can get a short recipe for doing so

This thread might be a good further reading