4

I have assembled binary vectors (0/1 for all elements and equal weight and arranged in time order) that have been separated into different cohorts where a unique event of interest occurs. I have removed the event of interest element itself and the prior 3 months of elements from all vectors. Now, I take a new vector to test and calculate the average pairwise Jaccard similarity between this vector and each cohort individually.

My questions center on interpretation:

What is the statistical interpretation of an average pairwise Jaccard similarity score in this example? Can this be seen as a probability or not?

If the number of samples in these cohorts increase, can it be interpreted that this would improve the prediction?

If this is valid, what would be the best performance metrics for evaluating this (Precision/Recall, F Score, Cross Validation?

Any advice would be sincerely appreciated. I'm just curious if this idea might be useful as an alternative to traditional survival analysis/time-to-event in my use case.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
Pylander
  • 425
  • 1
  • 4
  • 10
  • What does "statistical interpretation" mean? Can you be specific about what kind of thing you seek? I've been a statistician for quite a long while but have *no clue* what you mean by that. – Glen_b Aug 21 '16 at 04:07
  • @Glen_b Thank you kindly for your clarification inquiry. I guess the root of my inquiry is really centered on whether a jaccard similarity coefficient in the use case I detailed would be considered a truly valid probability measure. – Pylander Aug 21 '16 at 07:27
  • @Glen_b Say for example, that my example test vector when compared to a very large cohort gave an average pairwise jaccard similarity of 0.85. Should I interpret this to mean that this test vector is simply 85 percent similar to the average in the cohort? Or could I potentially go further with a sufficiently large sample in the cohort and say that there is a probability of 85% that the event of interest will occur. This is the crux of it I believe. – Pylander Aug 21 '16 at 07:28
  • Perhaps -- if you get a set with similarity of 0.85, it does have 85% of the union in the intersection -- you might say that's "85% overlap" and you could reasonably define that as "percentage similarity". You could then argue that the average similarity was 85%. – Glen_b Aug 21 '16 at 08:55
  • @Glen_b That definitely sounds reasonable as the basic interpretation. I guess I had just been hoping the since these cohorts are defined by the occurrence of an "event of interest". That I might be able to say "85% similar OR probability of event x occurring"? – Pylander Aug 21 '16 at 17:48
  • Uh, could you clarify? I'm trying to make sure I don't say yes if I am misunderstanding.) Jaccard similarity would generally be something like "number of times two things are the case together" divided by "number of times either of those two things occur" ... are you calling the events counted in the numerator - where those two things both happen - "the event of interest"? ... or are you talking about probability of something *given* an event of interest (i.e. conditional probability) – Glen_b Aug 21 '16 at 23:38
  • @Glen_b Absolutely, glad to add more detail. Let me try to give an example with more tangible details. Let's say these are different cohorts are defined as time ordered vectors of medical diagnosis codes that culminate in a diagnosis of interest (Diabetes Mellitus , Hypertension, etc.). We then eliminate the diagnosis of interest and say the 3 previous months of any other diagnoses that occur (we are really interested in predicting onset of the disease, not just the occurrence of the code). – Pylander Aug 22 '16 at 07:41
  • @Glen_b If we then take a new test vector of diagnoses and calculated the average pairwise jaccard similarity, we are really defining the similarity of the test vector to a large cohort of examples that THEN went on experience the event of interest, correct? – Pylander Aug 22 '16 at 07:41
  • @Glen_b So in the end should this strictly and simply be considered a naive similarity score or might it be considered an actual probability of the event of interest to occur in there are a sufficiently large number of vectors in a cohort? Or to put it another way, does a high similarity = high probability of the event of interest? – Pylander Aug 22 '16 at 07:41
  • @Glen_b Just curious if my clarification was helpful or if you ended up having any final opinions on this? – Pylander Sep 04 '16 at 02:18
  • Hi Pylander -- sorry -- I realized it would take me a while to respond with anything more substantive so I thought I'd see if someone else could offer an answer -- hence the bounty. – Glen_b Sep 04 '16 at 21:58
  • @Glen_b Oh, now I see the bounty. Thanks for that. We will see if someone chimes in for us. – Pylander Sep 06 '16 at 20:11
  • I wanted to answer but I would need to invest some time into answering properly. Rather than leave you hanging again I thought I'd see if a bounty would draw an answer from someone in a better position. If it doesn't get an answer I'll try to expand on my comments. – Glen_b Sep 06 '16 at 21:48
  • @ Glen_b Do you accept donations :) ? – Pylander Sep 13 '16 at 20:51
  • No. But I will try to find time for an answer. – Glen_b Sep 13 '16 at 23:04

0 Answers0