Variance of set of subsets

Question

First of all sorry for the sloppy terminology, but I am right looking for the name of a statistical concept.

I was asked to calculate the "turnover" of the Facebook friends commenting on my posts, so I am looking for an indicator that has high value if always the same let's say 10 friends are commenting my posts, and low if always different friends are commenting.

Obviously a set of friends commenting my given post form a subset of my friends, so I am looking a kind of "standard deviation", "variance" of these subsets over my all posts.

What is the proper name of this statistical concept? How do you calculate it?

What you describe is something different than what would generally be understood by "turnover". I think turnover is more like: how many friends comment on a post as compared to the total number of friends. — StijnDeVuyst, Oct 17 '17 at 11:22

score 3 · Answer 1 · answered Oct 17 '17 at 11:34

What about this: suppose the set of friends you have is $\mathcal{A}$ and $\mathcal{A}_i\subset\mathcal{A}$ is the set of friends commenting on post $i$, $i=1,\ldots,n$. Then $$ \text{KPI} = \frac{\big|\bigcup_{i=1}^n \mathcal{A}_i \big|}{\sum_{i=1}^n |\mathcal{A}_i|} $$ would be high if always different friends comment and low if always the same friends comment (so that's the opposite of what you propose), while being more or less independent of how many friends you have.

igrinis · Accepted Answer · 2017-10-18T18:35:58.747

2

You are looking for "measure of similarity" between the sets of commenting friends. One of the most popular measures is Jaccard index:

$$ J=\frac{|\bigcap_{i=1}^{n} A_i|}{|\bigcup_{i=1}^{n} A_i|} $$

$A_i$ is a set of friends that commented on post i.

If most friends that commented are the same, the intersection count will be close to the union count, and the Jaccard index will be close to 1. If only few friends have common comments, it will be small.

edited Oct 18 '17 at 18:35

answered Oct 18 '17 at 18:25

igrinis

331
1
3

Thank you. I intuitively came up with a very similar formula, but I tried to compute this measure for each pair of comments, and take the average of these on all comment pairs. Obviously this needs much more computation. Do you think that these two approaches result in similar results? – MrTJ Oct 23 '17 at 13:55
1

Jaccard index is very general, and as it is will give you the lower estimate on similarity. For example, if you have one single post that were "liked" by completely different subset of friends, it will "pull down" hundreds of similar subsets. So obviously, the result of pairwise comparison would be different. There are many things you can try. For example, pull posts together by time (say in groups by 10 posts) and doing some kind of sliding window, because a set of friends commenting today on some post (union) would be different from the set of friends you had 3 years ago. – igrinis Oct 24 '17 at 07:57

score 2 · Answer 3 · edited Jun 11 '20 at 14:32

StijnDeVuyst, and repeated by igrinis, provides a good measure and concept for the 'variance among subsets' you were looking for.

In relation to your task to find 'turnover' and not just 'similarity' of friends replying to your posts. I believe that you may wish to extend this concept.

Instead of taking all the subsets together you may want to look at adjacent subsets only. The difference is whether you want a relative measure for the friends that consistently reply to all of your posts, or a relative measure for the changes of replying friends (how much come and go) from post to post, or period to period.
Instead of creating subsets based on the friends replying to a specific post, you might want to create subsets by taking multiple posts together. You have to determine whether facebook-friends that answer irregularly (for instance answering post number x and x+2 but not post number x+1) should be considered as contributing to turnover.

(An example: For some countries that include short-term leaves as emigration and immigration, their statistics show high values, relative to the population. You may wonder if this is correct. Several media like to report on these high numbers of turnover without correctly placing this nuance and pretending that an extreme amount of people are leaving the country).

These statistics then become multidimensional. Turnover is for instance not constant in time.

Possibly you would like to provide multiple types of definitions of turnover in your graphs. For instance you could graph time series of 1) the number of leavers, 2) the number of arrivers and 3) the total number of commenting friends.

And then consider how you wish to express turnover. For instance, is it about replacement (the number of leaves that get replaced by arrival) or about changes? The latter, changes, is also reflecting growth and decrease.

As an alternative you could create a measure that determines how long on average your facebook friends remain are inside the commenting subset (residence time reflects turnover rate).

You could also brake this up into activity level changes (too loosen the concept that a facebook friend is either active or not active in commenting on your posts). For instance you could determine the number of replies per 10 posts as a level of activity and then determine the change in activity level for all your friends. In such way a friend who is changing from 8 replies in 10 posts to 2 replies to 10 posts can be included in your measure of turnover.

Thank you very much for so many ideas how to refine this measure! — MrTJ, Oct 23 '17 at 13:49

score 0 · Answer 4 · answered Oct 18 '17 at 03:58

This is simplistic, but I always start there...

Suppose you have 100 friends and 5 posts. And suppose each post gets 20 comments.

At one extreme, 20 of the same people all commented on each post. At the other extreme, 20 different people commented on each post.

In the first case the average number of comments per post for those 20 people was 1.0, and the same metric was zero for the other eighty. In the second case, everyone made .2 comments per post all 100).

My thought is to tunnel in from the count of the number of people who commented divided by the number of posts. Some people will score 1.0 (they comment on every post), and some will score lower (they only comment once or none).

If plotted from highest score to lowest, you will get something akin to a 'scree' plot.

Wouldn't that possibly get you pretty far?

Variance of set of subsets

4 Answers4