23

I have a dataset. Say $10$ observations and $3$ variables:

obs  A   B   C
1    0   0   1
2    0   1   0
3    1   0   1
4    1   1   0
5    1   0   1
6    1   0   0
7    1   1   0
8    0   0   1
9    0   1   1
10   0   1   1

Say that is $10$ customers who have bought (1) or not (0) in each category A, B, C. There are $16$ ones there so these $10$ customers buy into $1.6$ product categories on average.

Note customers can buy into more than one of A, B and C.

If I look at only those who buy A, there are $5$ customers who have bought into $9$ product categories, so that's $1.8$ on average.

B is $9/5$ again, or $1.8$.

C is $10/6 = 1.67.$

All of them above $1.6.$

which seems strange. I understand it but need to explain this to marketing next week and so need help!

What is this thing called?

I know it's not Simpson's paradox. To me it feels similar in logic to the Monty Hall problem and conditional probability.

James Adams
  • 331
  • 2
  • 7
  • 2
    Personally, I have no idea what you're talking about. Why not create a contingency table of the As, Bs and Cs to examine the cross-purchase patterns? – Mike Hunter Apr 04 '17 at 13:15
  • 3
    We have reports that say "Customers who buy C are worth more than average - 1.67 vs 1.6" Which is True, but A and B are worth more than average too. To which the inevitable question will arise "How can all customers be worth more than average"? – James Adams Apr 04 '17 at 13:21
  • 1
    Ugh. So many vendor reports are such crap. You simply need to clarify what the comparisons are that are being made. E.g., what is going into the denominator of the averages that is creating such a deep paradox? – Mike Hunter Apr 04 '17 at 13:26
  • 3
    I think his puzzle is that it superficially looks like [Lake Wobegon](https://en.wikipedia.org/wiki/Lake_Wobegon) where everyone is above average :P Let $X$ be the number of categories/item a customer purchased. Let $A$, $B$, and $C$ be indicators for purchasing in category A, B, and C respectively. $\operatorname{E}[X\mid A] = 1.8$, $\operatorname{E}[X\mid B] = 1.8$, and $\operatorname{E}[X\mid C] = 1.67$ while $\operatorname{E}[X] = 1.6$ – Matthew Gunn Apr 04 '17 at 13:34
  • 12
    You might want to think in terms of [complementary sets](https://en.wikipedia.org/wiki/Complement_(set_theory)) and Venn diagrams. The sets "customers who buy A" and "customers who do *not* buy A" are non-overlapping. But the sets you list in your question overlap. You can compute the overall average as a (weighted) average of subset averages **only** if the subsets form a [partition](https://en.wikipedia.org/wiki/Partition_of_a_set). – GeoMatt22 Apr 04 '17 at 13:36
  • Yeah I did think of that but that is more of a psychological bias. I have a bunch of retail data which says people who buy into any of 10 categories are worth more than average. Those expected value statements are great, I can see the problem clearer. I just want to know whether this little quirk as a name or an intuitive explanation like Monty Hall does (where there are 100 doors, you pick one and then the show host opens 98 doors). – James Adams Apr 04 '17 at 13:45
  • Ahh yes, venns! I think that might work. Will see if I can draw an explanation that way. Thanks. – James Adams Apr 04 '17 at 13:48
  • 4
    Is this loosely similar to the [majority-illusion](https://techxplore.com/news/2015-07-social-network-illusion-popular.html) paradox? In the same way that any individual is likely to be connected to a super networker, any purchase category is likely to contain a super purchaser? (I'm calling a super networker someone who connects with many people and a super purchaser someone who purchases many different items) – Matthew Gunn Apr 04 '17 at 13:50
  • This is a lot of heavy breathing for what is most likely an instance of fallacious reporting. The biggest problem is that the OP doesn't have the respondent level, raw data against which to rigorously check these results. – Mike Hunter Apr 04 '17 at 14:35
  • The table i posted in the OP was intended as an example of raw data. it was pretty randomly (as far as one can throw down numbers) generated. I think any dataset will produce results like this, there does not need to be any extreme values in there. I just wondered if there was a name for this. – James Adams Apr 04 '17 at 14:44
  • You can't have a mutually exclusive and complete set of A, B and C categories where the group averages are *all* higher than the average of all 3 without there being a violation of some fundamental assumption of data analysis. In your case, it's most likely that the denominator for the overall average differs (e.g., contains more respondents) from the ones used for the estimation of the means for A, B and C. – Mike Hunter Apr 04 '17 at 15:06
  • 1
    @DJohnson look carefully. The sets are NOT mutually exclusive. The sets have elements in common. The issue here is duplicate values being incorporated into the average. – user64742 Apr 07 '17 at 16:56
  • @TheGreatDuck Right you are. – Mike Hunter Apr 07 '17 at 17:52

6 Answers6

28

The average of every subcategory can be above the overall average if the subcategories overlap on the larger customers.

Simple example to gain intuition:

  • Let $A$ be an indicator whether an individual purchased an item in category A.
  • Let $B$ be an indicator whether an individual purchased an item in category B.
  • Let $X = A + B$ be the number of items purchased.

\begin{array}{ccc} \text{Person} & A & B \\ i & 1 & 0 \\ ii & 0 & 1 \\ iii & 1 & 1 \end{array}

The set of individuals where $A$ is true overlaps the set of individuals where $B$ is true. They are NOT disjoint sets.

Then $\operatorname{E}[X] \approx 1.33$ while $\operatorname{E}[X \mid A] = 1.5$ and $\operatorname{E}[X \mid B] = 1.5$

The statement that would be true is:

$$ P(A)\operatorname{E}[X\mid A] + P(B)\operatorname{E}[X\mid B] - P(AB)\operatorname{E}[X\mid AB] = \operatorname{E}[X]$$

$$ \frac{2}{3}1.5 + \frac{2}{3}1.5 - \frac{1}{3}2 = 1.3333$$

You can't simply compute $P(A)\operatorname{E}[X\mid A] + P(B)\operatorname{E}[X\mid B] $ because sets $A$ and $B$ overlap, the expression double counts the person who purchases both item $A$ and $B$!

Name for illusion/paradox?

I'd argue it's related to the majority illusion paradox in social networks.

You may have a single dude who networks/friends everyone. That person may be one out of a million overall, but he'll be one of each persons's $k$ friends.

Similarly, you have 1 out of 3 here purchasing both categories A and B. But within either category A or B, 1 out of the 2 purchasers is the super purchaser.

Extreme case:

Let's create $n$ sets of lotto tickets. Every set $S_i$ includes two tickets: a losing ticket $i$ and the jackpot winning ticket.

The average winnings in every set $S_i$ is then $\frac{J}{2}$ where $J$ is the jackpot. The average of each category is WAY above the average winnings per ticket overall $\frac{J}{n+1}$.

It's the same conceptual dynamic as the sales case. Every set $S_i$ includes the jackpot ticket in the same way that every category A, B, or C includes the heavy purchasers.

My bottom line point would be that intuition based upon disjoint sets, a full partition of the sample space does not carry over to a series of overlapping sets. If you condition on overlapping categories, every category can be above average.

If you partition the sample space and condition on disjoint sets, then categories have to average out to the overall mean, but that's not true for overlapping sets.

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • 3
    Thanks! I think the double counting is the key to explaining. I don't think this is necessarily the result of a few extreme values though. My example dataset above is fairly mundane and the "all groups above average" effect still happens. My guess it it will happen in most cases. Just wondered if it had a name or a previous example. – James Adams Apr 04 '17 at 14:48
  • This explanation would not hold if the data @JamesAdams is analyzing is flawed. I am contending that it is. You can't have a mutually exclusive and complete set of A, B and C categories where the group averages are all higher than the average of all 3 taken together without there being a violation of some fundamental assumption of data analysis. In your case, it's most likely that the denominator for the overall average differs (e.g., contains more respondents) from the ones used for the estimation of the means for A, B and C. – Mike Hunter Apr 04 '17 at 16:35
  • 2
    @DJohnson Of course you're right if sets A, B, and C partition the sample space. My reading of the question and the supplied "data" (whatever it is) is that A, B, and C are *overlapping* sets. If A, B, and C overlap, then the group averages can all be higher than the overall average (which is the point of my answer; the sets overlap on the biggest customers!). Nothing the OP has said is internally inconsistent. Your "we're getting passed BS data" detector might be better than mine though, and I agree it's always important to ask critical questions about the validity of the data/numbers. – Matthew Gunn Apr 04 '17 at 17:02
  • Yes they are overlapping sets. My dataset is millions of customers and 12 categories. When I saw my averages were all higher than the overall average I thought it looks odd but explainable. I put together the example set of 10 obs and 3 categories to see it. I just scattered 1s and 0s here and it came out the same. I suspect this happens with most datasets where this type of average is calculated. @Djohnson my example above that I am uses 10 as the denominator for the overall average, 5 for the As, 5 for the Bs, 6 for the Cs. Can you tell me what I am violating in this example? – James Adams Apr 04 '17 at 19:41
  • What does '10' represent? The net of respondents across the 3 categories? What happens to the averages if you use the same denominator for all? It should return averages that fluctuate around the grand mean. – Mike Hunter Apr 04 '17 at 21:57
  • 10 is the total number of unique customers. I want to say things like "The average customer buys into 1.5 categories" and "The average customer who buys A buys into 1.75 categories" etc. I thought this quirk/paradox/oddity or whatever might have a name. Like Simpson's or regression to the mean. Perhaps it is too mundane for that. That probably won't help me explain this to marketing though! – James Adams Apr 05 '17 at 06:51
10

I would call this the family size paradox or something similar

Suppose, for a simple example, everybody had one partner and a Poisson-distributed number of children with parameter $2$:

  • The average number of children per person would be $2$
  • The average number of children per person with children would be $\frac{2}{1-e^{-2}} \approx 2.313$
  • The average sibling group size for each individual (counting their brothers and sisters and themselves) would be $3$

Real demographic and survey numbers produce different numbers but similar patterns

The apparent paradox is that the average size of individuals' sibling groups is larger than the average number of children per family; with stable population dynamics, people tend to have fewer children on average than their parents did

The explanation is whether the average is being taken over parents and families or over siblings: there are different weightings being applied to large families. In your example there is a difference between weighting by individuals or by purchases; your conditional averages are pushed up by fact you condition on a particular purchase being made.

Henry
  • 30,848
  • 1
  • 63
  • 107
8

The other answers are overthinking what's going on. Suppose there is one product and two customers. One bought the product (once) and one didn't. The average number of products bought is 0.5, but if you look only at the customer who bought the product, the average rises to 1.

This doesn't seem like a paradox or counterintuitive to me; conditioning on buying a product will generally raise the average number of products bought.

  • Exactly. Assuming the purchases in each of the 3 categories are not heavily correlated, what you do is calculating the averages after increasing purchase rate to 100% in one of the categories. It would probably be more informative to compare eg. the average purchase rate in categories B and C: a) among all customers (11/20) b) among those who bought A (4/10). Depends on what you're trying to show/find I guess. – konrad Apr 05 '17 at 22:14
2

Is this not merely the "average of averages" confusion (e.g. previous stackexchange question) in disguise? Your temptation appears to be that the subsample averages should end up averaging to the population average, but this will rarely happen.

In the classical "average of averages", someone finds the average of N mutually exclusive subsets, and then is flabbergasted that these values do not average to the population average. The only way this average of averages works out is if your non-overlapping subsets have the same size. Otherwise, you need to take a weighted average.

Your problem is made more complex than this traditional average of averages confusion by having overlapping subsets, but it appears to me to just be this classic mistake with a twist. With overlapping subsets, it is even harder to end up with subsample averages that average to the population average.

In your example, since users who appear in multiple subsamples (and therefore have bought many things) will increase these averages. Basically you're counting each big-spender multiple times, while the frugal people that only buy one item are only encountered once, so you're biased to larger values. This is why your particular subsets have above average values, but I think this is still just the "average of averages" problem.

You can also construct all kinds of other subsets from your data where the subsample averages take on different values. For example, let's take subsets somewhat similar to your subsets. If you take the subset of people who did not buy A, you get 7/5=1.4 items on average. With the subset that did not buy B, you also get 1.4 items on average. Those who did not buy C, bought 1.5 items on average. These are all below the population average of 1.6 items/customer. Given the right dataset and the right collection of subsets, you could end up with overlapping subsets whose averages average to the population average; however, this would be uncommon in normal applications.

Is it just me, or does the word average now seem weird after so many repetitions... Hope my answer was helpful, and sorry if I ruined the word average for you!

tbell
  • 21
  • 2
  • Thanks! The comment about non-overlapping same size partitions clarified it in my mind. I was hoping when I come to present these figures I could say something like "All the category averages are higher than the overall average, but that's the Blahblah paradox". Like when you say "Simpson's Paradox!, Ivy League Sexism!" and then run out of the room. (You all do that sometimes don't you?) Would love to say to them "It's because these are overlapping subsets of different sizes" but don't think that will land! – James Adams Apr 05 '17 at 07:00
  • 1
    Haha, fair enough. I didn't totally get the context before - I'm an astrophysics grad student, so I'm not very familiar with the context. You could say something brief, to the effect of "All the subset averages are higher than the overall average because they way we made the subsets biases us towards larger values." I wouldn't mention the average of averages name since it's not all that well known, and your case is like a generalization to it. I'd also try to find a synonym to replace the word categories - generally I see the word as implying mutually exclusive subsets. – tbell Apr 05 '17 at 12:07
  • [Semantic Satiation](https://en.wikipedia.org/wiki/Semantic_satiation) is a psychological phenomenon in which repetition causes a word or phrase to temporarily lose meaning for the listener, who then perceives the speech as repeated meaningless sounds. – Patrick Apr 05 '17 at 22:59
1

Since the issue is "I understand it but need to explain this to marketing", OP seems concerned with how a layman will interpret these facts - (not whether the facts are true, or how to show that they are). The question references 10 product categories, (A-J), so how about this example:

[in meeting with marketing group]
OP: So, as you can see here, customers who buy A, B, and C, are all more valuable than average.
Layman: Wait?! How can everyone be higher than average?
OP: Good question. This slide focuses on customers of A, B, and C, but there are other, low performing, groups not shown. For example, customers of categories D and G are each worth about half of average.

This should quell everyone's internal bs-alarm about 'everything is above average'.

Patrick
  • 119
  • 3
  • This is not the way to answer a question. – Michael R. Chernick Apr 05 '17 at 22:16
  • His question had been answered, but no one addressed his problem. – Patrick Apr 05 '17 at 22:59
  • My comment only had to do with Patrick's answer. – Michael R. Chernick Apr 05 '17 at 23:01
  • I don't see any rule against different styles of answering. Reporting (real or imagined) discussions and conversations is a time-honoured way of thinking through issues from Socrates onwards (and before him for all I know). – Nick Cox Apr 09 '17 at 10:54
  • But that explanation is factually wrong. Even in the absence of further categories (D-J), the observation remains true: the averages of overlapping subsets can all be higher than the average of the whole set, even if the subsets cover the whole set. – isarandi Apr 17 '17 at 13:29
0

Ignore the other answers here. This actually is not a paradox at all. The actual issue at hand here which everyone seems to be ignoring is that you are mistaking which probability you are actually looking at. There are in fact two completely different averages and statistics at play here which both have there own uses and interpretations in your proposed example (marketing)!

First off there is the average number of products bought per customer. So on average, one customer buys 1.6 items. Of course, a customer cannot but 0.6 of the product (assuming it isn't something like rice or grain that has a continuous measurement associated with it).

Secondly, there is the average number of customers that buy a particular product. Sounds weird right? So on average a product has 5.33333333... customers buying it. This is different however. What we're describing here is not the number of products bought (there's only three of them!) but rather the number of people actually purchasing said product.

Think of the two values this way: What would these two values represent if there was only one customer or only one product? After all, the average of a single data point is just that given data point.

Or better yet, think of the chart as if it were giving you dollar amounts spent to buy the product. Obviously the average amount spent by an individual customer will be far less than the amount of money made on average by a product supplied by a major corporation (or even just a small business). I'm sure you can think of good ways to use both values when discussing the well-being of the company.

When you go to explain this to the marketing staff, explain it to them just like I have said. It isn't a paradox. It's just a completely different statistic. The only issue here was noticing that there was in fact, two different ways to read the chart (i.e. number of people buying per product vs. number of products bought per person).

tl;dr the first thing you described is the average amount an individual customer is willing to spend buying your products. The second is the average demand for a given product by the public. I'm sure you can see now why both are most certainly not the same thing. Comparing them as such will just give you garbage information.


EDIT

It would appear the question is actually asking about the average money spent by customers who buy some product a,b, or c. Alright. This is actually just an error in calculations. I wouldn't call this a paradox. It's really just a subtle flub.

Look at your columns. There are people that are shared between columns. Let's assume you did a proper weighted average. You are still adding up people twice. This means that the average will contain extra people with a value greater than or equal to 2. Now what was your average? It was 1.6! In essence your average looks like this:

$\frac {\sum_{i = 0}^{n} valueOfPerson_i*valueOfPerson_i} {n}$

That is definitely not the right formula. It is a weighted average though assuming mutual exclusiveness that is how you would adjust to get a true average in your situation.

$\frac {\sum_{i = 0}^{n} numberOfPeopleBuying_i*averageSpentByPersonBuying_i} {n}$

Either way you'll get a messed up average. One mistake was ignoring the need for a weighted average as one category has a greater "weight" in terms of the average. It's like density. One value is denser in people represents. The other issue is duplicate adding which will distort the average. I don't call either of these "paradoxes" though. Once I saw what you were doing it seemed obvious to me why that wouldn't work. The weighted average is somewhat self-explanatory for its need and I think now that you see that you added values multiple times... that cannot work. You basically took the average of the squares of their values.

user64742
  • 109
  • 4
  • I don't think this is the case. I'm not interested here in how many people buy a particular product. I am interested in how many total products a customer has bought given that they have bought A. – James Adams Apr 06 '17 at 12:57
  • @JamesAdams Alright fair enough. In that case the issue is even more trivial. You're just taking an average of a subset of your sample. In theory if you did the same with B and C the final average wouldn't be the actual average. However, this is just due to the samples being unequal. That's all. In fact, I see no reason why that would be obvious to a person. There is actually a solution to fixing the averages to get you the proper average. It's called a weighted average and basically you would "weight" each subaverage with the number of people in that group. Make sense? – user64742 Apr 06 '17 at 18:37
  • @JamesAdams and I know you are not interested in it. You're math which you claimed formed a paradox used that average to try and compute the average number of products per person. That's why in this answer I emphasize that there is a second average for a different statistic and your "mistake" was in trying to shoehorn it into being a completely different average. – user64742 Apr 06 '17 at 18:39