4

Without specifying how many responses were summarized in the graphic, the newspaper, The Guardian, today published these survey results:

enter image description here

Source: https://www.theguardian.com/film/2018/mar/01/what-is-the-best-oscar-winning-film-of-all-time-casablanca-the-godfather-moonlight

The 10 percentages add up to 100, so the reported percentages may or may not reflect rounding.

Assuming no rounding, the highest resolution is 1%, so the chart reflects at least 100 responses, or some multiple of 100.

Assuming rounding, since highest resolution is ~2% (6.49% rounded down to 6% for No Country for Old Men, and 4.5% rounded up to 5% for Moonlight), at least 50 responses underlay the chart. However I suspect more sophisticated math could reverse engineer additional insight. So ...

What is the strongest statement that can be made about the number of responses?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Micky
  • 45
  • 4
  • 1
    Perhaps worth mentioning is that about 360 people commented on this news article. With some assumption of "vote:comment" ratio, one may be able to guess. The advertisement department of this newspaper is probably more than willing to provide some readership statistics, which may help you gauging the max. – Penguin_Knight Mar 01 '18 at 15:03
  • 1
    There are now over 500 comments, but the graphic hasn't changed. – Micky Mar 01 '18 at 15:22
  • Annie Hall garnered 3%. Since 3 and 100 are relatively prime, the graphic must reflect at least 100 responses assuming no rounding. – Micky Mar 01 '18 at 16:40
  • Over 1000 comments now, graphic unchanged. – Micky Mar 01 '18 at 19:58

1 Answers1

4

Analysis

The question is this: given a set of whole numbers $x_1, x_2, \ldots, x_m$ (as given in the graphic) that have been obtained from some process of converting corresponding whole numbers $y_1, y_2, \ldots, y_m$ (the raw response counts) to percents and rounding, find the range of possible values of $n = y_1+y_2+\cdots + y_m$ (the total of the responses).

Observe that

  1. The smallest possible value of $n$ is less than or equal to $s(x)=x_1+x_2+\cdots+x_n$ because $y_i=x_i$ is a solution with $n=s(x).$

  2. $n$ can be arbitrarily large, because all values of the form $n=ks(x)$ obviously work for $k=1, 2, 3, \ldots,$ since $y_i=kx_i$ is such a solution.

In time proportional to $m$ we can check whether any particular candidate for $n$ works by trying to recover $y_i$ as the rounded version of $n x_i/100$, applying the rounding procedure to the resulting array of $y$'s, and checking whether it produces the $x$'s we have been given.

There are some subtleties to this formulation. One is that the sum of the rounded versions of the $n x_i/100$ might not equal $n$ anymore. For example, with $x=(20,30,50)$ and checking $n=19$, the fractions $19(20,30,50)/100=(3.8,5.7,9.5)$ round to $(4,6,10)$ which sums to $20\ne 19.$ There are various procedures to avoid such problems related to inconsistencies between rounded and unrounded versions of proportions. We have no way of knowing what procedure The Guardian may have used.

I have explored rounding procedures of the following form. They consist of computing the cumulative sums of the $y_i$: $$s_0=0, s_1=y_1, s_2=y_1+y_2,\ \ldots,\ s_m=y_1+y_2+\cdots+y_m.$$

These are individually and independently rounded to percents of the total, yielding corresponding whole numbers $0=t_0 \le t_1\le t_2\le \cdots \le t_n=100.$ The rounded version of $y_i$ is computed as the difference $t_i-t_{i-1}$. This guarantees the rounded versions sum to $t_n=100.$

Consider the raw counts $y=(5,9,10).$ The preceding procedure yields $x=(21,37,42).$ If, however, we were to reorder the values of $y$, we could also obtain $(20,38,42)$ and $(21,38,41)$ for the reported values $x$. This demonstrates that the result of rounding a sequence of values is not necessarily unique.

Solutions

The data in the question are $x= (3,16,4,4,32,15,3,12,6,5).$ By checking all $n$ from $1$ through $100$, I found that the smallest $n$ allowing $x$ to be perfectly reconstructed with this procedure is $n=73$, for which

$$y = (2, 12, 3, 3, 23, 11, 2, 9, 4 , 4).$$

By sorting $y$ (either ascending or descending), which I believe is often done, the solution is $n=63,$ for which

$$y = (2, 2, 2 , 3, 3, 4, 7, 10, 10, 20).$$

In general, there will be in toto $10!$ (over three million) permutations of $x$ to explore if one wishes to find the absolute smallest possible $n$. (In this case, repetitions in the values of $x$ decrease that count by an order of magnitude, but it's still a lot of computation--and reflects an approach that scales very badly with the length of $x$.) I instead looked at over 50,000 random permutations. Several of them yield $n=40$ (and most others yield $n=53$ or larger). Here's one obtained by reordering $x$ as $(32, 3, 5, 15, 3, 4, 16, 4, 6, 12)$:

$$y = (13, 1, 2, 6, 1, 2, 6, 2, 2, 5).$$

This allows us to conclude that the total number of responses may have been as low as $40$, depending on how the rounding was done.

The figure is a histogram of the smallest values of $n$ found for the last 10,000 permutations I looked at. The lowest value is $40$ (whose bar is too short to be seen at this resolution).

Figure

You can see from the groups of tallest bars that most permutations require somewhere around 67, 75, 80, or more raw responses (that is, two-thirds, three-quarters, or four-fifths of 100). If we were to consider all permutations of the ten results to be equally likely, and also if we were to suppose it was not a large survey, then numbers in these ranges would be good a posteriori bets concerning how many responses were obtained.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Never come across that rounding method. Interesting. In sciences, we're just taught to not round off intermediate calculations, just wait until the end to determine number of significant digits, then round to that. I suppose rounding to 100 could be useful in a situation where subsequent analysis required a total of 100. Otherwise, considering your example $ y = \{30, 30, 30, 10\} $, it seems prone to being less representative. What do you think? – Micky Mar 01 '18 at 23:50
  • I don't understand what you mean by "less representative." Rounding in the sciences is subject to the same difficulties whenever data have to sum to a constant: you need more sophisticated procedures to assure that the sum of rounded values equals the sum of the unrounded values. There's literature on this, but it has been so long that I read it that I cannot readily identify a reference for you. – whuber Mar 02 '18 at 15:01
  • Let me back up a moment. In your example, is it $x$ or $y$ that is $\{30,30,30,10\}$? Initially you have it as $x$, but then as $y$ later on. In the OP case, you refer to the percentages as $x_i$, and the raw data as $y_i$. Can you verify that your usage of $x$ and $y$ is consistent? – Micky Mar 02 '18 at 16:17
  • In your paragraph in which you discuss a problem encountered in recovering the raw counts from the percentages, you write: *"There are various procedures to avoid that problem--and we have no way of knowing what procedure The Guardian may have used."* But I don't see how The Guardian could have encountered *that* problem because they carried out the survey and thus had the true raw counts, then derived the percentages. The problem The Guardian may have encountered is that their resulting percentages did not sum to 100 after rounding to whole numbers. – Micky Mar 02 '18 at 16:26
  • Right: but the percentages they report sum to 100. We have no way to tell whether it just happened that way or if they applied a procedure that guarantees a sum to 100. If it's the latter, we don't know what procedure they used. Therefore, if we are to be thorough in addressing your question, we need to consider all possible (plausible) procedures they might have applied. – whuber Mar 02 '18 at 16:57
  • How about the inconsistent use of $x$ and $y$? – Micky Mar 02 '18 at 17:03
  • I don't understand what you mean by "inconsistent." $x$ and $y$ give the same proportions to the nearest percent, within rounding error: that's the whole point. – whuber Mar 02 '18 at 17:04
  • See my comment above, *"Let me back up a moment. "* – Micky Mar 02 '18 at 17:05
  • Thank you: now I see what you are asking. I did indeed mix up $x$ and $y$ in that example. I'll fix it right away. – whuber Mar 02 '18 at 17:08
  • Might I suggest using other var names, such as $p_i$ for percentages, and $c_i$ for vote counts? – Micky Mar 02 '18 at 17:11