14

First of all I'm not sure where this question should be posted. I'm asking if a statistics problem is NP-Complete and if not to solve it programmatically. I'm posting it here because the statistics problem is the center point.

I'm trying to find a better formula for solving a problem. The problem is: if I have 4d6 (4 ordinary 6 sided dice) and roll them all at once, remove a die with the lowest number (called "dropping"), then sum the remaining 3, what is the probability of each possible result? I know the answer is this:

Sum (Frequency): Probability
3   (1):         0.0007716049
4   (4):         0.0030864198
5   (10):        0.0077160494
6   (21):        0.0162037037
7   (38):        0.0293209877
8   (62):        0.0478395062
9   (91):        0.0702160494
10  (122):       0.0941358025
11  (148):       0.1141975309
12  (167):       0.1288580247
13  (172):       0.1327160494
14  (160):       0.1234567901
15  (131):       0.1010802469
16  (94):        0.0725308642
17  (54):        0.0416666667
18  (21):        0.0162037037

The average is 12.24 and standard deviation is 2.847.

I found the above answer by brute force and don't know how or if there is a formula for it. I suspect this problem is NP-Complete and therefore can only be solved by brute force. It might be possible to get all probabilities of 3d6 (3 normal 6 sided dice) then skew each of them upward. This would be faster than brute force because I have a fast formula when all dice are kept.

I programmed the formula for keeping all dice in college. I had asked my statistics professor about it and he found this page, which he then explained to me. There is a big performance difference between this formula and brute force: 50d6 took 20 seconds but 8d6 drop lowest crashes after 40 seconds (chrome runs out of memory).

Is this problem NP-Complete? If yes please provide a proof, if no please provide a non-brute force formula to solve it.

Note that I don't know much about NP-Complete so I might be thinking of NP, NP-Hard, or something else. The proof for NP-Completeness is useless to me the only reason why I ask for it is to prevent people from guessing. And please bare with me as it's been a long time since I worked on this: I don't remember statistics as well as I might need to solve this.

Ideally I'm looking for a more generic formula for X number of dice with Y sides when N of them are dropped but am starting with something much more simple.

Edit:

I would also prefer the formula to output frequencies but it is acceptable to only output probabilities.

For those interested I have programmed whuber's answer in JavaScript on my GitHub (in this commit only the tests actually use the functions defined).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
SkySpiral7
  • 253
  • 2
  • 8
  • 1
    This is an interesting question. I think it should be on-topic here. Thank you for your consideration. – gung - Reinstate Monica Dec 22 '14 at 16:40
  • 1
    Although the setting is interesting, you haven't yet asked an answerable question: the idea of NP-completeness depends on having a *class* of problems, while you have described only one. Exactly how do you want it to generalize? Although you hint that the numbers of dice could vary, various additional options are possible and they might yield different answers: you could change the numbers of faces, the values on the faces, the numbers of dice, and the numbers of dropped dice, all in various ways with various relationships among them. – whuber Dec 22 '14 at 16:49
  • 1
    @whuber She doesn't know any complexity theory but I think it's clear that she's asking after the family of problems generated by changing the number of dice. I also think I have an efficient algorithm for it. – Andy Jones Dec 22 '14 at 17:04
  • 2
    @Andy I do see at the end she is asking for "a more generic formula for X number of dice with Y sides when N of them are dropped". – whuber Dec 22 '14 at 17:34
  • @whuber Hah! Apparently not as clear as I thought then. Sorry, my bad. – Andy Jones Dec 22 '14 at 18:09
  • @whuber I don't know much about NP. My goal of posting this was to improve my program. I didn't want someone to guess that it can't be done because that doesn't help me. But if there was a proof against polynomial time then my program is probably as fast as it can be so I can move on. Also I'm male – SkySpiral7 Dec 26 '14 at 16:11
  • What is your estimate of the complexity of brute force solution? What complexity you hope to get? – Aksakal Dec 26 '14 at 19:17
  • I haven't had to calculate Big O Notation before and my code is complicated because it does everything. In this case it finds all possible die value combinations before summation. Then removes the smallest die. Then sums them. Then counts the frequency per sum. I would guess this is about O(k^n) where n is the number of dice and k is the number of sides. I don't have a complexity goal but I was hoping to make the program fast enough to handle normal use case without running out of memory. Sorry that my question is in an area that I don't know much about. – SkySpiral7 Dec 26 '14 at 20:39
  • I do have a function developed 1-2 years ago for a VB.net personal project that calculates the the mean sum of X Y-sided dice dropping the lowest 1 dice (I validate correctness with rumkin and anydice). It doesn't work for dropping more than the 1 lowest. It doesn't use any brute force. Frankly, I can't remember why it works, and I don't know enough about statistics to guess how to compute dropping the *n*th lowest. But, there may be something helpful in what I made, if I could figure out put it in an equation form. –  Apr 22 '15 at 05:18
  • I guess I can try to parse the equation to English. The mean of *x*D*y* drop lowest is: [(Sum of all permutations of *x* dice) - (Sum of (Sum of all permutations where 1...*y* is the lowest die)]/(# of permutations of *x*D*y*). Sum of all possible 4d6 is 18144 (calculate your way). The (Sum of (Sum of...)) is easy to loop. `For(i=1, i <= y, ++i) {total += i * ((y+1-i)^(x-1)-(s-i)^(x-1))}`. Which for 4d6 is 2275. # permutations is obviously `y^x`, 1296 in this case. So, `(18144-2275)/1296` ~ 12.2446, which is the mean you provided. Works for all x & y, if your `total` doesn't overflow. –  Apr 22 '15 at 06:57
  • @CreationEdge Thank you for your support but I don't think your formula can be made generic for any number of sides etc. (Also it'd probably be better posted as an unaccepted answer instead of a comment). The accepted answer by Andy I think is enough I just haven't got around to programming it. – SkySpiral7 May 25 '15 at 00:04

3 Answers3

7

Edit: @SkySpiral has had trouble getting the below formula to work. I currently don't have time to work out what the issue is, so if you're reading this it's best to proceed under the assumption it's incorrect.


I'm not sure about the general problem with varying numbers of dice, sides, and drops, but I think I can see an efficient algorithm for the drop-1 case. The qualifier is that I'm not completely sure that it's correct, but right now I can't see any flaws.

Let's start by not dropping any dice. Suppose $X_n$ represents the $n$th die, and suppose $Y_n$ represents the sum of $n$ dice. Then

$$p(Y_n = a) = \sum_k p(Y_{n-1} = a - k)p(X_n=k)$$

Now suppose $Z_n$ is the sum of $n$ dice when one die is dropped. Then

$$p(Z_n = a) = p(\text{$n$th die is the smallest})p(Y_{n-1} = a) + \\ p(\text{$n$th die is not the smallest})\sum_k p(Z_{n-1} = a - k)p(X_n=k)$$

If we define $M_n$ to be distribution of the minimum of $n$ dies, then

$$p(Z_n = a) = p(X_n \leq M_{n-1})p(Y_{n-1} = a | X_n \leq M_{n-1}) + \\ p(X_n > M_{n-1})\sum_k p(Z_{n-1} = a - k)p(X_n=k | X_n > M_{n-1})$$

and we can calculate $M_n$ using

$$p(M_n = a) = p(X_n \leq M_{n-1})p(X_n = a |X_n \leq M_{n-1}) + p(X_n > M_{n-1})p(M_{n-1} = a|X_n > M_{n-1})$$

Anyway, together this all suggests a dynamic programming algorithm based on $Y_n, Z_n$ and $M_n$. Should be quadratic in $n$.

edit: A comment has been raised on how to calculate $p(X_n \leq M_{n-1})$. Since $X_n, M_{n-1}$ can each only take on one of six values, we can just sum over all possibilities:

$$p(X_n \leq M_{n-1}) = \sum_{a,b} p(X_n = a, M_{n-1} = b, a \leq b)$$

Similarly, $p(X_n = k | X_n > M_{n-1})$ can be calculated by applying Bayes rule then summing over the possible values of $X_n, M_{n-1}$.

Andy Jones
  • 2,146
  • 9
  • 10
  • 1
    +1 This looks correct and you said that's it's quadratic. But it's been a few years since I took statistics (I'm primarily a programmer). So I'd like to fully understand this before marking it as the answer. Also I see you have p(nth is the smallest die) does this include if nth is tied with the smallest? Such as rolling all 3s. – SkySpiral7 Dec 26 '14 at 16:21
  • Good catch. If the $n$th die rolled is the same as the current minimum, we can regard that die as the one to be dropped. In which case the distribution is $Y_{n-1}$. I've swapped some $( – Andy Jones Dec 26 '14 at 18:58
  • Thank you. If I understand this correctly I think your formulas are the answer. However I don't know how to calculate p(X(n) > M(n-1)) (or the negation of it) or p(X(n)=k|X(n) > M(n-1)) so I can't use this answer yet. I'll mark this as the answer but I'd like more information. Can you edit your answer to explain these or should I post it as another question? – SkySpiral7 Dec 27 '14 at 02:56
  • Edited my answer. – Andy Jones Dec 27 '14 at 10:42
  • 1
    Sorry I know it's been a year and a half but I've finally gotten around to implementing this formula into code. However the p(Z(n)=a) formula appears incorrect. Suppose 2 dice with 2 sides (drop lowest), what are the chances of the result being 1? The chance of X(n) being the smallest or tied is 3/4 and p(Y(n-1)=1) is 1/2 so that Z(n) returns at least 3/8 even though the correct answer is 1/4. The Z formula looks correct to me and I don't know how to fix it. So if it's not too much to ask: what do you think? – SkySpiral7 Jul 10 '16 at 20:36
  • Aha, spotted my mistake. In the first term of $p(Z_n = a)$, I treated $Y_n$ as independent of $X_n \leq M_{n-1}$, when really it should be conditional. In the specific two-die-two-side case, there are four possible outcomes, and in 3/4 of those outcomes $X_2 \leq M_1$. In those three outcomes where $X_2 \leq M_1$, in 1/3 of those outcomes does $Y_1 = 1$. So $p(Z_2 = 1) = 3/4 * 1/3 = 1/4$, hooray. – Andy Jones Jul 11 '16 at 09:03
  • Doing this Z(n) formula [by hand](https://github.com/SkySpiral7/Dice/blob/d85d45b4678e115274be7818507e714003db39f1/src/library/StackExchange.js#L102) for "3d2 drop lowest" (3d2 is 3 dice with 2 sides) I got 17/32 but the correct answer is 1/2. Also p(X(3)=2|X(3)>M(2)) should be 1 but using Baye's Theorem [I get 1/3](https://github.com/SkySpiral7/Dice/blob/d85d45b4678e115274be7818507e714003db39f1/tests/StackExchange.js#L145), does Baye's Theorem have edge cases or am I doing something wrong? I don't have any way of PMing you but my email is rworcest@g.emporia.edu. – SkySpiral7 Jul 17 '16 at 23:34
  • I haven't heard from you (I also see you've had no activity since your last comment). I know it may not be fun to keep revisiting this question but I don't know how to fix it myself. I know I've had this as the accepted answer for a long time but if the formula can't at least account for "4d6 drop lowest" for all sums then it isn't correct and I'll have to un-mark it as correct (I'll give you some more time). – SkySpiral7 Aug 07 '16 at 21:59
  • @SkySpiral7 I'm afraid I don't have time right now to work through it fully. You should unmark it and I'll edit in a warning saying it might be incorrect. – Andy Jones Aug 08 '16 at 08:38
  • (To be clear, since my last comment I'd been hoping to find some time to work on this, but it doesn't look like it's going to happen any time soon. Sorry!) – Andy Jones Aug 08 '16 at 08:56
7

Solution

Let there be $n=4$ dice each giving equal chances to the outcomes $1, 2, \ldots, d=6$. Let $K$ be the minimum of the values when all $n$ dice are independently thrown.

Consider the distribution of the sum of all $n$ values conditional on $K$. Let $X$ be this sum. The generating function for the number of ways to form any given value of $X$, given that the minimum is at least $k$, is

$$f_{(n,d,k)}(x) = x^k+x^{k+1} + \cdots + x^d = x^k\frac{1-x^{d-k+1}}{1-x}.\tag{1}$$

Since the dice are independent, the generating function for the number of ways to form values of $X$ where all $n$ dice show values of $k$ or greater is

$$f_{(n,d,k)}(x)^n = x^{kn}\left(\frac{1-x^{d-k+1}}{1-x}\right)^n.\tag{2}$$

This generating function includes terms for the events where $K$ exceeds $k$, so we need to subtract them off. Therefore the generating function for the number of ways to form values of $X$, given $K=k$, is

$$f_{(n,d,k)}(x)^n - f_{(n,d,k+1)}(x)^n.\tag{3}$$

Noting that the sum of the $n-1$ highest values is the sum of all values minus the smallest, equal to $X-K$. The generating function therefore needs to be divided by $k$. It becomes a probability generating function upon multiplying by the common chance of any combination of dice, $(1/d)^n$:

$$d^{-n}\sum_{k=1}^dx^{-k}\left(f_{(n,d,k)}(x)^n - f_{(n,d,k+1)}(x)^n\right).\tag{4}$$

Since all the polynomial products and powers can be computed in $O(n\log n)$ operations (they are convolutions and therefore can be carried out with the discrete Fast Fourier Transform), the total computational effort is $O(k\,n\log n)$. In particular, it is a polynomial time algorithm.


Example

Let's work through the example in the question with $n=4$ and $d=6$.

Formula $(1)$ for the PGF of $X$ conditional on $K\ge k$ gives

$$\eqalign{ f_{(4,6,1)}(x) &= x+x^2+x^3+x^4+x^5+x^6 \\ f_{(4,6,2)}(x) &= x^2+x^3+x^4+x^5+x^6 \\ \ldots \\ f_{(4,6,5)}(x) &= x^5+x^6 \\ f_{(4,6,6)}(x) &= x^6 \\ f_{(4,6,7)}(x) &= 0. }$$

Raising them to the $n=4$ power as in formula $(2)$ produces

$$\eqalign{ f_{(4,6,1)}(x)^4 &= x^4 + 4x^5 + 10 x^6 + \cdots + 4x^{23} + x^{24} \\ f_{(4,6,2)}(x)^4 &= x^8 + 4x^9 + 10x^{10}+ \cdots + 4x^{23} + x^{24} \\ \ldots \\ f_{(4,6,5)}(x)^4 &=x^{20} + 4 x^{21} + 6 x^{22} + 4x^{23} +x^{24}\\ f_{(4,6,6)}(x)^4 &= x^{24}\\ f_{(4,6,7)}(x)^4 &= 0 }$$

Their successive differences in formula $(3)$ are

$$\eqalign{ f_{(4,6,1)}(x)^4 - f_{(4,6,2)}(x)^4 &= x^4 + 4x^5 + 10 x^6 + \cdots + 12 x^{18} + 4x^{19} \\ f_{(4,6,2)}(x)^4 - f_{(4,6,3)}(x)^4 &= x^8 + 4x^9 + 10x^{10} + \cdots + 4 x^{20} \\ \ldots \\ f_{(4,6,5)}(x)^4 - f_{(4,6,6)}(x)^4 &=x^{20} + 4 x^{21} + 6 x^{22} + 4x^{23} \\ f_{(4,6,6)}(x)^4 - f_{(4,6,7)}(x)^4 &= x^{24}. }$$

The resulting sum in formula $(4)$ is

$$6^{-4}\left(x^3 + 4x^4 + 10x^5 + 21x^6 + 38x^7 + 62x^8 + 91x^9 + 122x^{10} + 148x^{11} + \\167x^{12} + 172x^{13} + 160x^{14} + 131x^{15} + 94x^{16} + 54x^{17} + 21x^{18}\right).$$

For example, the chance that the top three dice sum to $14$ is the coefficient of $x^{14}$, equal to

$$6^{-4}\times 160 = 10/81 = 0.123\,456\,790\,123\,456\,\ldots.$$

It is in perfect agreement with the probabilities quoted in the question.

By the way, the mean (as calculated from this result) is $15869/1296 \approx 12.244598765\ldots$ and the standard deviation is $\sqrt{13\,612\,487/1\,679\,616}\approx 2.8468444$.

A similar (unoptimized) calculation for $n=400$ dice instead of $n=4$ took less than a half a second, supporting the contention that this is not a computationally demanding algorithm. Here is a plot of the main part of the distribution:

Figure

Since the minimum $K$ is highly likely to equal $1$ and the sum $X$ will be extremely close to having a Normal$(400\times 7/2, 400\times 35/12)$ distribution (whose mean is $1400$ and standard deviation is approximately $34.1565$), the mean must be extremely close to $1400-1=1399$ and the standard deviation extremely close to $34.16$. This nicely describes the plot, indicating it is likely correct. In fact, the exact calculation gives a mean of around $2.13\times 10^{-32}$ greater than $1399$ and a standard deviation around $1.24\times 10^{-31}$ less than $\sqrt{400\times 35/12}$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    Your answer is fast and is correct so I've marked it as the answer. Also in an edit I said it would also be nice to have frequencies if possible. For that you don't need to edit your answer since I can see that the ``6^-4`` multiplier is used to convert from frequency to probability. – SkySpiral7 Dec 05 '16 at 01:00
  • 1
    This is a really good answer. What modifications would have to be made to take only the highest 2 dice (or, the highest $h$ dice in general), instead of the highest 3? I assume it would modify equation 4 but I can't work it out or find a reference. – markovchain Sep 04 '20 at 05:35
1

I have a reasonably efficient algorithm for this that, on testing, seems to match results of pure brute force while relying less heavily on enumerating all possibilities. It's actually more generalized than the above problem of 4d6, drop 1.

Some notation first: Let $X_NdY$ indicate that you are rolling $X$ dice with $Y$ faces (integer values $1$ to $Y$), and considering only the highest $N$ dice rolled. The output is a sequence of dice values, e.g. $4_3d6$ yields $3, 4, 5$ if you rolled $1, 3, 4, 5$ on the four dice. (Note that I'm calling it a "sequence," but the order is not important here, particularly since all we care about in the end is the sum of the sequence.)

The probability $P(X_NdY = S)$ (or more specifically, $P(4_3d6 = S)$) is a simplified version of the original problem, where we are only considering a specific set of dice, and not all possible sets that add up to a given sum.

Suppose $S$ has $k$ distinct values, $s_0, s_1, ..., s_k$, such that $s_i > s_{i+1}$, and each $s_i$ has a count of $c_i$. For example, if $S = 3, 4, 4, 5$, then $(s_0,c_0) = (5,1)$, $(s_1,c_1) = (4,2)$, and $(s_2,c_2) = (3,1)$.

You can calculate $P(X_NdY = S)$ in the following way:

$$ P(X_NdY = S) = \frac{ \left( \prod_{i=0}^{k-1} {X - \sum_{h=0}^{i-1} c_h \choose c_i} \right) \left( \sum_{j=0}^{X-N} { c_k+X-N \choose c_k+X-N-j} (s_k-1)^j \right)}{ Y^X }$$

That's pretty messy, I know.

The product expression $\prod_{i=0}^{k-1}$ is iterating through all but the lowest of the values in $S$, and calculating all the ways those values may be distributed among the dice. For $s_0$, that's just $X \choose c_i$, but for $s_1$, we have to remove the $c_0$ dice that have already been set aside for $s_0$, and likewise for $s_i$ you must remove $\sum_{h=0}^{i-1}c_h$.

The sum expression $\sum_{j=0}^{X-N}$ is iterating through all the possibilities of how many of the dropped dice were equal to $s_k$, since that affects the possible combinations for the un-dropped dice with $s_k$ as their value.

By example, let's consider $P[4_3d6=(5,4,4)]$:

$$ (s_1, c_1) = (5, 1) $$ $$ (s_2, c_2) = (4, 2) $$

So using the formula above:

$$ P[4_3d6=(5,4,4)] \\ = \frac{ {4 \choose 1} \left( {3 \choose 3} \cdot 3^0 + {3 \choose 2} \cdot 3^1 \right) }{ 6^4 } \\ = \frac{5}{162} = 0.0\overline{308641975}$$

The formula breaks down on a domain issue when $s_k=1$ and $j=0$ in the summation, leading to a first term of $0^0$, which is indeterminate and needs to be treated as $1$. In such a case, a summation is not actually necessary at all, and can be omitted, since all the dropped dice will also have a value of $s_k = 1$.

Now here's where I do need to rely on some brute force. The original problem was to calculate the probability of the sum being some value, and $X_NdY$ represents the individual dice left after dropping. This means you must add up the probabilities for all possible sequences $S$ (ignoring ordering) whose sum is the given value. Perhaps there is a formula to calculate this across all such values of $S$ at once, but I haven't even tried broaching that yet.

I've implemented this in Python first, and the above is an attempt to express it mathematically. My Python algorithm is accurate and reasonably efficient. There are some optimizations that could be made for the case of calculating the entire distribution of $\sum X_NdY$, and maybe I'll do that later.

  • As a programmer it might be easier for me to understand your Python code (although I've never used Python so it might be the same). Posting the code here is off topic but you could post a link to github etc. – SkySpiral7 Oct 31 '16 at 01:47
  • 1
    Your answer may be correct and it seems to reduce the complexity from `O(Y^X)` to `O((Y+X-1)!/(X!*(Y-1)!))` but it still isn't as efficient as whuber's answer of `O(c*X*log(X))`. Thanks for your answer though +1. – SkySpiral7 Nov 29 '16 at 04:21