3

I was trying to help some person here, but then I discovered that I actually almost know entropy (but not fully).

Questions:

  • Q1: how is my definition of information different than that of Claude Shannon?
  • Q2: following my thread of thought (below), where did I go wrong?

My thread of thought

I think there are many different ways to look at entropy, but among the tiny subset of views I can think of, I prefer this view

  • The entropy of a thing is a measure of information that the thing contains.
  • Information of a thing is measured by total number of perfect questions that, if got answered, all ambiguity about the thing is resolved (i.e. the thing becomes fully known).

Let's see: You have a single 6-faced die that you have tossed. You have measured the outcome of this trusty fair die toss. I -on the other hand- don't know the outcome of your die toss. All what you have told me is this:

  • This single die toss is independent of any other die toss you might have done in the past.
  • Each die outcome $i$ has probability $\Pr(I=i)$.

Now, suppose that you tell me the outcome of the trusty die toss. Say, the outcome is 5. How much information did you give me by telling me "outcome is 5"?

The answer is (in my view): the quantity of information there is the quantity of perfect $n$-nary questions that I had to ask in order to know that know 5.

One could ask a single question "what is the number?", which is one hell of an $n$-nary question where $n$ is very very big (space of answers).

Alternatively one can ask many questions of the form "is the number greater than $x$?" which is a 2-nary (binary) question. Of course we will need to ask more of those questions.

A set of $n$-nary questions are said to be "perfect" if their numbers are minimum. Turns out that such questions are those that divide the space of answers in half at each time.

Also comparing information based on counts of questions of differing values of $n$ is not very meaningful (answers to questions of higher $n$ tell more about a thing than those of lower $n$). So we gotta stick $n$. Let's say that $n=2$.

Let's say the outcome was $1$; then these are the minimum/perfect questions that I need to ask to know the outcome:

  • Is outcome greater than 3.5? Answer will be "no".
  • Is outcome greater than 1? Answer will be "no"

Boom, it must be 1. I found it in 2 questions.

Now let's say that the outcome was 3:

  • Is outcome greater than 3.5? Answer will be "no".
  • Is outcome greater than 1? Answer will be "yes".
  • Is outcome greater than 2? Answer will be "yes".

Boom, it must be 3. I found it in 3 questions.

It seems that outcomes 1, 4 need two questions, and others (2,3,5,6) need 3 questions.

Since all outcomes are equally likely to happen, expected number of questions is $\frac{2+2+3+3+3+3}{6}= 2.666666\ldots 6667$

Now what if we asked a different kind of perfect binary questions. Say that outcome is actually 1 and we ask:

  • Is outcome greater than 3.5? Answer "no".
  • Is outcome greater than 2? Answer is "no".
  • Is outcome equal to 2? Answer is "no".

Boom, must be 1 then.

Let's say that actual outcome is 3:

  • Is outcome greater than 3.5? Answer "no".
  • Is outcome greater than 2? Answer is "yes".

Boom, it must be 3 then. In two questions.

I think with this kind of questions, outcomes 3 and 6 need 2 questions, while others (1,2,4,5) need 3. So average number of questions here is $\frac{2+2+3+3+3+3}{6}= 2.666666\ldots 6667$ as well.

But what about the grand average regardless of what kind of perfect questions do we ask? $\frac{(2+2+3+3+3+3)+(2+2+3+3+3+3)}{6+6}= 2.666666\ldots 6667$ obviously the same.

So I think that information contained in a single toss of a fair 6-faced die is 2.666667.

That is if we define information by total number of perfect $n$-nary questions that we need to ask.

But apparently Claude Shannon disagrees. He says: \begin{equation}\begin{split} \text{entropy of that fair die} &= \sum_{c=1}^6 \frac{1}{6}\log_2(\frac{6}{1})\\ &= \frac{1}{6} \times 6 \times \log_2(\frac{6}{1})\\ &= \log_2(\frac{6}{1})\\ &= 2.58492\\ \end{split}\end{equation}

caveman
  • 2,431
  • 1
  • 16
  • 32
  • 1
    As you discovered in your previous answer, your definition *is* different. So what are really trying to ask here? – whuber Apr 08 '16 at 17:39
  • @whuber because I have other follow up questions and I think it won't fit nicely there (better make it a question of its own). Primarily, I am currently disappointed on what entropy is measuring. It seems to me that it is measuring information of a different scenario (where many dies are tossed at the same time) than the actual one (a single die is tossed). (thank you for enlightening me in that post) – caveman Apr 08 '16 at 19:16
  • 2
    Here's a thought. Changing the base of a logarithm is just a rescaling of its value. For instance, using logarithms base $6$ is tantamount to multiplying all binary logarithms by $\log_6(2)\approx 0.387$. For any definition to have intrinsic meaning its values should not depend on the base used and should scale in exactly the same way. But if you could ask questions that have six, rather than two, options, then you would need *exactly* one question to determine the face of a die. Thus the log-6 "entropy" *by your definition* is $1$. Converted to binary that's $\log_2(6)$. – whuber Apr 08 '16 at 19:27
  • 1
    The point of that previous thought is that your analysis of "entropy" *depends fundamentally on which logarithmic base you choose to use.* That arbitrariness shows that although you might be capturing something of interest about the *combination* of the nature of the questions (that is, binary instead of 6-ary) and the distribution, you are *not* expressing some intrinsic property of the distribution itself. – whuber Apr 08 '16 at 19:30
  • (racing condition; i'm reading your 2nd post) Sorry not sure I understand. Are you saying that Shannon's entropy has no intrinsic meaning (because its value too depends on the base of the logarithm it uses)? And yes in my case if I ask a single 6-nary question, all ambiguities will be resolved by a single question. Therefore I think that the unit of information is "$n$-nary", and in order to make measured information comparable, we need to unify their units so that we compare information of same units such as (say) $2$-nary (information measured by using perfect binary questions). – caveman Apr 08 '16 at 19:41
  • 1
    No, I'm saying that Shannon's definition *does* have intrinsic meaning and yours obviously does *not*. His definition does *not* depend on the base, because the answer transforms appropriately. Consider this analogy: if you invent a thermometer that shows both degrees F and degrees C, but it gives a different conversion between F and C *depending on what it's measuring*, then it's a pretty weird thermometer, isn't it? Shannon's entropy converts consistently between different bases, no matter what the distribution may be, *but yours does not*. – whuber Apr 08 '16 at 19:44
  • What does "intrinsic meaning" mean? Similarly what is "some intrinsic property of the distribution itself"? And what is "the distribution itself"? As for the temperature conversion, I think my case is rather like this "getting different temperature readings depending on the temperature unit chosen; if we choose a 6-nary unit we get different answer than 2-nary". – caveman Apr 08 '16 at 19:55
  • 1
    "Intrinsic" means it's a property of the distribution. We have no problems using multiple temperature measurement systems because there's a universal conversion between them; thus, what a thermometer measures is a property of the physical system it is measuring. We have no problems with Shannon's entropy for exactly the same reasons. Your calculation is problematic because it does not behave like this. Your calculation is describing the relationship between some system of asking questions and a distribution, but it is not giving clear information about the distribution itself. – whuber Apr 08 '16 at 20:00
  • Sorry what is the "distribution itself"? Is this the distribution of 6-faced die outcomes? I.e. if $D$ is a random variable that takes values in set of outcomes $\{1,2,3,4,5,6\}$, then this distribution is the PMF $p_D$? – caveman Apr 08 '16 at 20:15
  • @whuber great insight onto the notion of intrinsic meaning of the definition. I too have some issue onto two definition of the information in statistic. Please do have a [look if convenient](https://stats.stackexchange.com/questions/457672/conflicting-definition-of-information-in-statistics-fisher-vs-shannon). – GENIVI-LEARNER Mar 31 '20 at 14:01

1 Answers1

3

Your definition is different in that you are asking questions about a single dice throw as opposed to the limit. You can ask more informative questions about multiple throws. I believe your definition is what Huffman coding is finding a code for - the minimal expected length of the optimal compression for each symbol separately - so you might want to look at examples for inefficiency in that approach.

DaVinci
  • 509
  • 2
  • 8
  • Can you give such examples? – caveman Apr 08 '16 at 20:17
  • Wikipedia on arithmetic coding has an example where it beats the huffman code: https://en.wikipedia.org/wiki/Arithmetic_coding#Huffman_coding – DaVinci Apr 08 '16 at 20:25
  • Also http://videolectures.net/mlss09uk_mackay_it/ part 2 starting at min 58 (ignoring the part about context since we are talking about independent samples) – DaVinci Apr 08 '16 at 20:28