# Entropy

# Just a little Bit

One of the first lessons in many programming courses is about variables and their types. In `C`

programming language (on a typical machine) a variable of type `unsigned char`

consists of 8 bits and can store any integer number from 0 up to 255. A `signed char`

variable takes the same amount of space and represents integers between −128 and 127. In both cases there are 256 distinct numbers.
You can play with the demo below to see how the numbers are represented. The `signed`

numbers use two's complement representation.

# Binary Representation

Possible states: **256**.

A 16-bits variable type allows representing one of 2^{16} = 65536 distinct values. The number of distinct values equals the number of distinct binary sequences of a given length. If we have *m* unique sequences then we can double the number by prepending a single bit to each of them. For example, from the 3 sequences 111, 0 and 10 we would get the 6 sequences: 0111, 00, 010 and 1111, 10, 110.

In general, an *n*-bit variable can store one of 2^{n} distinct values. By convention, an unsigned integer variable can store integers from 0 up to 2^{n} − 1. A signed integer variable can store any integer between −2^{n − 1} and 2^{n − 1} − 1.
If we take it to the (almost) extreme, what is the largest value one can store in a bit? We could consider an `unsigned bit`

variable with two possible values: 0 and 1. Similarly a `signed bit`

could store −1 and 0. But the value ranges described above are just a convention to make arithmetic operations more efficient. Nothing stops us from using a bit variable to represent any set of two values we want, for example −1728 and 65536. Or we could define a single bit floating point type that represents either ∞ or −∞. Of course we're not restricted to numbers, we can use a single bit to represent a choice between any two items, f.e., a T-Rex icon 🦖 and a string `potrzebie`

. What we cannot do is represent with a single bit a choice between more than two options.

## Number of Bits

Let's reverse the question: how many bits do we need to store one of *m* distinct values?
If *m* is an integer power of two, we need exactly log_{2}(*m*) bits: 1 bit to store one of 2 values, 3 bits to store one of 8 values, 4 bits to store one of 16 values and so on.

What if *m* is not an integer power of two? Let's say we want to store a single decimal digit. For *m* = 10 the formula suggests log_{2}(10) ≈ 3.32 bits. Is it justified to use the formula in this case? What does 0.32 bits mean? Perhaps there's some averaging in play. Let's say we represent the first 8 digits {0, 1, … , 7} using all distinct 3 bits sequences from 000 to 111, and the remaining two digits {8, 9} using two 4-bit sequences 1000 and 1001. On average we would need 3 bits in 8 out of 10 cases and 4 bits in 2 out of 10 cases, so $108 ⋅3+102 ⋅4=3.2$ bits. That's even better than log_{2}(10). The issue with this variable-length encoding is that we need a way to mark boundaries of each sequence, otherwise we won't be able to uniquely decode it.
For example, the bit sequence 100110001001 could be either split into 100, 110, 001, 001 and decoded as (4, 6, 1, 1) or it could be split into 1001, 1000, 1001 and decoded as (9, 8, 9).

In fact, if marking the boundaries wasn't a problem, we could come up with an even better encoding by taking the ten shortest binary sequences

*ε*, 0, 1, 00, 01, 10, 11, 000, 001, 010}

to get $101 ⋅0+102 ⋅1+104 ⋅2+103 ⋅3=1.9$ bits on average. By including *ε*, the empty sequence of length 0, we get infinitely many ways a given sequence of bits can be decoded into decimal digits.

Another interpretation of 3.32 bits is simply that 3 bits are not enough and we need to use 4 bits to store a single decimal digit.
Using the ceiling function ⌈*x*⌉, denoting the smallest integer that is greater than or equal to *x*, we can write that we need ⌈log_{2}(10)⌉ bits. This way to represent two decimal digits we need 8 bits. But there are exactly 100 distinct two-digit numbers (from 00 to 99) and to represent 100 numbers we need

_{2}(100)⌉ = ⌈6.64 … ⌉ = 7 bits

or 3.5 bits per digit. For three digits it's

In general, we can represent a sequence of *k* decimal digits using

For any real number *x* there's *x* ≤ ⌈*x*⌉ < *x* + 1, hence
⌈*k*log_{2}(10)⌉ is less than *k*log_{2}(10) + 1,
so we need at most $log_{2}(10)+k1 $ bits per digit. The larger the sequence length *k*, the closer the upper bound gets to the initial log_{2}(10) ≈ 3.32 bits.

In the same manner, if we want to encode one of *m* distinct values we need close to log_{2}(*m*) bits. Same as in the case of a 1-bit variable, the set of values doesn't have to consist of consecutive numbers or numbers at all. A better way to think about it is that we select one item from a set of *m* items and we want to store the information of what we selected.

## Decimal Digits, Revisited

As noted above the incorrect encoding of decimal digits was not only incorrect but also inefficient. Let's propose a new encoding: to fix the inefficiency we encode the first 8 digits {0, 1, … , 7} using all distinct 3 bits sequences from 000 to 111 as before, but the remaining two digits {8, 9} will be represented by a single bit, either 0 or 1, respectively. To make the encoding reversible, we precede each sequence with a single bit *l* denoting if the length is 3 (*l* = 0) or 1 (*l* = 1). We can test the encoding using the widget below.

# Average Code Length

Average code length: **3.6 bits**.

Uniquely decodable: **yes**.

On average we need

That's a saving from 4 bits for sure, but we're still far from log_{2}(10) (you can experiment with the demo above and try to get a better encoding).
Where are we loosing almost 0.28 bits? Everything seems to be fine: we use 3 bits to represent one of 8 digits, 1 bit to represent one of 2 digits, and 1 bit to represent one of 2 possible lengths.

The issue is that all the calculations from the beginning made one assumption: when selecting one item out of *m*, each item has the same chance of being selected. Why does it matter? If there are 10 items but only 3 of them can be selected, then two bits are clearly enough to represent which one was it. This is a degenerate example and one might say we had only 3 items in the first place. However, once we are allowed to assign unequal probabilities of sampling we can get arbitrarily close to the degenerate example by setting the probability of the 7 items to a sufficiently small positive number.

In case of our encoding, if we encode all 10 digits then 8 out of 10 times the value of length bit *l* is 0 and 2 out of 10 times it is 1. In other words, if we encode a single decimal digit sampled uniformly at random then the length bit *l* is 0 with probability $P(l=0)=54 $ and it is 1 with probability $P(l=1)=51 $, so it is definitely non-uniform. Clearly, whether it's uniform or not, as long as both probabilities are non-zero we cannot encode a single bit with less than one bit. However, in the same manner that we grouped decimal digits into pairs, if we are able to group non-uniform bits before encoding, then on average we can use less than one bit per bit. Let's consider a pair of independent bits *l*_{1} and *l*_{2} with the same distribution as *l*. With probability $P(l_{1}=0,l_{2}=0)=2516 $ both bits are 0 and it is the most probable configuration. We encode the pair as follows: if both bits are 0, then we output a single bit 0. Otherwise, we output 1 followed by the values of *l*_{1} and *l*_{2}. So it's either 0, 110, 101 or 111. On average we output

*l*

_{1}= 0,

*l*

_{2}= 0) ⋅ 1 + P (

*l*

_{1}= 1 or

*l*

_{2}= 1) ⋅ 3 = 1.72 bits

to encode a pair of bits or 0.86 bits per bit.

This example nicely shows how non-uniformity enables compression. The final improvement is to notice that we can output 10 instead of 101 and still get reversible encoding.
Now on average we use 1.56 bits to represent a pair of length bits (*l*_{1}, *l*_{2}). With this encoding we need 0.78 bits on average to encode a single length bit, thus we can encode a single decimal digit with 3.38 bits, much closer to the lower bound of 3.32.

Now let's try to answer how many bits are necessary in the general non-uniform case.

# Entropy

In 1948 paper titled A Mathematical Theory of Communication Claude Shannon introduced a measure of uncertainty called entropy. The paper considers what are reasonable properties that a measure of information produced by a random process should have:

Suppose we have a set of possible events whose probabilities of occurrence are

p_{1},p_{2}, … ,p_{n}. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome? If there is such a measure, say H(p_{1},p_{2}, … ,p_{n}), it is reasonable to require of it the following properties:

- H should be continuous in the
p_{i}.- If all the
p_{i}are equal, $p_{i}=n1 $, then H should be a monotonic increasing function ofn. With equally likely events there is more choice, or uncertainty, when there are more possible events.- If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Shannon then proceeds by showing that these properties uniquely determines H up to a constant multiplicative factor. The first property allows to consistently extend H from rational to irrational numbers. Without the second property H could be negative or equal to 0 everywhere.

The last property says that we can decompose a choice into multiple choices. Using a slightly modified example from the paper, if we have $P(🦄)=61 $, $P(🦒)=31 $ and $P(🌽)=21 $, then the amount of information required in selecting one of the three objects is equal to the amount of information required to first select between plants and animals, and then optionally (half the time) selecting the animal out of the two:

or less formally:

H(🦄, 🦒, 🌽) = H({🦄, 🦒}, 🌽) + P (🦄, 🦒) ⋅ H(🦄, 🦒) + P (🌽) ⋅ H(🌽).

In case of our decimal digit encoding with the length bit *l*, we get that entropy of choosing a decimal digit can be decomposed into first deciding if we want a value less than 8 (*l* = 0) or not (*l* = 1), and then selecting one of the values:

To draw a playing card out of full deck of 52 cards we can first pick up a suit and then a rank. Informally

The last four terms are equal so we get

## Definition

Shannon proved that the only function H satisfying the three above assumptions is of the form:

where *K* is a positive constant. Any term with *p*_{i} = 0 is ignored, i.e., it is assumed that 0 ⋅ log (0) = 0. The choice of *K* (or equivalently the base of the logarithm) corresponds to the choice of a unit. With *K* = 1 and the base 2 logarithm the resulting units are called bits – binary digits.

For a discrete random variable *X* with a set of possible values {*x*_{1}, … , *x*_{n}} and probabilities P (*X* = *x*_{i}) = *p*_{i} the entropy is equal to

The first equality shows that the entropy does not depend on the values {*x*_{1}, … , *x*_{n}}, just their probabilities. If *p*_{X}(*x*) = P (*X* = *x*) is a probability mass function of *X*, then the formula simplifies (complicates?) to

Let's have a look at some examples.

## Binary Entropy

The entropy of a 0–1 random variable with P (*X* = 1) = *p* = 1 − P (*X* = 0) equals

H_{b}(*p*) is called the binary entropy function.

*p*

H_{b}(*p*)

$21 $

1

$21 $

1

The plot is symmetric around $p=21 $ where it attains its maximum, $H_{b}(21 )=1$.

For the length bit from the previous section we have

## Uniform Random Variable

In case of a uniform random variable *U*_{m} over *m* values we get

Out of all random variables over *m* values *U*_{m} has the maximum entropy.

## Random Permutation

Let *S*_{n} be a process that outputs a random permutation of set {1, … , *n*} with all permutations being equally probable, that is, each of *n*! = 1 ⋅ 2 ⋅ 3 ⋅ … ⋅ *n* permutations has the same probability $n!1 $. The entropy H(*S*_{n}) is log_{2}(*n*!) which by Stirling's approximation is *n*log_{2}(*n*) − *n*log_{2}(*e*) + *O*(log (*n*)).

## Name of Entropy

To end this post it is worth mentioning how the entropy got its name. In Shannon's own words:

My greatest concern was what to call it. I thought of calling it "information", but the word was overly used, so I decided to call it "uncertainty". When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."

Source: Energy and Information.