This post skips over an in-depth explanation of Shannon entropy and only addresses the intuition surrounding it and how it can be misleading. Googling for “information theory” should come up with plenty of options for a more thorough treatment of information theory.

Entropy comes up in information theory as a way to measure “uncertainty” or alternatively a measure of “information”. Information theory itself was first formalized by Claude Shannon who’s name comes up often in the literature.

Conceptually information and uncertainty are opposing forces. More of one and you have less of the other.

The balance between Information and Uncertainty. Not ““; I don’t know what Gemini was thinking there. (Generated with Gemini.)

So, lack of information leads to uncertainty. In the absence of any information we are in total uncertainty.

In practice any given situation where we are concerned with information begins with a well defined foundation which lays out a set of possible outcomes. Each piece of information we receive should narrow down the set of possible outcomes until we reach a state where we have what we need to know.

If we only have one possible outcome, then we are certain of that outcome. The measure of uncertainty could be thought of as a way of measuring how far we have to go to get to this certainty.

Shannon entropy is a measure of uncertainty

A good example of this is the classic game “Guess who”, where you have to guess a person from a set of pictures based on a few pieces of information.

The original cast of characters of the game “Guess Who”

We want to pinpoint a person of interest; which in this case translates to narrowing down the list of possibilities to 1. Each person has an equal probability of being the person of interest. At the beginning our uncertainty is $\frac{1}{24}$ because there are 24 possibilities and we don’t know which one it is.

Let’s look at two possible pieces of information:

wears glasses – matches 5 persons
presents as a man – matches 18 persons

We started with 24 persons. Once we know that the person of interest is wearing glasses, we can narrow down the field to 5. On the other hand, knowing that the person presents as a man only narrows down the field to 18. The first piece of information is clearly more valuable than the second. By “valuable” we mean “reduces uncertainty”.

So, with “wears glasses”, our uncertainty went from $\frac{1}{24}$ to $\frac{1}{5}$ . Whereas, “presents as a man”, only reduces the uncertainty from $\frac{1}{24}$ to $\frac{1}{18}$ .

Entropy is a way to quantify –or measure– this difference.

Choices of measure for a single outcome

For the sake of discussion, let’s say the probability of a single outcome (call it $A$ ) is $\Pr(A)$ . And the measure of information learned from knowing this outcome is $M(A)$ , where $M$ is some function which we will figure out here.

Adding independent pieces of information should correspond to adding entropy

Channeling some high school math, we know that if the probability of event $A$ happenning is $p$ , and the probability of event $B$ happening is $q$ , and if these two events are independent, then the probability of both happening is going to be $pq$ .

For example, the event person is in the first row ( $A$ ) has a probability of $p = \frac{1}{3}$ , while the event person is in the third column ( $B$ ) has a probability of $q = \frac{1}{8}$ . Meanwhile, both of these happenning at the same time ( $A \wedge B$ ; where you can read the $\wedge$ as and) has a probability of $pq = \frac{1}{3} \cdot \frac{1}{8} = \frac{1}{24}$ . I.e.

$A \text{ and } B \text{ are independent events} \implies \Pr(A \wedge B) = \Pr(A) \cdot P(B) = pq$

A good measure should be intuitive. Which is to say, it should behave in ways that we expect. So for example, if we were to take two independent signals, adding them should yield a measure equal to the sum of each measured separately.

$\begin{equation} A \text{ and } B \text{ are independent events } \implies M(A \wedge B) = M(A) + M(B) \end{equation}$

The simplest way to convert a multiplication like $pq$ to a sum is to … drumroll … take the logarithm. So we could start with $M(p)$ being something like $\log(p)$ .

More uncertainty should correspond to a larger entropy

To further intuit, an outcome with a lower probability is more surprising and hence strictly more informative than an outcome with a higher probability, we need:

$\begin{equation} \Pr(A) \lt \Pr(B) \implies M(A) \gt M(B) \end{equation}$

No uncertainty should correspond to zero entropy

In other words,

$\begin{equation} \Pr(A) = 1 \implies M(A) = 0 \end{equation}$

because the outcome $A$ is going to happen with certainty.

Putting it all together

The simplest way to make entropy larger for outcomes with lower probability is to negate the logarithm. Doing so has the additional desirable effect that probabilities –which are in the range $[0,1]$ – will yield positive values.

So finally, we have the measure:

$M(A) = - \log\left( \Pr(A) \right)$

Since Claude Shannon was interested in measuring the information content of digital signals at the time, it made sense to use a more familiar base for the logarithm like $2$ instead of $e$ .

So the final equation is:

$M \left( p \right) = - \log_2\left( p \right)$

The choice of $2$ as the base has another nice feature that we can now interpret the information quantity as a number of bits. Because, for example, if we have 5 bits of information where each of the 32 values is equally likely, then any specific outcome has a probability of $\frac{1}{2^5}$ of occurring.

So knowing the outcome gets us $M(\frac{1}{2^5})$ amount of information; which happens to be $5$ — the number of bits we started with.

This measure of information for a single event is called the pointwise entropy, or self information. “Pointwise” because it is measuring a single outcome in the probability space.

Choice of measure for a probability space

Now that we have a measure of the information content of a single outcome, how do we measure the information content of a probability space?

For this, we take another simple formula we already know; which is the expectation. Hence we define the information content of a probability space to be the expected pointwise entropy.

This entropy of a probability space is often denoted $H(P)$ where $P$ is the probability space (defined formally below). For simplicity we can think of $P$ as being a function of some outcome to the probability of that outcome.

The expected pointwise entropy is therefore:

$\begin{equation*} \begin{split} H(P) &= \sum_{\omega \in \Omega} P(\omega) \cdot M(\omega) \\ &= - \sum_{\omega \in \Omega} P(\omega) \cdot \log_2\left( P(\omega) \right) \end{split} \end{equation*}$

Where $\Omega$ is the set of all possible outcomes.

Formalities

A common critique is that Shannon entropy can be misleading. Let’s see why.

Let’s start with a probability space. It consists of:

A set of possible outcomes. Usually designated as $\Omega$ .

For example, the set of possible person-of-interest consists of the 24 individuals who are pictured above.

Note that here we are dealing with a discrete probability space in that we have a countable set of possible outcomes. This isn’t always the case.
A set of subsets of possible outcomes. Usually designated as $\mathcal{F}$ .

In our example, it could be people with moustaches, people wearing hats, people with glasses, etc. The set doesn’t need to have any semantic meaning. So any group of people from the 24 possible people is a valid subset.

Typically there would be one element in $\mathcal{F}$ which contains all the possible outcomes. I.e. we say $\Omega \in \mathcal{F}$ .
A probability function. Usually designated as $P$ , that when applied to an element in $\mathcal{F}$ yields the probability of that set of outcomes occurring.

For example:

$P(\text{wears glasses}) = \frac{5}{24}$ $P(\text{presents as a man}) = \frac{18}{24}$

Also, when applied to $\Omega$ (i.e. the set of all possible outcomes), it yields $1$ as one would expect since otherwise it would not be the set of all possible outcomes. I.e.:

$P(\Omega) = 1$

Let $(\Omega, \mathcal{F}, P)$ be a probability space.

So, $\mathcal{F}$ is the set of all possible subsets of $\Omega$ , which is also written as $2^{\Omega}$ .

$\mathcal{F} = 2^{\Omega}$

… and …

$P$ is a function that maps from $\mathcal{F}$ to the real numbers in the range 0 thru 1 inclusive.

$P : \mathcal{F} \mapsto [0,1]$

The Shannon entropy of the probability function $P$ is given by:

$H(P) = - \sum_{\omega \in \Omega} P(\omega) \cdot \log_2\left( P(\omega) \right)$