## Ch1: What is Information?

Information is quantified using bits, not to be confused with binary digit. A binary digit contains at most one bit of information, but may contain less (if it’s not equally likely to be 0 and 1).

## Ch2: Entropy of Discrete Variables

Definition of **entropy** H(x) for discrete random variables. Entropy can be estimated from a sample, and is maximized with a uniform distribution. You can think of it as the average surprise of a random variable, where surprise is measured in bits.

The surprisal of an outcome is log 1/p(x) so the entropy, which is a weighted average, is defined as sum p(x) log 1/p(x).

## Ch3: Source Coding Theorem

Shannon’s **first fundamental theorem** (source coding theorem) describes how much information can be transmitted in a noiseless channel. If source has entropy H bits/symbol and channel has capacity C bits/second, then it is possible to transmit C/H – epsilon symbols per second, for any epsilon > 0. Moreover it’s not possible to transmit more than C/H symbols per second.

This only applies when you take a limit of arbitrarily long messages. Intuitively, if the source entropy is H bits/symbol, then you can construct a codebook of 2^H entries and each entry takes H binary digits to transmit, so your transmission rate is C/H symbols/second.

In English, looking at one character at a time, the entropy is about 4.1 bits/letter, but in the long run with infinite context, the entropy is about 1.8 bits/letter. Huffman coding gives an optimal way to encode a discrete random variable into binary codewords, by repeatedly merging the two least probable symbols into a composite symbol, which gives you a binary tree.

## Ch4: Noisy Channel Coding Theorem

Entropy of a joint discrete distribution is calculated as normally, and if X and Y are independent, then the joint distribution has entropy H(X, Y) = H(X) + H(Y). **Mutual information** is defined as I(X, Y) = H(X) + H(Y) – H(X,Y), and is 0 if and only if they’re independent. Can think of mutual information as average reduction in uncertainty in Y if you know the value for X.

**Conditional entropy** related to mutual information by I(X,Y) = H(X) – H(X|Y). To help remember, can visualize a Venn diagram where H(X) and H(Y) are the two circles, mutual information I(X,Y) is the overlapping area, joint entropy H(X,Y) is the union area, H(X|Y) and H(Y|X) are the difference areas — this recovers the difference formulas.

Shannon’s **second fundamental theorem** (noisy channel coding theorem) says that with channel capacity C bits/second and source entropy H: if H < C then the error at the receiver can be made arbitrarily small. If H > C then the error at the receiver is at least H-C but can be made less than H-C + \varepsilon. Practically, this means we can make a noisy channel into a noiseless one by reducing the entropy of our source (eg: by reducing the number of symbols and increasing redundancy).

**Q**: How accurately can mutual information and conditional entropy be inferred from a limited sample of data? What’s Shannon’s formal proof of these theorems?

## Ch5: Entropy of Continuous Variables

For entropy, continuous variables isn’t a simple extension of the discrete case because using the naive method, the entropy is larger as the histogram is more fine-grained, so the entropy is potentially infinite. Instead, define **differential entropy** as taking out the constant infinity term, so Hd(X) = E[1 / log p(x)].

Intuitively, this makes sense because a sample from continuous distribution is a real number that contains an infinite amount of information, but of course, in practice the measurement accuracy is finite. Differential entropy can be estimated by taking a histogram, but the MLE estimator is biased to be lower than the true value.

Unlike discrete variables, multiplying by a constant c increases the entropy by \log |c|. Entropy can be sometimes negative, eg, for a uniform distribution with range less than 1. Adding a constant does not change the entropy.

In discrete case, uniform distribution always maximizes entropy. In the continuous case, the distribution that maximizes entropy can be:

- If upper and lower bounds are fixed, then uniform distribution.
- If mean is fixed and all values positive, then exponential distribution.
- If variance is fixed, then Gaussian distribution. This occurs often when you want to maximize information transmitted with limited power.

## Ch6: Channel Capacity

A **Gaussian channel** is where each input X gets noise eta that has a Gaussian distribution to produce Y. The channel capacity is 1/2 log(1+P/N), where P (power) is the variance in the input, and N (noise) is the variance in the noise. Best to make the input a Gaussian distribution, so that the output is also Gaussian and has maximum entropy.

Having channel capacity C roughly means you can use it to transmit the same amount of information as 2^C discrete symbols. This is possible with long messages: as the length goes to infinity, you can transmit source of entropy R < C with probability of error that tends to zero. The book doesn’t say how this can actually be accomplished, though.

In **fixed range channel**, range is always between a range, so the output should be uniform distribution to maximize entropy. To do this, transform the input (assumed to be some continuous distribution) to be a uniform by encoding using the CDF.

## Ch7: Thermodynamic Entropy

Definition of thermodynamic entropy is quite similar to information theory entropy: S = k log W where W is the number of microstates that correspond to a macrostate and k is the Boltzmann constant.

The Landauer limit states a lower bound for the amount of energy required to process 1 bit of information (the argument is not that convincing though). It explains why Maxwell’s demon is impossible: the energy required to gather the information cancels out the energy accumulated by the demon.

Last chapter talks about examples of information theory in biology, and evidence (some of which is the author’s own research) that biological systems try to maximize communicative efficiency.

## Further notes

Overall a good intuitive book on information theory, covers enough to understand the major concepts and gives lots of examples, but skips some of the more technical details. Some questions that remain unanswered:

- Statistical inference on entropy, conditional entropy, mutual information from data? If MLE is biased, what should we use?
- Formal proof of second fundamental theorem, which is apparently clever.
- Mutual information is supposedly invariant to any transformation, but how accurately can this be measured in practice?
- How to transmit discrete message in Gaussian channel with efficiency approaching Shannon limit? Does discrete coding theory apply?
- Further justification of Landauer limit.