
17.1 BASIC INEQUALITIES OF INFORMATION THEORY 659
Definition The mutual information between two random variables X
and Y is defined by
I(X;Y) =
x∈X
y∈Y
p(x, y) log
p(x, y)
p(x)p(y)
= D(p(x, y)||p(x)p(y)).
(17.9)
The following basic information inequality can be used to prove many
of the other inequalities in this chapter.
Theorem 17.1.7 (Theorem 2.6.3: Information inequality) For any
two probability mass functions p and q,
D(p||q) ≥ 0 (17.10)
with equality iff p(x) = q(x) for all x ∈
X.
Corollary For any two random variables X and Y ,
I(X;Y) = D(p(x, y)||p(x)p(y)) ≥ 0 (17.11)
with equality iff p(x, y) = p(x)p(y) (i.e., X and Y are independent).
Theorem 17.1.8 (Theorem 2.7.2: Convexity of relative entropy)
D(p||q) is convex in the pair (p, q).
Theorem 17.1.9 (Theorem 2.4.1 )
I(X;Y) = H(X)− H(X|Y). (17.12)
I(X;Y) = H(Y) − H(Y|X). (17.13)
I(X;Y) = H(X)+ H(Y)− H(X,Y). (17.14)
I(X;X) = H(X). (17.15)
Theorem 17.1.10 (Section 4.4) For a Markov chain:
1. Relative entropy D(µ
n
||µ
n
) decreases with time.
2. Relative entropy D(µ
n
||µ) between a distribution and the stationary
distribution decreases with time.
3. Entropy H(X
n
) increases if the stationary distribution is uniform.
4. The conditional entropy H(X
n
|X
1
) increases with time for a station-
ary Markov chain.