Thomas M. Cover, Joy A. Thomas. Elements of information theory

Подождите немного. Документ загружается.

17.4 INEQUALITIES FOR TYPES 665

≤



x∈X

−r(x)log r(x) (17.30)

=||p − q||



x∈X

−

r(x)

||p − q||

log

r(x)

||p − q||

(17.31)

=−||p − q||

log ||p − q||

+||p − q||



r(x)

||p − q||



(17.32)

≤−||p − q||

log ||p − q||

+||p − q||

log |X|, (17.33)

where (17.30) follows from (17.27).



Finally, relative entropy is stronger than the L

norm in the following

sense:

Lemma 17.3.2 (Lemma 11.6.1)

D(p

||p

) ≥

2ln2

||p

− p

. (17.34)

The relative entropy between two probability mass functions P(x) and

Q(x) is zero when P = Q. Around this point, the relative entropy has

a quadratic behavior, and the ﬁrst term in the Taylor series expansion of

the relative entropy D(P ||Q) around the point P = Q is the chi-squared

distance between the distributions P and Q.Let

(P , Q) =



(P (x) − Q(x))

Q(x)

. (17.35)

Lemma 17.3.3 For P near Q,

D(P  Q) =

+···. (17.36)

Proof: See Problem 11.2.



17.4 INEQUALITIES FOR TYPES

The method of types is a powerful tool for proving results in large devi-

ation theory and error exponents. We repeat the basic theorems.

Theorem 17.4.1 (Theorem 11.1.1) The number of types with denom-

inator n is bounded by

|≤(n + 1)

|X |

. (17.37)

666 INEQUALITIES IN INFORMATION THEORY

Theorem 17.4.2 (Theorem 11.1.2) If X

,...,X

are drawn i.i.d.

according to Q(x), the probability of x

depends only on its type and is

given by

) = 2

−n(H (P

)+D(P

||Q))

. (17.38)

Theorem 17.4.3 (Theorem 11.1.3: Size of a type class T(P)) For any

type P ∈

(n + 1)

|X |

nH (P )

≤|T(P)|≤2

nH (P )

. (17.39)

Theorem 17.4.4 (Theorem 11.1.4) For any P ∈

and any distribu-

tion Q, the probability of the type class T(P) under Q

is 2

−nD(P ||Q)

ﬁrst order in the exponent. More precisely,

(n + 1)

|X |

−nD(P ||Q)

≤ Q

(T (P )) ≤ 2

−nD(P ||Q)

. (17.40)

17.5 COMBINATORIAL BOUNDS ON ENTROPY

We give tight bounds on the size of





when k is not 0 or n using the

result of Wozencraft and Reiffen [568]:

Lemma 17.5.1 For 0 <p<1, q = 1 − p, such that np is an integer,

√

8np q

≤





−nH (p)

≤

√

πnpq

. (17.41)

Proof: We begin with a strong form of Stirling’s approximation [208],

which states that

√

2πn





≤ n! ≤

√

2πn





12n

. (17.42)

Applying this to ﬁnd an upper bound, we obtain





≤

√

2πn(

)

12n

√

2πnp(

)

√

2πnq(

)

(17.43)

√

2πnpq

12n

(17.44)

17.6 ENTROPY RATES OF SUBSETS 667

√

πnpq

nH (p)

, (17.45)

since e

12n

= 1.087 <

√

2, hence proving the upper bound.

The lower bound is obtained similarly. Using Stirling’s formula, we

obtain





≥

√

2πn(

)

−



12np

12nq



√

2πnp(

)

√

2πnq(

)

(17.46)

√

2πnpq

−



12np

12nq



(17.47)

√

2πnpq

nH (p)

−



12np

12nq



. (17.48)

If np ≥ 1, and nq ≥ 3, then

−



12np

12nq



≥ e

−

= 0.8948 >

√

= 0.8862, (17.49)

and the lower bound follows directly from substituting this into the equa-

tion. The exceptions to this condition are the cases where np = 1,nq = 1

or 2, and np = 2,nq = 2 (the case when np ≥ 3, nq = 1 or 2 can be

handled by ﬂipping the roles of p and q). In each of these cases

np = 1,nq = 1 → n = 2,p =





= 2, bound = 2

np = 1,nq = 2 → n = 3,p =





= 3, bound = 2.92

np = 2,nq = 2 → n = 4,p =





= 6, bound = 5.66.

Thus, even in these special cases, the bound is valid, and hence the lower

bound is valid for all p = 0, 1. Note that the lower bound blows up when

p = 0orp = 1, and is therefore not valid.



17.6 ENTROPY RATES OF SUBSETS

We now generalize the chain rule for differential entropy. The chain rule

provides a bound on the entropy rate of a collection of random variables

in terms of the entropy of each random variable:

h(X

,...,X

) ≤



i=1

h(X

). (17.50)

668 INEQUALITIES IN INFORMATION THEORY

We extend this to show that the entropy per element of a subset of a set of

random variables decreases as the size of the subset increases. This is not

true for each subset but is true on the average over subsets, as expressed

in Theorem 17.6.1.

Deﬁnition Let (X

,...,X

) have a density, and for every S ⊆

{1, 2,...,n}, denote by X(S) the subset {X

: i ∈ S).Let

(n)







S: |S|=k

h(X(S))

. (17.51)

Here h

(n)

is the average entropy in bits per symbol of a randomly drawn

k-element subset of {X

,...,X

The following theorem by Han [270] says that the average entropy

decreases monotonically in the size of the subset.

Theorem 17.6.1

(n)

≥ h

(n)

≥···≥h

(n)

. (17.52)

Proof: We ﬁrst prove the last inequality, h

(n)

≤ h

(n)

n−1

. We write

h(X

,...,X

) =h(X

,...,X

n−1

)+h(X

,...,X

n−1

h(X

,...,X

) =h(X

,...,X

n−2

)

+ h(X

n−1

,...,X

n−2

≤h(X

,...,X

n−2

)

+ h(X

n−1

,...,X

n−2

h(X

,...,X

) ≤h(X

,...,X

) + h(X

Adding these n inequalities and using the chain rule, we obtain

nh(X

,...,X

) ≤



i=1

h(X

,...,X

i−1

i+1

,...,X

)

+ h(X

,...,X

) (17.53)

h(X

,...,X

) ≤



i=1

h(X

,...,X

i−1

i+1

,...,X

)

n − 1

(17.54)

17.6 ENTROPY RATES OF SUBSETS 669

which is the desired result h

(n)

≤ h

(n)

n−1

. We now prove that h

(n)

≤ h

(n)

k−1

for all k ≤ n by ﬁrst conditioning on a k-element subset, and then taking

a uniform choice over its (k − 1)-element subsets. For each k-element

subset, h

(k)

≤ h

(k)

k−1

, and hence the inequality remains true after taking

the expectation over all k-element subsets chosen uniformly from the n

elements.



Theorem 17.6.2 Let r>0, and deﬁne

(n)







S: |S|=k

r h(X(S))

. (17.55)

Then

(n)

≥ t

(n)

≥···≥t

(n)

. (17.56)

Proof: Starting from (17.54), we multiply both sides by r, exponentiate,

and then apply the arithmetic mean geometric mean inequality, to obtain

rh(X

,...,X

)

≤ e



i=1

rh(X

,...,X

i−1

i+1

,...,X

)

(n−1)

(17.57)

≤



i=1

rh(X

,...,X

i−1

i+1

,...,X

)

(n−1)

for all r ≥ 0, (17.58)

which is equivalent to t

(n)

≤ t

(n)

n−1

. Now we use the same arguments as

in Theorem 17.6.1, taking an average over all subsets to prove the result

that for all k ≤ n, t

(n)

≤ t

(n)

k−1

. 

Deﬁnition The average conditional entropy rate per element for all

subsets of size k is the average of the above quantities for k-element

subsets of {1, 2,...,n}:

(n)







S:|S|=k

h(X(S)|X(S

))

. (17.59)

Here g

(S) is the entropy per element of the set S conditional on the

elements of the set S

. When the size of the set S increases, one can

expect a greater dependence among the elements of the set S,which

explains Theorem 17.6.1.

670 INEQUALITIES IN INFORMATION THEORY

In the case of the conditional entropy per element, as k increases, the

size of the conditioning set S

decreases and the entropy of the set S

increases. The increase in entropy per element due to the decrease in

conditioning dominates the decrease due to additional dependence among

the elements, as can be seen from the following theorem due to Han [270].

Note that the conditional entropy ordering in the following theorem is the

reverse of the unconditional entropy ordering in Theorem 17.6.1.

Theorem 17.6.3

(n)

≤ g

(n)

≤···≤g

(n)

. (17.60)

Proof: The proof proceeds on lines very similar to the proof of the

theorem for the unconditional entropy per element for a random subset.

We ﬁrst prove that g

(n)

≥ g

(n)

n−1

and then use this to prove the rest of

the inequalities. By the chain rule, the entropy of a collection of random

variables is less than the sum of the entropies:

h(X

,...,X

) ≤



i=1

h(X

). (17.61)

Subtracting both sides of this inequality from nh(X

,...,X

),we

have

(n − 1)h(X

,...,X

) ≥



i=1

(

h(X

,...,X

) − h(X

)

(17.62)



i=1

h(X

,...,X

i−1

i+1

,...,X

(17.63)

Dividing this by n(n − 1), we obtain

h(X

,...,X

)

≥



i=1

h(X

,...,X

i−1

i+1

,...,X

)

n − 1

(17.64)

which is equivalent to g

(n)

≥ g

(n)

n−1

. We now prove that g

(n)

≥ g

(n)

k−1

for

all k ≤ n by ﬁrst conditioning on a k-element subset and then taking

a uniform choice over its (k − 1)-element subsets. For each k-element

subset, g

(k)

≥ g

(k)

k−1

, and hence the inequality remains true after taking

the expectation over all k-element subsets chosen uniformly from the n

elements.



17.7 ENTROPY AND FISHER INFORMATION 671

Theorem 17.6.4 Let

(n)







S:|S|=k

I(X(S);X(S

))

. (17.65)

Then

(n)

≥ f

(n)

≥···≥f

(n)

. (17.66)

Proof: The theorem follows from the identity I(X(S);X(S

)) =

h(X(S)) − h(X(S)|X(S

)) and Theorems 17.6.1 and 17.6.3. 

17.7 ENTROPY AND FISHER INFORMATION

The differential entropy of a random variable is a measure of its descriptive

complexity. The Fisher information is a measure of the minimum error

in estimating a parameter of a distribution. In this section we derive a

relationship between these two fundamental quantities and use this to

derive the entropy power inequality.

Let X be any random variable with density f(x). We introduce a loca-

tion parameter θ and write the density in a parametric form as f(x − θ).

The Fisher information (Section 11.10) with respect to θ is given by

J(θ) =



∞

−∞

f(x − θ)



∂

∂θ

ln f(x − θ)



dx. (17.67)

In this case, differentiation with respect to x is equivalent to differentiation

with respect to θ. So we can write the Fisher information as

J(X) =



∞

−∞

f(x− θ)



∂

∂x

ln f(x − θ)





∞

−∞

f(x)



∂

∂x

ln f(x)



dx, (17.68)

which we can rewrite as

J(X) =



∞

−∞

f(x)



∂

∂x

f(x)



dx. (17.69)

We will call this the Fisher information of the distribution of X. Notice

that like entropy, it is a function of the density.

The importance of Fisher information is illustrated in the following

theorem.

672 INEQUALITIES IN INFORMATION THEORY

Theorem 17.7.1 (Theorem 11.10.1: Cram´er–Rao inequality) The

mean-squared error of any unbiased estimator T(X)of the parameter θ is

lower bounded by the reciprocal of the Fisher information:

var(T ) ≥

J(θ)

. (17.70)

We now prove a fundamental relationship between the differential

entropy and the Fisher information:

Theorem 17.7.2 (de Bruijn’s identity: entropy and Fisher information)

Let X be any random variable with a ﬁnite variance with a density f(x).

Let Z be an independent normally distributed random variable with zero

mean and unit variance. Then

∂

∂t

(X +

√

tZ) =

J(X+

√

tZ), (17.71)

where h

is the differential entropy to base e. In particular, if the limit

exists as t → 0,

∂

∂t

(X +

√

tZ)



t=0

J(X). (17.72)

Proof: Let Y

= X +

√

tZ. Then the density of Y

(y) =



∞

−∞

f(x)

√

2πt

−

(y−x)

dx. (17.73)

Then

∂

∂t

(y) =



∞

−∞

f(x)

∂

∂t



√

2πt

−

(y−x)



dx (17.74)



∞

−∞

f(x)



−

√

2πt

−

(y−x)

(y − x)

√

2πt

−

(y−x)



dx. (17.75)

We also calculate

∂

∂y

(y) =



∞

−∞

f(x)

√

2πt

∂

∂y



−

(y−x)



dx (17.76)



∞

−∞

f(x)

√

2πt



−

y − x

−

(y−x)



dx (17.77)

17.7 ENTROPY AND FISHER INFORMATION 673

and

∂

∂y

(y) =



∞

−∞

f(x)

√

2πt

∂

∂y



−

y − x

−

(y−x)



dx (17.78)



∞

−∞

f(x)

√

2πt



−

(y−x)

(y − x)

−

(y−x)



dx.

(17.79)

Thus,

∂

∂t

(y) =

∂

∂y

(y). (17.80)

We will use this relationship to calculate the derivative of the entropy of

, where the entropy is given by

) =−



∞

−∞

(y) ln g

(y) dy. (17.81)

Differentiating, we obtain

∂

∂t

) =−



∞

−∞

∂

∂t

(y) dy −



∞

−∞

∂

∂t

(y) ln g

(y) dy (17.82)

=−

∂

∂t



∞

−∞

(y) dy −



∞

−∞

∂

∂y

(y) ln g

(y) dy. (17.83)

The ﬁrst term is zero since



(y) dy = 1. The second term can be inte-

grated by parts to obtain

∂

∂t

) =−



∂g

(y)

∂y

ln g

(y)



∞

−∞



∞

−∞



∂

∂y

(y)



(y)

dy.

(17.84)

The second term in (17.84) is

J(Y

). So the proof will be complete if

we show that the ﬁrst term in (17.84) is zero. We can rewrite the ﬁrst

term as

∂g

(y)

∂y

ln g

(y) =



∂g

(y)

∂y

√

(y)







(y) ln



(y)



. (17.85)

The square of the ﬁrst factor integrates to the Fisher information and

hence must be bounded as y →±∞. The second factor goes to zero since

x ln x → 0asx → 0andg

(y) → 0asy →±∞. Hence, the ﬁrst term in

674 INEQUALITIES IN INFORMATION THEORY

(17.84) goes to 0 at both limits and the theorem is proved. In the proof, we

have exchanged integration and differentiation in (17.74), (17.76), (17.78),

and (17.82). Strict justiﬁcation of these exchanges requires the application

of the bounded convergence and mean value theorems; the details may

be found in Barron [30].



This theorem can be used to prove the entropy power inequality, which

gives a lower bound on the entropy of a sum of independent random

variables.

Theorem 17.7.3 (Entropy power inequality) If X and Y are indepen-

dent random n-vectors with densities, then

h(X + Y)

≥ 2

h(X)

+ 2

h(Y)

. (17.86)

We outline the basic steps in the proof due to Stam [505] and Blachman

[61]. A different proof is given in Section 17.8.

Stam’s proof of the entropy power inequality is based on a perturbation

argument. Let n = 1. Let X

= X +

√

f(t)Z

, Y

= Y +

√

g(t)Z

,where

and Z

are independent N(0, 1) random variables. Then the entropy

power inequality for n = 1 reduces to showing that s(0) ≤ 1, where we

deﬁne

s(t) =

2h(X

)

+ 2

2h(Y

)

2h(X

)

. (17.87)

If f(t) →∞and g(t) →∞as t →∞, it is easy to show that s(∞) = 1.

If, in addition, s



(t) ≥ 0fort ≥ 0, this implies that s(0) ≤ 1. The proof

of the fact that s



(t) ≥ 0 involves a clever choice of the functions f(t)

and g(t), an application of Theorem 17.7.2 and the use of a convolution

inequality for Fisher information,

J(X+ Y)

≥

J(X)

J(Y)

. (17.88)

The entropy power inequality can be extended to the vector case by

induction. The details may be found in the papers by Stam [505] and

Blachman [61].

17.8 ENTROPY POWER INEQUALITY AND

BRUNN–MINKOWSKI INEQUALITY

The entropy power inequality provides a lower bound on the differential

entropy of a sum of two independent random vectors in terms of their

individual differential entropies. In this section we restate and outline an