263
109
1
0
Mathematics
Statistics and Probability
In the past over two decades, very fruitful results have been obtained in information theory in the study of the Shannon entropy. This study has led to the discovery of a new class of constraints on the Shannon entropy called non-Shannon-type inequalities. Intimate connections between the Shannon entropy and different branches of mathematics including group theory, combinatorics, Kolmogorov complexity, probability, matrix theory, etc, have been established. All these discoveries were based on a formality introduced for constraints on the Shannon entropy, which suggested the possible existence of constraints that were not previously known. We assert that the same formality can be applied to inequalities beyond information theory. To illustrate the ideas, we revisit through the lens of this formality three fundamental inequalities in mathematics: the AM-GM inequality in algebra, Markov’s inequality in probability theory, and the Cauchy-Scharwz inequality for inner product spaces. Applications of this formality have the potential of leading to the discovery of new inequalities and constraints in different branches of mathematics.
Corresponding author: Raymond W. Yeung, whyeung@ie.cuhk.edu.hk
Inequality defined on the ordered field of real numbers is one of the most fundamental concepts in mathematics. In particular, a universally quantified inequality is one that holds for all members in the domain of discourse satisfying certain conditions. Examples are
The latter may simply be written as “If x≥5, then x≥4”. Essentially, any inequality in mathematics that bears a name is a universally quantified inequality, for example, the AM-GM inequality, the Cauchy-Schwarz inequality, Jensen’s inequality, and Minkowski’s inequality. The list goes on and on.
The main contribution of this paper is to introduce a framework that gives a geometrical interpretation of such inequalities. This framework has its origin in information theory [1], specifically in the study of inequalities on the Shannon entropy (or simply entropy when there is no ambiguity) in the late 1990s [2]. This study has led to the discovery of so-called non-Shannon-type entropy inequalities, namely inequalities on the Shannon entropy beyond what were known before then (collectively called Shannon-type entropy inequalities). Subsequently, intimate relations between the Shannon entropy and different branches of mathematics including group theory, combinatorics, Kolmogorov complexity, probability, matrix theory, etc, have been established. Inspired by this development, there has also been a wave of pursuit of new inequalities on the von Neumann entropy, a generalization of the Shannon entropy to the quantum case.
We assert that this geometrical framework which has led to very fruitful results in the study of entropy inequalities can also be applied to general universally quantified inequalities. In this paper, we will develop the concepts starting with a very simple example, and then apply the concepts to increasingly elaborate examples. The results are presented in a logical order instead of the chronological order that the results were obtained.
The rest of the paper is organized as follows. In Section 2, we discuss the well-known inequality of arithmetic and geometric means (AM-GM inequality) in algebra, and show that the AM-GM inequality completely characterizes the relation between the AM and GM of a finite collection of nonnegative numbers. In Section 3, we discuss Markov’s inequality in probability theory. We show that for a nonnegative random variable T and a nonnegative value c, Markov’s inequality essentially completely characterizes the relation between the two quantities E[T] (expectation of T) and Pr{T≥c}. In Section 4, we discuss the Cauchy-Schwarz inequality for real inner product spaces. We show that for two vectors u,v∈V, where V is a real inner product space, the Cauchy-Schwarz inequality completely characterizes the relation among the three quantities ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩ if and only if dim(V) (the dimension of V) is at least 2. Nevertheless, there exists no inequality on the quantities ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩ that holds for all inner product space V (regardless of the value of dim(V)) which is not implied by the Cauchy-Schwarz inequality. The results in Sections 2 to 4 are new to our knowledge. In Section 5, we give an exposition of the study on entropy inequalities in information theory since the late 1990s. We also briefly discuss the relations between the Shannon entropy and network coding, conditional independence of random variables, finite groups, positive semi-definite matrices, Kolmogorov complexity, and quantum mechanics. The paper is concluded in Section 6.
The inequality of arithmetic and geometric means, or the AM-GM inequality in brief, is elementary in algebra. In this section, we use this very simple inequality as an example to illustrate the concepts we will develop in this work.
The arithmetic mean and geometric mean of a finite list of nonnegative numbers x1,x2,…,xn are
AM=1n(x1+…+xn)and
GM=n√x1⋅…⋅xn,respectively. The AM-GM inequality says that
AM≥GM.For a collection of proofs of the AM-GM inequality, we refer the reader to [3]. Throughout this section, we will use AM and GM to denote the arithmetic mean and the geometric mean of some finite list of nonnegative numbers, respectively.
Traditionally, the AM-GM inequality is interpreted as either a lower bound on the AM or an upper bound on the GM. Here, we take a somewhere different view on the AM-GM inequality that will be elaborated in the rest of the section.
For a finite list of nonnegative numbers, the AM and GM are quantities of interest, and we are interested in the relation between these two quantities. The AM-GM inequality is a characterization of this relation. Since xi≥0 for all i, we immediately have AM≥0 and GM≥0. These inequalities come directly from the setup of the problem. In fact, from GM≥0 and AM≥GM, we can obtain AM≥0. Therefore, AM≥0 is redundant.
We now introduce a geometrical framework for understanding the relation between the quantities AM and GM. Let a and g be the coordinates of R2, the 2-dimensional Euclidean space, where a and g correspond to AM and GM, respectively. Define the region
Υ={(a,g)∈R2:g≥0 and a≥g}.where in the above, g≥0 and a≥g correspond to GM≥0 and AM≥GM, respectively. See Figure 1 for an illustration of Υ.
We now ask a very basic question: Are there constraints on AM and GM other than GM≥0 and AM≥GM? As we will see, the answer to this question hinges on the next proposition.
Proposition 1. For any (a,g)∈Υ, there exist x,y≥0 such that
a=x+y2andg=√xy,i.e., a and g are the AM and GM of the list of nonnegative numbers x,y, respectively.
Proof. Consider a fixed ordered pair (a,g)∈Υ. Since (a,g)∈Υ, we have g≥0 and a≥g. Let
a=x+y2andg=√xy,where x,y∈R are unknowns. Here, x and y can be obtained by solving the above simultaneous equations, which is elementary. Note that if g<0, then g=√xy in the above cannot be satisfied, but this is not the case since (a,g)∈Υ. The solution to (2) is
x=a±√a2−g2andy=a∓√a2−g2,where a2−g2≥0 since a≥g. Therefore, x and y are always real. Finally, it follows from √a2−g2≤a that x and y are always nonnegative. The proposition is proved. ◻
An ordered pair (a,g) is an achievable pair, or simply achievable,1 if a and g are respectively the AM and GM of some finite list of nonnegative numbers. The next theorem is a consequence of Proposition 1.
Theorem 1. An ordered pair (a,g)∈R2 is achievable if and only if (a,g)∈Υ.
Proof. Proposition 1 implies that every ordered pair (a,g)∈Υ is achievable. On the other hand, if an ordered pair (a,g) is achievable, then (a,g)=(AM,GM) where AM and GM are respectively the arithmetic mean and geometric mean of some finite list of nonnegative numbers. Therefore, g=GM≥0 and a=AM≥GM=g, i.e., g≥0 and a≥g, implying that (a,g)∈Υ. Hence, an ordered pair (a,g) is achievable if and only if (a,g)∈Υ. The theorem is proved. ◻
Theorem 1 says that Υ is precisely the set of all achievable pairs, thus completely characterizing the relation between the arithmetic and geometric means of a finite list of nonnegative numbers. Since Υ is defined by g≥0 (corresponding to GM≥0) and a≥g (corresponding to the AM-GM inequality), where the former comes directly from the setup of the problem and can be regarded as given, we say that the AM-GM inequality completely characterizes the relation between the AM and GM of a finite list of nonnegative numbers.
An inequality in the quantities AM and GM has the general form
f(AM,GM)≥0,where f:R2→R.2 For example, if f(a,g)=a−g, then (3) becomes the AM-GM inequality. Let
Rf={(a,g)∈R2:f(a,g)≥0}be the region in R2 induced by f≥0. If (3) is satisfied for all finite lists of nonegative numbers, we say that the inequality is valid.
Theorem 2. The inequality (3) is valid if and only if
Υ⊂Rf.Proof. We first prove the “if” part. Assume that (4) holds. Consider any finite list of nonnegative numbers and let AM and GM be the arithmetic mean and geometric mean, respectively. Then (AM,GM) is achievable, and so by Theorem 1, (AM,GM)∈Υ. Then by (4), (AM,GM)∈Rf, implying that f(AM,GM)≥0. This shows that the inequality (3) is valid.
Next, we prove the “only if” part by contradiction. Assume that the inequality (3) is valid, i.e., f(AM,GM)≥0 is satisfied by all finite lists of nonnegative numbers, but
Υ⊄{(a,g)∈R2:f(a,g)≥0}.Then there exists an ordered pair (a0,g0)∈Υ such that f(a0,g0)<0. Since (a0,g0)∈Υ, by Theorem 1, (a0,g0) is achievable, which means that (a0,g0)=(AM∗,GM∗), where AM* and GM* are respectively the arithmetic mean and geometric mean of some finite list of nonnegative numbers. In other words, we have f(AM∗,GM∗)=f(a0,g0)<0, which is a contradiction to the assumption that f(AM,GM)≥0 is satisfied by all finite lists of nonnegative numbers. The theorem is proved. ◻
Theorem 2 gives a complete characterization of all valid inequalities in AM and GM. Specifically, f(AM,GM))≥0 is a valid inequality in AM and GM if and only if Υ⊂Rf, the region induced by f≥0, is an outer bound on Υ. This is illustrated in Figure 1.
In Theorem 2, the set inclusion in (4) is equivalent to
g≥0a≥g}⇒f(a,g)≥0.Upon replacing the dummy variables a and g by AM and GM, respectively, the above becomes
GM≥0AM≥GM}⇒f(AM,GM)≥0,meaning that any valid inequality in AM and GM is implied by the inequalities GM≥0 and AM≥GM. Hence, we conclude that there exists no inequality in AM and GM other than these two inequalities. As discussed, since the inequality GM≥0 comes directly from the setup of the problem and can be regarded as given, in view of (5), we say that the AM-GM inequality is sharp.
We end this section with a remark on the tightness of a valid inequality. Let f(AM,GM)≥0 be a valid inequality. By Theorem 2, the set inclusion in (4) holds. If f(AM,GM)≥0 is tight, then f(AM∗,GM∗)=0 for some achievable pair (AM∗,GM∗), which by Theorem 1 is in Υ. If the boundary of the region Rf is equal to
{(a,g)∈R2:f(a,g)=0},then (AM∗,GM∗) is in Υ as well as on the boundary of Rf. This implies that the boundary of Rf touches the region Υ at the point where f≥0 is tight, providing a geometrical interpretation of the tightness of a valid inequality.
As an example, the inequality 2AM≥GM is equivalent to f(AM,GM)≥0 with f(a,g)=2a−g. It is a valid inequality because Υ⊂Rf. Moreover, 2AM≥GM is tight for the list of nonnegative number, 0, with (AM∗,GM∗)=(0,0). Accordingly, the boundary of Rf, namely the set
{(a,g)∈R2:2a=g},touches the region Υ at the origin. This is illustrated in Figure 2.
In probability theory, Markov’s inequality asserts that for a nonnegative random variable T and any fixed c>0,
Pr{T≥c}≤E[T]c,where E[T] denotes the expectation of T. Here, Pr{T≥c} and E[T] are two quantities of interest for any nonnegative random variable T, and we are interested in the relation between them. Markov’s inequality gives a characterization of this relation.
Let c in (6) be fixed, and let FT denote the probability distribution of T, i.e., FT(t)=Pr{T≤t}. The following is a proof of (6):
E[T]=∫t≥0tdFT(t)=∫0≤t<ctdFT(t)+∫t≥ctdFT(t)i)≥∫t≥ctdFT(t)ii)≥c∫t≥cdFT(t)=cPr{T≥c}.Then (6) is obtained by dividing both sides about by c. In the above, the inequality i) is tight if and only if
∫0≤t<ctdFT(t)=0,or
Pr{0<T<c}=0.Note that although Pr{T=0} can be strictly positive, it would not in any case make any contribution to (7). On the other hand, the inequality ii) is tight if and only if
Pr{T>c}=0,or equivalently,
Pr{T≥c}=Pr{T=c}.If (6) is tight, i.e., i) and ii) are tight simultaneously, then FT can only have two point masses, one at 0 and the other at c, with
Pr{T=0}+Pr{T=c}=1.As a sanity check, for this distribution, from (8), we have
E[T]=c⋅Pr{T=c}=c⋅Pr{T≥c},or
Pr{T≥c}=E[T]c.Thus (6) indeed holds with equality.
Note that the above discussion holds regardless of the value of Pr{T≥c} (which can be any number between 0 and 1). Thus Markov’s inequality can hold with equality for all values of Pr{X≥c}.
We can obtain further insight on Markov’s inequality by means of a geometrical framework similar to the one discussed in Section 2 for the AM-GM inequality. Continue to assume that c>0 is fixed. Let p and m be real numbers such that p=Pr{T≥c} and m=E[T] for some nonnegative random variable T. Then from (6), we have m≥cp. Since p is a probability, we also have 0≤p≤1. Now regard p and m as the two coordinates in R2, and define the region
Ψc={(p,m)∈R2:0≤p≤1 and m≥cp}.See Figure 3.
For real numbers p and m, if there exists a nonnegative random variable T such that p=Pr{T≥c} and m=E[T], we say that the ordered pair (p,m)∈R2 is achievable (by the random variable T). Note that an ordered pair in R2 may be achievable by more than one nonnegative random variable, i.e.,
(p,m)=(Pr{T≥a},E[T])=(Pr{T′≥a},E[T′])where T and T′ have different probability distributions.
We will show that almost all ordered pairs in Ψc are achievable. We first define the achievable region
Ψ∗c={(p,m)∈R2:(p,m) is achievable}.Since for any nonnegative random variable T, 0≤Pr{T≥c}≤1 and Markov’s inequality is satisfied, if (p,m) is achievable, then (p,m)∈Ψc. Consequently, Ψ∗c⊂Ψc.
Theorem 3. Ψ∗c=Ψc∖{(0,m):m≥c}.
Proof. First, we prove that Ψ∗c⊂Ψc∖{(0,m):m≥c}. Since Ψ∗c⊂Ψc, we only need to show that (0,m) is not achievable for all m≥c. When p=0, we have Pr{T≥c}=p=0, or Pr{0≤T<c}=1. This implies that m=E[T]<c. In other words, every ordered pair (0,m) where m≥c is not achievable.
Now, we prove that Ψc∖{(0,m):m≥c}⊂Ψ∗c. For any 0<p≤1, consider any (p,m)∈Ψc and the following probability mass function for a random variable T:
Pr{T=0}=1−pandPr{T=m/p}=p.Since m≥cp, we have m/p≥c, and so
Pr{T≥c}=Pr{T=m/p}=p,and E[T]=p(m/p)=m. Therefore, (p,m) is achieved by the random variable T as constructed. Note that unless m=cp, (p,m) can always be achieved by more than one probability distribution.
It remains to prove that every ordered pair (0,m) with 0≤m<c is achievable. This can be done by noting that such an ordered pair can be achieved by any random variable T with Pr{T=m}=1. The theorem is proved. ◻
The region Ψc is defined by Markov’s inequality together with the constraint 0≤p≤1 which comes from the setup of the problem, and we have shown that for any fixed c>0, every ordered pair in Ψc except for a region with Lebesgue measure 0 (namely the region {(0,m):m≥c}) is achievable by some random variable T. Specifically:
See Figure 3 for an illustration. Since every ordered pair in Ψc except for a region with Lebesgue measure 0 is achievable, or equivalently, ¯Ψ∗c=Ψc, we say that Markov’s inequality almost completely characterizes the relation between Pr{T≥c} and E[T].
Now consider an inequality in the quantities Pr{T≥c} and E[T]:
f(Pr{T≥c},E[T])≥0,where f:R2→R. For example, if f(m,p)=m−cp, then the inequality in (12) becomes (6), namely Markov’s inequality. Let
Rf={(p,m)∈R2:f(p,m)≥0}be the region in R2 induced by f≥0.
If (12) holds for all nonnegative random variable T, we say that the inequality is valid. Markov’s inequality is such an example. The fundamental importance of the achievable region Ψ∗c is explained in the next theorem, which asserts that an inequality f(Pr{T≥c},E[T])≥0 is valid if and only if Ψ∗c⊂Rf. This implies that the region Ψ∗c completely characterizes all valid inequalities of the form (12).
Theorem 4. The inequality (12) is valid if and only if Ψ∗c⊂Rf.
Proof. We first show that if Ψ∗c⊂Rf, then (12) holds for all nonnegative random variable T. Assume that Ψ∗c⊂Rf. Consider any nonnegative random variable T and the ordered pair (Pr{T≥c},E[T]). By the definition of Ψ∗c, we have (Pr{T≥c},E[T])∈Ψ∗c⊂Rf, so that (Pr{T≥c},E[T])∈Rf. It then follows from the definition of Rf that f(Pr{T≥c},E[T])≥0.
Next, we show that if (12) holds for all nonnegative random variable T, then Ψ∗c⊂RF. Consider any (p,m)∈Ψ∗c. Then (p,m)=(Pr{T≥c},E[T]) for some nonnegative random variable T. Since by our assumption (12) holds for all nonnegative random variable T, we see that (p,m)∈Rf, and hence Ψ∗c⊂Rf.
Combining the above, we conclude that Ψ∗c⊂Rf is a necessary and sufficient condition for the inequality set (12) to be valid. The theorem is proved. ◻
We now consider a finite set of inequalities on Pr{T≥c} and E[T],
F={fi(Pr{T≥c},E[T])≥0,1≤i≤k},where fi:R2→R. We say that F is valid if fi(Pr{T≥c},E[T])≥0 is valid for all i. Let
RF={(p,m)∈R2:fi(p,m)≥0,1≤i≤m}be the region in R2 induced by F. The following is a corollary of Theorem 4.
Corollary 1. The set of inequalities (13) is valid if and only if Ψ∗c⊂RF.
Proof. Let
˜f(p,m)={1if fi(p,m)≥0 for all i−1otherwise.Then ˜f(p,m)≥0 if and only if fi(p,m)≥0 for all i. In other words, the set of inequalities (13) is valid if and only if ˜f(p,m)≥0 is valid, and hence RF=R˜f. Then the corollary is proved by applying Theorem 4. ◻
We end this section with an example for Corollary 1.
Example 1. Let
f1=p,f2=1−p,f3=m−cp,and F={fi(Pr{T≥c},E[T])≥0,i=1,2,3}. Then RF becomes Ψc. Since fi(Pr{T≥c},E[T])≥0 is valid for i=1,2,3, so is the inequality set F. Then by Corollary 1, Ψ∗c⊂ΨF, which is indeed the case.
The Cauchy-Schwarz inequality, which applies to a general inner product space, is among the most important inequalities in mathematics. For the purpose of this work, it suffices to confine our discussion to real inner product spaces.
Definition 1. A real inner product space is a vector space V over R together with an inner product ⟨⋅,⋅⟩:V×V→R that satisfies the following for any vectors u,v,w∈V and any scalars a,b∈R:
The Cauchy-Schwarz inequality asserts that for any u,v∈V,
⟨u,v⟩2≤⟨u,u⟩⟨v,v⟩,with equality if and only if u and v are linearly dependent. Here, for any pair of vectors u and v, the three quantities of interest are ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩, and we are interested in the relation among them. The Cauchy-Schwarz inequality gives a characterization of this relation. In this section, we discuss this inequality in the spirit of the discussion in the previous sections.
In view of the three quantities involved in the Cauchy-Schwarz inequality, namely ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩, we are motivated to consider the ordered triple (⟨u,u⟩,⟨v,v⟩,⟨u,v⟩) for any u,v∈V. For (x,y,z)∈R3, if (x,y,z)=(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩) for some u,v∈V, then we say that (x,y,z) is achievable. We then define the achievable region
Φ∗={(x,y,z)∈R3:(x,y,z) is achievable}.Also note that an ordered triple in R3 may be achievable by more than one pair of vectors, i.e.,
(x,y,z)=(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)=(⟨u′,u′⟩,⟨v′,v′⟩,⟨u′,v′⟩)where (u,v)≠(u′,v′).
Let V be a fixed inner product space. Consider a finite set of inequalities on the quantities ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩:
F={fi(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)≥0:1≤i≤m}where fi:R3→R. For example, if fi(x,y,z)=xy−z2, then the ith inequality in (15) becomes (14), the Cauchy-Schwarz inequality. Let
RF={(x,y,z)∈R3:fi(x,y,z)≥0,1≤i≤m}be the region in R3 induced by F.
If an inequality in (15) holds for all u,v∈V, we say that the inequality is valid. The Cauchy-Schwarz inequality is such an example. If the inequality in (15) is valid for all i, we say that the inequality set F is valid. The next theorem on the fundamental importance of the achievable region Φ∗ follows directly from the discussion in Section 3, so the proof is omitted here.
Theorem 5. For any inner product space V, the inequality set (15) is valid if and only if Φ∗⊂RF.
Consider
f1(x,y,z)=xf2(x,y,z)=yf3(x,y,z)=xy−z2.From (16) to (18), we obtain
f1(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)=⟨u,u⟩≥0f2(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)=⟨v,v⟩≥0f3(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)=⟨u,u⟩⟨v,v⟩−⟨u,v⟩2≥0respectively, which hold for all u,v∈V. Note that (19) and (20) are implied by the positive-definiteness of an inner product space. From (19) to (21), we see that
fi(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩)≥0is valid for i=1,2,3. Now define region
Φ={(x,y,z)∈R3:x,y≥0 and z2≤xy}which in fact is the region RF with F={fi≥0,i=1,2,3}. Thus by Theorem 5, we have Φ∗⊂Φ.
Ultimately, we are interested in obtaining a complete characterization of the achievable region Φ∗ instead of just an outer bound on it. The question is whether Φ is indeed equal to Φ∗. If so, we say that the Cauchy-Schwarz inequality is tight. Otherwise, additional inequalities on the quantities ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩ are needed to completely characterize Φ∗. Such an inequality, if exists, would play the same fundamental role as the Cauchy-Schwarz inequality. The next theorem asserts that there exists no constraint on ⟨u,u⟩, ⟨v,v⟩, and ⟨u,v⟩ other than the Cauchy-Schwarz inequality and positive-definiteness if dim(V)≥2, but is not so if dim(V)=0,1. We also say that the Cauchy-Schwarz inequality is sharp if and only if dim(V)≥2 (by regarding positive-definitness as given).3
Theorem 6. For an inner product space V, Φ∗=Φ if and only if dim(V)≥2. For dim(V)=0, Φ∗={0}, and for dim(V)=1,
Φ∗={(x,y,z)∈R3:x,y≥0 and z2=xy}.The following proposition is instrumental in proving Theorem 6.
Proposition 2. Let V be an inner product space. If dim(V)≥2, then for any non-zero vector u∈V, there exists a unit vector w∈V such that ⟨u,w⟩=0.
Proof. Consider an inner product space V with dim(V)≥2 and let u be a non-zero vector in V. Since dim(V)≥2, there exists another vector t∈V such that u and t are linearly independent. Let
v=t−⟨u,t⟩⟨u,u⟩u.Note that ⟨u,u⟩>0 since u≠0. Then
⟨u,v⟩=⟨u,t−⟨u,t⟩⟨u,u⟩u⟩=⟨u,t⟩−⟨u,t⟩⟨u,u⟩⟨u,u⟩=0,where the last step is justified because ⟨u,u⟩≠0. Also, we have
⟨v,v⟩=⟨t−⟨u,t⟩⟨u,u⟩u,t−⟨u,t⟩⟨u,u⟩u⟩=⟨t,t⟩−2⟨u,t⟩2⟨u,u⟩+⟨u,t⟩2⟨u,u⟩⟨u,u⟩2=⟨t,t⟩−⟨u,t⟩2⟨u,u⟩≥0,where the inequality above follows from the Cauchy-Schwarz inequality (14) which is tight if and only if u and t are linearly dependent. Since the latter does not hold by our assumption, we conclude that ⟨v,v⟩>0, i.e., v is not the zero vector. Finally, the proposition is proved by letting
w=v√⟨v,v⟩,so that
⟨u,w⟩=⟨u,v⟩√⟨v,v⟩=0and
√⟨w,w⟩=√⟨v,v⟩⟨v,v⟩=1.◻
Proof of Theorem 6. For the case dim(V)=0, since V={0}, we have Φ∗={0}⊊Φ.
For the case dim(V)=1, since for any two vectors u,v∈V, one of them is always a scalar multiple of the other, the inequality (14) is always satisfied with equality. This implies that
Φ∗⊂{(x,y,z)∈R3:x,y≥0 and z2=xy},which is a proper subset of Φ. To prove that
{(x,y,z)∈R3:x,y≥0 and z2=xy}⊂Φ∗,it suffices to show that every (x,y,z) satisfying x,y≥0 and z2=xy, there exist u,v∈V such that
(x,y,z)=(⟨u,u⟩,⟨v,v⟩,⟨u,v⟩).We first consider the case that either x=0 or y=0. If x=0, then z2≤xy becomes z2≤0, which implies z=0. Then (24) is satisfied by letting u=0 and v∈V such that ⟨v,v⟩=y. Likewise if y=0.
Now consider the case that x,y>0. Then z2=xy>0, which implies that z≠0. Let u∈V such that
⟨u,u⟩=x.We need to consider two cases for z, namely z>0 and z<0. First consider the case z>0. Together with z2=xy, we have z=√xy. Let b=√y/x, so that b2x=y. Let v=bu. Then
⟨v,v⟩=⟨bu,bu⟩=b2⟨u,u⟩=b2x=y,and
⟨u,v⟩=⟨u,bu⟩=b⟨u,u⟩=bx=√xy=z.Then we see from (25) to (27) that (24) is satisfied. For the case z<0, we have z=−√xy. Then we let b=−√y/x and repeat the above steps to show that (24) is again satisfied. Therefore, we have proved (23) and hence (22).
Now consider the case dim(V)≥2. It suffices to show that for any (x,y,z)∈Φ, there exist u,v∈V such that (24) is satisfied. Then (x,y,z)∈Φ∗, showing that Φ⊂Φ∗.
Consider any (x,y,z)∈Φ. Let u∈V such that ⟨u,u⟩=x. We seek v∈V such that
⟨v,v⟩=yand
⟨u,v⟩=z.We first consider the case that either x=0 or y=0, which can be proved in exactly the same way as we have proved the case for dim(V)=1. Now consider the case that x,y>0, and choose any u∈V such that ⟨u,u⟩=x. From Proposition 2, there exists a unit vector w∈V such that ⟨u,w⟩=0. Let
v=zxu+√y−z2xw.Note that the quantity inside the square root is nonnegative because z2≤xy. Since ⟨u,w⟩=0, we have
⟨v,v⟩=z2x2⟨u,u⟩+(y−z2x)=z2x2x+(y−z2x)=z2x+(y−z2x)=y,and
⟨u,v⟩=⟨u,zxu⟩=zx⟨u,u⟩=z.Thus (28) and (29) are satisfied. The theorem is proved. ◻
Remarks
To end this section, we argue that the Cauchy-Schwarz inequality can be regarded as sharp even when dim(V)=0 or 1 if we take explicit consideration of the dimension of the vector space. Specifically,
In the last section, we use the Cauchy-Schwarz inequality to illustrate how a geometrical formulation can potentially lead to interesting results. In this section, we apply the same formality to inequalities on the Shannon entropy, which has led to very fruitful and unexpected results in the past over two decades. This section is a brief exposition of this subject. The reader is referred to [[4], Chs. 13-15] and [5][6] for more in-depth discussions.4
In this section, all random variables are discrete. The Shannon entropy (or simply entropy when there is no ambiguity) for a random variable X with probability mass function p(x) is defined as
H(X)=−∑x∈SXp(x)logp(x),where SX denotes the support of X. For a pair of jointly distributed random variables X and Y with probability mass function p(x,y), the entropy is defined as
H(X,Y)=−∑(x,y)∈SXYp(x,y)logp(x,y),where SXY denotes the support of p(x,y). The entropy for a finite number of random variables is defined likewise. The entropy for two or more random variables is often called a joint entropy, although the distinction between entropy and joint entropy is unnecessary.
In information theory (see [7][8][4]), entropy is the fundamental measure of information. In addition to entropy, the following quantities are defined:
Mutual Information | I(X;Y)=H(X)+H(Y)−H(X,Y) |
Conditional Entropy | H(X|Y)=H(X,Y)−H(Y) |
Conditional Mutual Information | I(X;Y|Z)=H(X,Z)+H(Y,Z)−H(X,Y,Z)−H(Z). |
These quantities, collectively called Shannon’s information measures, are used extensively in coding theorems in information theory problems.
In this section, inequalities on Shannon’s information measures are discussed. These inequalities are the main tool for proving converse coding theorems, which establish that for a particular communication problem, no coding scheme exists if certain conditions are not satisfied. In other words, these inequalities establishes the “impossibilities” in information theory, and they are sometimes referred to as the “laws of information theory” [9].
As we see from the above, all Shannon’s information measures can be expressed as a linear combinations of entropies. Therefore, inequalities on Shannon’s information measures can be written as inequalities on entropies. For this reason, they are referred to as entropy inequalities.
In this section, we will not focus on the application of entropy inequalities in proving converse coding theorems. Rather, we will focus on these inequalities themselves. Like what we have done in the previous sections, we first introduce a geometrical framework for entropy inequalities.
Let [n]={1,…,n}, N=2[n], and ¯N=N∖{∅}. Let Θ={Xi,i∈[n]} be a collection of n discrete random variables. Associated with any collection of n random variables are k:=2n−1 joint entropies. For α∈N, write Xα=(Xi,i∈α), with the convention that X∅ is a constant. For example, X{1,2,3}, or simply X123, denotes (X1,X2,X3). For a collection Θ of n random variables, define the set function HΘ:N→R by
HΘ(α)=H(Xα),α∈N,with HΘ(∅)=0 because X∅ is a constant. HΘ is called the entropy function of Θ.
Let Hn denote Rk, the k-dimensional Euclidean space, with the coordinates labeled by hα,α∈¯N. We call Hn the entropy space for n random variables. As an example, for n=3, the coordinates of H3 are labelled by
h1,h2,h3,h12,h13,h23,h123,where h123 denotes h{1,2,3}, etc. Then for each collection Θ of n random variables, HΘ can be represented by a column vector hΘ∈Hn, called the entropy vector of Θ, whose component corresponding to α is equal to HΘ(α) for all α∈¯N. On the other hand, a column vector h∈Hn is called entropic5 if it is equal to the entropy vector hΘ of some collection Θ of n random variables.
Like what we have done for the Markov inequality and the Cauchy-Schwarz inequality, we are motivated to define the region
Γ∗n={h∈Rk:h is entropic}.The region Γ∗n is referred to as the region of entropy vectors.
An entropy inequality f(hΘ)≥0, where f:Rk→R, is valid if it holds for all collection Θ of n random variables. For example, the inequality
H(X1)+H(X2)≥H(X1,X2),or
I(X1;X2)≥0,is valid because it holds for any random variables X1 and X2; this will be further discussed in Section 5.2. In the sequel, let n be fixed. The following proposition is analogous to Proposition 4. Its proof is omitted.
Proposition 3. A set of entropy inequalities {fi(hΘ)≥0,1≤i≤m} is valid if and only if
Γ∗n⊆{h∈Hn:fi(h)≥0,1≤i≤m}.In information theory, we very often deal with entropy inequalities with certain constraints on the joint distribution for the random variables involved. These are called constrained entropy inequalities, and the constraints on the joint distribution can usually be expressed as linear constraints on the entropies. In the sequel, we always assume that the constraints on the entropies are of this form. The following are some examples:
It is not difficult to show that (30) is equivalent to
{I(X1;X2)=0I(X2;X3|X1)=0I(X1;X3|X2)=0.Suppose there are q constraints on the entropies given by
Qh=0,where Q is a q×k matrix. Without loss of generality, we can assume that these q constraints are linearly independent, so that Q is full row rank. Let
Φ={h∈Hn:Qh=0}.In other words, the q constraints confine h to a linear subspace Φ in the entropy space. The following is the constrained version of Proposition 3.
Proposition 4. A set of entropy inequalities {fi(hΘ)≥0,1≤i≤m} is valid under the constraint Φ if and only if
(Γ∗n∩Φ)⊂{h∈Hn:fi(h)≥0,1≤i≤m}.We will refer to the inequalities in Proposition 3 as (unconstrained) entropy inequalities, and the inequalities in Proposition 4 as constrained entropy inequalities. Note that we can let Φ=Hn when there is no constraint on the entropies. In this sense, an unconstrained entropy inequality is a special case of a constrained entropy inequality.
It is well known in information theory that all Shannon’s information measures are nonnegative. This set of inequalities is collectively called the basic inequalities in information theory. Specifically, these are inequalities of the form
where α, β, and γ are disjoint subsets of N. On the other hand, the entropy function satisfies the polymatroidal axioms [10]: For any δ,σ⊂N,
It can be shown that the basic inequalities and polymatroid axioms are equivalent (see [4][Appendix 14.A]).
The basic inequalities, expressed in terms of the entropies, are linear inequalities in Hn. Denote this set of inequalities by Gh≥0, where G is an m×k matrix, and define
Γn={h∈Hn:Gh≥0}.Since the basic inequalities always hold, we see from Proposition 3 that Γ∗n⊂Γn.
Shannon-type inequalities are entropy inequalities that are implied by the basic inequalities. Specifically, an entropy inequality f(h)≥0 is a Shannon-type inequality if and only if
Γn⊂{h∈Hn:f(h)≥0}.More generally, under the linear constraint Φ (cf. (31)), f(h)≥0 is a Shannon-type inequality if and only if
(Γn∩Φ)⊂{h∈Hn:f(h)≥0}.Since Γ∗n⊂Γn, it follows that
(Γ∗n∩Φ)⊂{h∈Hn:f(h)≥0},which implies that a Shannon-type inequality is valid.
As mentioned earlier in this section, entropy inequalities are the main tool for proving converse coding theorems in information theory. In fact, Shannon-type inequalities had been all the entropy inequalities that were known until the discovery of non-Shannon-type inequalities in the late 1990s.
Since Γn is a polyhedral cone, verification of a linear Shannon-type inequality can be formulated as a linear programming problem [2]. ITIP [11] was the first software developed for this purpose, which runs on MATLAB. Subsequently, variants of ITIP with different additional features have been developed. AITIP [12] can produce a human-readable proof and suggest counterexamples when the inequality to be verified is not Shannon-type. PSITIP [13] can render proofs for converse coding theorems in network information theory [14]. See [15] for a list of related software. Recently, a symbolic approach to the problem that can drastically speed up the computation has been developed [16][17]. For a general discussion on machine-proving of entropy inequalities, we refer the reader to the tutorial paper [18].
To our knowledge, [9] was the first work in the literature that explicitly asked whether there exists any constraint on the entropy function other than the polymatroidal axioms. The same question was raised in [19] in a somewhat different form. With the geometrical formulation for entropy inequalities described at the beginning of this section, it became reasonable to conjecture the existence of constraints on the entropy function beyond Shannon-type inequalities, because it is not readily provable that Γ∗n=Γn.
Before diving deeper into this subject, we first note that
While it is straightforward to show that Γ∗2=Γ2, the problem is already nontrivial for n=3. Specifically, along an extreme direction of Γ3, only certain discrete points are entropic, making Γ∗3 not closed. However, upon taking the closure of Γ∗3, we obtain Γ3.
For n≥4, the set Γ∗n is very complex and characterization of Γ∗n remains an open problem. Nevertheless, the following general properties of Γ∗n are known:
Here, int(¯Γ∗n)⊂Γ∗n means that the difference between Γ∗n and ¯Γ∗n can only be on the boundary. As discussed, Γ∗3 and ¯Γ∗3 differ on an extreme direction of ¯Γ∗3 (= Γ3), which is on the boundary of ¯Γ∗3.
For 4 random variables, the following constrained entropy inequality was proved [21]: If
I(X1;X2)=I(X1;X2|X3)=0,then
I(X3;X4)≤I(X3;X4|X1)+I(X3;X4|X2).This inequality, referred to in the literature as ZY97, cannot be proved by ITIP and hence is a non-Shannon-type inequality.
It was discussed earlier that along an extreme direction of Γ3, only certain discrete points are entropic, while the rest are non-entropic. In the above, the constraints in (32) together with Γ4 define a 13-dimensional face6 of Γ4, and ZY97 asserts that a region on this face is not entropic. However, it is still unclear whether ¯Γ∗4 is equal to Γ4. Shortly after, the following unconstrained non-Shannon-type entropy inequality was discovered [22]: For any random variables X1, X2, X3, and X4,
2I(X3;X4)≤I(X1;X2)+I(X1;X3,X4)+3I(X3;X4|X1)+I(X3;X4|X2).This inequality, referred to in the literature as ZY98 or the Zhang-Yeung inequality, shows that ¯Γ∗4 is a proper subset of Γ4. See Fig. 4 for an illustration.
Subsequently, many non-Shannon-type inequalities for four or more random variables were discovered [23][24][20][25][26]. In particular, the existence of an infinite class of unconstrained non-Shannon-type inequalities for four random variables was proved, implying that Γ∗4 (and more generally Γ∗n) is not a pyramid [20]. See [27] for a unifying discussion.
The above are the efforts on characterizing ¯Γ∗n, in particular ¯Γ∗4, which remains an open problem. Throughout the years, there have have been efforts on characterizing Γ∗4, which is even more difficult. Notable works along this line include [28][29][30][31][32]. The connection with Γ∗n with conditional independence of random variables will be discussed in Section 5.4.2.
The study of entropy inequalities, more specifically characterization of the region Γ∗n, was shown to be intimately related to a distributed coding problem inspired by satellite communication [33]. This line of research was subsequently developed into the theory of network coding [34][35]. In the meantime, intimate relations between this subject and different branches of mathematics and physics were established. In particular, the non-Shannon-type inequalities for entropy induce corresponding inequalities for finite groups, Kolmogorov complexity, and positive semi-definite matrices. In this section, we give a high-level introduction of these developments. For a comprehensive treatment of the topic, we refer the readers to [5][6].
In network communication, to send information from a source node s to a destination node t, the predominant existing method is routing, namely that data packets are routed from node s to node t through the intermediate nodes in its original form. Network coding theory [34] refutes the folklore that routing alone can achieve the network capacity. Rather, coding at the network nodes, referred to as “network coding”, is in general required. As routing a data packet from an input to an output of a network node can be regarded as applying the identity map to the data packet, routing is a special case of network coding.
The advantage of network coding can be illustrated by a simple example called the butterfly network, represented by the directed graph in Figure 5. Here, a directed edge (i,j) represents a communication channel from node i to node j. A bit bi is generated at source node si, i=1,2, and the bits b1 and b2 are to be multicast7 to two destination nodes t1 and t2. Figure 5(a) shows a routing solution, in which both b1 and b2 need to be transmitted on channel (1,2).
If only one bit can be transmitted on each channel, then there exists no routing solution for this multicast problem. However, if an intermediate node can apply computation to the incoming bits instead of just routing them through, then a solution can be obtained as in Figure 5(b). Here at node 1, the received bits b1 and b2 are combined into a new bit b1⊕b2, where ‘⊕’ denotes binary addition or exclusive-or (XOR). This operation is referred to as network coding. The bit b1⊕b2 is then transmitted to node 2, where two copies of the bit are sent to the destination nodes t1 and t2, respectively. At node t1, a copy of the bit b1 is received directly from node s1, while the bit b2 can be decoded by adding the two received bits:
b1⊕(b1⊕b2)=(b1⊕b1)⊕b2=0⊕b2=b2.Similarly, the bits b1 and b2 can be decoded at node t2.
Among the vast literature of network coding, [33], [36], and [35] are directly related to the discussions in this section. In a nutshell, these works give a complete characterization of the capacity region8 of the general network coding problem on an acyclic network in terms of Γ∗, the region of entropy vectors. Here, we omit the subscript n in Γ∗ because the exact number of random variables involved depends on the setup of the specific network coding problem. Evidently, this characterization is implicit because the complete characterization of Γ∗ is still open. Nevertheless, an outer (inner) bound on Γ∗ directly induces an outer (inner) bound on the capacity region of the network coding problem. For a comprehensive discussion on this topic, we refer the reader to [[4], Ch. 21][35].
We use Xα⊥Xβ∣Xγ to denote the conditional independency (CI)
Xα and Xβ are conditionally independent given Xγ,where α, β, and γ are assumed to be disjoint subsets of [n]. When γ=∅, Xα⊥Xβ∣Xγ becomes an unconditional independency which we regard as a special case of a conditional independency.
In probability theory, we are often given a set of CI’s and we need to determine whether another given CI is logically implied. We refer to this problem as the implication problem, which is one of the most basic problems in probability theory. For example, we want to know whether
X1⊥X3∣X2X1⊥X2}⇒X1⊥X3.This is not difficult to prove. However, the general implication problem is extremely difficult, and it has been solved only up to four random variables [37].
We now explain the relation between the implication problem and the region Γ∗n. A CI involving random variables X1,X2,⋯,Xn has the form
Xα⊥Xβ∣Xγ,where α, β, and γ are disjoint subsets of [n]. Denote this generic CI by K. From the discussion in Section 5.1, K is equivalent to I(Xα;Xβ∣Xγ)=0, i.e., setting the basic inequality I(Xα;Xβ∣Xγ)≥0 to equality. Furthermore, since I(Xα;Xβ∣Xγ)=0 is equivalent to
H(Xα∪γ)+H(Xβ∪γ)−H(Xα∪β∪γ)−H(Xγ)=0,the CI K corresponds to the following hyperplane in Hn:
E(K):={h∈Hn:hα∪γ+hβ∪γ−hα∪β∪γ−hγ=0}.Since the region Γn is defined by the basic inequalities (with I(Xα;Xβ∣Xγ)≥0 being one), E(K)∩Γn is a face of Γn.
Let Π={Kl} be a collection of CI’s, and we want to determine whether Π implies a given CI K. This would be the case if and only if the following is true:
For all h∈Γ∗n, if h∈∩lE(Kl), then h∈E(K).Equivalently,
Π implies Kif and only if[∩lE(Kl)]∩Γ∗n⊆E(K).Therefore, the implication problem can be solved if Γ∗n can be characterized. Since Γ∗n⊆Γn, in the above, [∩lE(Kl)]∩Γ∗n can be rewritten as
[∩lE(Kl)]∩Γ∗n∩Γn=∩l[(E(Kl)∩Γn)∩Γ∗n].As discussed, \mathcal{E}(K_l) \cap \Gamma_n is a face of \Gamma_n. In other words, the implication problem can be solved by characterizing \Gamma_n^*, in particular characterizing \Gamma_n^* on the faces of \Gamma_n. Therefore, to tackle the implication problem, it is not sufficient just to characterize \overline{\Gamma_n^*}.
Hence, the region \Gamma_n^* is not only of fundamental importance in information theory, but is also of fundamental importance in probability theory. For a more general discussion of this topic, we refer the reader to the series of papers [38][39][37].
Let X_1 and X_2 be any two random variables. Then
H(X_1) + H(X_2) \geq H(X_1, X_2), \tag{38}which is equivalent to the basic inequality
I(X_1; X_2) \geq 0. \tag{39}Let G be any finite group and G_1 and G_2 be subgroups of G. It is well known in group theory9 that
{|G||G_1 \cap G_2|} \geq {|G_1||G_2|}, \tag{40}where |G| denotes the order of G and G_1 \cap G_2 denotes the intersection of G_1 and G_2 (G_1 \cap G_2 is also a subgroup of G). By rearranging the terms, the above inequality can be written as
\log \frac{|G|}{|G_1|} + \log \frac{|G|}{|G_2|} \geq \log \frac{|G|}{|G_1 \cap G_2|}. \tag{41}By comparing (38) and (41), one can easily identify the one-to-one correspondence between the forms of these two inequalities, namely that X_i corresponds to G_i, i=1,2, and (X_1, X_2) corresponds to G_1 \cap G_2. While (38) is true for any pair of random variables X_1 and X_2, (41) is true for any finite group G and subgroups G_1 and G_2. As a further example, from the entropy inequality
H(X_1, X_3) + H(X_2, X_3) \geq H(X_1, X_2, X_3) + H(X_3) \tag{42}which is equivalent to I(X_1;X_2|X_3) \geq 0, we can obtain the group inequality
\log \frac{|G|}{|G_1 \cap G_3|} + \log \frac{|G|}{|G_2 \cap G_3|} \geq \log \frac{|G|}{|G_1 \cap G_2 \cap G_3|} + \log \frac{|G|}{|G_3|} \tag{43}that holds for all finte group |G| and subgroups G_1, G_2, and G_3.
This one-to-one correspondence can be extended to any random variables X_1, X_2, \cdots, X_n and any finite group G and its subgroups G_1, G_2, \cdots, G_n. For example, consider the non-Shannon-type inequality ZY98 which can be written in terms of joint entropies as follows:
\left.\begin{array}{r} H\left(X_1\right)+H\left(X_1, X_2\right)+2 H\left(X_3\right) \\ +2 H\left(X_4\right)+4 H\left(X_1, X_3, X_4\right) \\ +H\left(X_2, X_3, X_4\right) \end{array}\right\} \leq\left\{\begin{array}{l} 3 H\left(X_1, X_3\right)+3 H\left(X_1, X_4\right) \\ +3 H\left(X_3, X_4\right)+H\left(X_2, X_3\right) \\ +H\left(X_2, X_4\right) \end{array}\right..This entropy inequality, which holds for all random variables X_1, X_2, X_3 and X_4, corresponds to the group inequality
\left.\begin{array}{r} \left|G_1 \cap G_3\right|^3\left|G_1 \cap G_4\right|^3 \\ \cdot\left|G_3 \cap G_4\right|^3\left|G_2 \cap G_3\right| \\ \cdot\left|G_2 \cap G_4\right| \end{array}\right\} \leq\left\{\begin{array}{l} \left|G_1\right|\left|G_1 \cap G_2\right|\left|G_3\right|^2 \\ \cdot\left|G_4\right|^2\left|G_1 \cap G_3 \cap G_4\right|^4, \\ \cdot\left|G_2 \cap G_3 \cap G_4\right| \end{array}\right.which holds for all finite group |G| and subgroups G_1, G_2, G_3, and G_4. We call such an inequality a “non-Shannon-type” group inequality. Curiously, there has not been a proof of this inequality based on group theory alone (without going through the entropy function), which can shed light on the group-theoretic meaning of this inequality. Likewise, from any other non-Shannon-type entropy inequality, one can obtain the corresponding group inequality.
In the above, we have discussed how to obtain a group inenquality from an entropy inequality. On the other hand, if a group inequality of the form (41) or (43) holds, then the corresponding entropy inequality of the form (38) or (42) also holds.
This one-to-one correspondence between entropy inequalities and group inequalities is intimately related to a combinatorial structure known as the quasi-uniform array [40]. This combinatorial structure, inspired by the fundamental notion of strong typicality in information theory, is exhibited by any finite group and its subgroups. We refer the reader to [41][[4], Ch. 16] for the details.
Let X be a continuous random variable with probability density function (pdf) f(x). The differential entropy of X is defined as
h(X) = -\int f(x)\log f(x)dx.Likewise, the joint differential entropy of a random vector \mathbf{X} with joint pdf f(\mathbf{x}) is defined as
h(\mathbf{X}) = -\int f(\mathbf{x})\log f(\mathbf{x})d\mathbf{x}. \tag{44}The integral in the above definitions are assumed to be taken over the support of the underlying pdf.
A linear differential entropy inequality
\sum_{\alpha \in \overline{\mathsf{N}}} c_\alpha h(X_\alpha) \geq 0is said to be balanced if for all i \in [n], we have \sum_{\alpha \in \overline{\mathsf{N}}: i \in \alpha} c_\alpha = 0. (The same can be defined for an entropy inequality.) It was proved in [42] that the above differential entropy inequality is valid if and only if it is balanced and its discrete analog is valid. For example,
h(X|Y) = h(X,Y) - h(Y) \geq 0is not valid because it is not balanced. On the other hand,
I(X;Y) = h(X) + h(Y) - h(X,Y) \geq 0is valid because it is balanced and its discrete analog
H(X) + H(Y) - H(X,Y) \geq 0is valid. Thus if \Gamma_n^* can be determined, then in principle all valid differential entropy inequalities can be determined.
Any n \times n symmetric positive semi-definite matrix K = [k_{ij}] defines a Gaussian vector \mathbf{X} = [X_1 X_2 \cdots X_n] with covariance matrix K. Substituting the corresponding Gaussian distribution into (44), we obtain
h(\mathbf{X}) = \frac{1}{2}\log [(2\pi e)^n |K|],where |\cdot| denotes the determinant of a matrix. For \alpha \in \overline{\mathsf{N}}, let K_\alpha be the submatrix of K at the intersection of the rows and the columns of K indexed by \alpha, whose determinant |K_\alpha| is called a principal minor of K. Note that K_\alpha is the covariance matrix of the subvector \mathbf{X}_\alpha = [X_i: i \in \alpha]. Since \mathbf{X}_\alpha is also Gaussian, it follows that
h(\mathbf{X}_\alpha) = \frac{1}{2}\log [(2\pi e)^{|\alpha|} |K_\alpha|]. \tag{45}Now consider the independence bound for differential entropy,
h(X_1, X_2, \cdots, X_n) \leq \sum_i h(X_i),which is tight if and only if X_i, i \in [n] are mutually independent. Substituting (45) into the above, we have
\frac{1}{2}\log [(2\pi e)^n |K|] \leq \sum_i \frac{1}{2}\log [(2\pi e) K_i],or
\frac{n}{2}\log (2\pi e) + \frac{1}{2}\log |K| \leq \frac{n}{2}\log (2\pi e) + \frac{1}{2}\log \prod_i K_i.Note that those terms involving \frac{1}{2}\log(2\pi e) are cancelled out, because the independence bound is a valid differential entropy inequality and so it is balanced. After simplification, we obtain
|K| \leq \prod_i K_i,namely Hadamard’s inequality, which is tight if and only if X_i, i \in [n] are mutually independent, or k_{ij}=0 for all i \neq j.
This and similar techniques can been applied to obtain various inequalities on the principal minors of symmetric positive semi-definite matrices [[8], Section 16.8]. These include a generalization of Hardamad’s inequality due to Szász [43] and the Minkowski inequality [44].
For every valid differential entropy inequality, a corresponding inequality involving the principal minors of a symmetric positive semi-definite matrix can be obtained in this fashion. It turns out that all non-Shannon-type inequalities for discrete random variables discovered so far are balanced, and so they are also valid for differential entropy. For example, from ZY98 we can obtain
|K_1||K_{12}||K_3|^2|K_4|^2|K_{134}|^4|K_{234}| \leq |K_{13}|^3|K_{14}|^3|K_{34}|^3|K_{23}||K_{24}|,which can be called a “non-Shannon-type” inequality for 4 \times 4 positive semi-definite matrix K. It was proved in [45] that for 3 \times 3 positive semi-definite matrices, all inequalities involving the principal minors can be obtained through the Gaussian distribution as explained.
Kolmogorov complexity, also known as Kolmogorov-Chatin complexity, is a subfield of computer science. The Kolmogorov complexity of a sequence x, denoted by K(x), is the length of the shortest description of the string with respective to a universal description language. Without getting into the details, such a universal description language can be based on a computer programming language. Likewise, the Kolmogorov complexity of a pair of sequences x and y is denoted by K(x,y). We refer the reader to [46] for a comprehensive treatment of the subject.
Hammer et al. [47] established that all linear inequalities that are valid for Kolmogorov complexity are also valid for entropy, and vice versa. For example, the inequality
H(X_1) + H(X_2) \geq H(X_1, X_2)for any X_1, X_2 corresponds to the inequality
K(x_1) + K(x_2) \geq K(x_1, x_2)for any two sequences x_1 and x_2. This establishes a one-to-one correspondence between entropy and Kolmogorov complexity. Due to this one-to-one correspondence, “non-Shannon-type” inequalities for Kolmogorov complexity can be obtained accordingly.
The von Neumann entropy [48] is a generalization of the classical entropy (Shannon entropy) to the field of quantum mechanics.10 For any quantum state described by a Hermitian positive semi-definite matrix \rho, the von Neumann entropy of \rho is defined as
S(\rho) = -\mathrm{Tr}(\rho \log \rho).Consider distinct quantum systems A and B. The joint system is described by a Hermitian positive semi-definite matrix \rho_{AB}. The individual systems are described by \rho_A and \rho_B which are obtained from \rho_{AB} by taking partial trace. Consider a fixed \rho_{AB}. We simply use S(A) to denote the entropy of System A, i.e., S(\rho_A). In the following, the same convention applies to other joint or individual systems. It is well known that
|S(A) - S(B)| \leq S(AB) \leq S(A) + S(B).The second inequality above is called the subadditivity for the von Neumann entropy. The first inequality, called the triangular inequality (also known as the Araki-Lieb inequality [49]), is regarded as the quantum analog of the inequality
H(X) \leq H(X, Y) \tag{46}for the Shannon entropy. It is important to note that although the Shannon entropy of a joint system is always not less than the Shannon entropy of an individual system as shown in (46), this may not be true in quantum systems. It is possible that S(AB) = 0 but S(A) > 0 and S(B) > 0, for example, when AB is a pure entangled state [50]. From this fact, we can see that the quantum world can be quite different from the classical world.
The strong subadditivity of the von Neumann entropy [51][52] plays the same role as the basic inequalities for the classical entropy. For distinct quantum systems A, B, and C, strong subadditivity can be represented by the following two equivalent forms:
\begin{align} S(A) + S(B) & \leq S(AC) + S(BC) \\ S(ABC) + S(B) & \leq S(AB) + S(BC). \end{align}These inequalities can be used to show many other interesting inequalities involving conditional entropy and mutual information. Similar to classical information theory, quantum conditional entropy and quantum mutual information are defined as S(A|B) = S(A,B) - S(B) and S(A:B) = S(A) + S(B) - S(A,B), respectively. For distinct quantum systems A, B, C and D, we have [50]
i) Conditioning reduces conditional entropy:
ii) Discarding quantum systems never increases mutual information:
S(A:B) \leq S(A:B,C).iii) Subadditivity of conditional entropy [53]:
\begin{align} S(A,B|C,D) & \leq S(A|C) + S(B|D) \\ S(A,B|C) & \leq S(A|C) + S(B|C) \\ S(A|B,C) & \leq S(A|B) + S(A|C). \end{align}Following the discovery of non-Shannon-type inequalities for the classical entropy, it became natural to ask whether there exist constraints on the von Neumann entropy beyond strong subadditivity. It was proved a few years later that for a three-party system, there exist no such constraint [54]. Subsequently, a constrained inequality for the von Neumann entropy for a four-party system which is independent of strong subadditivity was discovered [55], and a family of countably infinitely many constrained inequalities that are independent of each other and strong subadditivity was proved [56].
In this paper, we have developed a framework for universally quantified inequalities. With its root in information theory, this framework provides a geometrical interpretation that captures the very meaning of such inequalities. With this formality, we have revisited three celebrated inequalities in mathematics, namely the AM-GM inequality, Markov’s inequality, and the Cauchy-Schwarz inequality, and clarified the related issues. To demonstrate the power of this formality, we have discussed its application to the study of entropy inequalities that have yielded very fruitful results in a number of subjected related to the Shannon entropy. Application of this formality to different branches of mathematics can identify situations in which new fundamental inequalities on quantities of interest may exist, and potentially lead to the discovery of such inequalities.
1 The discussion in this section is built upon the concept of “achievability”, which will continue to play a central role in the rest of the paper. We note that achievability is a fundamental concept in information theory for formulating coding theorems. For example, C. E. Shannon’s celebrated source coding theorem [1] states that the coding rate for lossless data compression must be at least equal to the entropy rate of the information source. This means that any coding rate above the entropy rate of the information source is asymptotically achievable by some encoding-decoding schemes.
2 For any A \subset \mathbb{R}^2, if we let
f(a, g) = \left\{ \begin{array}{ll} 1 & \text{if~}(a.g) \in A \\ -1 & \text{if~}(a.g) \notin A, \end{array} \right.then (a, g) \in A if and only if f(a, g) \geq 0. Thus an inequality of the form (3) can constrain (AM, GM) to any subset of \mathbb{R}^2.
3 In the literature there have been works on sharpening (or refining) the Cauchy-Schwarz inequality, for example [57][58]. The bounds on \langle \mathbf{u}, \mathbf{v}\rangle obtained in these works are tighter than the Cauchy-Schwarz inequality. However, these bounds depend on quantities other than \langle \mathbf{u}, \mathbf{u}\rangle and \langle \mathbf{v}, \mathbf{v}\rangle, and so they are beyond the scope of the current work. Nevertheless, for such a bound, a geometrical framework similar to the one discussed in this section can be introduced, but it would be more complicated because more than three quantities are involved.Generally speaking, if more information about \mathbf{u} and \mathbf{v} is available, then tighter bounds on \langle \mathbf{u}, \mathbf{v}\rangle can be obtained. In particular, if full information about \mathbf{u} and \mathbf{v} is available, i.e., both \mathbf{u} and \mathbf{v} are known, then \langle \mathbf{u}, \mathbf{v}\rangle can be determined exactly.
4 A general discussion on the Shannon entropy and related inequalities can be found in a blog by Terence Tao [59].
5 Equivalently, \mathbf{h} \in \mathcal{H}_n is entropic if it is achievable by some collection of n random variables.
6 For a convex polytope P \subseteq \mathcal{H}_n, a face is any set of the form F = P \cap \{\mathbf{h} \in \mathcal{H}_n : \mathbf{b}^\top \mathbf{h} = c\}, where \mathbf{b}^\top \mathbf{h} \leq c for all \mathbf{h} \in P.
7 In network communication, multicast refers to transmitting a message to a specified subset of destination nodes in the network.
8 The capacity region contains all the achievable information rate tuples of the network coding problem.
9 See for example [[4], Theorem 16.27] for a proof.
10 We refer the reader to [50] for a comprehensive treatment of quantum information theory.