信息论与编码：信息度量

Posted 2020-11-26 hitgxz

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了信息论与编码：信息度量相关的知识，希望对你有一定的参考价值。

信息度量

1. 独立与马尔可夫链

独立（Independence）

对于两个随机变量(X)和(Y)，若对所有的((x, y) in mathcal{X} imes mathcal{Y})，都有
[ p(x, y) = p(x)p(y) ]
则称(X)和(Y)独立，记为(X perp Y)。

(p(x), p(y), p(x, y))分别是( ext{Pr}(X=x), ext{Pr}(Y=y), ext{Pr}(X=x, Y=y))的简写。

相互独立（Mutual Independence）

给定随机变量(X_{1}, cdots, X_{n})，若对于所有的((x_1, cdots, x_{n}) in mathcal{X}_{1} imes cdots imes mathcal{X}_{n})，都有：
[ p(x_{1}, cdots, x_{n}) = p(x_{1})cdots p(x_{n}) ]
则(X_{1}, cdots, X_{n})相互独立。

两两独立（Pairwise Independence）

随机变量(X_{1}, cdots, X_{n})两两独立，若对于所有的(1 le i lt j le n)，(X_{i})和(X_{j})独立。

相互独立可以推出两两独立。

条件独立（Conditional Independence）

对于随机变量(X, Y, Z)，若：
[ p(x,y,z)p(y) = p(x,y)p(y,z) ]
则称(X)与(Z)在给定(Y)的条件下独立，记作(X perp Z mid Y)或(X ightarrow Y ightarrow Z)

马尔可夫链（Markov Chain）

对于随机变量(X_{1}, cdots, X_{n})（(n ge 3)），(X_{1} ightarrow cdots ightarrow X_{n})构成马尔可夫链，若：
[ p(x_{1},cdots,x_{n})p(x_{2})cdots p(x_{n-1}) = p(x_{1},x_{2})cdots p(x_{n-1},x_{n}) ]
马尔可夫链的等价定义：

(p(x_{1},cdots,x_{n})=egin{cases}p(x_{1})p(x_{2}|x_{1})cdots p(x_{n}|x_{n-1}) & ext{if} p(x_{1})cdots p(x_{n-1}) > 0\ 0 & ext{otherwise}end{cases})，
(p(x_{t}|x_{1},cdots,x_{t-1})=p(x_{t}|x_{t-1}))，其中(1 le t le n)

性质：

若(X_{1} ightarrow cdots ightarrow X_{n})是马尔可夫链，则(X_{n} ightarrow cdots ightarrow X_{1})也是马尔可夫链，

性质：马尔可夫子链（Markov Subchains）

(X_{1} ightarrow cdots ightarrow X_{n})是马尔可夫链， (mathcal{N}_{n} = left{1, 2, cdots, n ight})，对于(mathcal{N}_{n})的子集(alpha)，用(X_{alpha})表示(left{X_{i}: i in alpha ight})。给定(mathcal{N}_{n})的不相交子集(alpha_{1}, cdots, alpha_{m})，若对于所有的(k_{j} in alpha_{j}, j = 1, cdots, m)，(k_{1} lt cdots lt k_{m})，则(X_{alpha_{1}} ightarrowcdots ightarrow X_{alpha_{m}})构成一个马尔可夫链。

2. 香农信息度量

熵（entropy）：

随机变量(X)的熵定义为：(displaystyle H(X) = -sum_{xin mathcal{X}}p(x)log p(x) = sum_{x in mathcal{X}}p(x)log frac{1}{p(x)})

称(displaystyle log frac{1}{p(X)})是(X)的信息量，则熵是信息量的期望，即(H(X) = E logfrac{1}{p(X)})

示例：二元随机变量的熵

(X sim ext{Bernoulli}(p))，则(H(p) = p imes log frac{1}{p} + (1-p) imes log frac{1}{1-p})。(H(p))是关于(p)的函数，函数在(p = 0.5)处取最大值。

联合熵（joint entropy）：

随机变量(X, Y)的联合熵定义为：(displaystyle H(X, Y) = -sum_{x,y}p(x,y)log p(x,y) = sum_{x,y}p(x,y)log frac{1}{p(x,y)})

(log frac{1}{p(X,Y)})是二元组((X, Y))的信息量。

条件熵（conditional entropy）：

对于随机变量(X, Y)，(Y)在给定(X)条件下的条件熵定义为：
[ egin{align*} H(Y|X) &= sum_{x}p(x)H(Y|X=x)&= sum_{x}p(x)sum_{y}p(y|x)log frac{1}{p(y|x)}&= sum_{x,y}p(x,y)log frac{1}{p(y|x)}&= Elog frac{1}{p(Y|X)} end{align*} ]

联合熵与条件熵的关系：(H(X,Y)=H(X)+H(Y|X) = H(Y) + H(X|Y))

(displaystyle H(X,Y|Z,W=w,S=s,U) = sum_{x,y,z,u}p(x,y,z,u|w,s)log frac{1}{p(x,y|z,w,s,u)})

互信息（mutual information）：

随机变量(X,Y)之间的互信息定义为：(displaystyle I(X;Y) = sum_{x,y}p(x,y)log frac{p(x,y)}{p(x)p(y)} = E log frac{p(X,Y)}{p(X)p(Y)})

互信息与条件熵的关系：

(H(X) = H(X|Y) + I(X;Y))

(H(Y) = H(Y|X) + I(X;Y))

条件互信息（conditional mutual information）：

对于随机变量(X, Y, Z)，(X,Y)在给定(Z)条件下的条件互信息定义为：
[ egin{align*} I(X;Y|Z) &= sum_{z}p(z)sum_{x,y}p(x,y|z)logfrac{p(x,y|z)}{p(x|z)p(y|z)} &= sum_{x,y,z}p(x,y,z)logfrac{p(x,y|z)}{p(x|z)p(y|z)}&= Elogfrac{p(X,Y|Z)}{p(X|Z)p(Y|Z)} end{align*} ]
(displaystyle I(X;Y|Z=z,V)=sum_{x,y,v}p(x,y,v|z)logfrac{p(x,y|z,v)}{p(x|z,v)p(y|z,v)})

3. 链式规则

(displaystyle H(X_{1}, dots, X_{n})=sum_{i=1}^{n}H(X_{i} mid X_{1}, dots, X_{i-1}))

(displaystyle H(X_{1}, dots, X_{n} mid Y)=sum_{i=1}^{n}H(X_{i} mid X_{1}, dots, X_{i-1},Y))

(displaystyle I(X_{1}, dots, X_{n};Y) = sum_{i=1}^{n}I(X_{i};Y|X_{1}, dots, X_{i-1}))

(displaystyle I(X_{1}, dots, X_{n};Ymid Z) = sum_{i=1}^{n}I(X_{i};Y|X_{1}, dots, X_{i-1}, Z))

4. 信息散度

信息散度/KL距离/相对熵：

在同一个字典(mathcal{X})上的两个分布(p)与(q)之间的信息散度（informational divergence）定义为：
[ D(p parallel q) = sum_{x in mathcal{X}}p(x) log frac{p(x)}{q(x)} = E_{p}log frac{p(X)}{q(X)} ]

(displaystyle I(X;Y) = D(p(x,y)parallel p(x)q(x)))

性质：

对于同一个字典(mathcal{X})上的两个分布(p)和(q)：
[ egin{align*} D(pparallel q) &= sum_{x in mathcal{X}}p(x) log frac{p(x)}{q(x)}&= log e sum_{x in mathcal{X}}p(x) ln frac{p(x)}{q(x)}&ge log e sum_{x in mathcal{X}} p(x) (1 - frac{q(x)}{p(x)})&= log esum_{x in mathcal{X}}(p(x) - q(x))&= 0 end{align*} ]
取得等号当且仅当(p = q)

度量（metric)

函数( ho(x, y))是一个度量函数，若对于所有的(x, y)：

( ho(x, y) ge 0)
( ho(x, y) = ho(y, x))
( ho(x, y) = 0)当且仅当(x = y)
( ho(x, y) + ho(y, z) ge ho(x, z))

例子：

( ho(X, Y) = H(X|Y) + H(Y|X))满足条件1，2，4，若将(X = Y)定义为存在一个从(X)到(Y)的一一映射，则条件3也满足。

条件4：
[ egin{align*} ho(X,Z) &= H(X|Z) + H(Z|X)&= I(X;Y|Z) + H(X|Y,Z) + I(Y;Z|X) + H(Z|X,Y)&le H(Y|Z) + H(X|Y) + H(Y|X)+H(Z|Y)&= H(X|Y) + H(Y|X) + H(Y|Z) + H(Z|Y)&= ho(X,Y) + ho(Y,Z) end{align*} ]

基本不等式

Logarithm Inequality：(displaystyle ln x le x - 1 Leftrightarrow ln x ge 1 - frac{1}{x})

Jensen Inequality：(f)是凸函数，(lambda_i ge 0)且(sum lambda_i = 1)，则(displaystyle fleft(sum lambda_ix_i ight) le sum lambda_i f(x_i))

Relative Inequality：(displaystyle sum_i p_i log frac{p_i}{q_i} ge 0)，等号成立当且仅当(p_i = q_i)

Log-Sum Inequality：(displaystyle sum u_{i} log frac{u_i}{v_i} ge left(sum u_{i} ight) log frac{sum u_{i}}{sum v_{i}})，等号成立当且仅当(displaystyle frac{u_{i}}{v_{i}} = constant)

关于信息度量的一些不等式

(H(X) ge 0)，等号成立当且仅当(X)是确定的。证明：(H(X) = I(X;X) = D(p(x,x)parallel p(x) p(x)) ge 0)
(H(Y|X) ge 0)，等号成立当且仅当(Y)是(X)的一个函数。证明：(H(Y|X) = I(Y;Y|X) = D(p(y,y|x)parallel p(y|x)p(y|x))ge 0)
(I(X;Y) ge 0)，等号成立当且仅当(X)与(Y)独立
(I(X;Y|Z) ge 0)，等号成立当且仅当(X)与(Y)在给定(Z)时条件独立

定理：

(H(Y|X) le H(Y))，等号成立当且仅当(X)与(Y)独立。证明：(H(Y) = H(Y|X) + I(X;Y) ge H(Y|X))

定理：

(displaystyle H(X_1, X_2, dots, X_n) le sum_{i=1}^{n} H(X_i))，等号成立当且仅当(X_i)相互独立。证明：(displaystyle H(X_1, dots, X_n) = sum_{i=1}^{n}H(X_{i}|X_{1}, dots, X_{i-1}) le sum_{i=1}^{n}H(X_i))

定理：

(I(X;Y,Z) ge I(X;Y))，等号成立当且仅当(X ightarrow Y ightarrow Z)构成马尔可夫链。证明：(I(X;Y,Z) = I(X;Y) + I(X;Z|Y) ge I(X;Y))

定理：

若(U ightarrow X ightarrow Y ightarrow V)构成一个马尔可夫链，则(I(X;Y) ge I(U;V))。证明：由于(U ightarrow X ightarrow Y)是马尔可夫链，所以(I(X;Y) = I(U,X;Y)=I(U;Y)+I(X;Y|U) ge I(U;Y))；同理，(I(U;Y) ge I(U;V))。

定理：

对于随机变量(X)，当(X)服从均匀分布时，熵取得最大值，即(H(X) le log left|mathcal{X} ight|)。证明：设(u(x))是(mathcal{X})上的均匀分布，(D(p(x)parallel u(x)) ge 0)。

Fano‘s Inequality：

(X)是随机变量，(hat{X})是对(X)的估计（(X, hat{X} in mathcal{X})），出错的概率是(P_e = ext{Pr}(X eq hat{X}))，则：
[ H(Xmid hat{X}) le h_b(P_e) + P_e log (left|mathcal{X} ight|-1) ]

证明：定义(Y = 1cdotleft{X eq hat{X} ight})，则( ext{Pr}(Y=1) = P_e, ext{Pr}(Y=0) = 1 - P_e, H(Y) = h_{b}(P_e))
[ egin{align*} H(X|hat{X}) &= H(X|hat{X}) + H(Y|X,hat{X})&= H(X,Y|hat{X})&= H(Y|hat{X})+H(X|Y,hat{X})&=H(Y|hat{X}) + ext{Pr}(Y=1)H(X|Y=1,hat{X})&le H(Y) + ext{Pr}(Y=1)sum_{hat{x} in mathcal{X}} ext{Pr}(hat{X}=hat{x})H(X|Y=1,hat{X}=hat{x})&le H(Y) + ext{Pr}(Y=1)sum_{hat{x} in mathcal{X}} ext{Pr}(hat{X}=hat{x})log (left|mathcal{X} ight|-1)&= h_b(P_e) + P_elog (left|mathcal{X} ight|-1)\end{align*} ]

平稳信源的熵率

离散时间信源（discrete-time information source）：(left{X_{k}: k ge 1 ight})

熵率（entropy rate）：(left{X_{k} ight})的熵率定义为：(H_X=displaystyle lim_{n ightarrow infty}frac{1}{n}H(X_1, X_2, cdots, X_{n}))，若极限存在。

例子：

(left{X_{k} ight})是一个( ext{i.i.d})信源，用(X)表示任何一个时间步的随机变量，则：
[ lim_{n ightarrow infty}frac{1}{n}H(X_1, cdots, X_{n}) = lim_{n ightarrow infty}frac{ncdot H(X)}{n} = H(X) ]
熵率存在，熵率是(H(X))。

例子：

(left{X_{k} ight})是一个信源，各个(X_k)相互独立，且(H(X_{k}) = k)，则：
[ lim_{n ightarrow infty}frac{1}{n}H(X_1, cdots, X_{n}) = lim_{n ightarrow infty}frac{n+1}{2} ]
熵率不存在。

平稳信源（stationary information source）：对于一个信源(left{X_{k} ight})，若对于任意的(m, l ge 1)，(X_1, X_2, cdots, X_m)与(X_{1+l}, X_{2+l}, cdots, X_{m+l})具有相同的联合概率分布，则称之为平稳信源。

定义：(displaystyle H_X^{'} = lim_{n ightarrow infty}H(X_n|X_1, X_2, dots, X_{n-1}))

定理：平稳信源(left{X_k ight})的熵率(H_X)存在且(H_X = H_{X}^{'})

证明：(H(X_n|X_1, X_2, dots, X_{n-1}) le H(X_n|X_2, dots, X_{n-1})=H(X_{n-1}|X_1, X_2, dots, X_{n-2}))，令(a_n = H(X_n|X_1, X_2, dots, X_{n-1}))，则序列单调递减且存在下界，故极限存在。

(displaystyle H_{X}^{'} = lim_{n ightarrow infty}a_{n} = lim_{n ightarrow n}frac{sum_{i=1}^{n}a_i}{n}=lim_{n ightarrow infty} frac{1}{n}sum_{i=1}^{n}H(X_i|X_1, X_2, dots, X_{i-1}) = lim_{n ightarrow infty}frac{1}{n}H(X_1, dots, X_n) = H_{X})

以上是关于信息论与编码：信息度量的主要内容，如果未能解决你的问题，请参考以下文章

信息论小记

视频编解码·学习笔记7. 熵编码算法：基础知识 & 哈夫曼编码

数学之美记录

2017上半年软考第一章重要知识点

概率距离度量方式

《信息与编码》考试复习笔记2----第二章离散信息源