前面两节分别介绍了使用前向算法(Forward Algorithm)和后向算法(Backward Alogrithm)对求值问题(Evaluation)进行求解。本节将介绍学习问题(Learning)并使用EM算法进行求解。
学习问题(Learning)我们并不陌生,其本质上是 给定数据集合
X
\mathcal X
X,通过
X
\mathcal X
X求解模型的参数信息
θ
\theta
θ:
之前介绍的模型中:
本节介绍的隐马尔可夫模型同样可以采用狭义EM算法进行求解。
针对隐马尔可夫模型,对场景与相关数学符号进行介绍:
隐马尔可夫模型的参数记作 λ \lambda λ,模型参数共包含三个部分:
我们在前向算法(Forward Algorithm)中介绍过,如果直接求解
P
(
O
∣
λ
)
P(\mathcal O \mid \lambda)
P(O∣λ),它的表达式结果如下:
P
(
O
∣
λ
)
=
∑
I
P
(
O
,
I
∣
λ
)
=
∑
i
1
⋯
∑
i
T
[
π
⋅
∏
t
=
2
T
a
i
t
,
i
t
+
1
⋅
∏
t
=
1
T
b
i
t
(
o
t
)
]
P(O∣λ)=∑IP(O,I∣λ)=∑i1⋯∑iT[π⋅T∏t=2ait,it+1⋅T∏t=1bit(ot)]
P(O∣λ)=I∑P(O,I∣λ)=i1∑⋯iT∑[π⋅t=2∏Tait,it+1⋅t=1∏Tbit(ot)]
如果对
P
(
O
∣
λ
)
P(\mathcal O \mid \lambda)
P(O∣λ)直接使用极大似然估计,即:
λ
^
M
L
E
=
arg
max
λ
P
(
O
∣
λ
)
\hat {\lambda}_{MLE} = \mathop{\arg\max}\limits_{\lambda} P(\mathcal O \mid \lambda)
λ^MLE=λargmaxP(O∣λ)
直接求解的方式存在许多问题:
EM算法公式表示如下:
θ
(
t
+
1
)
=
arg
max
θ
∫
Z
P
(
X
,
Z
∣
θ
)
⋅
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
\theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta} \int_{\mathcal Z} P(\mathcal X,\mathcal Z \mid \theta)\cdot P(\mathcal Z \mid \mathcal X,\theta^{(t)}) d\mathcal Z
θ(t+1)=θargmax∫ZP(X,Z∣θ)⋅P(Z∣X,θ(t))dZ
将EM算法中出现的变量与隐马尔可夫模型中出现的概念进行映射:
经过映射后,基于隐马尔可夫模型的EM算法表示如下:
'状态序列'
I
\mathcal I
I是离散的,因此积分过程中改成
∑
\sum
∑;
λ
(
t
+
1
)
=
arg
max
λ
∑
I
[
log
P
(
O
,
∣
λ
)
⋅
P
(
I
∣
O
,
λ
(
t
)
)
]
\lambda^{(t+1)} = \mathop{\arg\max}\limits_{\lambda} \sum_{\mathcal I} \left[\log P(\mathcal O,\mathcal \mid \lambda) \cdot P(\mathcal I \mid \mathcal O, \lambda^{(t)})\right]
λ(t+1)=λargmaxI∑[logP(O,∣λ)⋅P(I∣O,λ(t))]
为了简化运算,对上述公式进行变形:
条件概率公式~
λ
(
t
+
1
)
=
arg
max
λ
∑
I
[
log
P
(
O
,
∣
λ
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
P
(
O
,
λ
(
t
)
)
]
\lambda^{(t+1)} = \mathop{\arg\max}\limits_{\lambda} \sum_{\mathcal I} \left[\log P(\mathcal O,\mathcal \mid \lambda) \cdot \frac{P(\mathcal I,\mathcal O \mid \lambda^{(t)})}{P(\mathcal O,\lambda^{(t)})}\right]
λ(t+1)=λargmaxI∑[logP(O,∣λ)⋅P(O,λ(t))P(I,O∣λ(t))]
观察后项的分母部分 P ( O , λ ( t ) ) P(\mathcal O ,\lambda^{(t)}) P(O,λ(t)):
至此,EM算法可化简为如下形式:
需要注意的点:化简后的结果已经不是期望形式了~
λ
(
t
+
1
)
=
arg
max
λ
∑
I
[
log
P
(
I
,
O
∣
λ
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
]
λ(t+1)=argmax
λ(t+1)=λargmaxI∑[logP(I,O∣λ)⋅P(I,O∣λ(t))]
并且
λ
(
t
)
\lambda^{(t)}
λ(t)以及迭代求解后的
λ
(
t
+
1
)
\lambda^{(t+1)}
λ(t+1)本身不是一个参数,而是由三个参数组成:
λ
(
t
)
=
(
π
(
t
)
,
A
(
t
)
,
B
(
t
)
)
;
λ
(
t
+
1
)
=
(
π
(
t
+
1
)
,
A
(
t
+
1
)
,
B
(
t
+
1
)
)
\lambda^{(t)} = (\pi^{(t)}, \mathcal A^{(t)},\mathcal B^{(t)}); \quad \lambda^{(t+1)} = (\pi^{(t+1)},\mathcal A^{(t+1)},\mathcal B^{(t+1)})
λ(t)=(π(t),A(t),B(t));λ(t+1)=(π(t+1),A(t+1),B(t+1))
将EM算法的公式部分表示为关于
λ
,
λ
(
t
)
\lambda,\lambda^{(t)}
λ,λ(t)的函数:
Q
(
λ
,
λ
(
t
)
)
=
∑
I
[
log
P
(
I
,
O
∣
λ
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
]
\mathcal Q(\lambda,\lambda^{(t)}) = \sum_{\mathcal I} \left[\log P(\mathcal I,\mathcal O\mid \lambda) \cdot P(\mathcal I,\mathcal O \mid \lambda^{(t)})\right]
Q(λ,λ(t))=I∑[logP(I,O∣λ)⋅P(I,O∣λ(t))]
将直接求解得到的
P
(
O
,
I
∣
λ
)
=
π
⋅
∏
t
=
2
T
a
i
t
,
i
t
+
1
⋅
∏
t
=
1
T
b
i
t
(
o
t
)
P(\mathcal O,\mathcal I \mid \lambda) = \pi \cdot \prod_{t=2}^{T} a_{i_{t},i_{t+1}} \cdot \prod_{t=1}^{T} b_{i_t}(o_t)
P(O,I∣λ)=π⋅∏t=2Tait,it+1⋅∏t=1Tbit(ot)带入
Q
(
λ
,
λ
(
t
)
)
\mathcal Q(\lambda,\lambda^{(t)})
Q(λ,λ(t))中:
Q
(
λ
,
λ
(
t
)
)
=
∑
I
[
log
(
π
⋅
∏
t
=
2
T
a
i
t
,
i
t
+
1
⋅
∏
t
=
1
T
b
i
t
(
o
t
)
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
]
=
∑
I
[
(
log
π
+
log
∏
t
=
2
T
a
i
t
,
i
t
+
1
+
log
∏
t
=
1
T
b
i
t
(
o
t
)
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
]
=
∑
I
[
(
log
π
+
∑
t
=
2
T
log
a
i
t
,
i
t
+
1
+
∑
t
=
1
T
log
b
i
t
(
o
t
)
)
⋅
P
(
I
,
O
∣
λ
(
t
)
)
]
Q(λ,λ(t))=I∑[log(π⋅t=2∏Tait,it+1⋅t=1∏Tbit(ot))⋅P(I,O∣λ(t))]=I∑[(logπ+logt=2∏Tait,it+1+logt=1∏Tbit(ot))⋅P(I,O∣λ(t))]=I∑[(logπ+t=2∑Tlogait,it+1+t=1∑Tlogbit(ot))⋅P(I,O∣λ(t))]
这里以求解
π
(
t
+
1
)
\pi^{(t+1)}
π(t+1)为例,进行求解。
观察上式中的小括号部分,只有
log
π
\log \pi
logπ和
π
\pi
π有关,而剩余两项均和
π
\pi
π无关联,看做常数。因此,针对求解
π
(
t
+
1
)
\pi^{(t+1)}
π(t+1)的
Q
(
λ
,
λ
(
t
)
)
\mathcal Q(\lambda,\lambda^{(t)})
Q(λ,λ(t))表达如下:
π
(
t
+
1
)
=
arg
max
π
Q
(
λ
,
λ
(
t
)
)
=
arg
max
π
∑
I
[
log
π
⋅
P
(
O
,
I
∣
λ
(
t
)
)
]
=
arg
max
π
∑
i
1
⋯
∑
i
T
[
log
π
⋅
P
(
O
,
i
1
,
⋯
,
i
T
∣
λ
(
t
)
)
]
π(t+1)=πargmaxQ(λ,λ(t))=πargmaxI∑[logπ⋅P(O,I∣λ(t))]=πargmaxi1∑⋯iT∑[logπ⋅P(O,i1,⋯,iT∣λ(t))]
观察上述展开结果,
π
\pi
π根据定义,状态序列
I
\mathcal I
I 的初始概率分布
π
\pi
π只和 第一个状态变量
i
1
i_1
i1相关,和其他变量无关。因此,将上式继续化简成如下形式:
i
1
i_1
i1的取值范围是离散的,是状态值集合
Q
\mathcal Q
Q中的一个元素。因此将
∑
i
1
\sum_{i_1}
∑i1改为
∑
k
=
1
K
\sum_{k=1}^{\mathcal K}
∑k=1K。
π
(
t
+
1
)
=
arg
max
π
∑
i
1
[
log
π
⋅
P
(
O
,
i
1
∣
λ
(
t
)
)
]
=
arg
max
π
∑
k
=
1
K
[
log
P
(
i
1
=
q
k
)
⋅
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
]
π(t+1)=πargmaxi1∑[logπ⋅P(O,i1∣λ(t))]=πargmaxk=1∑K[logP(i1=qk)⋅P(O,i1=qk∣λ(t))]
需要注意的是,将
π
\pi
π写成概率组成向量的形式,是存在约束条件的。即:
∑
k
=
1
K
P
(
i
1
=
q
k
)
=
1
\sum_{k=1}^{\mathcal K} P(i_1 = q_k) = 1
k=1∑KP(i1=qk)=1
至此,将
π
(
t
+
1
)
\pi^{(t+1)}
π(t+1)的求问题转化为 带一个约束的优化问题:
{
arg
max
π
∑
k
=
1
K
[
log
P
(
i
1
=
q
k
)
⋅
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
]
s
.
t
.
∑
k
=
1
K
P
(
i
1
=
q
k
)
=
1
⎩
⎨
⎧πargmax∑k=1K[logP(i1=qk)⋅P(O,i1=qk∣λ(t))]s.t.∑k=1KP(i1=qk)=1
使用拉格朗日乘数法求解该问题:
定义拉格朗日函数
L
(
π
,
η
)
\mathcal L(\pi,\eta)
L(π,η)表示如下:
∑
k
=
1
K
P
(
i
1
=
q
k
)
\sum_{k=1}^{\mathcal K} P(i_1 = q_k)
∑k=1KP(i1=qk)和
1
1
1之间怎么去减,需要视情况而定。
L
(
π
,
η
)
=
∑
k
=
1
K
[
log
P
(
i
1
=
q
k
)
⋅
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
]
+
η
(
∑
k
=
1
K
P
(
i
1
=
q
k
)
−
1
)
\mathcal L(\pi,\eta) = \sum_{k=1}^{\mathcal K} [\log P(i_1 = q_k) \cdot P(\mathcal O,i_1=q_k \mid \lambda^{(t)})] + \eta \left(\sum_{k=1}^{\mathcal K} P(i_1 = q_k) - 1\right)
L(π,η)=k=1∑K[logP(i1=qk)⋅P(O,i1=qk∣λ(t))]+η(k=1∑KP(i1=qk)−1)
令
π
k
=
P
(
i
1
=
q
k
)
\pi_k = P(i_1 = q_k)
πk=P(i1=qk),即
π
\pi
π向量中的其中一个分量。令
L
(
π
,
η
)
\mathcal L(\pi,\eta)
L(π,η)对
π
k
\pi_k
πk求偏导:
∂
L
∂
π
k
=
1
π
k
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
+
η
(
1
−
0
)
\frac{\partial \mathcal L}{\partial \pi_k} = \frac{1}{\pi_k} P(\mathcal O,i_1 = q_k \mid \lambda^{(t)}) + \eta(1 - 0)
∂πk∂L=πk1P(O,i1=qk∣λ(t))+η(1−0)
令
∂
L
∂
π
k
≜
0
\frac{\partial \mathcal L}{\partial \pi_k} \triangleq 0
∂πk∂L≜0,有:
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
+
π
k
⋅
η
=
0
P(\mathcal O,i_1 = q_k\mid \lambda^{(t)}) + \pi_k \cdot\eta = 0
P(O,i1=qk∣λ(t))+πk⋅η=0
基于上式,则有:
∑
k
=
1
K
[
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
+
π
k
⋅
η
]
=
0
\sum_{k=1}^{\mathcal K} \left[P(\mathcal O,i_1 = q_k\mid \lambda^{(t)}) + \pi_k \cdot \eta \right] = 0
k=1∑K[P(O,i1=qk∣λ(t))+πk⋅η]=0
该式可从两种角度解释:
将
∑
k
=
1
K
[
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
+
π
k
⋅
η
]
=
0
\sum_{k=1}^{\mathcal K} \left[P(\mathcal O,i_1 = q_k\mid \lambda^{(t)}) + \pi_k \cdot \eta \right] = 0
∑k=1K[P(O,i1=qk∣λ(t))+πk⋅η]=0展开:
边缘概率分布~
约束条件~
最终结果有:
P
(
O
∣
λ
(
t
)
)
+
η
=
0
→
η
=
−
P
(
O
∣
λ
(
t
)
)
P(\mathcal O \mid \lambda^{(t)}) + \eta = 0 \\ \to \eta =-P(\mathcal O \mid \lambda^{(t)})
P(O∣λ(t))+η=0→η=−P(O∣λ(t))
将
η
=
−
P
(
O
∣
λ
(
t
)
)
\eta =-P(\mathcal O \mid \lambda^{(t)})
η=−P(O∣λ(t))带回
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
+
π
k
⋅
η
=
0
P(\mathcal O,i_1 = q_k\mid \lambda^{(t)}) + \pi_k \cdot\eta = 0
P(O,i1=qk∣λ(t))+πk⋅η=0中,求得
p
i
k
pi_k
pik的最终结果为:
π
k
(
t
+
1
)
=
P
^
(
i
1
=
q
k
)
=
P
(
O
,
i
1
=
q
k
∣
λ
(
t
)
)
P
(
O
∣
λ
(
t
)
)
\pi_k^{(t+1)} = \hat P(i_1 = q_k)=\frac{P(\mathcal O,i_1 = q_k \mid \lambda^{(t)})}{P(\mathcal O \mid \lambda^{(t)})}
πk(t+1)=P^(i1=qk)=P(O∣λ(t))P(O,i1=qk∣λ(t))
同理,可以求解出其他分量的迭代结果,从而得到迭代后的初始概率分布
π
(
t
+
1
)
\pi^{(t+1)}
π(t+1):
π
(
t
+
1
)
=
(
π
1
(
t
+
1
)
,
π
2
(
t
+
1
)
,
⋯
,
π
K
(
t
+
1
)
)
T
\pi^{(t+1)} = (\pi_1^{(t+1)},\pi_2^{(t+1)},\cdots,\pi_{\mathcal K}^{(t+1)})^{T}
π(t+1)=(π1(t+1),π2(t+1),⋯,πK(t+1))T
此时,就已经将
π
(
t
+
1
)
\pi^{(t+1)}
π(t+1)和
λ
(
t
)
\lambda^{(t)}
λ(t)的迭代方式找出,同理
A
(
t
+
1
)
,
B
(
t
+
1
)
\mathcal A^{(t+1)},\mathcal B^{(t+1)}
A(t+1),B(t+1)也使用相似方式。
下一节将介绍解码问题(Decoding)。
相关参考:
机器学习-隐马尔可夫模型5-Learning问题-Baum Welch算法(EM)