Def.1 给定数据集 D = { x 1 , x 2 , ⋯ , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,⋯,xn},(xi∈Rd), D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯ , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,⋯,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} Ci⊆D,Ci∩Cj=∅,∪i=1kCi=D;
称聚类 A = { A 1 , ⋯ , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,⋯,Ar} 是聚类 B = { B 1 , ⋯ , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,⋯,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} ∀Ai∈A ,存在 B j ∈ B B_j \in \mathcal{B} Bj∈B 使得 A i ⊆ B j A_i \subseteq B_j Ai⊆Bj
D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯ , { x n } } , ⋯ , C n = { { x 1 , x 2 , ⋯ , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},⋯,{xn}},⋯,Cn={{x1,x2,⋯,xn}},且 C t \mathcal{C}_t Ct 是 C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。
Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t Ci∈Ct 且 C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} Cj∈Ct+1 满足,则 C i C_i Ci 与 C j C_j Cj 之间有一条边。

事实:
算法14.1 :
输入: D , k \mathbf{D}, k D,k
输出: C \mathcal{C} C
问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?
δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:
简单连接: δ ( C i , C j ) : = min { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)∣x∈Ci,y∈Cj}
完全连接: δ ( C i , C j ) : = max { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)∣x∈Ci,y∈Cj}
组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ni⋅njx∈Ci∑y∈Cj∑δ(x,y),ni=∣Ci∣,nj=∣Cj∣
均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=∣∣μi−μj∣∣2,μi=n1x∈Ci∑x,μj=n1y∈Cj∑y
极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=x∈Ci∑∣∣x−μi∣∣2
对 C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=x∈Ci∪Cj∑∣∣x−μij∣∣2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1x∈Ci∪Cj∑x
δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEij−SSEi−SSEj
证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2
简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=Ci∪Cj,nij:=ni+nj
注意:
C
i
∩
C
j
=
∅
C_i \cap C_j=\emptyset
Ci∩Cj=∅,故
∣
C
i
j
∣
=
n
i
+
n
j
|C_{ij}|=n_i+n_j
∣Cij∣=ni+nj
δ
(
C
i
,
C
j
)
=
∑
z
∈
C
i
j
∥
z
−
μ
i
j
∥
2
−
∑
x
∈
C
i
∥
x
−
μ
i
∥
2
−
∑
y
∈
C
j
∥
y
−
μ
j
∥
2
=
∑
z
∈
C
i
j
z
T
z
−
n
i
j
μ
i
j
T
μ
i
j
−
∑
x
∈
C
i
x
T
x
+
n
i
μ
i
T
μ
i
−
∑
y
∈
C
j
y
T
y
+
n
j
μ
j
T
μ
j
=
n
i
μ
i
T
μ
i
+
n
j
μ
j
T
μ
j
−
(
n
i
+
n
j
)
μ
i
j
T
μ
i
j
注意到:
μ
i
j
=
1
n
i
j
∑
z
∈
C
i
j
z
=
1
n
i
+
n
j
(
∑
x
∈
C
i
x
+
∑
y
∈
C
j
y
)
=
1
n
i
+
n
j
(
n
i
μ
i
+
n
j
μ
j
)
\boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j})
μij=nij1z∈Cij∑z=ni+nj1(x∈Ci∑x+y∈Cj∑y)=ni+nj1(niμi+njμj)
故:
μ
i
j
T
μ
i
j
=
1
(
n
i
+
n
j
)
2
(
n
i
2
μ
i
T
μ
i
+
2
n
i
n
j
μ
i
T
μ
j
+
n
j
2
μ
j
T
μ
j
)
\boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)
μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ
(
C
i
,
C
j
)
=
n
i
μ
i
T
μ
i
+
n
j
μ
j
T
μ
j
−
1
(
n
i
+
n
j
)
(
n
i
2
μ
i
T
μ
i
+
2
n
i
n
j
μ
i
T
μ
j
+
n
j
2
μ
j
T
μ
j
)
=
n
i
(
n
i
+
n
j
)
μ
i
T
μ
i
+
n
j
(
n
i
+
n
j
)
μ
j
T
μ
j
−
n
i
2
μ
i
T
μ
i
−
2
n
i
n
j
μ
i
T
μ
j
−
n
j
2
μ
j
T
μ
j
n
i
+
n
j
=
n
i
n
j
(
μ
i
T
μ
i
−
2
μ
i
T
μ
j
+
μ
j
T
μ
j
)
n
i
+
n
j
=
(
n
i
n
j
n
i
+
n
j
)
∥
μ
i
−
μ
j
∥
2
问题:如何快速计算算法14.1 第7步:更新矩阵?
☆ Lance–Williams formula
δ
(
C
i
j
,
C
r
)
=
α
i
⋅
δ
(
C
i
,
C
r
)
+
α
j
⋅
δ
(
C
j
,
C
r
)
+
β
⋅
δ
(
C
i
,
C
j
)
+
γ
⋅
∣
δ
(
C
i
,
C
r
)
−
δ
(
C
j
,
C
r
)
∣
| Measure | α i \alpha_i αi | α j \alpha_j αj | β \beta β | γ \gamma γ |
|---|---|---|---|---|
| 简单连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | − 1 2 -{1\over2} −21 |
| 完全连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | 1 2 1\over2 21 |
| 组群平均 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | 0 0 0 | 0 0 0 |
| 均值距离 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2−ninj | 0 0 0 |
| 极小方差 | n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr | n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr | − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nr−nr | 0 0 0 |
Proof:
简单连接
δ
(
C
i
j
,
C
r
)
=
min
{
δ
(
x
,
y
)
∣
x
∈
C
i
j
,
y
∈
C
r
}
=
min
{
δ
(
C
i
,
C
r
)
,
δ
(
C
j
,
C
r
)
}
a
=
a
+
b
−
∣
a
−
b
∣
2
,
b
=
a
+
b
+
∣
a
−
b
∣
2

完全连接
见上图
组群平均
δ
(
C
i
j
,
C
r
)
=
∑
x
∈
C
i
∪
C
j
∑
y
∈
C
r
δ
(
x
,
y
)
(
n
i
+
n
j
)
⋅
n
r
=
∑
x
∈
C
i
∑
y
∈
C
r
δ
(
x
,
y
)
+
∑
x
∈
C
j
∑
y
∈
C
r
δ
(
x
,
y
)
(
n
i
+
n
j
)
⋅
n
r
=
n
i
n
r
δ
(
C
i
,
C
r
)
+
n
j
n
r
δ
(
C
j
,
C
r
)
(
n
i
+
n
j
)
⋅
n
r
=
n
i
δ
(
C
i
,
C
r
)
+
n
j
δ
(
C
j
,
C
r
)
(
n
i
+
n
j
)
均值距离:作业
极小方差
基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2