《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE

This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty： $\star\star$
Creativity: $\star\star$
Comprehensiveness (全面性)： $\star\star\star\star\star$

Symbol System:

Symbol	Meaning
$y_i \in \{0,1\}$	The label for class $i$
$\uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i)$	The parent, children, ancestors, descentors, and sibilings of node $i$
$\mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i}$	the label vector for classes $\mathbf{i}$
$\mathcal{H} = \{0,\dots,N-1\}$	The class hierachy, where $N$ is the number of nodes
$I (x)$	An indicator function output 1 when x is true, 0 otherwise.
$\mathcal{R}$	The conditional risk

Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
$y_i = 1 \Rightarrow y_{\uparrow(i)} = 1.$

For the DAG-type HMC with, there are two interpretations:

AND-interpretation. We have $y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1}$
OR-interpretation. We have $y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1$

Loss functions for Flat and Hierarchical Classification
It is a review.

Zero-one loss:
$\ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y})$

Hamming loss:
$\ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i)$

Top- $k$ precision:
$k$ most-confident predicted positive labels for each sample.
$\text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k}$
So the loss is
$\ell_{\text{top-k}} = 1 - \text{top-k-precision}$

Ranking loss:
$\ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2})$

Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.

Cai and Hofmann:
$\ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i)$
where $c_i$ is the cost for node $i$ .

Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?

Hierarchical multi-label classification

H-Loss:
$\ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)})$
where $\alpha$ and $\beta$ are weight for FN and FP.

Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
$c_i = \left\{$

\begin{aligned} 1, & i = 0, \\ \frac{c_{⇑ (i)}}{n_{\Leftrightarrow (i)}}, & i > 0, \end{aligned}

\right.

c_{i} = ⎩ ⎨ ⎧ 1, \frac{c _{⇑ (i)}}{n _{\Leftrightarrow (i)}}, i = 0, i > 0,

where

n_{\Leftrightarrow(i)}

is the number of siblings of

i

(including

i

).

Matching Loss:
$\ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y})$ .
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where $\text{cost}(j\rightarrow i)$ is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.

It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:

\begin{aligned} max_{{ψ_{i}}_{i \in H}} \sum_{i \in H} ψ_{i} {\tilde{y}}_{i} \\ s . t . & ψ_{i} \leq ψ_{↑ (i)}, \forall i \in H ∖ {0}, \\ ψ_{0} = 1, ψ_{i} \in {0, 1}, \\ \sum_{i = 0}^{N - 1} ψ_{i} = L \end{aligned}

s . t . {ψ_{i}}_{i \in H} max i \in H \sum ψ_{i} y_{i} ψ_{i} \leq ψ_{↑ (i)}, \forall i \in H ∖ {0}, ψ_{0} = 1, ψ_{i} \in {0, 1}, i = 0 \sum N - 1 ψ_{i} = L

where $\psi_i = 1$ indicates that node $i$ is predicted positive in $\hat{\mathbf{y}}$ ; and 0 otherwise.

When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
$\psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i).$

Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,

For hierarchical hamming loss:
$\ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i$

DAG class hierarchy derives
$c_i = \left\{$

\begin{aligned} 1, & i = 0, \\ \sum_{j \in⇑ (i)} \frac{c_{j}}{n_{↓ (j)}}, & i > 0 \end{aligned}

\right.

c_{i} = ⎩ ⎨ ⎧ 1, j \in⇑ (i) \sum \frac{c _{j}}{n _{↓ (j)}}, i = 0, i > 0

where

n

is the number of children of node

j

.

There are special cases in origin papaer, but it is easy and not discussed here.

For hierarchical ranking loss:
$\ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)),$

where $c_{ij} = c_ic_j$ ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

Minimizing the risk
The conditional risks (or simply the risk) $\mathcal{R}(\hat{\mathbf{y}})$ of predicting multilabel $\hat{\mathbf{y}}$ is the expectation of $\ell(\mathbf{\hat{y}},\mathbf{y})$ over all possible $y$ ’s as ground truth, i.e.,
$\argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}).$

There are three issues to be addressed:
(1) Estimating $P(\mathbf{y}|\mathbf{x})$ .
(2) Computing $\mathcal{R}(\hat{\mathbf{y}})$ without exhaustively searching.
(3) Minimizing $\mathcal{R}(\mathbf{\hat{y}})$ .

This paper computes $p_i$ through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
$\mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i$

where $q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x})$ , $d (i)$ is the depth of node $i$ . $\Uparrow_j(i)$ is the $i$ ’s ancestor at depth j, $\Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\}$ is the set of $i$ ’s ancestors at depths 0 t0 j.

The risk for hierarchical hamming loss:
$\mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i)$

The risk for hierarchical ranking loss:
$\mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C$

Efficient minimizing the risk:
$\hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}),$
where
$\mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L$
where $\text{supp}(f) := \{x \in X | f(x) \neq 0\}$ is the support of $f$ .

实际上是比较朴素也比较容易理解的优化目标，通过按照positive label的数量来分别优化，也就是different $L$ .

This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

Conclusions

This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.

相关阅读:
算法竞赛入门【码蹄集新手村600题】(MT1451-1500）
JAVA_内部类学习笔记
牛血清白蛋白标记微囊藻毒素(MCLR)(BSA-MCLR)
【Java】面试笔记_接口抽象类_重载与重写
k8s--基础--02--组件
中英文说明书丨艾美捷Actin聚合检测试剂盒的原理和应用
ESP8266-Arduino网络编程实例-扫描WiFi可用网络
算法力扣刷题记录四十一【N叉树遍历】
如何解决网站被攻击的问题
C# 自定义控件

原文地址：https://blog.csdn.net/wuyanxue/article/details/126797398