• 《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE


    This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
    Reading Difficulty: ⋆ ⋆ \star\star
    Creativity: ⋆ ⋆ \star\star
    Comprehensiveness (全面性): ⋆ ⋆ ⋆ ⋆ ⋆ \star\star\star\star\star

    Symbol System:

    SymbolMeaning
    y i ∈ { 0 , 1 } y_i \in \{0,1\} yi{0,1}The label for class i i i
    ↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) (i),(i),(i),(i),(i)The parent, children, ancestors, descentors, and sibilings of node i i i
    y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi{0,1}ithe label vector for classes i \mathbf{i} i
    H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,,N1}The class hierachy, where N N N is the number of nodes
    I ( x ) I(x) I(x)An indicator function output 1 when x is true, 0 otherwise.
    R \mathcal{R} RThe conditional risk

    Hierarchy Constraints
    In HMC, if the label structure is a tree, we have:
    y i = 1 ⇒ y ↑ ( i ) = 1. y_i = 1 \Rightarrow y_{\uparrow(i)} = 1. yi=1y(i)=1.

    For the DAG-type HMC with, there are two interpretations:

    1. AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi=1y(i)=1

    2. OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi=1y(c)=1

    Loss functions for Flat and Hierarchical Classification
    It is a review.

    Zero-one loss:
    ℓ 0 / 1 ( y ^ , y ) = I ( y ^ ≠ y ) \ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y}) 0/1(y^,y)=I(y^=y)

    Hamming loss:
    ℓ hamming ( y ^ , y ) = ∑ i ∈ H I ( y ^ i ≠ y i ) \ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i) hamming(y^,y)=iHI(y^i=yi)

    Top- k k k precision:
    k k k most-confident predicted positive labels for each sample.
    top-k-precision ( y ^ , y ) = The number of TP predictions in the top-k labels of  y ^ k \text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k} top-k-precision(y^,y)=kThe number of TP predictions in the top-k labels of y^
    So the loss is
    ℓ top-k = 1 − top-k-precision \ell_{\text{top-k}} = 1 - \text{top-k-precision} top-k=1top-k-precision

    Ranking loss:
    ℓ rank = ∑ ( i , j ) : y i > y j ( I ( y i ^ < y ^ j ) + I ( y i ^ = y ^ j ) 2 ) \ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2}) rank=(i,j):yi>yj(I(yi^<y^j)+2I(yi^=y^j))

    Hierarchical Multi-class Classificaiton
    A review.
    Note: Only a single path can be predicted positive.

    Cai and Hofmann:
    ℓ = ∑ i ∈ H c i I ( y ^ i ≠ y i ) \ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i) =iHciI(y^i=yi)
    where c i c_i ci is the cost for node i i i.

    Dekel et al. :
    It seems that this loss is complicated.
    But this paper treats this loss as similar to the former loss?

    Hierarchical multi-label classification

    H-Loss:
    ℓ H = α ∑ i : y i = 1 , y ^ i = 0 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) + β ∑ i : y i = 0 , y ^ i = 1 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) \ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) H=αi:yi=1,y^i=0ciI(y^(i)=y(i))+βi:yi=0,y^i=1ciI(y^(i)=y(i))
    where α \alpha α and β \beta β are weight for FN and FP.

    Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
    Thus, there are a cost assigning approach
    c i = { 1 ,  i = 0, c ⇑ ( i ) n ⇔ ( i ) ,  i > 0 , c_i = \left\{

    1, i = 0,c(i)n(i), i > 0," role="presentation" style="position: relative;">1, i = 0,c(i)n(i), i > 0,
    \right. ci= 1,n(i)c(i), i = 0, i > 0,
    where n ⇔ ( i ) n_{\Leftrightarrow(i)} n(i) is the number of siblings of i i i (including i i i).

    Matching Loss:
    ℓ match = α ∑ i : y i = 1 ϕ ( i , y ^ ) + β ∑ i : y ^ i = 1 ϕ ( i , y ) \ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y}) match=αi:yi=1ϕ(i,y^)+βi:y^i=1ϕ(i,y).
    where
    $
    \phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
    $
    where cost ( j → i ) \text{cost}(j\rightarrow i) cost(ji) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

    Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

    Condensing (压缩) sort and Selection Algorithm for HMC
    It is a review.
    It can be used on both tree and DAG hierarchies.

    It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
    max ⁡ { ψ i } i ∈ H ∑ i ∈ H ψ i y ~ i s . t . ψ i ≤ ψ ↑ ( i ) , ∀ i ∈ H ∖ { 0 } , ψ 0 = 1 , ψ i ∈ { 0 , 1 } , ∑ i = 0 N − 1 ψ i = L

    max{ψi}iHiHψiy~is.t.ψiψ(i),iH{0},ψ0=1,ψi{0,1},i=0N1ψi=L" role="presentation" style="position: relative;">max{ψi}iHiHψiy~is.t.ψiψ(i),iH{0},ψ0=1,ψi{0,1},i=0N1ψi=L
    s.t.{ψi}iHmaxiHψiy iψiψ(i),iH{0},ψ0=1,ψi{0,1},i=0N1ψi=L

    where ψ i = 1 \psi_i = 1 ψi=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^; and 0 otherwise.

    When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
    ψ i ≤ ψ j , ∀ i ∈ H ∖ { 0 } , ∀ j ∈ ⇑ ( i ) . \psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i). ψiψj,iH{0},j∈⇑(i).

    Extending Flatten loss
    This paper extends Hamming Loss and Ranking Loss to support hierarchy,

    For hierarchical hamming loss:
    ℓ H-hamming = α ∑ i : y i = 1 ∧ y ^ i = 0 c i + β ∑ i : y i = 0 ∧ y ^ i = 1 c i \ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i H-hamming=αi:yi=1y^i=0ci+βi:yi=0y^i=1ci

    DAG class hierarchy derives
    c i = { 1 , i = 0 , ∑ j ∈ ⇑ ( i ) c j n ↓ ( j ) , i > 0 c_i = \left\{

    1,i=0,j∈⇑(i)cjn(j),i>0" role="presentation" style="position: relative;">1,i=0,j∈⇑(i)cjn(j),i>0
    \right. ci= 1,j∈⇑(i)n(j)cj,i=0,i>0
    where n n n is the number of children of node j j j.

    There are special cases in origin papaer, but it is easy and not discussed here.

    For hierarchical ranking loss:
    ℓ H-rank = ∑ ( i , j ) : y i > y j c i j ( I ( y ^ i < y ^ j ) + 1 2 I ( y ^ i = y ^ j ) ) , \ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)), H-rank=(i,j):yi>yjcij(I(y^i<y^j)+21I(y^i=y^j)),

    where c i j = c i c j c_{ij} = c_ic_j cij=cicj ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

    Minimizing the risk
    The conditional risks (or simply the risk) R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) of predicting multilabel y ^ \hat{\mathbf{y}} y^ is the expectation of ℓ ( y ^ , y ) \ell(\mathbf{\hat{y}},\mathbf{y}) (y^,y) over all possible y y y’s as ground truth, i.e.,
    ( 期望风险极小化 ) arg min ⁡ y ^ ∈ Ω R ( y ^ ) = ∑ y ℓ ( y ^ , y ) P ( y ∣ x ) . (期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}). (期望风险极小化)y^ΩargminR(y^)=y(y^,y)P(yx).

    There are three issues to be addressed:
    (1) Estimating P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P(yx).
    (2) Computing R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) without exhaustively searching.
    (3) Minimizing R ( y ^ ) \mathcal{R}(\mathbf{\hat{y}}) R(y^).

    This paper computes p i p_i pi through chain rule, and the risk is transferred into different forms for different losses.
    The risk for matching loss:
    R match ( y ^ ) = ∑ i : y ^ i = 0 ϕ ( i , y ^ ) + ∑ i : y ^ i q i \mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i Rmatch(y^)=i:y^i=0ϕ(i,y^)+i:y^iqi

    where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi=j=0d(i)1l=j+1d(i)cl(i)P(y0:j(i)=1,yj+1(i)=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) j(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} 0:j(i)={0(i),,j(i)} is the set of i i i’s ancestors at depths 0 t0 j.

    The risk for hierarchical hamming loss:
    R H-hamming ( y ^ ) = α ∑ i : y ^ i = 0 c i p i + β ∑ i : y ^ i = 1 c i ( 1 − p i ) \mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i) RH-hamming(y^)=αi:y^i=0cipi+βi:y^i=1ci(1pi)

    The risk for hierarchical ranking loss:
    R H-rank ( y ^ ) = ∑ 0 ≤ i < j ≤ N − 1 c i j ( p i I ( y ^ i ≤ y ^ j ) + p j I ( y ^ i ≥ y ^ j ) + p i + p j 2 I ( y ^ i = y ^ j ) ) − C \mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C RH-rank(y^)=0i<jN1cij(piI(y^iy^j)+pjI(y^iy^j)+2pi+pjI(y^i=y^j))C

    Efficient minimizing the risk:
    y ^ = arg min ⁡ L = 1 , … , N R ( y ^ ( L ) ⋆ ) , \hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}), y^=L=1,,NargminR(y^(L)),
    where
    y ^ ( L ) ⋆ = arg min ⁡ y ^ ∈ Ω R ( y ^ ) : ∣ supp ( y ^ ) ∣ = L \mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L y^(L)=y^ΩargminR(y^):supp(y^)=L
    where supp ( f ) : = { x ∈ X ∣ f ( x ) ≠ 0 } \text{supp}(f) := \{x \in X | f(x) \neq 0\} supp(f):={xXf(x)=0} is the support of f f f.

    实际上是比较朴素也比较容易理解的优化目标,通过按照positive label的数量来分别优化, 也就是different L L L.

    This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

    Conclusions

    This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
    This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.

  • 相关阅读:
    算法竞赛入门【码蹄集新手村600题】(MT1451-1500)
    JAVA_内部类 学习笔记
    牛血清白蛋白标记微囊藻毒素(MCLR)(BSA-MCLR)
    【Java】面试笔记_接口抽象类_重载与重写
    k8s--基础--02--组件
    中英文说明书丨艾美捷Actin聚合检测试剂盒的原理和应用
    ESP8266-Arduino网络编程实例-扫描WiFi可用网络
    算法力扣刷题记录 四十一【N叉树遍历】
    如何解决网站被攻击的问题
    C# 自定义控件
  • 原文地址:https://blog.csdn.net/wuyanxue/article/details/126797398