OCR 表格识别中的树编辑距离

作者提供了脚本，但是jupter

我复制了一遍，如下：


pred = 'Name of algori Notablefeatures
MACS [23] Uses both a control library and local statistics to minimize bias
SICER [15] Designed for detecting diffusely enriched regions; for example, histone modification
PeakSEQ [24] Corrects for reference genome mappability and local statistics
SISSRs [25] High resolution, precise identification of binding-site location
F-seq [26] Uses kernel density estimation
'
true = 'Name of algorithm Notable features
MACS [23] Uses both a control library and local statistics to minimize bias
SICER [14] Designed for detecting diffusely enriched regions; for example, histone modification
PeakSeq [24] Corrects for reference genome mappability and local statistics
SISSRs [25] High resolution, precise identification of binding-site location
F-seq [26] Uses kernel density estimation
'
from metric import TEDS
# Initialize TEDS object
teds = TEDS()
# Evaluate
 
score = teds.evaluate(pred, true)
print('TEDS score:', score)

Name of algori	Notablefeatures
MACS [23]	Uses both a control library and local statistics to minimize bias
SICER [15]	Designed for detecting diffusely enriched regions; for example, histone modification
PeakSEQ [24]	Corrects for reference genome mappability and local statistics
SISSRs [25]	High resolution, precise identification of binding-site location
F-seq [26]	Uses kernel density estimation

Name of algorithm	Notable features
MACS [23]	Uses both a control library and local statistics to minimize bias
SICER [14]	Designed for detecting diffusely enriched regions; for example, histone modification
PeakSeq [24]	Corrects for reference genome mappability and local statistics
SISSRs [25]	High resolution, precise identification of binding-site location
F-seq [26]	Uses kernel density estimation

这两个string，作者只提供了标签，没有提供图片。我们可以直接把标签里面复制到一个html文件中，然后用浏览器打开。

如果你的格式有点乱，可以格式化一下。

我们只关注body 里面的内容。<> 和是一对，就像括号一样。thead 是表头，就是下面加黑的那个。tbody是表格身体。tr是一行，td是一个单元格。

下面来说跨行的情况

whole country 垮了四列


<tr>
     <td colspan="4">
           whole country
     td>
 tr>

下面是跨行

DHS WI 新开一行 CDR-RS 不新开一行。CDR 和 RS 新开一行。也就是说，是否要新开一行只与上一个单元格的行起点有关系。


            <tr>
                <td rowspan="3">
                    DHS WI
                td>
                <td>
                    CDR–RS
                td>
                <td>
                    0.76
                td>
                <td>
                    0.394
                td>
            tr>
            <tr>
                <td>
                    CDR
                td>
                <td>
                    0.64
                td>
                <td>
                    0.483
                td>
            tr>
            <tr>
                <td>
                    RS
                td>
                <td>
                    0.74
                td>
                <td>
                    0.413
                td>
            tr>

上面的图片我是用的作者example文件夹中的，这里不仅有图片还有标签。还有脚本，脚本可以把jason转为HTML格式的字符串。

脚本我运行不了，要改一改


if __name__ == '__main__':
    import json
    import sys
    f = "PubTabNet_Examples.jsonl"
    file = open(f, 'r', encoding='utf-8')
    for line in file.readlines():
        annotations = json.loads(line)
        if annotations["imgid"] == 32:
            html_string = format_html(annotations)
            print(html_string)

计算树编辑距离的核心代码就是这里：

PubTabNet/metric.py at master · ibm-aur-nlp/PubTabNet · GitHub

distance = APTED(tree_pred, tree_true, CustomConfig()).compute_edit_distance()

这里核心代码我也不懂，大概意思是作者自定义了一个tree的类，通过递归的方式逐层加载HTML的string，创建出整棵树。然后使用上面那个API计算树的距离。

下面是作者封装的一个树。 tree_true = self.load_html_tree(true)


{
    "tag": table{
        "tag": thead{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['<b>', 'N', 'a', 'm', 'e', ' ', 'o', 'f', ' ', 'a', 'l', 'g', 'o', 'r', 'i', 't', 'h', 'm', 'b>'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['<b>', 'N', 'o', 't', 'a', 'b', 'l', 'e', ' ', 'f', 'e', 'a', 't', 'u', 'r', 'e', 's', 'b>'
                ]
            }
        }
    }{
        "tag": tbody{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['M', 'A', 'C', 'S', ' ', '[', '2', '3', '
                    ]'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['U', 's', 'e', 's', ' ', 'b', 'o', 't', 'h', ' ', 'a', ' ', 'c', 'o', 'n', 't', 'r', 'o', 'l', ' ', 'l', 'i', 'b', 'r', 'a', 'r', 'y', ' ', 'a', 'n', 'd', ' ', 'l', 'o', 'c', 'a', 'l', ' ', 's', 't', 'a', 't', 'i', 's', 't', 'i', 'c', 's', ' ', 't', 'o', ' ', 'm', 'i', 'n', 'i', 'm', 'i', 'z', 'e', ' ', 'b', 'i', 'a', 's'
                ]
            }
        }{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['S', 'I', 'C', 'E', 'R', ' ', '[', '1', '4', '
                    ]'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['D', 'e', 's', 'i', 'g', 'n', 'e', 'd', ' ', 'f', 'o', 'r', ' ', 'd', 'e', 't', 'e', 'c', 't', 'i', 'n', 'g', ' ', 'd', 'i', 'f', 'f', 'u', 's', 'e', 'l', 'y', ' ', 'e', 'n', 'r', 'i', 'c', 'h', 'e', 'd', ' ', 'r', 'e', 'g', 'i', 'o', 'n', 's', ';', ' ', 'f', 'o', 'r', ' ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', ' ', 'h', 'i', 's', 't', 'o', 'n', 'e', ' ', 'm', 'o', 'd', 'i', 'f', 'i', 'c', 'a', 't', 'i', 'o', 'n'
                ]
            }
        }{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['P', 'e', 'a', 'k', 'S', 'e', 'q', ' ', '[', '2', '4', '
                    ]'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['C', 'o', 'r', 'r', 'e', 'c', 't', 's', ' ', 'f', 'o', 'r', ' ', 'r', 'e', 'f', 'e', 'r', 'e', 'n', 'c', 'e', ' ', 'g', 'e', 'n', 'o', 'm', 'e', ' ', 'm', 'a', 'p', 'p', 'a', 'b', 'i', 'l', 'i', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'l', 'o', 'c', 'a', 'l', ' ', 's', 't', 'a', 't', 'i', 's', 't', 'i', 'c', 's'
                ]
            }
        }{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['S', 'I', 'S', 'S', 'R', 's', ' ', '[', '2', '5', '
                    ]'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['H', 'i', 'g', 'h', ' ', 'r', 'e', 's', 'o', 'l', 'u', 't', 'i', 'o', 'n', ',', ' ', 'p', 'r', 'e', 'c', 'i', 's', 'e', ' ', 'i', 'd', 'e', 'n', 't', 'i', 'f', 'i', 'c', 'a', 't', 'i', 'o', 'n', ' ', 'o', 'f', ' ', 'b', 'i', 'n', 'd', 'i', 'n', 'g', '-', 's', 'i', 't', 'e', ' ', 'l', 'o', 'c', 'a', 't', 'i', 'o', 'n'
                ]
            }
        }{
            "tag": tr{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['F', '-', 's', 'e', 'q', ' ', '[', '2', '6', '
                    ]'
                ]
            }{
                "tag": td,
                "colspan": 1,
                "rowspan": 1,
                "text": ['U', 's', 'e', 's', ' ', 'k', 'e', 'r', 'n', 'e', 'l', ' ', 'd', 'e', 'n', 's', 'i', 't', 'y', ' ', 'e', 's', 't', 'i', 'm', 'a', 't', 'i', 'o', 'n'
                ]
            }
        }
    }
}

相关阅读:
举例说明PyTorch函数torch.cat与torch.stack的区别
 嵌入式Linux系统编程 — 3.7 文件目录与处理
 【Docker】- 【入门】- 001 - 创建docker 账户以及上传image和部署image
北京十大靠谱律师事务所最新排名（2022前十推荐）
待试验：AR与DMX同步，实现AR效果与现实灯光互相影响，增强体验的真实感
 工作和异地，都是生活的考验
 面向大规模队列，百万并发的多优先级消费系统设计
 看完这篇教你玩转渗透测试靶机vulnhub——FunBox4（CTF）
学习javaEE初阶的第一堂课
 7.1 为什么要用函数
原文地址：https://blog.csdn.net/qq_40709711/article/details/126472962