总结一下常见的序列标注任务的标注体系

序列标注：序列标注是NLP中一个基本任务，在序列标注中，我们想对一个序列的每一个元素标注一个标签，一般情况下，序列标注可以分为中文分词，命名实体识别等

每个元素都需要被标注为一个标签，，其中一个标签指向实体的开始，另外一个标签指向实体的中间部分或者结束部分，例如在NER任务中，最常用的就是BIO标注体系。

记录下常见的标注体系：

1.BIO标注体系：

B-begin:代表实体的开头

I-inside：代表实体的中间或结尾

O-outside:代表非实体部分

2.BIOES标注体系：

B-begin:代表实体的开头

I-inside：代表实体的中间

O-outside:代表非实体部分

E-end:代表实体的结尾

S-single:代表单个字符，其本身就是一个实体

3.BMES标注体系

B-begin:代表实体的开头

M-inside：代表实体的中间

O-outside:代表非实体部分

E-end:代表实体的结尾

S-single:代表单个字符，其本身就是一个实体

综合来看，在很多任务上各种标注体系的表现差异不大。

下面附上BMES转为BIO标签体系的代码实现：


def load_lines(path, encoding='utf8'):
    with open(path, 'r', encoding=encoding) as f:
        lines = [line.strip() for line in f.readlines()]
        return lines
 
 
def write_lines(lines, path, encoding='utf8'):
    with open(path, 'w', encoding=encoding) as f:
        for line in lines:
            f.writelines('{}\n'.format(line))
 
 
def bmes_to_json(bmes_file, json_file):
    """
    将bmes格式的文件，转换为json文件，json文件包含text和label,并且转换为BIO的标注格式
    Args:
        bmes_file:
        json_file:
    :return:
    """
    texts = []
    with open(bmes_file, 'r', encoding='utf8') as f:
        lines = f.readlines()
        words = []
        labels = []
        for idx in trange(len(lines)):
            line = lines[idx].strip()
 
            if not line:
                assert len(words) == len(labels), (len(words), len(labels))
                sample = {}
                sample['text'] = words
                sample['label'] = labels
                texts.append(json.dumps(sample, ensure_ascii=False))
 
                words = []
                labels = []
            else:
                word = line.split()
                label = line.split()
                label = str(label).replace('M-', 'I-').replace('E-', 'I-')
                words.append(word)
                labels.append(label)
 
    with open(json_file, 'w', encoding='utf-8') as f:
        for text in texts:
            f.write("{}\n".format(text))
if __name__ == '__main__':
    # 生成json文件
    data_names = ['msra']
    path = '../datasets'
    for data_name in data_names:
        logger.info('processing dataset:{}'.format(data_name))
        files = os.listdir(join(path, data_name))
        for file in files:
            file = join(path, data_name, file)
            data_type = os.path.basename(file).split('.')[0]
            out_path = join(path, data_name, data_type+'.json')

相关阅读:
jupyter使用教程及python语法基础
微信小程序- - - - - rich-text 富文本问题
HDLbits: Lemmings2
Python(四)——变量的定义和简单使用
Java多线程案例【定时器】
JuiceFS v1.0 正式发布，首个面向生产环境的 LTS 版本
盘点四大运营商的5G套餐亮点，国庆出游上车5G不用愁！
【Java分享客栈】一文搞定CompletableFuture并行处理，成倍缩短查询时间。
从头开始机器学习：逻辑回归
QT集成Protobuf

原文地址：https://blog.csdn.net/weixin_48592695/article/details/126004598