序列标注:序列标注是NLP中一个基本任务,在序列标注中,我们想对一个序列的每一个元素标注一个标签,一般情况下,序列标注可以分为中文分词,命名实体识别等
每个元素都需要被标注为一个标签,,其中一个标签指向实体的开始,另外一个标签指向实体的中间部分或者结束部分,例如在NER任务中,最常用的就是BIO标注体系。
记录下常见的标注体系:
1.BIO标注体系:
B-begin:代表实体的开头
I-inside:代表实体的中间或结尾
O-outside:代表非实体部分
2.BIOES标注体系:
B-begin:代表实体的开头
I-inside:代表实体的中间
O-outside:代表非实体部分
E-end:代表实体的结尾
S-single:代表单个字符,其本身就是一个实体
3.BMES标注体系
B-begin:代表实体的开头
M-inside:代表实体的中间
O-outside:代表非实体部分
E-end:代表实体的结尾
S-single:代表单个字符,其本身就是一个实体
综合来看,在很多任务上各种标注体系的表现差异不大。
下面附上BMES转为BIO标签体系的代码实现:
- def load_lines(path, encoding='utf8'):
- with open(path, 'r', encoding=encoding) as f:
- lines = [line.strip() for line in f.readlines()]
- return lines
-
-
- def write_lines(lines, path, encoding='utf8'):
- with open(path, 'w', encoding=encoding) as f:
- for line in lines:
- f.writelines('{}\n'.format(line))
-
-
- def bmes_to_json(bmes_file, json_file):
- """
- 将bmes格式的文件,转换为json文件,json文件包含text和label,并且转换为BIO的标注格式
- Args:
- bmes_file:
- json_file:
- :return:
- """
- texts = []
- with open(bmes_file, 'r', encoding='utf8') as f:
- lines = f.readlines()
- words = []
- labels = []
- for idx in trange(len(lines)):
- line = lines[idx].strip()
-
- if not line:
- assert len(words) == len(labels), (len(words), len(labels))
- sample = {}
- sample['text'] = words
- sample['label'] = labels
- texts.append(json.dumps(sample, ensure_ascii=False))
-
- words = []
- labels = []
- else:
- word = line.split()
- label = line.split()
- label = str(label).replace('M-', 'I-').replace('E-', 'I-')
- words.append(word)
- labels.append(label)
-
- with open(json_file, 'w', encoding='utf-8') as f:
- for text in texts:
- f.write("{}\n".format(text))
- if __name__ == '__main__':
- # 生成json文件
- data_names = ['msra']
- path = '../datasets'
- for data_name in data_names:
- logger.info('processing dataset:{}'.format(data_name))
- files = os.listdir(join(path, data_name))
- for file in files:
- file = join(path, data_name, file)
- data_type = os.path.basename(file).split('.')[0]
- out_path = join(path, data_name, data_type+'.json')