• 基于 chinese-roberta-wwm-ext 微调训练中文命名实体识别任务


    一、模型和数据集介绍

    1.1 预训练模型

    chinese-roberta-wwm-ext 是基于 RoBERTa 架构下开发,其中 wwm 代表 Whole Word Masking,即对整个词进行掩码处理,通过这种方式,模型能够更好地理解上下文和语义关联,提高中文文本处理的准确性和效果。

    与原始的 BERT 模型相比,chinese-roberta-wwm-ext 在训练数据规模和训练步数上做了一些调整,以进一步提升模型的性能和鲁棒性。并且在大规模无监督语料库上进行了预训练,使其具备强大的语言理解和生成能力。它能够广泛应用于各种自然语言处理任务,如文本分类、命名实体识别、情感分析等。我们可以使用这个模型作为基础,在不同的任务上进行微调和迁移学习,以实现更准确、高效的中文文本处理

    huggingface地址:https://huggingface.co/hfl/chinese-roberta-wwm-ext

    进到 huggingface 中下载预训练模型:

    在这里插入图片描述

    1.2 数据集

    数据集采用 CLUENER(中文语言理解测评基准)2020数据集

    进入下面链接下载数据集:

    https://www.cluebenchmarks.com/introduce.html

    在这里插入图片描述

    数据分为10个标签类别,分别为: 地址(address),书名(book),公司(company),游戏(game),政府(goverment),电影(movie),姓名(name),组织机构(organization),职位(position),景点(scene

    数据实例如下:

    {"text": "浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为,对目前国内商业银行而言,", "label": {"name": {"叶老桂": [[9, 11]]}, "company": {"浙商银行": [[0, 3]]}}}
    {"text": "生生不息CSOL生化狂潮让你填弹狂扫", "label": {"game": {"CSOL": [[4, 7]]}}}
    {"text": "那不勒斯vs锡耶纳以及桑普vs热那亚之上呢?", "label": {"organization": {"那不勒斯": [[0, 3]], "锡耶纳": [[6, 8]], "桑普": [[11, 12]], "热那亚": [[15, 17]]}}}
    {"text": "加勒比海盗3:世界尽头》的去年同期成绩死死甩在身后,后者则即将赶超《变形金刚》,", "label": {"movie": {"加勒比海盗3:世界尽头》": [[0, 11]], "《变形金刚》": [[33, 38]]}}}
    {"text": "布鲁京斯研究所桑顿中国中心研究部主任李成说,东亚的和平与安全,是美国的“核心利益”之一。", "label": {"address": {"美国": [[32, 33]]}, "organization": {"布鲁京斯研究所桑顿中国中心": [[0, 12]]}, "name": {"李成": [[18, 19]]}, "position": {"研究部主任": [[13, 17]]}}}
    {"text": "目前主赞助商暂时空缺,他们的球衣上印的是“unicef”(联合国儿童基金会),是公益性质的广告;", "label": {"organization": {"unicef": [[21, 26]], "联合国儿童基金会": [[29, 36]]}}}
    {"text": "此数据换算成亚洲盘罗马客场可让平半低水。", "label": {"organization": {"罗马": [[9, 10]]}}}
    {"text": "你们是最棒的!#英雄联盟d学sanchez创作的原声王", "label": {"game": {"英雄联盟": [[8, 11]]}}}
    {"text": "除了吴湖帆时现精彩,吴待秋、吴子深、冯超然已然归入二三流了,", "label": {"name": {"吴湖帆": [[2, 4]], "吴待秋": [[10, 12]], "吴子深": [[14, 16]], "冯超然": [[18, 20]]}}}
    {"text": "在豪门被多线作战拖累时,正是他们悄悄追赶上来的大好时机。重新找回全队的凝聚力是拉科赢球的资本。", "label": {"organization": {"拉科": [[39, 40]]}}}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    首先将数据集转换成 BIO 标注格式:

    import json
    
    
    # 将数据转为  BIO 标注形式
    def dimension_label(path, save_path, labels_path=None):
        label_dict = ['O']
        with open(save_path, "a", encoding="utf-8") as w:
            with open(path, "r", encoding="utf-8") as r:
                for line in r:
                    line = json.loads(line)
                    text = line['text']
                    label = line['label']
                    text_label = ['O'] * len(text)
                    for label_key in label:  # 遍历实体标签
                        B_label = "B-" + label_key
                        I_label = "I-" + label_key
                        if B_label not in label_dict:
                            label_dict.append(B_label)
                        if I_label not in label_dict:
                            label_dict.append(I_label)
                        label_item = label[label_key]
                        for entity in label_item:  # 遍历实体
                            position = label_item[entity]
                            start = position[0][0]
                            end = position[0][1]
                            text_label[start] = B_label
                            for i in range(start + 1, end + 1):
                                text_label[i] = I_label
                    line = {
                        "text": text,
                        "label": text_label
                    }
                    line = json.dumps(line, ensure_ascii=False)
                    w.write(line + "\n")
                    w.flush()
    
        if labels_path:  # 保存 label ,后续训练和预测时使用
            label_map = {}
            for i,label in enumerate(label_dict):
                label_map[label] = i
            with open(labels_path, "w", encoding="utf-8") as w:
                labels = json.dumps(label_map, ensure_ascii=False)
                w.write(labels + "\n")
                w.flush()
    
    
    if __name__ == '__main__':
        path = "./cluener_public/dev.json"
        save_path = "./data/dev.json"
        dimension_label(path, save_path)
    
        path = "./cluener_public/train.json"
        save_path = "./data/train.json"
        labels_path = "./data/labels.json"
        dimension_label(path, save_path, labels_path)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56

    转换后的格式如下所示:

    {"text": "浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为,对目前国内商业银行而言,", "label": ["B-company", "I-company", "I-company", "I-company", "O", "O", "O", "O", "O", "B-name", "I-name", "I-name", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "生生不息CSOL生化狂潮让你填弹狂扫", "label": ["O", "O", "O", "O", "B-game", "I-game", "I-game", "I-game", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "那不勒斯vs锡耶纳以及桑普vs热那亚之上呢?", "label": ["B-organization", "I-organization", "I-organization", "I-organization", "O", "O", "B-organization", "I-organization", "I-organization", "O", "O", "B-organization", "I-organization", "O", "O", "B-organization", "I-organization", "I-organization", "O", "O", "O", "O"]}
    {"text": "加勒比海盗3:世界尽头》的去年同期成绩死死甩在身后,后者则即将赶超《变形金刚》,", "label": ["B-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-movie", "I-movie", "I-movie", "I-movie", "I-movie", "I-movie", "O"]}
    {"text": "布鲁京斯研究所桑顿中国中心研究部主任李成说,东亚的和平与安全,是美国的“核心利益”之一。", "label": ["B-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "B-position", "I-position", "I-position", "I-position", "I-position", "B-name", "I-name", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-address", "I-address", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "目前主赞助商暂时空缺,他们的球衣上印的是“unicef”(联合国儿童基金会),是公益性质的广告;", "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "O", "O", "B-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "I-organization", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "此数据换算成亚洲盘罗马客场可让平半低水。", "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "B-organization", "I-organization", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "你们是最棒的!#英雄联盟d学sanchez创作的原声王", "label": ["O", "O", "O", "O", "O", "O", "O", "O", "B-game", "I-game", "I-game", "I-game", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "除了吴湖帆时现精彩,吴待秋、吴子深、冯超然已然归入二三流了,", "label": ["O", "O", "B-name", "I-name", "I-name", "O", "O", "O", "O", "O", "B-name", "I-name", "I-name", "O", "B-name", "I-name", "I-name", "O", "B-name", "I-name", "I-name", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
    {"text": "在豪门被多线作战拖累时,正是他们悄悄追赶上来的大好时机。重新找回全队的凝聚力是拉科赢球的资本。", "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-organization", "I-organization", "O", "O", "O", "O", "O", "O"]}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    labels.json 标签如下:

    {"O": 0, "B-name": 1, "I-name": 2, "B-company": 3, "I-company": 4, "B-game": 5, "I-game": 6, "B-organization": 7, "I-organization": 8, "B-movie": 9, "I-movie": 10, "B-address": 11, "I-address": 12, "B-position": 13, "I-position": 14, "B-government": 15, "I-government": 16, "B-scene": 17, "I-scene": 18, "B-book": 19, "I-book": 20}
    
    • 1

    二、模型微调训练

    2.1 处理数据集构建 Dataset

    ner_datasets.py

    from torch.utils.data import Dataset, DataLoader
    import torch
    import json
    
    
    class NERDataset(Dataset):
        def __init__(self, tokenizer, file_path, labels_map, max_length=300):
            self.tokenizer = tokenizer
            self.max_length = max_length
            self.labels_map = labels_map
    
            self.text_data = []
            self.label_data = []
            with open(file_path, "r", encoding="utf-8") as r:
                for line in r:
                    line = json.loads(line)
                    text = line['text']
                    label = line['label']
                    self.text_data.append(text)
                    self.label_data.append(label)
    
        def __len__(self):
            return len(self.text_data)
    
        def __getitem__(self, idx):
            text = self.text_data[idx]
            labels = self.label_data[idx]
    
            # 使用分词器对句子进行处理
            inputs = self.tokenizer.encode_plus(
                text,
                None,
                add_special_tokens=True,
                padding='max_length',
                truncation=True,
                max_length=self.max_length,
                return_tensors='pt'
            )
    
            input_ids = inputs['input_ids'].squeeze()
            attention_mask = inputs['attention_mask'].squeeze()
    
            # 将标签转换为数字编码
            label_ids = [self.labels_map[l] for l in labels]
    
            if len(label_ids) > self.max_length:
                label_ids = label_ids[0:self.max_length]
    
            if len(label_ids) < self.max_length:
                # 标签填充到最大长度
                label_ids.extend([0] * (self.max_length - len(label_ids)))
    
            return {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'labels': torch.LongTensor(label_ids)
            }
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58

    2.2 模型迭代训练

    import torch
    from torch.utils.data import Dataset, DataLoader
    from transformers import AutoTokenizer, AutoModelForTokenClassification
    from ner_datasets import NERDataset
    from tqdm import tqdm
    import json
    import time, sys
    import numpy as np
    from sklearn.metrics import f1_score
    
    
    def train(epoch, model, device, loader, optimizer, gradient_accumulation_steps):
        model.train()
        time1 = time.time()
        for index, data in enumerate(tqdm(loader, file=sys.stdout, desc="Train Epoch: " + str(epoch))):
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)
    
            outputs = model(
                input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            loss = outputs.loss
            # 反向传播,计算当前梯度
            loss.backward()
            # 梯度累积步数
            if (index % gradient_accumulation_steps == 0 and index != 0) or index == len(loader) - 1:
                # 更新网络参数
                optimizer.step()
                # 清空过往梯度
                optimizer.zero_grad()
    
            # 100轮打印一次 loss
            if index % 100 == 0 or index == len(loader) - 1:
                time2 = time.time()
                tqdm.write(
                    f"{index}, epoch: {epoch} -loss: {str(loss)} ; each step's time spent: {(str(float(time2 - time1) / float(index + 0.0001)))}")
    
    
    def validate(model, device, loader):
        model.eval()
        acc = 0
        f1 = 0
        with torch.no_grad():
            for _, data in enumerate(tqdm(loader, file=sys.stdout, desc="Validation Data")):
                input_ids = data['input_ids'].to(device)
                attention_mask = data['attention_mask'].to(device)
                labels = data['labels']
    
                outputs = model(input_ids, attention_mask=attention_mask)
                _, predicted_labels = torch.max(outputs.logits, dim=2)
                predicted_labels = predicted_labels.detach().cpu().numpy().tolist()
                true_labels = labels.detach().cpu().numpy().tolist()
    
                predicted_labels_flat = [label for sublist in predicted_labels for label in sublist]
                true_labels_flat = [label for sublist in true_labels for label in sublist]
    
                accuracy = (np.array(predicted_labels_flat) == np.array(true_labels_flat)).mean()
                acc = acc + accuracy
                f1score = f1_score(true_labels_flat, predicted_labels_flat, average='macro')
                f1 = f1 + f1score
    
        return acc / len(loader), f1 / len(loader)
    
    
    def main():
        labels_path = "./data/labels.json"
        model_name = 'D:\\AIGC\\model\\chinese-roberta-wwm-ext'
        train_json_path = "./data/train.json"
        val_json_path = "./data/dev.json"
        max_length = 300
        epochs = 5
        batch_size = 1
        lr = 1e-4
        gradient_accumulation_steps = 16
        model_output_dir = "output"
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
        # 加载label
        with open(labels_path, "r", encoding="utf-8") as r:
            labels_map = json.loads(r.read())
    
        # 加载分词器和模型
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(labels_map))
        model.to(device)
    
        # 加载数据
        print("Start Load Train Data...")
        train_dataset = NERDataset(tokenizer, train_json_path, labels_map, max_length)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        print("Start Load Validation Data...")
        val_dataset = NERDataset(tokenizer, val_json_path, labels_map, max_length)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
        # 定义优化器和损失函数
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
        print("Start Training...")
        best_acc = 0.0
        for epoch in range(epochs):
            train(epoch, model, device, train_loader, optimizer, gradient_accumulation_steps)
            print("Start Validation...")
            acc, f1 = validate(model, device, val_loader)
            print(f"Validation : acc: {acc} , f1: {f1}")
    
            if best_acc < acc: # 保存准确率最高的模型
                print("Save Model To ", model_output_dir)
                model.save_pretrained(model_output_dir)
                tokenizer.save_pretrained(model_output_dir)
                best_acc = acc
    
    if __name__ == '__main__':
        main()
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116

    运行之后可以看到训练进度:

    在这里插入图片描述

    训练中可以看到验证集的准确率和f1 指标,并保存一个准确率最高的模型。
    在这里插入图片描述

    训练完成后,可以看到最后一轮的 acc: 0.988304790270546 , f1: 0.7567446800115455,在 output 下可以看到保存的模型文件:

    在这里插入图片描述

    三、模型测试

    from transformers import AutoTokenizer, AutoModelForTokenClassification
    import torch
    import json
    
    
    # 解析实体
    def post_processing(outputs, text, labels_map):
        _, predicted_labels = torch.max(outputs.logits, dim=2)
    
        predicted_labels = predicted_labels.detach().cpu().numpy()
    
        predicted_tags = [labels_map[label_id] for label_id in predicted_labels[0]]
    
        result = {}
        entity = ""
        type = ""
        for index, word_token in enumerate(text):
            tag = predicted_tags[index]
            if tag.startswith("B-"):
                type = tag.split("-")[1]
                if entity:
                    if type not in result:
                        result[type] = []
                    result[type].append(entity)
                entity = word_token
            elif tag.startswith("I-"):
                type = tag.split("-")[1]
                if entity:
                    entity += word_token
            else:
                if entity:
                    if type not in result:
                        result[type] = []
                    result[type].append(entity)
                entity = ""
        return result
    
    def main():
        labels_path = "./data/labels.json"
        model_name = './output'
        max_length = 300
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
        # 加载label
        labels_map = {}
        with open(labels_path, "r", encoding="utf-8") as r:
            labels = json.loads(r.read())
            for label in labels:
                label_id = labels[label]
                labels_map[label_id] = label
    
        # 加载分词器和模型
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(labels_map))
        model.to(device)
    
        while True:
            text = input("请输入:")
            if not text or text == '':
                continue
            if text == 'q':
                break
    
            encoded_input = tokenizer(text, padding="max_length", truncation=True, max_length=max_length)
            input_ids = torch.tensor([encoded_input['input_ids']]).to(device)
            attention_mask = torch.tensor([encoded_input['attention_mask']]).to(device)
    
            outputs = model(input_ids, attention_mask=attention_mask)
            result = post_processing(outputs, text, labels_map)
            print(result)
    
    if __name__ == '__main__':
        main()
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74

    输入:根据北京市住房和城乡建设委员会总体工作部署,市建委调配给东城区118套房源,99户家庭全部来到现场

    识别结果:

    {'government': ['北京市住房和城乡建设委员会'], 'address': ['东城区']}
    
    • 1

    输入:为星际争霸2冠军颁奖的嘉宾是来自上海新闻出版局副局长陈丽女士。最后,为魔兽争霸3项目冠军—

    识别结果:

    {'game': ['星际争霸2', '魔兽争霸3'], 'position': ['上海新闻出版局'], 'name': ['副局长', '陈丽']}
    
    • 1

    输入:作出对成钢违纪辞退处理决定,并开具了退工单。今年8月,公安机关以不应当追究刑事责任为由

    识别结果:

    {'name': ['成钢'], 'government': ['公安机关']}
    
    • 1
  • 相关阅读:
    汇编:lea 需要注意的一点
    基于Springboot+redis+Springsecruity的网上商城 改改就可以做毕业设计
    「Python实用秘技11」在Python中利用ItsDangerous快捷实现数据加密
    ChatGPT⼊门到精通(5):ChatGPT 和Claude区别
    【学习笔记47】开关变量和拖拽效果
    mysql中GROUP_CONCAT函数详解
    算力时代,算能(SOPHGO)的算力芯片/智算板卡/服务器选型
    壁纸小程序Vue3(分类页面和用户页面基础布局)
    React基础(超级详细,有案例)
    从头理解与编码LLM的自注意力机制
  • 原文地址:https://blog.csdn.net/qq_43692950/article/details/133768324