• 李宏毅-hw7-利用Bert完成QA


    一、查漏补缺、熟能生巧:

            只有熬过不熟练的时期,反复琢磨,才会有熟练之后,藐视众生的时刻

    1.关于transformers中的tokenizer的用法的简单介绍:
    1. from transformers import BertTokenizerFast
    2. # 加载预训练的BERT模型tokenizer
    3. tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
    4. # 文本输入
    5. text = "This is an example sentence."
    6. # 对文本进行分词和标记化
    7. tokens = tokenizer(text, truncation=True, padding=True)
    8. # 获取分词后的token IDs
    9. input_ids = tokens['input_ids']
    10. # 获取token类型IDs
    11. token_type_ids = tokens['token_type_ids']
    12. # 获取注意力遮罩
    13. attention_mask = tokens['attention_mask']
    14. print("Token IDs:", input_ids)
    15. print("Token Type IDs:", token_type_ids)
    16. print("Attention Mask:", attention_mask)

    运行结果如下:

    Token IDs: [101, 2023, 2003, 2019, 2742, 6251, 102]

    Token Type IDs: [0, 0, 0, 0, 0, 0, 0]

    Attention Mask: [1, 1, 1, 1, 1, 1, 1]

    2.char_to_token的用法和作用说明:
    1. from transformers import BertTokenizer
    2. # 初始化BERT分词器
    3. tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
    4. # 原始文本
    5. text = "我爱自然语言处理"
    6. # 分词
    7. tokens = tokenizer.tokenize(text)
    8. print("分词后的标记序列:", tokens)
    9. # 字符级别的索引
    10. char_index = 3
    11. # 将标记序列转换为句子字符串,并查找字符在句子字符串中的位置
    12. sentence = tokenizer.convert_tokens_to_string(tokens)
    13. char_position = sentence.index(text[char_index])
    14. print("字符级别索引 {} 对应的句子字符串位置为: {}".format(char_index, char_position))
    15. # 将标记序列转换为ID序列
    16. input_ids = tokenizer.convert_tokens_to_ids(tokens)
    17. print("ID序列:", input_ids)

    输出结果:

    分词后的标记序列: ['我', '爱', '自', '然', '语', '言', '处', '理']
    字符级别索引 3 对应的句子字符串位置为: 6
    ID序列: [2769, 4263, 5632, 4197, 6427, 6241, 1905, 4415]

    上面的红色字体说明得不准确,下面是准确的“分词”的描述

    3.关于list中:max_len允许本来的长度不够,那就返回本来长度的数值:
    tokenized_question.ids[:self.max_question_len]
    1. #测试:
    2. test = [1,2,3,4]
    3. print(test[:100])

    输出:[1, 2, 3, 4]

    4.

    二、代码研读:
    version_1:助教版本的代码:
    1. #下载data数据
    2. # Download link 1
    3. !gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip
    4. # Download Link 2 (if the above link fails)
    5. # !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip
    6. # Download Link 3 (if the above link fails)
    7. # !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip
    8. !unzip -o hw7_data.zip
    9. # For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
    10. !nvidia-smi
    1. #安装transformers
    2. # You are allowed to change version of transformers or use other toolkits
    3. !pip install transformers
    1. #引入库,fix seed
    2. import json
    3. import numpy as np
    4. import random
    5. import torch
    6. from torch.utils.data import DataLoader, Dataset
    7. from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast
    8. from tqdm.auto import tqdm
    9. device = "cuda" if torch.cuda.is_available() else "cpu"
    10. # Fix random seed for reproducibility
    11. def same_seeds(seed):
    12. torch.manual_seed(seed)
    13. if torch.cuda.is_available():
    14. torch.cuda.manual_seed(seed)
    15. torch.cuda.manual_seed_all(seed)
    16. np.random.seed(seed)
    17. random.seed(seed)
    18. torch.backends.cudnn.benchmark = False
    19. torch.backends.cudnn.deterministic = True
    20. same_seeds(0)
    1. # Change "fp16_training" to True to support automatic mixed precision training (fp16)
    2. #解读baseline的代码,暂时还不需要考虑这个fp16的部分
    3. fp16_training = False
    4. if fp16_training:
    5. !pip install accelerate==0.2.0
    6. from accelerate import Accelerator
    7. accelerator = Accelerator(fp16=True)
    8. device = accelerator.device
    9. # Documentation for the toolkit: https://huggingface.co/docs/accelerate/
    1. #载入bert 的model 和 tokenizer
    2. model = BertForQuestionAnswering.from_pretrained("bert-base-chinese").to(device) #从transformers库中的pretrained好的model获取到这个中文版本的翻译 model
    3. tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese") #用于tokenize的对象
    4. # You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)
    1. #从文件中把 数据 读取到 数组中
    2. #这个部分就是定义1个读取数据的函数,然后调用这个函数,从3个json文件中读取对应的paragraphs 和 questions数据
    3. #(1)先查看各个文件中的数据的类型和分布情况:
    4. #"questions": [
    5. # {
    6. # "id": 0,
    7. # "paragraph_id": 660,
    8. # "question_text": "福岡市的兩大中心地區指的是博多地區,還有哪一地區?",
    9. # "answer_text": "天神地區",
    10. # "answer_start": 343,
    11. # "answer_end": 346
    12. #里面的questions数据就是1个map映射到1个数组,每个元素是上述这些,注意,每个quetion的id都不一样,对应于token_type_ids
    13. #paragraphs是1个map映射到1个数组,每个元素都是1串文字,每一串都是1篇文章,没有其他东西了,所以应该是用上面的paragraphs_id去找
    14. def read_data(file):
    15. with open(file, 'r', encoding="utf-8") as reader:
    16. data = json.load(reader)
    17. return data["questions"], data["paragraphs"]
    18. train_questions, train_paragraphs = read_data("hw7_train.json")
    19. dev_questions, dev_paragraphs = read_data("hw7_dev.json")
    20. test_questions, test_paragraphs = read_data("hw7_test.json")
    1. #将原始的数据 进行 tokenizer
    2. # Tokenize questions and paragraphs separately
    3. # 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__
    4. #这个部分估计就是用上述的bert_tokenizer来对这些数据中的"中文字"进行tokenize了
    5. train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
    6. dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
    7. test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False)
    8. #这段代码的作用无非就是 利用tokenizer这个对象处理 对上面的3组数据进行挨个地处理罢了:
    9. #上面也就是对train_question数组中的每一项中的"question_text"中的《中文文字内容进行tokenizer》之后,只是将这一段文字的tokens存储到train_questions_tokenized这个数组中
    10. train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
    11. dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
    12. test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)
    13. #这一段也是上面一样的作用,也是将文字转换为tokens之后保存到对应的数组中
    14. # You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model
    1. #获取 train_loader,dev_loader,test_loader:
    2. class QA_Dataset(Dataset):
    3. def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs): #参数split对象,QA分别的tokens数组,还有1个questions不知道是什么
    4. self.split = split #split == 'train' of 'Dev'
    5. self.questions = questions
    6. self.tokenized_questions = tokenized_questions
    7. self.tokenized_paragraphs = tokenized_paragraphs
    8. self.max_question_len = 40
    9. self.max_paragraph_len = 150
    10. ##### TODO: Change value of doc_stride #####
    11. self.doc_stride = 150
    12. # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
    13. self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1
    14. def __len__(self):
    15. return len(self.questions) #需要提问的数量,所以,这个questions到底是什么呢
    16. def __getitem__(self, idx):
    17. question = self.questions[idx]
    18. tokenized_question = self.tokenized_questions[idx]
    19. tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]] #从这里可以看出这个questions就是存储有结构体的questions原始内容
    20. #上面已经完成了__getitem__中的基本内容,可以返回tokenized_que 和 对应的paragraph_tokens
    21. #下面就是一些对这些数据进行预处理的内容,为了方便得到更加nice的训练数据
    22. ##### TODO: Preprocessing #####
    23. # Hint: How to prevent model from learning something it should not learn
    24. if self.split == "train":
    25. # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph
    26. #这里的tokenized_paragraph就是1个文段而已,这2行代码的作用就是
    27. #首先,这个question['answer_start']得到的就是一个数值,这个数值代表再对应文章中的其实的位置
    28. #通过这2行代码得到 分词之后的 索引位置
    29. answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
    30. answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])
    31. # A single window is obtained by slicing the portion of paragraph containing the answer
    32. #这个部分就是助教说的可以修改的部分,因为这里的删减,导致了 一些信息的 丢失,如果修改这个preprocess部分,应该可以提高正确率
    33. mid = (answer_start_token + answer_end_token) // 2
    34. paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len)) #由于有最长的paragraph的长度限制,所以开始的位置取0,或者后面合理的一个位置
    35. paragraph_end = paragraph_start + self.max_paragraph_len #end的位置就是start + max_len咯
    36. # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)<————其实这里将input_que和input_para相加就得到了左边这个窗口,也就是bert的输入窗口
    37. input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] #这里面就是ids--相当于的一个文字的替代数值,
    38. input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]#paragraph_input_token数组的设置
    39. # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window
    40. answer_start_token += len(input_ids_question) - paragraph_start #如果我没有想错的话,就是将CLS+question + sep +paragraph+sep拼接的位置
    41. answer_end_token += len(input_ids_question) - paragraph_start
    42. # Pad sequence and obtain inputs to model
    43. input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph) #返回input的ids 和 token_type_ids,mask下面查看一下这个padding函数的作用
    44. return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token
    45. # Validation/Testing
    46. else:
    47. input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
    48. # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
    49. #这里返回的是好几个list,也就是说,这些内容之间起点的位置相隔为doc_stride,也就是相当于 不同窗口之间的有doc_stride部分不相交
    50. for i in range(0, len(tokenized_paragraph), self.doc_stride):
    51. # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
    52. input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
    53. input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
    54. # Pad sequence and obtain inputs to model
    55. input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
    56. input_ids_list.append(input_ids)
    57. token_type_ids_list.append(token_type_ids)
    58. attention_mask_list.append(attention_mask)
    59. return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)
    60. #下面分析一下这个padding函数的作用,其实就是那个研二的学长在讲transformer时候提到的长度不够,需要补齐padding1的功能
    61. #其实这里最好是联系那个 CSL + question_tokens + SEP + para_tokens + SEP的那个transformer的输入窗口进行理解
    62. def padding(self, input_ids_question, input_ids_paragraph):
    63. # Pad zeros if sequence length is shorter than max_seq_len
    64. padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
    65. # Indices of input sequence tokens in the vocabulary
    66. input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
    67. #token_type_ids就是用来 标注出其中的para的位置(不过它原来的作用是区分不同的 句子)
    68. # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
    69. token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
    70. #将padding部分的mask设置为0,非padding1的部分设置为1
    71. # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
    72. attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
    73. return input_ids, token_type_ids, attention_mask
    74. train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
    75. dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
    76. test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)
    77. train_batch_size = 32
    78. # Note: Do NOT change batch size of dev_loader / test_loader !
    79. # Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
    80. train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
    81. dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
    82. test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)
    1. #定义1个eval处理的部分:
    2. def evaluate(data, output): #这个data就是dev_loader中的每个元素,也就是那3组数据, out_put就是将data输入到bert的结果
    3. ##### TODO: Postprocessing #####
    4. # There is a bug and room for improvement in postprocessing
    5. # Hint: Open your prediction file to see what is wrong
    6. answer = ''
    7. max_prob = float('-inf')
    8. num_of_windows = data[0].shape[1]
    9. for k in range(num_of_windows):
    10. # Obtain answer by choosing the most probable start position / end position
    11. start_prob, start_index = torch.max(output.start_logits[k], dim=0) #获得最大可能的起始index 和 对应的probablility
    12. end_prob, end_index = torch.max(output.end_logits[k], dim=0) #获取最大可能的结尾index 和 对应的probablility
    13. # Probability of answer is calculated as sum of start_prob and end_prob
    14. prob = start_prob + end_prob #相加作为总prob
    15. # Replace answer if calculated probability is larger than previous windows
    16. if prob > max_prob:
    17. max_prob = prob
    18. # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
    19. answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
    20. # Remove spaces in answer (e.g. "大 金" --> "大金")
    21. return answer.replace(' ','')
    1. #train里面的代码
    2. num_epoch = 1
    3. validation = True
    4. logging_step = 100 #每隔logging_step进行1次输出
    5. learning_rate = 1e-4
    6. optimizer = AdamW(model.parameters(), lr=learning_rate)
    7. if fp16_training:
    8. model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
    9. model.train()
    10. print("Start Training ...")
    11. for epoch in range(num_epoch):
    12. step = 1
    13. train_loss = train_acc = 0
    14. for data in tqdm(train_loader):
    15. # Load all data into GPU
    16. data = [i.to(device) for i in data]
    17. # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
    18. # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)
    19. #这个bert模型的输出都已经是封装好的了,所以loss也在里面了
    20. output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])
    21. # Choose the most probable start position / end position
    22. start_index = torch.argmax(output.start_logits, dim=1)
    23. end_index = torch.argmax(output.end_logits, dim=1)
    24. # Prediction is correct only if both start_index and end_index are correct
    25. train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
    26. train_loss += output.loss
    27. if fp16_training:
    28. accelerator.backward(output.loss)
    29. else:
    30. output.loss.backward()
    31. optimizer.step()
    32. optimizer.zero_grad()
    33. step += 1
    34. ##### TODO: Apply linear learning rate decay #####
    35. # Print training loss and accuracy over past logging step
    36. if step % logging_step == 0:
    37. print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
    38. train_loss = train_acc = 0
    39. if validation:
    40. print("Evaluating Dev Set ...")
    41. model.eval()
    42. with torch.no_grad():
    43. dev_acc = 0
    44. for i, data in enumerate(tqdm(dev_loader)):
    45. output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
    46. attention_mask=data[2].squeeze(dim=0).to(device))
    47. # prediction is correct only if answer text exactly matches
    48. dev_acc += evaluate(data, output) == dev_questions[i]["answer_text"] #如果返回的结果就是ans,那么dev_acc+1
    49. print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
    50. model.train()
    51. # Save a model and its configuration file to the directory 「saved_model」
    52. # i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
    53. # Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
    54. print("Saving Model ...")
    55. model_save_dir = "saved_model"
    56. model.save_pretrained(model_save_dir)
    1. #test部分的内容:
    2. print("Evaluating Test Set ...")
    3. result = []
    4. model.eval()
    5. with torch.no_grad():
    6. for data in tqdm(test_loader):
    7. output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
    8. attention_mask=data[2].squeeze(dim=0).to(device))
    9. result.append(evaluate(data, output))
    10. result_file = "result.csv"
    11. with open(result_file, 'w') as f:
    12. f.write("ID,Answer\n")
    13. for i, test_question in enumerate(test_questions):
    14. # Replace commas in answers with empty strings (since csv is separated by comma)
    15. # Answers in kaggle are processed in the same way
    16. f.write(f"{test_question['id']},{result[i].replace(',','')}\n")
    17. print(f"Completed! Result is in {result_file}")

    version2:改进版本

    1.改进那个learning rate decay部分:

    torch.optim — PyTorch 2.0 documentation

    1. optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    2. scheduler = ExponentialLR(optimizer, gamma=0.9)
    3. for epoch in range(20):
    4. for input, target in dataset:
    5. optimizer.zero_grad()
    6. output = model(input)
    7. loss = loss_fn(output, target)
    8. loss.backward()
    9. optimizer.step()
    10. scheduler.step() #每隔epoch调用1次(如果我只有1个epoch呢?)

    另一种2次decay:

    1. optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    2. scheduler1 = ExponentialLR(optimizer, gamma=0.9)
    3. scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
    4. for epoch in range(20):
    5. for input, target in dataset:
    6. optimizer.zero_grad()
    7. output = model(input)
    8. loss = loss_fn(output, target)
    9. loss.backward()
    10. optimizer.step()
    11. scheduler1.step()
    12. scheduler2.step()

    2.减少doc_step的数值:

    第一次改为75 + decay_gamma=0.99->效果如下:(kaggle)

    colab:

    效果非常不错!!!嘻嘻!!在Dev上表现有显著提升

    备注:在kaggle上下载数据改成如下:

    1. !pip install gdown
    2. import gdown
    3. url = 'https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb'
    4. output = 'hw7_data.zip'
    5. gdown.download(url, output, quiet=False)

     

    version3:使用hugging_face上的pretrained_model

    第一点,修改dataloader中input_para的start和end的位置,使得ans_start 和 ans_end并不总是处于这个para的中间位置,(总是处于中间的话,会让model产生误解,以为ans总是处于窗口中间),不过,这里怎么进行修改呢?

    答:主要是修改para_start和ans_start的距离,这样,就可以让ans_start处于窗口的不同位置

    (......目前没有进行修改,我想用概率的方法应该比较容易实现,以后再说吧 ...)

    第二点,改用其他model:

    (1)

    hfl/chinese-pert-base-mrc · Hugging Face

    1. # Load model directly
    2. from transformers import AutoTokenizer, AutoModelForQuestionAnswering
    3. tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-pert-base-mrc")
    4. model = AutoModelForQuestionAnswering.from_pretrained("hfl/chinese-pert-base-mrc").to(device)
    5. #很多时候,报错就是因为 忘记了to(device)

    我里个去,Hugging_face上面的pretrained model就是好用!!!!无敌了老哥

    直接吊打一切

    (2)

    这个model也还行,不过比起上面那个,还是拉了一些

    (我当时忘了他说要用berttokenizerFast了)

    ckiplab/bert-base-chinese-qa · Hugging Face

    1. # Load model directly
    2. from transformers import AutoTokenizer, AutoModelForQuestionAnswering
    3. tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-chinese-qa")
    4. model = AutoModelForQuestionAnswering.from_pretrained("ckiplab/bert-base-chinese-qa").to(device)

  • 相关阅读:
    微服务理论知识
    北京互联网营销服务商浩希数字科技申请1350万美元纳斯达克IPO上市
    爱情中不需要太多“礼貌”
    PHP文字识别Tesseract (CentOS)
    【技巧】Windows 下安装 ES 报错:Permission denied
    windows下通过远程桌面访问linux图形界面
    微信交友unicloud云开发小程序
    区块链的概念和特征
    初级软件测试工程师笔试试题,你知道答案吗?
    Spring Data JPA 之 DataSource 详解及其加载过程
  • 原文地址:https://blog.csdn.net/xiao_ZHEDA/article/details/133041383