目录
1.1 两大PDF解析器:nougat VS ScienceBeam
第三部分 对review数据的进一步处理:规范Review的格式且多聚一
3.2 为了让模型对review的学习更有迹可循:归纳出来4个要点且多聚一
3.2.1 设计更好的提示模板以让大模型帮梳理出来review语料的4个内容点
3.2.3 通过最终的prompt来处理review数据:ChatGPT VS 开源模型
3.2.4 对review数据的最后梳理:得到JSON文本的变体版且剔除长尾数据
3.3 (选读)相关工作之AcademicGPT:增量训练LLaMA2-70B,包含论文审稿功能
3.3.1 AcademicGPT: Empowering Academic Research
3.3.2 论文评审:借鉴ReviewAdvisor抽取出review的7个要点(类似我司借鉴斯坦福工作把review归纳出4个要点)
3.3.3 70B的AcademicGPT在论文审稿上效果不佳的原因
第四部分 模型的选型:从Mistral、Mistral-YaRN到LongLora LLaMA
4.1 前置知识:Mistral 7B、YaRN、LongLoRA/LongQLoRA
4.1.3 LongLoRA LLaMA与LongQLoRA LLaMA
4.2 模型怎么选,此三PK:Yarn-Mistral-7b-64k、Mistral-instruct、LLaMA-LongLoRA/LLaMA-LongQLoRA
4.2.2 直接通过llama factory微调Mistral-instruct
4.2.3 LLaMA2 7B chat-LongLoRA:成功
4.2.3.1 不染的工作:大改LongLoRA源码跑第一轮
4.2.3.2 阿李的工作:小改LongLoRA源码跑第二轮
4.2.4 基于LongQLoRA + 一万多条paper-review数据集微调LLaMA2 7B chat:成功
第五部分 模型的训练:如何微调LLaMA2、Yarn-Mistral
5.1 LLaMA2 7b chat + LongQLoRA训练
5.2 LLaMA2 7b chat + LongLoRA训练
5.2.1 [不染]Llama2-7b-chat + LongLoRA源码 + 训练/推理
5.2.2 [阿李]LLaMA-2-7b-chat + LongLoRA训练
6.1.2 基于「重叠度上命中率指标」衡量LLM评估效果的流程
6.2 对LLaMA2 7B chat-LongQLoRA效果的评估:强过GPT3.5和GPT4
6.2.1 让我司审稿模型、GPT3.5分别对测试集的paper输出review
6.2.2 对review的处理:格式转换、为review项标注观点序号
6.2.3 对人工、llama2、GPT3.5输出的观点项进行匹配
6.2.4 我司审稿模型与GPT大PK:计算命中率与命中数,一决胜率
6.3 对LLaMA2 7B chat-LongLoRA效果的评估:依然强过GPT3.5和GPT4
6.3.1 针对「5.2.1 [不染]Llama2-7b-chat + LongLoRA源码 + 训练/推理」的评估
6.3.2 针对「5.2.2 [阿李]LLaMA-2-7b-chat + LongLoRA训练」的评估
如此前这篇文章《学术论文GPT的源码解读与微调:从ChatPaper到七月论文审稿GPT第1版》中的第三部分所述,对于论文的摘要/总结、对话、翻译、语法检查而言,市面上的学术论文GPT的效果虽暂未有多好,可至少还过得去,而如果涉及到论文的修订/审稿,则市面上已有的学术论文GPT的效果则大打折扣
原因在哪呢?本质原因在于无论什么功能,它们基本都是基于API实现的,而关键是API毕竟不是万能的,API做翻译/总结/对话还行,但如果要对论文提出审稿意见,则API就捉襟见肘了,故为实现更好的review效果,需要使用特定的对齐数据集进行微调来获得具备优秀review能力的模型
继而,我们在第一版中,做了以下三件事
所以,进入Q4后,我司论文审稿GPT的项目团队开始做第二版(我司自从23年Q3在教育团队之外,我再带队成立LLM项目团队之后,一直在不断迭代各种LLM项目,后来每个项目各自一个项目组,其中第二项目组负责论文审稿GPT,23年Q3由我和阿荀组成,23年Q4增加了不染、雪狼、朝阳,24年Q1增加了文弱、阿李),并着重做以下三大方面的优化
nougat是Meta针对于学术PDF文档的开源解析工具(其主页、其代码仓库),以OCR方法为主线,较之过往解析方案最突出的特点是可准确识别出公式、表格并将其转换为可适应Markdown格式的文本。缺陷就是转换速读较慢、且解析内容可能存在一定的乱序
和另一个解析器sciencebeam做下对比,可知
当然,还要考虑的是解析器格式化的粒度,比如正文拆成了什么样子的部分,后续我们需不需要对正文的特定部分专门取出来做处理,如果格式化粒度不好的话,可能会比较难取出来
- # 新建虚拟环境
- conda create -n nougat-ocr python=3.10
- # 激活虚拟环境
- conda activate nougat-ocr
- # 使用pip安装必要库(镜像源安装可能会出现版本冲突问题,建议开启代理使用python官方源进行安装)
- pip install nougat-ocr -i https://pypi.org/simple
- # 初次使用时会自动获取最新的权重文件
- # 针对单个pdf文件
- nougat {pdf文件路径} -o {解析输出目录}
- # 针对多个pdf所在文件夹
- nougat {pdf目录路径} -o {解析输出目录}
ScienceBeam是经典PDF文档解析器GROBID的变体项目,是论文《Can large language models provide useful feedback on research papers? A large-scale empirical analysis》所采用的文本提取方法,同其他较早期的解析方法一样,对公式无法做出LateX层面的解析,且该解析器仅支持在X86架构的Linux系统中使用
// 待更
最终,针对有review的2.6万篇paper (第一版 全部paper3万篇,其中带review的2.5万篇;第二版 全部paper3.2万篇,其中带review的2.6万篇 )
ScienceBeam解析的结果为字典,其中涉及的键有
实际取用的部分是其中的title、abstract、figure_and_table_captions以及main_content
且会加入[TITLE]、[ABSTRACT]、[CAPTIONS]、[CONTENT]特殊符号加以区分Paper的各个部分,考虑到[CONTENT]可能会提及[CAPTIONS]中的内容,因此将[CAPTIONS]置于[CONTENT]之前
- [TITLE]
- 标题
-
- [ABSTRACT]
- 摘要
-
- [CAPTIONS]
- 各图表描述
-
- [CONTENT]
- 其余正文
// 待更
在第一版中,我们对review数据做了如下处理
总之
以“b_forum”字段为与Paper数据所关联的外键,“b_forum”为对应Paper的唯一标识符(id)
针对原始数据,我们做以下4点处理
本部分数据处理的代码,暂在七月在线的「大模型项目开发线上营」中见
// 待更
近日,来自斯坦福大学等机构的研究者把数千篇来自Nature、ICLR等的顶会文章丢给了GPT-4,让它生成评审意见、修改建议,然后和人类审稿人给出的意见相比较
所以,怎样让LLM给你审稿呢?具体来说,如下图所示
- Your task now is to draft a high-quality review outline for a top-tierMachine Learning (ML) conference fora submission titled “{PaperTitle}”:
-
- ```
- {PaperContent}
- ```
-
- ======
- Your task:
- Compose a high-quality peer review of a paper submitted to a Nature family journal.
-
- Start by "Review outline:".
- And then:
- "1. Significance and novelty"
- "2. Potential reasons for acceptance"
- "3. Potential reasons for rejection", List multiple key reasons. For each key reason, use **>=2 sub bullet points** to further clarify and support your arguments in painstaking details. Be as specific and detailed as possible.
- "4. Suggestions for improvement", List multiple key suggestions. Be as specific and detailed as possible.
-
- Be thoughtful and constructive. Write Outlines only.
上一节介绍的斯坦福这个让GPT4当审稿人的工作,对我司做论文审稿GPT还挺有启发的
对于上面的“第三大点 审稿语料的组织”,我们(特别是阿荀,其次我)创造性的想出来一个思路,即通过提示模板让大模型来帮忙梳理咱们爬的审稿语料,好把审稿语料 梳理归纳出来上面所说的4个方面的常见review意见
那怎么设计这个提示模板呢?借鉴上节中斯坦福的工作,提示模板可以在斯坦福那个模板基础上,进一步优化如下
// 暂在「大模型项目开发线上营」中见,至于在本文中的更新,待更
我们知道一篇paper存在多个review,而对review数据的学习有三种模式
如此,最终清洗之后的24000篇paper的review,用多聚一的思路搞的话,便可以直接一次调用支持16K的GPT 3.5(毕竟16K的长度足够,可以把所有的review数据一次性给到GPT3.5 16K),或开源模型让它直接从所有review数据里提炼出4个要点,大概是24000多次
综上,即是考虑多聚一策略来处理Review数据,主要是对Prompting提出了更高的要求:
相当于咱们得基于上述要求来设计Prompt (最终设计好的prompt暂在七月在线的「大模型项目开发线上营」里讲,至于本文本部分内容的更新则明年Q1更新)
当我们最终的prompt设计好了之后,接下来,便可以让大模型通过该prompt处理review数据了,那我们选用哪种大模型呢,是ChatGPT还是开源模型,为此,我们对比了以下三种大模型
经过对比发现
// 待更,具体怎么个对比法,以及怎么个效果更好,暂在线上营里见,至于本文后续更新
不过我们在实际使用的过程中,发现OpenAI对API的访问有各种限制且限制的比较严格(即对用户有多层限制:https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one,比如分钟级请求限制、每日请求限制、分钟级token限制、每日token限制 ),访问经常会假死不给返回、也没报错,所以很多时间耗费在被提示“访问超限”,然后等待又重复访问、再被提示超限这样的过程,使得我们一开始使用OpenAI的官方接口23年11.24到11.30大概7天才出了2600多条,并且后续限制访问的出现频率愈加高,头疼..
原本的经过“多聚一”review侧的数据由JSON mode返回所得,均为JSON格式(字典),大体形式如下
- {
- "Significance and novelty": {
- 大体描述: 具体描述,
- 大体描述: 具体描述,
- ...
- },
- "Potential reasons for acceptance": {
- 大体描述: 具体描述,
- 大体描述: 具体描述,
- ...
- },
- "Potential reasons for rejection": {
- 大体描述: 具体描述,
- 大体描述: 具体描述,
- ...
- },
- "Suggestions for improvement": {
- 大体描述: 具体描述,
- 大体描述: 具体描述,
- ...
- }
- }
但考虑到后续要微调的开源模型对JSON格式的关注程度可能不足,学习JSON文本可能存在一定的困难,故最终将上述JSON格式的内容转为如下的格式(可以理解为JSON文本的变体版)
- [Significance and novelty]
- <大体描述> 具体描述
- <大体描述> 具体描述
- ...
-
- [Potential reasons for acceptance]
- <大体描述> 具体描述
- <大体描述> 具体描述
- ...
-
- [Potential reasons for rejection]
- <大体描述> 具体描述
- <大体描述> 具体描述
- ...
-
- [Suggestions for improvement]
- <大体描述> 具体描述
- <大体描述> 具体描述
- ...
即如下图所示
且依据内部文件《[正式方案]长尾数据清洗及后续安排》,文本长度过少的Review可能仅包含有一些无关紧要的信息,因此还可以考虑将长度过少的Review进行剔除(当然,paper侧也得剔除相关的长尾数据)
经过一系列操作之后,数据量从22319对paper-review降到了15566
接着通过设计相关指令,且结合处理后的Paper及Review(一篇paper对应一篇review),最终得到一份类Alpaca格式的数据集(instruction-input-output三元组数据),如下所示
- [
- {
- "instruction": "You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **. \nThe given paper is as follows: \n\n\n",
- "input": "[TITLE]\nImage Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks\n\n[ABSTRACT]\nWe propose a new, multi-component energy function for energy-based Generative Adversarial Networks (GANs) based on methods from the image quality assessment literature. Our approach expands on the Boundary Equilibrium Generative Adversarial Network (BEGAN) by outlining some of the short-comings of the original energy and loss functions. We address these short-comings by incorporating an l1 score, the Gradient Magnitude Similarity score, and a chrominance score into the new energy function. We then provide a set of systematic experiments that explore its hyper-parameters. We show that each of the energy function's components is able to represent a slightly different set of features, which require their own evaluation criteria to assess whether they have been adequately learned. We show that models using the new energy function are able to produce better image representations than the BEGAN model in predicted ways.\n\n[CAPTIONS]\nFigure 1: From left to right, the images are the original image, a contrast stretched image, an image with impulsive noise contamination, and a Gaussian smoothed image. Although these images differ greatly in quality, they all have the same MSE from the original image (about 400), suggesting that MSE is a limited technique for measuring image quality.\nFigure 2: Comparison of the gradient (edges in the image) for models 11 (BEGAN) and 12 (scaled BEGAN+GMSM), where O is the original image, A is the autoencoded image, OG is the gradient of the original image, AG is the gradient of the autoencoded image, and S is the gradient magnitude similarity score for the discriminator (D) and generator (G). White equals greater similarity (better performance) and black equals lower similarity for the final column.\nFigure 3: Comparison of the chrominance for models 9 (BEGAN+GMSM+Chrom), 11 (BEGAN) and 12 (scaled BEGAN+GMSM), where O is the original image, OC is the original image in the corresponding color space, A is the autoencoded image in the color space, and S is the chrominance similarity score. I and Q indicate the (blue-red) and (green-purple) color dimensions, respectively. All images were normalized relative to their maximum value to increase luminance. Note that pink and purple approximate a similarity of 1, and green and blue approximate a similarity of 0 for I and Q dimensions, respectively. The increased gradient 'speckling' of model 12Q suggests an inverse relationship between the GMSM and chrominance distance functions.\nTable 1: Models and their corresponding model distance function parameters. The l 1 , GMSM, and Chrom parameters are their respective β d values from Equation 8.\nTable 2: Lists the models, their discriminator mean error scores, and their standard deviations for the l 1 , GMSM, and chrominance distance functions over all training epochs. Bold values show the best scores for similar models. Double lines separate sets of similar models. Values that are both bold and italic indicate the best scores overall, excluding models that suffered from modal collapse. These results suggest that model training should be customized to emphasize the relevant components.\n\n[CONTENT]\nSection Title: INTRODUCTION\n INTRODUCTION\n\nSection Title: IMPROVING LEARNED REPRESENTATIONS FOR GENERATIVE MODELING\n IMPROVING LEARNED REPRESENTATIONS FOR GENERATIVE MODELING Radford et al. (2015) demonstrated that Generative Adversarial Networks (GANs) are a good unsu- pervised technique for learning representations of images for the generative modeling of 2D images. Since then, a number of improvements have been made. First, Zhao et al. (2016) modified the error signal of the deep neural network from the original, single parameter criterion to a multi-parameter criterion using auto-encoder reconstruction loss. Berthelot et al. (2017) then further modified the loss function from a hinge loss to the Wasserstein distance between loss distributions. For each modification, the proposed changes improved the resulting output to
- # 中间一大段略去..
- The first row is model 1, which corresponds with the original BEGAN model. Rows 2-12 represent our experiments. Each cell represents the output of a random sample. \n",
- "output": "[Significance and novelty]\n
Reviewers highlighted that the paper's proposed approach was somewhat incremental, and its significance and novelty were questioned. The modifications and additions to the existing BEGAN framework were not perceived as significantly contributing to the field.\n The paper's novelty was deemed to be restricted, particularly in the choice and justification of the energy function components. Reviewers pointed out that the selected components were borrowed from existing image quality assessment techniques, and their application was not seen as a significant contribution to GAN theories or algorithm designs.\n\n[Potential reasons for acceptance]\n -\n\n[Potential reasons for rejection]\n Reviewers noted that the experimental section was weak, with reliance on visual evaluations and a limited range of datasets (primarily CelebA). This led to concerns about the thoroughness and robustness of the paper's findings.\n The paper was criticized for its lack of clarity in motivating its contributions and presenting results. Reviewers found it difficult to estimate the significance of the proposed model and understand how its results differed from baseline models.\n\n[Suggestions for improvement]\n To strengthen the paper, reviewers suggested broadening the experimental testing to include different datasets involving natural images, beyond the single CelebA dataset. This would provide a more comprehensive evaluation of the proposed techniques.\n Improvements in the paper's clarity and justification were recommended, specifically in clarifying the design choices made for the energy function components. Providing clear justifications for the modifications and additions to the BEGAN framework would enhance the paper's credibility and significance.\n\n" - },
- {
- "instruction": "You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **. \nThe given paper is as follows: \n\n\n",
- "input": "[TITLE]\nLearning Independent Features with Adversarial Nets for Non-linear ICA\n\n[ABSTRACT]\nReliable measures of statistical dependence could potentially be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives (Goodfellow et al. 2014, Arjovsky et al. 2017) which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.\n\n\n[CAPTIONS]\nFigure 1: Schematic representation of the entire system for learning non-linear ICA. Specific func- tional shapes can be enforced by choosing a suitable decoder architecture.\nFigure 2: The decoder architecture used for the PNL experiments. It can only learn transformations in which a linear transformation is followed by the application of non-linear scalar functions to each of the dimensions.\nFigure 3: Source signals used in the experiments.\nFigure 4: Convergence plots for the linear synthetic source task.\nTable 1: Maximum correlation results on all tasks for the synthetic data. A 'g' in the suffix of the model name indicates that a separate generator network was used instead of the resampling method. Two scores separated by a '/' indicate that the first score was obtained using the model selection described in Section 6.1 while the second score is simply the best score a posteriori. Parentheses refer contain the standard deviation of the scores multiplied with 10 −4 .\nTable 2: Maximum correlation results on all tasks for the audio data. A 'g' in the suffix of the model name indicates that a separate generator network was used instead of the resampling method. Two scores separated by a '/' indicate that the first score was obtained using the model selection described in Section 6.1 while the second score is simply the best score a posteriori. Parentheses refer contain the standard deviation of the scores multiplied with 10 −4 .\n\n[CONTENT]\nSection Title: INTRODUCTION\n INTRODUCTION When unsupervised models are developed to learn interesting features of data, this often implies that each feature represents some interesting unique property of the data that cannot easily be derived from the other features. A sensible way to learn such features is by ensuring that they are statistically independent.
- # 中间一大段略去..
- Future work should be devoted to a more thorough theoretical analysis of of the proposed methods for minimizing and measuring dependence and how to evaluate them.\n See the appendix for more details about the synthetic signals.\n",
- "output": "[Significance and novelty]\n
The use of adversarial objectives for computing independent features sheds new light on the problem of independent component analysis. It presents an interesting and potentially innovative approach to addressing the issue.\n The paper's comparison with existing methods for ICA, such as linear and post-nonlinear ICA, highlights the attempt to address a wide range of ICA problems using the proposed GAN variant. This comprehensive approach could contribute to the literature on ICA methods.\n\n[Potential reasons for acceptance]\n The paper presents a conceptually thought-provoking approach to independent component analysis using adversarial training, which could contribute to the advancement of ICA methods.\n The coverage of both linear and non-linear ICA problems demonstrates the broad applicability of the proposed GAN-based approach, potentially adding value to the field of independent component analysis.\n\n[Potential reasons for rejection]\n Reviewers have expressed concerns about the lack of clarity, focus, and thorough analysis in the presentation of the proposed GAN variant for ICA, leading to a marginal rating below the acceptance threshold.\n Reviewers have noted that the comparison with existing methods, such as linear and post-nonlinear ICA, is inadequate, and the paper lacks comprehensive analysis and evaluation, resulting in a rating marginally below the acceptance threshold.\n\n[Suggestions for improvement]\n The authors should focus their discussion on addressing specific ICA problems, streamlining the presentation, and providing a more focused and in-depth analysis of the proposed GAN variant for ICA. Emphasizing the novelty and significance of the approach could strengthen the paper.\n Enhancing the comparison with prior work, especially in the context of linear and non-linear ICA, and providing a more thorough evaluation of the proposed method would address concerns raised by the reviewers and potentially improve the paper's acceptance prospects.\n\n" - },
- # 总计15566条..
至于完整数据集,我司的大模型项目开发线上营里见
再考虑到单条数据算作“instruction+input+output”的拼接,使用Mistral的tokenizer对各条数据进行分词,并统计数据的token数
由上图大致可了解到单条token数大致在6000至12000的区间,较多数据的长度分布在8500左右,因此后续在为训练模型设定序列裁切(cut off)长度时选择11264或12288比较合适
11月下旬,我司第二项目组的阿荀发现
他们与我们有两点显著不同的是,一者,他们对LLaMA做了增量预训练(AcademicGPT is a continual pretraining on LLaMA2),二者,我司目前的论文审稿GPT暂只针对英文论文的评审(毕竟七月的客户要发论文的话,以英文EI ei期刊 SCI论文为主,其次才中文期刊),而他们还考虑到了中文,故他们考虑到LLaMA2-70B有限的中文能力与学术领域知识,所以他们收集中文数据和学术英文数据来对相关方面进行提高
基于上述所得的120B的数据,他们使用192个40G显存的A100 GPU进行继续二次预训练(他们的所有工作我没有任何羡慕,但唯独他们有192块A100,让我个人着实羡慕了一把,好期待有哪个大豪可以解决下我司七月的GPU紧缺问题,^_^),最终通过37天的训练,使得LLaMA2-70B进一步获得理解中文与学术内容的能力,以下是关于训练的更多细节
他们和我司一样,都是从同一带有论文review的网站上收集了29119篇Paper和约79000条Review,然后经过下述处理
最终,经过上面一系列梳理之后得到的paper数据 + 归纳好的review数据去微调70B模型
为方便大家理解,我补充一下关于这篇《Can We Automate Scientific Reviewing?》的解释说明
事实上,该篇论文的视角在于将“Review”视作对Paper的摘要与对应内容的评估,以此保证事实正确性。因此该篇论文考虑将Paper Review问题建模为摘要生成任务,采用当时(2021)较为先进的BART模型进行训练,得到ReviewAdvisor模型
通过设计好的评估系统,得出如下观察:
- 模型容易生成非事实性陈述
- 模型尚未学习到高级理解,如没法实质地分辨Paper的高质量与低质量
- 模型倾向于模仿训练数据的语言风格(倾向低级模式),如容易生成训练样本中的高频句子
- 可以较好地概括论文核心思想
最终结论是:“模型评审还尚未能替代人工评审,但可以辅助人工进行评审”
这项工作有两个值得关注的地方:
- 增强Review数据(通过BERT对review数据抽取式归纳出8个要点、然后人工做校正)
对于相对杂乱的Review内容来说,研究团队只想保留有用的“结构化”内容,因此他们将从定义“结构化方面”开始,从Review中取出相应的结构化内容,由此实现Review侧的数据增强 1 定义结构化方面
研究团队讨论出了他们所认为的一篇“好的Review”所应该具备的各个方面,包括如下8个要点:
Summary(SUM):总结摘要
Motivation/Impact(MOT):动机/影响
Originality(ORI):原创性
Soundness/Correctness(SOU):合理性/正确性
Substance(SUB):实质性
Replicability(REP):可复现性
Meaningful Comparison(CMP):有意义的对比
Clarity(CLA):清晰程度
2 人工标注
研究团队邀请6名具有机器学习背景的学生对原本的Review进行注释,注释手法倾向于“抽取式摘要”,即标注原文本中哪些片段属于何种类别「which are Summary (SUM), Moti-vation/Impact (MOT) , Originality (ORI), Sound-ness/Correctness (SOU), Substance (SUB), Repli-cability (REP), Meaningful Comparison (CMP)and Clarity (CLA)」
类似于“... ... The results are new[Positive Originality] and important to this field[Positive Motivation] ... ...”
3 训练标注器
考虑到人工标注全部数据并不现实,使用第2步标注过的Review数据训练一个BERT抽取模型作为标注器,用于自动标注原Review中的方面项。即输入Review文本,BERT对文本进行逐token分类预测,预测出Review哪些部分属于哪些方面
4 后处理
使用标注器BERT对余下数据进行标注后,其结果并不完全可信(毕竟BERT的能力没有像GPT3.5那么强,即结果没那么可信),需要制定规则或使用人工对标注器的预测结果进行校正
5 人工检查
邀请具有机器学习背景的人员检查标注结果- 生成Review(通过paper和BERT抽取且人工校正过的review语料,微调BART)
根据给定Paper生成Review,模型选型为彼时最大长度为1024的BART模型,考虑到Paper的长度较长,因此整个生成Review的方案被设计成了两阶段的形式,即首先从Paper中择取出突出片段(输入上下文长度压缩),然后基于这些突出片段来生成review摘要
选取突出片段
使用诸如“demonstrate”“state-of-the-art”等关键词及对句子的诸多规则判断来确定突出片段
训练方面感知摘要(Aspect-aware Summarizaiton)模型
基于基础Seq2Seq模型实现的是由输入序列(Paper)预测输出序列(Review)的过程,研究团队在这个基础上引入了“方面感知”来辅助模型进行预测,强调模型对“方面要点”的输出,即引入两个的多层感知机来分别进行生成任务:模型不仅要逐token生成Review内容,还要逐token预测其对应的“方面要点” 因此模型需要同时学习两个损失函数
这也意味着模型在一次推理中将输出2条序列,其一为预测的Review内容(其损失函数为),其二为预测的方面要点(其损失函数为)
根据原论文中展示的对一些论文做审稿的案例来看,其效果并不佳
下图是论文中的两个审稿案例
- 下图是论文中的审稿案例1,可以看出来,它指出对应论文的缺点:“写作需要打磨。存在太多的拼写和语法错误。实验设置不够令人信服。首先,没有提供基线。其次,作者仅在单一数据集上进行了实验。第三,作者没有报告结果的方差。”
这种审稿意见对于论文作者本身而言,参考价值可能不大,毕竟当你指出有太多的拼写和语法错误,最好是具体指出来所谓的拼写和语法错误是在论文中哪一段- 下图是论文中的审稿案例2 但第5个Weaknesses的点「5. The writing of the paper could be improved. For example, the authors should explain what xt,i means in
Eq. (1) 」是说论文应该解释下公式(1)中 的含义,但原论文的公式(1)不涉及
而效果不佳的原因有多个方面,下面更多对比与我司的不一致
当然,在没有实际开源出来让用户使用之前,也不好下太多论断,具体等他们先对外开放吧(且他们看到本文后,我相信很快也会改进)
23年12月中旬,本项目总算要走到模型选型阶段了,在此前的工作:数据的处理和数据的质量提高上,下足了功夫,用了各种策略 也用了最新的GPT3.5 16K帮归纳review信息,整个全程是典型的大模型项目开发流程
而论文审稿GPT第二版在做模型选型的时候,我司一开始考虑了三个候选模型:
以下逐一介绍这三个模型,以及对应的训练细节、最终效果
参见此文的第一部分《从Mistral 7B到MoE模型Mixtral 8x7B的全面解析:从原理分析到代码解读》
因项目中要用到YaRN,所以我又专门写了一篇文章介绍什么是YaRN,详见《大模型上下文扩展之YaRN解析:从直接外推ALiBi、位置插值、NTK-aware插值、YaRN》
比如该文中有讲到:“3.1 YaRN怎么来的:基于“NTK-by-parts”插值修改注意力”
除了前述的插值技术,他们还观察到,在对logits进行softmax操作之前引入温度t可以统一地影响困惑度,无论数据样本和扩展上下文窗口上的token位置如何,更准确地说,将注意力权重的计算修改为
通过将RoPE重新参数化为一组2D矩阵对,给实现注意力缩放带来了明显的好处(The reparametrization of RoPE as a set of 2D matrices has a clear benefit on the implementation of this attention scaling)
- 可以利用“长度缩放”技巧,简单地将复杂的RoPE嵌入按相同比例进行缩放,使得qm和kn都以常数因子进行缩放
这样一来,在不修改代码的情况下,YaRN能够有效地改变注意力机制
we can instead use a "length scaling" trick which scales both qm and kn by a constant factor p 1/t by simply scaling the complex RoPE embeddings by the same amount.
With this, YaRN can effectively alter the attention mechanism without modifying its code.- 此外,在推理和训练期间,它没有额外开销,因为RoPE嵌入是提前生成并在所有向前传递中被重复使用的。结合“NTK-by-parts”插值方法,就得到了YaRN方法
Furthermore, it has zero overhead during both inference and training, as RoPE embeddings are generated in advance and are reused for all forward passes. Combining it with the "NTK-by-parts" interpolation, we have the YaRN method对于LLaMA和LLaMA 2模型,他们推荐以下值:
上式是在未进行微调的LLaMA 7b、13b、33b和65b模型上,使用“NTK-by-parts”方法对各种因素的尺度扩展进行最小困惑度拟合得到的(The equation above is found by fitting p 1/t at the lowest perplexity against the scale extension by various factors s using the "NTK-by-parts" method)
Yarn-Mistral-7b-64k相当于自己实现了modeling,即把mistral的sliding windows attention改了,相当于把sliding windows的范围从滑窗大小直接调到了65536即64K(即直接滑65536那么个范围的滑窗,其实就是全局)
通过此文《通透理解FlashAttention与FlashAttention2:让大模型上下文长度突破32K的技术之一》的开头可知,LLaMA2的上下文长度只有4K,但通过longlora技术的加持,可以让其上下文长度扩展到32K(LLaMA2 7B可以扩展到100K、LLaMA2 70B可以扩展到32K)
模型 | 对应的上下文长度 |
LLaMA | 2048 |
LLaMA2 | 4096 |
LLaMA2-long(其23年9.27发的论文) | 32K |
基于LongLoRA技术的LongAlpaca-7B/13B/70B | 32K以上 |
而LongQLoRA则相当于LongLoRA + QLoRA
至于什么是LongLoRA、LongQLoRA,请参见此文:《大模型上下文长度的超强扩展:从LongLoRA到LongQLoRA(含源码剖析)》
接上文《3.2.4 对review数据的最后梳理:得到JSON文本的变体版且剔除长尾数据》,我们终于要开始选择合适的模型来微调了,然后在具体微调的时候,又注意到了微调库llama factory
所以我们有以下4种微调方式
接下来便逐一阐述以上4种微调方式
一开始阿荀通过「yarn-mistral + qlora + s2attn + llama factory」跑起来了几百条数据后发现,初始loss达到了6(当然,虽然loss初期很高,但有在下降,说明模型还是有学到,只不过初期loss很高说明我们给的数据和模型所学过的数据差异比较大)
这里面有三件比较有意思的事
实话讲,上面第三个问题 还挺麻烦的,因为模型的输出没有实质性内容,就是在复读用户的一部分输入
加之在此之前,没有人公开用yarn后的模型做过sft,没有实证可以参考
那最后怎么办呢,具体大模型项目开发线上营中见
如果我们要微调Mistral 7B–Instruct的话,我们当时的第一反应是怎么扩展Mistral 7B–Instruct的长度呢(Mistral 7B–Instruct的上下文长度只有8K)
在最开始用微调库llama factory实际微调LLaMA-LongLoRA时
24年1.31日,不染通过longlora跑了4个epoch之后,针对一篇新的论文得到的review意见如下
- <Potential reasons for acceptance>
- <Technical solidity> The paper presents a technically solid approach to neural architecture search, offering a novel perspective on the problem and providing a theoretically sound optimization method.
- <Empirical evidence> The experimental results demonstrate competitive or better performance compared to existing methods, along with improved efficiency and scalability.
- <Clear presentation> The paper is well written and easy to understand, making it accessible to a wide audience.
-
- [Potential reasons for rejection]
- <No related terms> The paper lacks related terms, which may impact its relevance and positioning in the field.
- <Insufficient comparison with prior art> The paper does not sufficiently compare its approach with prior art, particularly in relation to existing methods for neural architecture search.
- <Unclear motivation> The motivation behind the proposed method is not clearly explained, leading to uncertainty about its significance and novelty.
-
- [Suggestions for improvement]
- <Include related terms> The paper should include relevant related terms to enhance its positioning and relevance in the field.
- <Comprehensive comparison with prior art> A thorough comparison with existing methods for neural architecture search, particularly those addressing similar issues, would strengthen the paper's contribution.
Providing a clearer explanation of the motivation behind the proposed method would help establish its significance and novelty.
再之后,我让第二项目组另外一位同事阿李又跑了一轮,其中,超参和longqlora的一致(emb和norm是默认就加权重),且虽说longlora的rank 不一定非得设置为64,但考虑到设置为64也能跑,那就暂先64
最终,gpu所耗资源还行,大概也就比longqlora多了5G左右的占用,故没用量化
至于具体我们如何训练的,以及最终该模型的效果如何,请见下文的模型训练、与模型评估
LLaMA2 7B chat本身的上下文长度只有4096,好在我们给它加上LongQLoRA之后,其上下文长度确实实现了从4096到12288(至于为何是12288,原因见上文的3.2.4节最后)
24年1.17日(是在我创业即将9周年的前两天),在历经80h的模型训练之后,我们终于通过15565条paper-review数据集把LLaMA2 7B chat LongQLoRA微调好了(相比3.2.4节最后说的15566去掉了一条异常数据,至于怎么个异常法,线上营中说 ),是我司第二项目组「包括我、阿荀(主力)、朝阳、雪狼、不染」花费整整半年、且历经论文审稿第一版、第二版的里程碑式工作(后续再迭代优化下之后,今年会把这个工作发表成SCI论文)
具体而言
- You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **.
- The given paper is as follows.:
-
- [TITLE]
- YaRN: Efficient Context Window Extension of Large Language Models
-
- [ABSTRACT]
- Rotary Position Embeddings (RoPE) have been shown to effectively encode posi- tional information in transformer-based language models. However..
-
- # 还有一大段CONTENT,略..
提示微调后得到的模型LLaMA2 7B chat-LongQLoRA针对上文提过的YARN这篇论文(其不在我们的paper-review训练集之内)输出审稿意见:实验设计比较少和对方法的讲述不够容易让人理解的问题,如下所示-
- You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: ** importance and novelty **, ** potential reasons for acceptance **, ** potential reasons for rejection **, and ** suggestions for improvement **.
- You just need to use the following JSON format for output, but don't output opinions that don't exist in the original reviews. if you're not sure, return an empty dict:
- {
- 'Significance and novelty': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
- 'Potential reasons for acceptance': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
- "Potential reasons for rejection": List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
- 'Suggestions for improvement': List multiple items by using Dict, The key is a brief description of the item, and the value is a detailed description of the item.
- }
-
- The given paper is as follows.:
-
- [TITLE]
- YaRN: Efficient Context Window Extension of Large Language Models
-
- [ABSTRACT]
- Rotary Position Embeddings (RoPE) have been shown to effectively encode posi- tional information in transformer-based language models. However, ...
-
- # 还有一大段CONTENT,略..
其次,输出结果分别如下图左部(提到实验不够广泛的问题,然后指出YaRN的实现示例不够清晰)、下图右部所示(指出了实验不够广泛、范围比较有限)接下来是三个重点工作
通过上文这节《4.4.4 基于LongQLoRA + 一万多条paper-review数据集微调LLaMA2 7B chat:成功》的内容,我们已经知道终于微调成功了,但到底如何基于一万多paper-review数据集微调LLaMA 2呢?
首先,如之前所说,我们的微调代码是改自LongQLoRA的源码(没有用llama factory),具体而言
当然,更多细节在线上营中透露,如下图所示
但上图好像没有看到文件1、文件3、文件5呢?原因在于
- # 加载模型
- logger.info(f'Loading model from: {args.model_name_or_path}')
- model = AutoModelForCausalLM.from_pretrained(
- args.model_name_or_path,
- config=config,
- device_map=device_map,
- load_in_4bit=True,
- # torch_dtype=torch.float16,
- torch_dtype=torch.bfloat16,
- trust_remote_code=True,
- quantization_config=BitsAndBytesConfig(
- load_in_4bit=True,
- # bnb_4bit_compute_dtype=torch.float16,
- bnb_4bit_compute_dtype=torch.bfloat16,
- bnb_4bit_use_double_quant=True,
- bnb_4bit_quant_type="nf4",
- llm_int8_threshold=6.0,
- llm_int8_has_fp16_weight=False,
- ),
- )
以下是所需的资源需求
接下来,如下配置环境
- cd /path/to/LongQLoRA
-
- # 创建虚拟环境
- conda create -n longqlora python=3.9 pip
-
- # 配置虚拟环境
- ## 单独安装pytorch
- pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117 -i https://pypi.org/simple
- ## 单独安装flash attention
- pip install flash_attn -i https://pypi.org/simple
- ## 安装requirements
- pip install -r requirements.txt -i https://pypi.org/simple
关于这个requirements,如下所示
accelerate==0.21.0 transformers==4.31.0 peft==0.4.0 bitsandbytes==0.39.0 loguru numpy pandas tqdm deepspeed==0.9.5 tensorboard sentencepiece transformers_stream_generator tiktoken einops # torch==1.13.0 openpyxl httpx # flash_attn==2.3.3 joblib==1.2.0 scikit_learn==0.24.2 # 之所以上面把torch和flash_attn注释掉了,因为上面已单独安装
- # 安装git-lfs
- curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
- sudo apt-get install git-lfs
-
- # 激活git-lfs
- git lfs install
获取Llama-2-7b-chat-hf模型文件 - # 进入用于存储模型文件的目录
- cd /path/to/models_dir
-
- # 获取Llama-2-7b-chat-hf
- git lfs clone https://huggingface.co/NousResearch/Llama-2-7b-chat-hf
相关主要参数说明 | |
参数 | 释义 |
output_dir | 训练输出(日志、权重文件等)目录,即创建的输出目录外加自定义的文件名 |
model_name_or_path | 用于训练的模型文件目录,即获取的模型文件路径 |
train_file | 训练所用数据路径,即放置数据集的路径。 |
deepspeed | deepspeed参数路径,即LongQLoRA目录下的“train_args/deepspeed/deepspeed_config_s2_bf16.json” |
sft | 是否是SFT训练模式 |
use_flash_attn | 是否使用flash attention、attention |
num_train_epochs | 训练轮次 |
per_device_train_batch_size | 每个设备的batch_size |
gradient_accumulation_steps | 梯度累计数 |
max_seq_length | 数据截断长度 |
model_max_length | 模型所支持的最大长度,即本次训练所要扩展的目标长度 |
learning_rate | 学习率 |
logging_steps | 打印频率,每logging_steps步打印1次 |
save_steps | 权重存储频率,每save_steps步保存1次 |
save_total_limit | 权重存储数量上限,超出该上限时自动删除早期存储的权重 |
lr_scheduler_type | 学习率调度策略 |
warmup_steps | warmup步数 |
lora_rank | lora秩的大小 |
lora_alpha | lora的缩放尺度 |
lora_dropout | lora的dropout概率 |
gradient_checkpointing | 是否开启gradient_checkpointing |
optim | 所选用的优化器 |
bf16 | 是否开启bf16训练 |
report_to | 输出的日志形式 |
dataloader_num_workers | 读取数据所用线程数,0为不开启多线程 |
save_strategy | 保存策略,steps为按步数进行保存、epochs为按轮次进行保存 |
weight_decay | 权重衰减值 |
max_grad_norm | 梯度裁剪阈值 |
remove_unused_columns | 是否删除数据集中的无关列 |
- export CUDA_LAUNCH_BLOCKING=1
- deepspeed train.py --train_args_file /path/to/LongQLoRA/train_args/llama2-7b-chat-sft-bf16.yaml
其中--train_args_file,即指训练所用yaml文件的路径- # 进入LongQLoRA源码目录
- cd /path/to/LongQLoRA
-
- # 启动bash文件进行训练
- bash run_train_sft_bf16.sh
// 更多线上营中见
长QA数据使用以下提示进行微调:
不染针对longlora的源码做了不少修改,以下列举几点(完整修改见七月在线对应的课程)
- prompt_input, prompt_no_input = PROMPT_DICT["prompt_input_llama2"], PROMPT_DICT["prompt_llama2"]
- sources = [
- prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example)
- for example in list_data_dict
- ]
改成 - # 因为第75行代码中"prompt_llama2": "[INST]{instruction}[/INST]"
- # 拼接后对文本截断12288时,会造成过长的数据失去了[/INST],形成了两种不同提示词的训练句式。
- prompt_no_input = PROMPT_DICT["prompt_llama2"]
- sources = [
- prompt_no_input.format_map({"instruction": example['instruction'][:12270]}) # 先对instruction进行截断
- for example in list_data_dict
- ]
- if training_args.low_rank_training:
- if model_args.model_type == "gpt-neox":
- # added `dense` to match with llama as the basic LoRA would only target 'query_key_value'
- targets = ["query_key_value", "dense"]
- else:
- targets=["q_proj", "k_proj", "v_proj", "o_proj"]
-
- config = LoraConfig(
- r=8,
- lora_alpha=16,
- target_modules=targets,
- lora_dropout=0,
改成 - if training_args.low_rank_training:
- if model_args.model_type == "gpt-neox":
- targets = ["query_key_value", "dense"]
- else:
- targets=["q_proj", "v_proj"] # 显存不够,减少训练参数
-
- config = LoraConfig( # 提升秩的大小,以学到更多特征,但越大显存占用越大
- r=32,
- lora_alpha=16,
- target_modules=targets,
- lora_dropout=0.05,
正因为这么一改,可能便导致了不如上节「5.1 LLaMA2 7b chat + LongQLoRA训练」的效果,详见下文的模型评估阿李基本没咋改longlora的源码,毕竟如果longlora本身实现没问题,但其训练参数则基本对齐longqlora的(下图左侧为不染的,下图右侧为阿李的)
当然,也并非和longqlora全部百分百对齐,比如关于学习率
但阿李在微调小量数据时,learning_rate设置的和longqlora一样的0.0001,可能还是有点大,故最终全量数据时按照longlora默认的10-5 搞
最后总结一下阿李与不染训练过程的差异
- 在参数层面上,对比不染的微调,阿李微调的时候保留了embedding和norm的微调
同时阿李的attention部分有 "q_proj", "k_proj", "v_proj", "o_proj" 参与微调,不染是 "q_proj", "v_proj" 参与微调。- 在超参数层面上,阿李是2个epoch,不染是5个epoch
阿李的learning_rate是0.00001,不染的learning_rate是0.0001
阿李的gradient_accumulation_steps是16,不染的是8总之,如阿荀所说,整体来看
- 阿李用的lora层更多、更新步幅更小、轮次更少——可能会导致lora参数还学不够
- 不染用的lora层更少、更新步幅大、轮次多——可能lora参数太少、学了太多,但考虑到不染的结果连格式也难以遵循,可能他选取的lora层着实太少了
在斯坦福那篇让GPT4当审稿人的论文中(具体论文详见上文3.1节),他们评估了GPT-4 vs. Human和Human vs. Human在命中率方面的两两重叠,命中率定义为集合A中comments与集合B中comments匹配的比例,计算方法如下
如下图所示
总之
- 针对LLM提出的Review与人类的Review,均分别使用一定的prompt (具体prompt见线上营)交由GPT-4进行摘要处理。对LLM下达任务,要求其关注Review中潜在的拒绝原因,并以特定的JSON格式来提供Review所指出的关键问题所在,研究团队解释侧重关键问题的目的在于“Review中的批评直接有助于指导作者改进论文”
- 将需要评估的LLM Review与人类Review由上一步得到的内容共同输入至GPT-4中,利用特定的prompt (具体prompt见线上营)来指示GPT-4输出新的JSON内容,让GPT-4指出两个传入的内容中的匹配项,并且对匹配程度进行评估(5-10分)
作者研究发现5分、6分的相似项置信程度不佳,因此设定7分以上视为“匹配”,再基于计算重叠程度,其中为LLM提出的批评项数,为LLM与人类提出的匹配批评项数
注,本文的对比测试中
- gpt4是gpt-4-turbo-preview,即GPT_MODEL = "gpt-4-1106-preview"
注,下图是后来24年2.15加上的,即在23年Q4时 还没出来GPT4-0125版本- gpt3.5是gpt-3.5-turbo-1106
都是支持json format且长输入的版本
在验证集的数据准备上,我司使用57篇训练集外的Paper,各自都对应有“多聚一”后的人工Review。且考虑到使用LLM可能存在输出不稳定的情况,因此将57条数据均分别复制5份,共得到285条测试数据,因此后续LLM有机会对每个输入进行5次生成
下图为测试集的Paper数据
下图为测试集的Review数据(人工/Golden)
使用LLaMA2-paperreview(为免与实际概念上的Paper数据和Review数据混淆,以下简称llama2)对测试集进行输出,得到类似如下结果
使用gpt-3.5-turbo对测试集进行输出,得到类似如下结果(至于使用的prompt模板如何写,见线上营)
首先,将原本为JSON格式的gpt-3.5-turbo输出内容转换为如下图所示的格式(与人工Review、llama2 Review相同)
接下来,为Review项标注观点序号
对人工Review中的“
对llama2 Review中的“
对gpt-3.5-turbo Review中的“
基于上文提到过的这篇论文《Can large language models provide useful feedback on research papers? A large-scale empirical analysis》所提出的“节点匹配prompt”进行小部分修改,得到特定的prompt来指示gpt4进行Review项匹配(至于完整的prompt见线上营)
- prompt模板主要涉及4个要点:
- - 指示gpt4分析并匹配给定的Review A 和 Review B两篇Review中相匹配的观点。
- - 给出输出示例,理当是一个类似
- “{
- {匹配的项1:
- {rationale: 阐明匹配原因, similiraty: 匹配分数},
- {匹配的项2:
- rationale: 阐明匹配原因, similiraty: 匹配分数},
- ...
- }”
- 的多层JSON
- - 声明匹配分数准则,匹配分数由5至10,数字越大匹配成程度越高。
- - 指出如果不存在匹配项则返回空JSON
首先,使用gpt4基于特定prompt模板对llama2 Review(Review A)和人工Review(Review B)进行观点匹配,输出结果类似下图
然后,使用gpt4基于上述prompt模板对gpt-3.5-turbo Review(Review A)和人工Review(Review B)进行观点匹配,输出结果类似下图
一开始第二项目组的文弱做了很多评估,加之考虑到以后会经常做各种评估,我总结了一下今后的「模型评估原则」:在同在一个季度的工作 才互相PK,且用当季度最强的裁判去评判
按照这个原则的话
- 那23年我司q4的7b工作,去对标GPT4时,对比gpt4-1106即可,且gpt4-1106做裁判
不需要对比gpt4-0125的任何生成结果,且7b也不用gpt4-0125的任何裁判结果
但后续的第3版13b可以去和gpt4-0125的生成结果PK,且gpt4-0125去评判
如此对我司和OpenAI都公平公正,毕竟谁都会升级,那就PK同一时间段的产物- 关于裁判 再强调一下
裁判首选用当时的最强裁判去评判,避免每出来一个新裁判,就得把以前的工作重新评判下,那评判无止境了
根据上一节的评估方法,只考虑命中数的话(之后版本的评估,除非特别说明,都默认只考虑命中数),直接上结果(有的朋友可能已经预料到了,由于本质上都是微调的LLaMA2 7B chat,只是半个月之前用的longqlora,这一次用的longlora,所以效果上不会有什么太大的差距)
还是同样的评估方法,通过56篇paper-人工review产生的285条数据集,依然考察命中数,评估结果如下图所示,依然超过了GPT3.5、GPT4
至此,本文中已透露了很多我司论文审稿GPT第2版的各种工程细节,这些细节网上很少有,毕竟商用项目
至于第2.5版本的优化,请参见此文《七月论文审稿GPT第2.5版:微调GPT3.5 turbo 16K和llama2 13B以扩大对GPT4的优势》
当然 再更多则在「大模型项目开发线上营」见