大模型综述论文笔记6-15

这里写自定义目录标题

Keywords
Backgroud for LLMs
- Technical Evolution of GPT-series Models
- - Research of OpenAI on LLMs can be roughly divided into the following stages
Resources
Pre-training

Keywords

GPT:Generative Pre-Training

Backgroud for LLMs

Technical Evolution of GPT-series Models

Two key points to GPT’s success are (I) training decoder-onlly Transformer language models that can accurately predict the next word and (II) scaling up the size of language models

Research of OpenAI on LLMs can be roughly divided into the following stages

Early Explorations

请添加图片描述

Capacity Leap

ICT

Capacity Enhancement

1.training on code data
Codex: a GPT model fine-tuned on a large corpus of GitHub
code
2.alignment with human preference
reinforcement learning from human feedback (RLHF) algorithm

Note that it seems that the wording of “instruction tuning” has seldom
been used in OpenAI’s paper and documentation, which is substituted by
supervised fine-tuning on human demonstrations (i.e., the first step
of the RLHF algorithm).

The Milestones of Language Models

chatGPT(based on gpt3.5 and gpt4) and GPT-4(multimodal)

Resources

在这里插入图片描述
Stanford Alpaca is the first open instruct-following model fine-tuned based on LLaMA (7B).
Alpaca LoRA (a reproduction of Stanford Alpaca using LoRA)

model 、data、library

Pre-training

在这里插入图片描述

Data Collection

General Text Data:webpages, books, and conversational text
Specialized Text Data:Multilingual text, Scientific text, Code

Data Preprocessing

Quality Filtering

The former approach trains a selection classifier based on highquality texts and leverages it to identify and filter out low quality data.
heuristic based approaches to eliminate low-quality texts through a set of well-designed rules: Language based filtering, Metric based filtering, Statistic based filtering, Keyword based filtering

De-duplication

Existing work has found that duplicate data in a corpus would reduce the diversity of language models, which may cause the training process to become unstable and thus affect the model performance.

Privacy Redaction: (PII:personally identifiable information )
Tokenization:(It aims to segment raw text into sequences of individual tokens, which are subsequently used as the inputs of LLMs.) Byte-Pair Encoding (BPE) tokenization; WordPiece tokenization; WordPiece tokenization

相关阅读:
Java：为什么Java对银行的未来很重要?
嵌入式Linux_学习路线+基础知识
工具类：展开收起文字
网络套接字编程（二）
cuda和cuDNN的安装
自己动手写编译器:实现命令行模块
QsciScintilla自动代码完成实现原理
24---WPF缓存
如何阅读一篇论文
如何搭建一个自己的音乐服务器

原文地址：https://blog.csdn.net/Ives_WangShen/article/details/132612122