码农知识堂 - 1000bd
  •   Python
  •   PHP
  •   JS/TS
  •   JAVA
  •   C/C++
  •   C#
  •   GO
  •   Kotlin
  •   Swift
  • 大模型综述论文笔记6-15


    这里写自定义目录标题

    • Keywords
    • Backgroud for LLMs
      • Technical Evolution of GPT-series Models
        • Research of OpenAI on LLMs can be roughly divided into the following stages
          • Early Explorations
          • Capacity Leap
          • Capacity Enhancement
          • The Milestones of Language Models
    • Resources
    • Pre-training
      • Data Collection
      • Data Preprocessing
        • Quality Filtering
        • De-duplication

    Keywords

    GPT:Generative Pre-Training

    Backgroud for LLMs

    Technical Evolution of GPT-series Models

    Two key points to GPT’s success are (I) training decoder-onlly Transformer language models that can accurately predict the next word and (II) scaling up the size of language models

    Research of OpenAI on LLMs can be roughly divided into the following stages

    Early Explorations

    请添加图片描述

    Capacity Leap

    ICT

    Capacity Enhancement

    1.training on code data
    Codex: a GPT model fine-tuned on a large corpus of GitHub
    code
    2.alignment with human preference
    reinforcement learning from human feedback (RLHF) algorithm

    Note that it seems that the wording of “instruction tuning” has seldom
    been used in OpenAI’s paper and documentation, which is substituted by
    supervised fine-tuning on human demonstrations (i.e., the first step
    of the RLHF algorithm).

    The Milestones of Language Models

    chatGPT(based on gpt3.5 and gpt4) and GPT-4(multimodal)

    Resources

    在这里插入图片描述
    Stanford Alpaca is the first open instruct-following model fine-tuned based on LLaMA (7B).
    Alpaca LoRA (a reproduction of Stanford Alpaca using LoRA)

    model 、data、library

    Pre-training

    在这里插入图片描述

    Data Collection

    General Text Data:webpages, books, and conversational text
    Specialized Text Data:Multilingual text, Scientific text, Code

    Data Preprocessing

    Quality Filtering

    1. The former approach trains a selection classifier based on highquality texts and leverages it to identify and filter out low quality data.
    2. heuristic based approaches to eliminate low-quality texts through a set of well-designed rules: Language based filtering, Metric based filtering, Statistic based filtering, Keyword based filtering

    De-duplication

    Existing work has found that duplicate data in a corpus would reduce the diversity of language models, which may cause the training process to become unstable and thus affect the model performance.

    1. Privacy Redaction: (PII:personally identifiable information )
    2. Tokenization:(It aims to segment raw text into sequences of individual tokens, which are subsequently used as the inputs of LLMs.) Byte-Pair Encoding (BPE) tokenization; WordPiece tokenization; WordPiece tokenization
  • 相关阅读:
    Java:为什么Java对银行的未来很重要?
    嵌入式Linux_学习路线+基础知识
    工具类:展开收起文字
    网络套接字编程(二)
    cuda和cuDNN的安装
    自己动手写编译器:实现命令行模块
    QsciScintilla自动代码完成实现原理
    24---WPF缓存
    如何阅读一篇论文
    如何搭建一个自己的音乐服务器
  • 原文地址:https://blog.csdn.net/Ives_WangShen/article/details/132612122
  • 最新文章
  • 攻防演习之三天拿下官网站群
    数据安全治理学习——前期安全规划和安全管理体系建设
    企业安全 | 企业内一次钓鱼演练准备过程
    内网渗透测试 | Kerberos协议及其部分攻击手法
    0day的产生 | 不懂代码的"代码审计"
    安装scrcpy-client模块av模块异常,环境问题解决方案
    leetcode hot100【LeetCode 279. 完全平方数】java实现
    OpenWrt下安装Mosquitto
    AnatoMask论文汇总
    【AI日记】24.11.01 LangChain、openai api和github copilot
  • 热门文章
  • 十款代码表白小特效 一个比一个浪漫 赶紧收藏起来吧!!!
    奉劝各位学弟学妹们,该打造你的技术影响力了!
    五年了,我在 CSDN 的两个一百万。
    Java俄罗斯方块,老程序员花了一个周末,连接中学年代!
    面试官都震惊,你这网络基础可以啊!
    你真的会用百度吗?我不信 — 那些不为人知的搜索引擎语法
    心情不好的时候,用 Python 画棵樱花树送给自己吧
    通宵一晚做出来的一款类似CS的第一人称射击游戏Demo!原来做游戏也不是很难,连憨憨学妹都学会了!
    13 万字 C 语言从入门到精通保姆级教程2021 年版
    10行代码集2000张美女图,Python爬虫120例,再上征途
Copyright © 2022 侵权请联系2656653265@qq.com    京ICP备2022015340号-1
正则表达式工具 cron表达式工具 密码生成工具

京公网安备 11010502049817号