tokenizers processors模块

模块概述

processors模块负责对文本执行额外的转换，添加额外的特殊标记。比如有这样一句话，“this is a text, this is a another text.”，通过处理后，这句话就变成了"[CLS] this is a text [SEP] this is a another text [SEP]"。

processors模块中实现的都是PostProcessor的子类，对于PostProcessor，官网的解释如下，大致意思是提供与一些基于 Transformers 的 SoTA 模型兼容的高级构建功能。例如，对于 BERT，它会将标记化的句子包裹在 [CLS] 和 [SEP] 标记周围。

Post-Processing: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.

processor模块实现了4种PostProcessor，分别是BertProcessing、ByteLevel、RobertaProcessing、TemplateProcessing。

模块使用

1、BertProcessing

tokenizers.processors.BertProcessing(sep, cls)
1

BertProcessing后处理器负责将文本添加特殊标记。参数sep和参数cls为一个元组（str, int），第一项为特殊标记，第二项为标记对应的id。

>>>  def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]

>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
>>> tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

>>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
>>> trainers = trainers.WordPieceTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.encode("this is a text!!!").ids
[758, 560, 64, 4413, 5, 5, 5]

>>> processor = processors.BertProcessing(sep=("[SEP]", tokenizer.token_to_id("[SEP]")),
                                          cls=("[CLS]", tokenizer.token_to_id("[CLS]")))
processor.process(tokenizer.encode("this is a text!!!")).ids
[2, 758, 560, 64, 4413, 5, 5, 5, 3]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

2、ByteLevel

tokenizers.processors.ByteLevel(trim_offsets = True)
1

ByteLevel后处理器在Byte-level BPE模型后使用，经过Byte-level BPE模型处理后会包含许多空格，这些空格都是包含在offsets中，如果不想要这些空格包含在offsets中，可以指定参数trim_offsets=True，这也是默认的处理方式。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]

>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.BPE())
>>> tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
>>> trainers = trainers.BpeTrainer(special_tokens=["<|endoftext|>"])
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> processor = processors.ByteLevel(trim_offsets=False)
>>> processor.process(tokenizer.encode("this is a text!!!")).offsets
[(0, 4), (4, 7), (7, 9), (9, 14), (14, 15), (15, 16), (16, 17)]
>>> processor = processors.ByteLevel()
>>> processor.process(tokenizer.encode("this is a text!!!")).offsets
[(0, 4), (5, 7), (8, 9), (10, 14), (14, 15), (15, 16), (16, 17)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

3、RobertaProcessing

tokenizers.processors.RobertaProcessing(sep, cls, trim_offsets = True, add_prefix_space = True )
1

RobertaProcessing后处理器处理后供Roberta模型来使用，Roberta使用的也是Byte-level BPE模型，因此经Byte-level BPE模型处理后也会包含许多空格。参数add_prefix_space的值应该与pre_tokenizers中使用的add_prefix_space值一致，因为如果添加了token前面添加了空格，也会影响到offsets。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")

>>> tokenizer = Tokenizer(models.BPE(unk_token=""))
>>> tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
>>> special_tokens = ["", "", "", "", ""]
>>> trainers = trainers.BpeTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> processor = processors.RobertaProcessing(sep=("", tokenizer.token_to_id("")),
	                                         cls=("", tokenizer.token_to_id("")),
	                                         trim_offsets=True,
	                                         add_prefix_space=True)
>>> processor.process(tokenizer.encode("this is a text!!!")).ids
[2, 522, 305, 176, 4452, 5, 5, 5, 3]
>>> processor.process(tokenizer.encode("this is a text!!!")).offsets
[(0, 0), (0, 4), (5, 7), (8, 9), (10, 14), (14, 15), (15, 16), (16, 17), (0, 0)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

4、TemplateProcessing

tokenizers.processors.TemplateProcessing(single, pair, special_tokens )
1

TemplateProcessing后处理器提供一个模板化的文本处理，以便将特殊标记添加到输入序列中。参数single表示单个序列使用的模板，参数pair表示两个序列使用的模板，参数special_tokens表示模板使用的特殊标记，是一个元组(str, int)。

对于参数single和参数pair有三种形式，分别是：

# 只指定序列，type_ids默认为0 $A或者$B # 只指定type_ids，序列默认为A $0、$1、... # 既指定序列，又指定type_ids $A:0、$B:1
1
2
3
4
5
6

>>> def batch_iterator(): for i in range(0, len(dataset), 1000): yield dataset[i: i + 1000]["text"] >>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation") >>> tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]")) >>> tokenizer.normalizer = normalizers.BertNormalizer() >>> tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer() >>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"] >>> trainers = trainers.WordPieceTrainer(special_tokens=special_tokens) >>> tokenizer.train_from_iterator(batch_iterator(), trainers) >>> tokenizer.encode("this is a text!!!").ids) [482, 359, 38, 3350, 5, 5, 5] >>> cls_token_id = tokenizer.token_to_id("[CLS]") >>> sep_token_id = tokenizer.token_to_id("[SEP]") >>> post_processor = processors.TemplateProcessing( single=f"[CLS]:0 $A:0 [SEP]:0", pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", special_tokens=[ ("[CLS]", cls_token_id), ("[SEP]", sep_token_id), ], ) >>> post_processor.process(tokenizer.encode("this is a text!!!")).ids [2, 482, 359, 38, 3350, 5, 5, 5, 3] >>> (post_processor.process(tokenizer.encode("this is a text!!!")).type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0] >>> post_processor.process(tokenizer.encode("this is a text!!!"), tokenizer.encode("this is another text!!!")).ids [2, 482, 359, 38, 3350, 5, 5, 5, 3, 482, 359, 1061, 3350, 5, 5, 5, 3] >>> post_processor.process(tokenizer.encode("this is a text!!!"), tokenizer.encode("this is another text!!!")).type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

相关阅读:
Java设计模式之代理模式(一)
vue模板语法下集
 JSON数据和解析
 工作失误合集，这个月的工资被扣没咯！
DataX二次开发——（5）基于CopyIn原理新增greenplumwriter
服务器质量不好会对网站造成的危害有哪些？
java计算机毕业设计校园失物招领管理系统源码+系统+mysql数据库+lw文档
 java计算机毕业设计计算机实验课程学习系统MyBatis+系统+LW文档+源码+调试部署
 揭秘短网址背后的灰色产业
 Spring5之IOC容器中IOC操作之Bean管理(二)之p名称空间注入、外部bean、内部bean、级联赋值

原文地址：https://blog.csdn.net/weixin_49346755/article/details/126499720

最新文章

攻防演习之三天拿下官网站群
 数据安全治理学习——前期安全规划和安全管理体系建设
 企业安全 | 企业内一次钓鱼演练准备过程
 内网渗透测试 | Kerberos协议及其部分攻击手法
 0day的产生 | 不懂代码的"代码审计"
安装scrcpy-client模块av模块异常，环境问题解决方案
 leetcode hot100【LeetCode 279. 完全平方数】java实现
 OpenWrt下安装Mosquitto
AnatoMask论文汇总
 【AI日记】24.11.01 LangChain、openai api和github copilot

热门文章

十款代码表白小特效一个比一个浪漫赶紧收藏起来吧！！！
奉劝各位学弟学妹们，该打造你的技术影响力了！
五年了，我在 CSDN 的两个一百万。
Java俄罗斯方块，老程序员花了一个周末，连接中学年代！
面试官都震惊，你这网络基础可以啊！
你真的会用百度吗？我不信 — 那些不为人知的搜索引擎语法
 心情不好的时候，用 Python 画棵樱花树送给自己吧
 通宵一晚做出来的一款类似CS的第一人称射击游戏Demo！原来做游戏也不是很难，连憨憨学妹都学会了！
13 万字 C 语言从入门到精通保姆级教程2021 年版
 10行代码集2000张美女图，Python爬虫120例，再上征途