transformers PreTrainedTokenizer类

基类概述

PreTrainedTokenizer类是所有分词器类Tokenizer的基类，该类不能被实例化，所有的分词器类（比如BertTokenizer、DebertaTokenizer等）都继承自PreTrainedTokenizer类，并实现了基类的方法。

基类方法

1、_ _ call _ _函数

__call__(
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        is_split_into_words: bool = False,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        **kwargs
    )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

该函数返回一个BatchEncoding，该类继承自python字典类型，可以像使用字典一样使用BatchEncoding。除此之外，该类还有一些将单词、字符转换成分词的方法。下面以BertTokenizer为例，说明一下各个参数的含义。

参数text表示要编码的序列或序列批次。参数text_pair表示分句的序列或序列批次。可以为一个str或者一个列表、一个str组成的列表，如果为一个str组成的列表，则参数is_split_into_words应为True。

# text为一个string
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# text为一个列表
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks."])
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# text为一个string组成的列表，此时设置 参数is_split_into_words=True
>>> tokenizer(text=[["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks"]], is_split_into_words=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# text_pair和text的格式应一样，text为列表则text_pair也应为列表
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",
                      text_pair="I demand that the more John eat, the more he pays.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

参数add_special_tokens表示是否根据模型向其中添加特殊的标记，比如[CLS]、[SEQ]、[PAD]等标记，默认为True。

# add_special_tokens=False，不添加特殊标记
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",add_special_tokens=False)
{'input_ids': [1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# 默认为True，添加特殊标记
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks."],add_special_tokens=True)
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']
1
2
3
4
5
6
7
8
9
10

参数padding表示是否进行填充，参数truncation表示是否进行截取。padding可以为一个bool数据类型，也可以为一个str表示填充的策略，分别是longest、max_length、do_not_pad。truncation既可以为bool数据类型，也可以是str表示截断的策略，分别是longest_first、only_first、only_second、do_not_truncate。这两个参数默认都是False。

参数max_length表示填充或者截断的最大长度，通常与padding、truncation参数一同使用来设置句子的长度。参数pad_to_multiple_of如果设置，则将序列填充为所提供值的倍数。

# padding为True，则设置填充，填充后的长度与句子最长的长度相等
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],padding=True)
>>> encodings 
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 0, 0, 0], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP] [PAD] [PAD] [PAD]', '[CLS] i demand that the more john eat, the more he pays. [SEP]']

# truncation为True，设置截断，默认情况下，截断后的长度与句子最长长度相同，故不会截断任何句子
# 此时，可以通过设置max_length的值来指定截断长度。
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# truncation为True，同时设置max_length=10。
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

参数stride表示如果设置truncation=True，return_overflowing_tokens=True将包含来自返回的截断序列末尾的一些标记，以在截断序列和溢出序列之间提供一些重叠。则这个值表示重叠标记的数量。

>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10, stride=2, return_overflowing_tokens=True)
>>> encodings
# 此处overflowing_tokens中的[1997，1996]就是重叠的标记，个数为2，与参数stride=2设置的一致
{'overflowing_tokens': [[1997, 1996, 5749, 1012], [4521, 1010, 1996, 2062, 2002, 12778, 1012]], 'num_truncated_tokens': [2, 5], 'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
1
2
3
4

参数return_tensors表示返回的数据类型，可以是tf（tensorflow张量）、pt（pytorch张量）、np（numpy数组）。

>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="tf")
>>> type(encodings["input_ids"])
<class 'tensorflow.python.framework.ops.EagerTensor'>

>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="np")
>>> type(encodings["input_ids"])
<class 'numpy.ndarray'>
1
2
3
4
5
6
7

参数return_token_type_ids表示是否返回分句的id，参数return_attention_mask表示是否返回注意力的掩码，参数return_overflowing_tokens表示是否返回截断的单词id，return_special_tokens_mask表示是否返回特殊标记信息，参数return_length表示是否返回句子的长度。

# 只有一个分句，token_type_ids全为0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# 有两个分句，第二个分句token_type_ids为1
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",text_pair="I demand that the more John eat, the more he pays.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# overflowing_tokens表示截断的单词
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",truncation=True,max_length=10,return_overflowing_tokens=True)
{'overflowing_tokens': [5749, 1012], 'num_truncated_tokens': 2, 'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# special_tokens_mask使用特殊标记字符[CLS]、[SEQ]等的地方标记为1，其他为0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_special_tokens_mask=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# length表示各个句子的长度
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks.", "I demand that the more John eat, the more he pays."],return_length=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'length': [12, 15], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

参数return_offsets_mapping表示是否返回句子中每个单词的起始位置，仅适用于Fast分词器，比如BertTokenizerFast。

>>> tokenizer("The sailors rode the breeze clear of the rocks.",return_offsets_mapping=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 3), (4, 11), (12, 16), (17, 20), (21, 27), (28, 33), (34, 36), (37, 40), (41, 46), (46, 47), (0, 0)]}
1
2

参数verbose表示是否打印出冗余的警告信息。

2、encode函数

encode(
        text: Union[TextInput, PreTokenizedInput, EncodedInput],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs
    )
1
2
3
4
5
6
7
8
9
10
11

该函数使用分词器将字符串编码成一个int列表。所有的参数与__call__函数中的参数含义一致。在进行批处理时，此函数不常使用，通常使用__call__函数来处理。

# text可以为一个str
>>> encoding = tokenizer.encode(text="The sailors rode the breeze clear of the rocks.")
>>> encoding
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text也可以为一个List[str]，这个列表中的每一个str称为一个token
>>> encoding = tokenizer.encode(text=["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks", "."])
>>> encoding
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text还可以为一个List[int]，每个int为token所对应的id
>>> encodings = tokenizer.encode(text=[100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012])
>>> encodings
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
1
2
3
4
5
6
7
8
9
10
11
12

3、decode函数

decode(
        token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = True,
        **kwargs
    )
1
2
3
4
5
6

函数使用分词器将一个int列表转换成一个str。
参数clean_up_tokenization_spaces表示是否清理标记化空间，如果设值为False，则将id转换成str后标点符号与单词之间的空格不会被清除；如果为True，则会被清除。默认为True。

>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks. [SEP]'# 标点符号与单词间没有空格
>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], clean_up_tokenization_spaces=False)
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks . [SEP]' # 标点符号与单词间有空格
1
2
3
4
5
6

4、batch_decode函数

batch_decode(
        sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = True,
        **kwargs
    )
1
2
3
4
5
6

一次处理多个List[int]，也就是一个List[List[int]]，结果返回一个List[str]

# sequences为一个List[List[int]]
>>> decodings = tokenizer.batch_decode([[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]])
>>> decodings
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']
1
2
3
4

5、convert_ids_to_tokens函数

convert_ids_to_tokens(
        ids: Union[int, List[int]], skip_special_tokens: bool = False
    )
1
2
3

将一个id列表转换成一个str列表。

>>> tokens = tokenizer.convert_ids_to_tokens([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> tokens
['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]']
1
2
3

6、convert_tokens_to_ids函数

convert_tokens_to_ids(tokens: Union[str, List[str]])
1

将tokens（List[str])转换成一个id列表（List[int])

>>> ids = tokenizer.convert_tokens_to_ids(['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]'])
>>> ids
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
1
2
3

7、tokenize函数

tokenize(text: TextInput, **kwargs)
1

将str转换成一个List[str]。

>>> tokenizer.tokenize("The sailors rode the breeze clear of the rocks.")
['the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.']
1
2

相关阅读:
计算机毕业设计（附源码）python在线共享笔记系统
 C++ STL库的介绍和使用
 机器视觉技术的演进：YOLO系列与Halcon的深度对比
 Python实验三
 Excel VBA Win窗口开发散点图(窗体应用项目,完整开发分享)
VMware16虚拟机添加硬盘(磁盘)和挂载硬盘(磁盘)
开源软件 FFmpeg 生成模型使用图片数据集
 Java学习笔记（二十五）
学习Golang时遇到的似懂非懂的概念
 Spring底层原理学习笔记--第十一讲--(aop之proxy增强-jdk及aop之proxy增强-cglib)
原文地址：https://blog.csdn.net/weixin_49346755/article/details/125407176