BertTokenizer 使用方法

python 导入与初始化 BertTokenizer

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path='bert-base-chinese')
1
2
3

首先定义一些数据

sents = [
    '选择珠江花园的原因就是方便。',
    '笔记本的键盘确实爽。',
    '房间太小。其他的都一般。',
    '今天才知道这书还有第6卷,真有点郁闷.',
    '机器背面似乎被撕了张什么标签，残胶还在。',
]
1
2
3
4
5
6
7

Tokenizer的几种使用方法（目前遇到的比较常用的、有新的后续补充）

附上hugging face 中，BertTokenizer 的说明文档

tokenizer.tokenize

只是将句子拆分为token，并不映射为对应的id

tokenizer.tokenize(sents[0])
############################################
['选', '择', '珠', '江', '花', '园', '的', '原', '因', '就', '是', '方', '便', '。']
1
2
3

tokenizer.convert_tokens_to_ids

将token映射为其对应的id

tokenizer.convert_tokens_to_ids(['选', '择', '珠', '江', '花', '园', '的', '原', '因', '就', '是', '方', '便', '。'])
############################################
[6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511]
1
2
3

tokenizer.encode

这个函数只返回编码的结果（input_ids）

tokenizer.encode(sents[0])
############################################
[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102]
1
2
3

tokenizer.encode_plus

这种方法能够返回更多的编码信息，（更多编码信息后面说明）

tokenizer.encode_plus(sents[0])
#############################################
{
	'input_ids': [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102], 	
	'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
	'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
1
2
3
4
5
6
7

tokenizer.batch_encode_plus

以 batch 的形式去编码句子，返回的信息基本和 encode_plus 是一样的。

tokenizer.batch_encode_plus([sents[0], sents[1]])
###########################################################
{
	'input_ids': [
		[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102], 
		[101, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102]
	], 
	'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
	'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
}

1
2
3
4
5
6
7
8
9
10
11

编码过程中一些主要参数说明

多的不说直接上例子

out = tokenizer.encode_plus(
    text=sents[0],
    text_pair=sents[1],

    #当句子长度大于max_length时,截断
    truncation=True,

    #一律补零到max_length长度
    padding='max_length',
    max_length=30,
    add_special_tokens=True,

    #可取值tf,pt,np,默认为返回list
    return_tensors=None,

    #返回token_type_ids
    return_token_type_ids=True,

    #返回attention_mask
    return_attention_mask=True,   

    #返回special_tokens_mask 特殊符号标识
    return_special_tokens_mask=True,

    #返回offset_mapping 标识每个词的起止位置,这个参数只能BertTokenizerFast使用
    #return_offsets_mapping=True,

    #返回length 标识长度
    return_length=True,
)

for k, v in out.items():
    print(k, ':', v)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

运行结果如下：

input_ids : [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
length : 30
1
2
3
4
5

相关阅读:
【JavaSE】集合专项练习篇（附源码）
网络代理技术：隐私保护与安全加固的利器
SpringBoot项目如何优雅的实现操作日志记录
Java技术栈 —— 模版引擎 Freemarker or Thymeleaf？
[python 刷题] 217 Contains Duplicate
华为云从入门到实战 | 云服务概述与华为云搭建Web应用
Java版分布式微服务云开发架构 Spring Cloud+Spring Boot+Mybatis 电子招标采购系统功能清单
为什么选择WordPress作为企业CMS？
TFIDF与BM25
MQ的概述及优缺点及使用场景

原文地址：https://blog.csdn.net/Defiler_Lee/article/details/126490287