Training is performed by passing parameters of spm_train to SentencePieceTrainer.train() function.
- import sentencepiece as spm
-
- # train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
- # `m.vocab` is just a reference. not used in the segmentation.
- spm.SentencePieceTrainer.train('--input=examples/hl_data/botchan.txt --model_prefix=m --vocab_size=500')
训练时可传的参数:
--input
: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.--model_prefix
: output model name prefix. .model
and .vocab
are generated.--vocab_size
: vocabulary size, e.g., 8000, 16000, or 32000--character_coverage
: amount of characters covered by the model, good defaults are: 0.9995
for languages with rich character set like Japanese or Chinese and 1.0
for other languages with small character set.--model_type
: model type. Choose from unigram
(default), bpe
, char
, or word
. The input sentence must be pretokenized when using word
type.
- # makes segmenter instance and loads the model file (m.model)
- sp = spm.SentencePieceProcessor()
- sp.load('m.model')
-
- text = """
- If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
- """
- # encode: text => id
- print(sp.encode_as_pieces(text))
- print(sp.encode_as_ids(text))
-
- # decode: id => text
- # print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est', 'ly']))
- # print(sp.decode_ids([209, 31, 9, 375, 586, 34]))
官方参考连接:
sentencepiece/README.md at master · google/sentencepiece · GitHub