• 文本生成不同解码方法的具体实现

    Greedy Search

    贪婪搜索指的是在每一个时间步中使用前n-1的词来预测第n个生成词。即w_i = argmax P(w|w_{1:t-1}),在这个过程中,第n个词是前n-1个词预测的概率最高的词。

    1. """
    2. Greedy Search
    3. """
    4. import tensorflow as tf
    5. from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
    6. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    7. # add the EOS token as PAD token to avoid warnings
    8. model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
    9. # 将句子转化为可用的输入形式
    10. input_ids = tokenizer.encode('the japanese is so ', return_tensors='tf')
    11. # 设置生成文本的最大长度未50
    12. greedy_output = model.generate(input_ids, max_length=50, early_stopping=True)
    13. """
    14. 输出内容:
    15. the japanese is so - that it's not even -
    16. that it's -that it's - that it's - that it's - that it's
    17. """

    Beam Search


    1. # 这里使用了5个束来进行搜索
    2. beam_output = model.generate(input_ids, max_length=50, early_stoppping=True, num_beams=5)
    3. """
    4. the japanese is so iced up that I don't even know how to pronounce it.
    5. I'm not sure how to pronounce it.
    6. I'm not sure how to pronounce it.
    7. """

    由于以上的得到的结果会出现重复的输出内容,为此我们对其进行设置,使用n-gram penatly来确保没有n-gram在预测的句子中出现两次。但这样也会出现一个问题,那就是在需要重复的场景中,也只能出现一次。

    1. ngram_beam_output = model.generate(input_ids, max_length=50, early_stoppping=True, num_beams=5,no_repeat_ngram_size=2) # 这里的no_repeat_gram_size表示ngram
    2. """
    3. 输出内容:
    4. the japanese is so iced up that I don't even know how to pronounce it.
    5. I'm not sure if it's because I'm lazy, or if I just want to be able to say it in Japanese, but I
    6. """

    topk answer


    1. ngram_beam_topk_output = model.generate(input_ids, max_length=50, early_stoppping=True, num_beams=5,no_repeat_ngram_size=2, num_return_sequences=5)
    2. """
    3. 0: the japanese is so iced up that I don't even know how to pronounce it.
    4. I'm not sure if it's because I'm lazy, or if I just want to be able to say it in Japanese, but I
    5. 1: the japanese is so iced up that I don't even know how to pronounce it.
    6. I'm not sure if it's because I'm lazy, or if I just want to be able to say it in Japanese, but it
    7. 2: the japanese is so iced up that I don't even know how to pronounce it.
    8. I'm not sure if it's because I'm lazy, or if I just want to be able to read Japanese, but I think it
    9. 3: the japanese is so iced up that I don't even know how to pronounce it.
    10. I'm not sure if it's because I'm lazy, or if I just want to be able to say it in Japanese. But I
    11. 4: the japanese is so iced up that I don't even know how to pronounce it.
    12. I'm not sure if it's because I'm lazy, or if I just want to be able to read Japanese, but I think I
    13. """



    1. tf.random.set_seed(0)
    2. # print("Output:\n" + 100 * '-')
    3. sample_output = model.generate(input_ids, do_sample=True, max_length=200, top_k=50)
    4. print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
    5. """
    6. 输出:
    7. the japanese is so icky, that even the "Japan's food is the finest" is not really an excuse.
    8. My point is what it is that does the Japanese taste better than others in this country. The reason is that
    9. """


    1. sample_output = model.generate(
    2. input_ids,
    3. do_sample=True,
    4. max_length=50,
    5. top_k=0,
    6. temperature=0.9
    7. )
    8. print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
    9. """
    10. 输出:
    11. the japanese is so familiar with the Japanese language that it's hard to imagine it being used in a Japanese context.
    12. The Japanese language is a very complex language, and it's hard to imagine a Japanese person using it in
    13. """


    1. sample_output = model.generate(
    2. input_ids,
    3. do_sample=True,
    4. max_length=50,
    5. top_k=50
    6. )
    7. """
    8. the japanese is so iaa.
    9. I mean, that's true that people are very hard at finding words in japanese. In fact it might
    10. be more accurate to say that Japanese means "Japanese", which is what you
    11. """


    1. # deactivate top_k sampling and sample only from 92% most likely words
    2. sample_output = model.generate(
    3. input_ids,
    4. do_sample=True,
    5. max_length=50,
    6. top_p=0.92,
    7. top_k=0
    8. )
    9. """
    10. the japanese is so!!!! Cuz of that, you forgot to put caps properly!!!!
    11. I'll make this shit up for those who can't usually talk with one who can PLEASE READ THE INPICIOUS and SATURN INVINC
    12. """


    1. # set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
    2. sample_outputs = model.generate(
    3. input_ids,
    4. do_sample=True,
    5. max_length=50,
    6. top_k=50,
    7. top_p=0.95,
    8. num_return_sequences=3
    9. )
    10. """
    11. the japanese is so icky in mine. it's a lot like someone's salivating over peas and chocolate.
    12. rxtfffffff.......ooh they need Japanese drinking?? Just a thought....... Join us and talk to us about all kinds
    13. """

     参考:How to generate text: using different decoding methods for language generation with Transformers (huggingface.co)

