🔎大家好,我是Sonhhxg_柒,希望你看完之后,能对你有所帮助,不足请指正!共同学习交流🔎
📝个人主页-Sonhhxg_柒的博客_CSDN博客 📃
🎁欢迎各位→点赞👍 + 收藏⭐️ + 留言📝
📣系列专栏 - 机器学习【ML】 自然语言处理【NLP】 深度学习【DL】
🖍foreword
✔说明⇢本人讲解主要包括Python、机器学习(ML)、深度学习(DL)、自然语言处理(NLP)等内容。
如果你对这个系列感兴趣的话,可以关注订阅哟👋
文章目录
当我们讨论语言模型时,我们展示了如何生成文本。构建一个聊天机器人是类似的,除了我们正在为一个交换建模。这可以使我们的要求更复杂,或者实际上更简单,具体取决于我们要如何解决问题。
在本章中,我们将讨论一些可以对此建模的方法,然后我们将构建一个程序,该程序将使用生成模型来获取然后生成响应。首先,让我们谈谈什么是话语。
形态学和句法告诉我们词素是如何组合成词的,词是如何组合成短语和句子的。将句子组合成更大的语言行为并不容易建模。有不恰当的句子组合的想法。让我们看一些例子:
I went to the doctor, yesterday. It is just a sprained ankle.
I went to the doctor, yesterday. Mosquitoes have 47 teeth.
在第一个例子中,第二句话显然与第一句话有关。从这两句话,结合常识,我们可以推断出说话者是因为脚踝问题去看医生,结果是扭伤。第二个例子没有意义。从语言学的角度来看,句子是从概念生成的,然后编码成单词和短语。句子所表达的概念是相互联系的,所以一个句子序列应该由相似的概念联系起来。无论对话中只有一个或多个发言者,这都是正确的。
话语的语用学对于理解如何对其建模很重要。如果我们正在为客户服务交换建模,则响应范围可能会受到限制。这些有限类型的响应通常称为意图。在构建客户服务聊天机器人时,这大大降低了潜在的复杂性。如果我们对一般对话进行建模,这可能会变得更加困难。语言模型学习序列中可能发生的事情,但它们无法学习生成概念。所以我们的选择是要么构建一些模型来模拟可能的序列,要么找到一种作弊的方法。
我们可以通过对无法识别的意图构建罐头响应来作弊。例如,如果用户声明我们的简单模型不期望,我们可以让它回应,“对不起,我不明白。” 如果我们正在记录对话,我们可以使用使用预设响应的交换来扩展我们涵盖的意图。
在我们所涵盖的示例中,我们将构建一个纯粹为整个话语建模的程序。本质上,它是一种语言模型。不同之处在于我们如何使用它。
本章与前几章的不同之处在于它没有使用 Spark。Spark 非常适合批量处理大量数据。在交互式应用程序中它不是很好。此外,循环神经网络可能需要很长时间来训练大量数据。因此,在本章中,我们正在处理一小部分数据。如果您有正确的硬件,您可以更改 NLTK 处理以使用 Spark NLP。
我们将构建一个故事构建工具。这个想法是帮助某人写一个类似于格林童话故事的原创故事。从包含更多参数的意义上说,这个模型将比以前的语言模型复杂得多。该程序将是一个脚本,它要求输入句子并生成一个新句子。然后,用户获取该句子,对其进行修改和更正,然后输入它。
我们试图解决的问题是什么?
我们需要一个系统来推荐故事中的下一个句子。我们还必须认识到文本生成技术的局限性。我们需要让用户参与循环。所以我们需要一个可以生成相关文本的模型和一个可以让我们查看输出的系统。
有哪些限制条件?
首先,我们需要一个具有两个上下文概念的模型——前一个句子和当前句子。我们不需要过多担心性能,因为这将与人进行交互。这似乎违反直觉,因为大多数交互式系统需要相当低的延迟。然而,如果你考虑这个程序正在产生什么,等待一到三秒的响应并不是不合理的。
我们如何解决约束问题?
我们将构建一个用于生成文本的神经网络,特别是 RNN,如第4章和第8章所述。我们可以在这个模型中学习词嵌入,但我们可以使用预先构建的嵌入。这将帮助我们更快地训练模型。
这个项目的大部分工作将是开发一个模型。一旦我们有了模型,我们将构建一个简单的脚本,我们可以用它来编写我们自己的格林式童话故事。一旦我们开发了这个脚本,这个模型就有可能被用来驱动 Twitter 机器人或 Slackbot。
在文本生成的实际生产环境中,我们希望监控生成文本的质量。这将使我们能够通过开发更有针对性的训练数据来改进生成的文本。
我们输入固定大小的字符窗口并预测下一个字符。现在我们需要找到一种方法来考虑更大的文本部分。有几个选项。
许多 RNN 架构包括一个用于学习单词嵌入的层。这仅需要我们学习更多参数,因此我们将使用预训练的 GloVe 模型。此外,我们将在令牌级别上构建模型,而不是像以前那样在角色级别上构建模型。
我们可以使窗口大小比平均句子大得多。这有利于保持相同的模型架构。缺点是我们的 LSTM 层必须在很长的距离上维护信息。我们可以使用一种用于机器翻译的架构。
让我们考虑连接方法。
当前输入将是句子上的窗口,因此对于给定句子的每个窗口,我们将使用相同的上下文向量。这种方法的好处是能够扩展到多个句子。缺点是模型必须学会平衡远近的信息。
让我们考虑有状态的方法。
通过减少前一句的影响,这有助于使训练更容易。然而,这是一把双刃剑,因为上下文给我们的信息较少。我们将使用这种方法。
- from collections import Counter
- import pickle as pkl
-
- import nltk
- import numpy as np
- import pandas as pd
-
- from keras.models import Model
- from keras.layers import Input, Embedding, LSTM, Dense, CuDNNLSTM
- from keras.layers.merge import Concatenate
- import keras.utils as ku
- import keras.preprocessing as kp
- import tensorflow as tf
- np.random.seed(1)
- tf.set_random_seed(2)
让我们还为句子的开头和结尾以及未知标记定义一些特殊标记。
- START = '>'
- END = '###'
- UNK = '???'
现在,我们可以加载数据了。我们需要替换一些特殊字符。
- with open('grimms_fairytales.txt', encoding='UTF-8') as fp:
- text = fp.read()
-
- text = text\
- .replace('\t', ' ')\
- .replace('“', '"')\
- .replace('”', '"')\
- .replace('“', '"')\
- .replace('‘', "'")\
- .replace('’', "'")
现在,我们可以将我们的文本处理成标记化的句子。
- sentences = nltk.tokenize.sent_tokenize(text)
- sentences = [s.strip()for s in sentences]
- sentences = [[t.lower() for t in nltk.tokenize.wordpunct_tokenize(s)] for s in sentences]
- word_counts = Counter([t for s in sentences for t in s])
- word_counts = pd.Series(word_counts)
- vocab = [START, END, UNK] + list(sorted(word_counts.index))
我们需要为我们的模型定义一些超参数。
dim是令牌嵌入的大小w是我们将使用的窗口的大小max_len是我们使用的句子长度units是我们将用于 LSTM 的状态向量的大小- dim = 50
- w = 10
- max_len = int(np.quantile([len(s) for s in sentences], 0.95))
- units = 200
现在,让我们加载 GloVe 嵌入。
- glove = {}
- with open('glove.6B/glove.6B.50d.txt', encoding='utf-8') as fp:
- for line in fp:
- token, embedding = line.split(maxsplit=1)
- if token in vocab:
- embedding = np.fromstring(embedding, 'f', sep=' ')
- glove[token] = embedding
-
- vocab = list(sorted(glove.keys()))
- vocab_size = len(vocab)
我们还需要查找 one-hot-encoded 输出。
- i2t = dict(enumerate(vocab))
- t2i = {t: i for i, t in i2t.items()}
-
- token_oh = ku.to_categorical(np.arange(vocab_size))
- token_oh = {t: token_oh[i,:] for t, i in t2i.items()}
现在,我们可以定义一些实用函数。
我们需要填充句子的结尾;否则,我们将无法从句子中的最后一个单词中学习。
- def pad_sentence(sentence, length):
- sentence = sentence[:length]
- if len(sentence) < length:
- sentence += [END] * (length - len(sentence))
- return sentence
我们还需要将句子转换为矩阵。
- def sent2mat(sentence, embedding):
- mat = [embedding.get(t, embedding[UNK]) for t in sentence]
- return np.array(mat)
我们需要一个将序列转换为滑动窗口序列的函数。
- def slide_seq(seq, w):
- window = []
- target = []
- for i in range(len(seq)-w-1):
- window.append(seq[i:i+w])
- target.append(seq[i+w])
- return window, target
现在我们可以构建我们的输入矩阵。我们将有两个输入矩阵。一个来自上下文,一个来自当前句子。
- Xc = []
- Xi = []
- Y = []
-
- for i in range(len(sentences)-1):
-
- context_sentence = pad_sentence(sentences[i], max_len)
- xc = sent2mat(context_sentence, glove)
-
- input_sentence = [START]*(w-1) + sentences[i+1] + [END]*(w-1)
- for window, target in zip(*slide_seq(input_sentence, w)):
- xi = sent2mat(window, glove)
- y = token_oh.get(target, token_oh[UNK])
-
- Xc.append(np.copy(xc))
- Xi.append(xi)
- Y.append(y)
-
- Xc = np.array(Xc)
- Xi = np.array(Xi)
- Y = np.array(Y)
- print('context sentence: ', xc.shape)
- print('input sentence: ', xi.shape)
- print('target sentence: ', y.shape)
context sentence: (42, 50) input sentence: (10, 50) target sentence: (4407,)
让我们建立我们的模型。
- input_c = Input(shape=(max_len,dim,), dtype='float32')
- lstm_c, h, c = LSTM(units, return_state=True)(input_c)
-
- input_i = Input(shape=(w,dim,), dtype='float32')
- lstm_i = LSTM(units)(input_i, initial_state=[h, c])
-
- out = Dense(vocab_size, activation='softmax')(lstm_i)
- model = Model(input=[input_c, input_i], output=[out])
print(model.summary())
- Model: "model_1"
- __________________________________________________________________________
- Layer (type) Output Shape Param # Connected to
- ==========================================================================
- input_1 (InputLayer) (None, 42, 50) 0
- __________________________________________________________________________
- input_2 (InputLayer) (None, 10, 50) 0
- __________________________________________________________________________
- lstm_1 (LSTM) [(None, 200), (None, 200800 input_1[0][0]
- __________________________________________________________________________
- lstm_2 (LSTM) (None, 200) 200800 input_2[0][0]
- lstm_1[0][1]
- lstm_1[0][2]
- __________________________________________________________________________
- dense_1 (Dense) (None, 4407) 885807 lstm_2[0][0]
- ==========================================================================
- Total params: 1,287,407
- Trainable params: 1,287,407
- Non-trainable params: 0
- __________________________________________________________________________
- None
- model.compile(
- loss='categorical_crossentropy', optimizer='adam',
- metrics=['accuracy'])
现在我们可以训练我们的模型了。根据您的硬件,这在 CPU 上每个 epoch 可能需要四分钟。这是我们迄今为止最复杂的模型,具有近 130 万个参数。
Epoch 1/10 145061/145061 [==============================] - 241s 2ms/step - loss: 3.7840 - accuracy: 0.3894 ... Epoch 10/10 145061/145061 [==============================] - 244s 2ms/step - loss: 1.8933 - accuracy: 0.5645
一旦我们训练了这个模型,我们就可以尝试生成一些句子。这个函数需要一个上下文句子和一个输入句子——我们可以简单地提供一个单词来开始。该函数会将标记附加到输入句子,直到END生成标记或我们达到最大允许长度。
- def generate_sentence(context_sentence, input_sentence, max_len=100):
- context_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]
- context_sentence = pad_sentence(context_sentence, max_len)
- context_vector = sent2mat(context_sentence, glove)
- input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]
- input_sentence = [START] * (w-1) + input_sentence
- input_sentence = input_sentence[:w]
- output_sentence = input_sentence
-
- input_vector = sent2mat(input_sentence, glove)
- predicted_vector = model.predict([[context_vector], [input_vector]])
- predicted_token = i2t[np.argmax(predicted_vector)]
- output_sentence.append(predicted_token)
- i = 0
- while predicted_token != END and i < max_len:
- input_sentence = input_sentence[1:w] + [predicted_token]
- input_vector = sent2mat(input_sentence, glove)
- predicted_vector = model.predict([[context_vector], [input_vector]])
- predicted_token = i2t[np.argmax(predicted_vector)]
- output_sentence.append(predicted_token)
- i += 1
- return output_sentence
因为我们需要提供新句子的第一个单词,所以我们可以简单地从语料库中找到的开头标记进行采样。让我们将需要的第一个单词的分布保存为 JSON。
- first_words = Counter([s[0] for s in sentence])
- first_words = pd.Series(first_words)
- first_words = first_words.sum()
first_words.to_json('grimm-first-words.json')
- with open('glove-dict.pkl', 'wb') as out:
- pkl.dump(glove, out)
- with open('vocab.pkl', 'wb') as out:
- pkl.dump(i2t, out)
让我们看看在没有人工干预的情况下生成了什么。
- context_sentence = '''
- In old times, when wishing was having, there lived a King whose
- daughters were all beautiful, but the youngest was so beautiful that
- the sun itself, which has seen so much, was astonished whenever it
- shone in her face.
- '''.strip().replace('\n', ' ')
-
- input_sentence = np.random.choice(first_words.index, p=first_words)
-
- for _ in range(10):
- print(context_sentence, END)
- output_sentence = generate_sentence(context_sentence, input_sentence, max_len)
- output_sentence = ' '.join(output_sentence[w-1:-1])
- context_sentence = output_sentence
- input_sentence = np.random.choice(first_words.index, p=first_words)
- print(output_sentence, END)
- In old times, when wishing was having, there lived a King whose daughters
- were all beautiful, but the youngest was so beautiful that the sun
- itself, which has seen so much, was astonished whenever it shone in her
- face. ###
- " what do you desire ??? ###
- the king ' s son , however , was still beautiful , and a little chair
- there ' s blood and so that she is alive ??? ###
- the king ' s son , however , was still beautiful , and the king ' s
- daughter was only of silver , and the king ' s son came to the forest ,
- and the king ' s son seated himself on the leg , and said , " i will go
- to church , and you shall be have lost my life ??? ###
- " what are you saying ??? ###
- cannon - maiden , and the king ' s daughter was only a looker - boy . ###
- but the king ' s daughter was humble , and said , " you are not afraid
- ??? ###
- then the king said , " i will go with you ??? ###
- " i will go with you ??? ###
- he was now to go with a long time , and the bird threw in the path , and
- the strong of them were on their of candles and bale - plants . ###
- then the king said , " i will go with you ??? ###
该模型不会很快通过图灵测试。这就是为什么我们需要一个人参与其中。让我们构建我们的脚本。首先,让我们保存我们的模型。
model.save('grimm-model')
我们的脚本需要能够访问我们的一些实用函数以及超参数——例如dim,w.
- %%writefile fairywriter.py
- """
- 这个脚本帮助你生成一个童话故事。
- """
-
- import pickle as pkl
-
- import nltk
- import numpy as np
- import pandas as pd
-
- from keras.models import load_model
- import keras.utils as ku
- import keras.preprocessing as kp
- import tensorflow as tf
-
-
- START = '>'
- END = '###'
- UNK = '???'
-
-
- FINISH_CMDS = ['finish', 'f']
- BACK_CMDS = ['back', 'b']
- QUIT_CMDS = ['quit', 'q']
- CMD_PROMPT = ' | '.join(','.join(c) for c in [FINISH_CMDS, BACK_CMDS, QUIT_CMDS])
- QUIT_PROMPT = '"{}" to quit'.format('" or "'.join(QUIT_CMDS))
- ENDING = ['THE END']
-
-
- def pad_sentence(sentence, length):
- sentence = sentence[:length]
- if len(sentence) < length:
- sentence += [END] * (length - len(sentence))
- return sentence
-
-
- def sent2mat(sentence, embedding):
- mat = [embedding.get(t, embedding[UNK]) for t in sentence]
- return np.array(mat)
-
-
- def generate_sentence(context_sentence, input_sentence, vocab, max_len=100, hparams=(42, 50, 10)):
- max_len, dim, w = hparams
- context_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]
- context_sentence = pad_sentence(context_sentence, max_len)
- context_vector = sent2mat(context_sentence, glove)
- input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]
- input_sentence = [START] * (w-1) + input_sentence
- input_sentence = input_sentence[:w]
- output_sentence = input_sentence
-
- input_vector = sent2mat(input_sentence, glove)
- predicted_vector = model.predict([[context_vector], [input_vector]])
- predicted_token = vocab[np.argmax(predicted_vector)]
- output_sentence.append(predicted_token)
- i = 0
- while predicted_token != END and i < max_len:
- input_sentence = input_sentence[1:w] + [predicted_token]
- input_vector = sent2mat(input_sentence, glove)
- predicted_vector = model.predict([[context_vector], [input_vector]])
- predicted_token = vocab[np.argmax(predicted_vector)]
- output_sentence.append(predicted_token)
- i += 1
- return output_sentence
-
-
- if __name__ == '__main__':
- model = load_model('grimm-model')
- (_, max_len, dim), (_, w, _) = model.get_input_shape_at(0)
- hparams = (max_len, dim, w)
- first_words = pd.read_json('grimm-first-words.json', typ='series')
- with open('glove-dict.pkl', 'rb') as fp:
- glove = pkl.load(fp)
- with open('vocab.pkl', 'rb') as fp:
- vocab = pkl.load(fp)
-
- print("Let's write a story!")
- title = input('Give me a title ({}) '.format(QUIT_PROMPT))
- story = [title]
- context_sentence = title
- input_sentence = np.random.choice(first_words.index, p=first_words)
- if title.lower() in QUIT_CMDS:
- exit()
-
- print(CMD_PROMPT)
- while True:
- input_sentence = np.random.choice(first_words.index, p=first_words)
- generated = generate_sentence(context_sentence, input_sentence, vocab, hparams=hparams)
- generated = ' '.join(generated)
- ### 模型创建一个建议的句子
- print('Suggestion:', generated)
- ### 用户回复他们想要添加的句子
- ### 用户可以修改建议的句子或编写自己的
- ### 这是将用于制作下一个建议句子的
- sentence = input('Sentence: ')
- if sentence.lower() in QUIT_CMDS:
- story = []
- break
- elif sentence.lower() in FINISH_CMDS:
- story.append(np.random.choice(ENDING))
- break
- elif sentence.lower() in BACK_CMDS:
- if len(story) == 1:
- print('You are at the beginning')
- story = story[:-1]
- context_sentence = story[-1]
- continue
- else:
- story.append(sentence)
- context_sentence = sentence
-
- print('\n'.join(story))
- print('exiting...')
让我们运行一下我们的脚本。我将使用它来阅读建议并将其中的元素添加到下一行。一个更复杂的模型可能能够生成可以编辑和添加的句子,但这个模型并不完全存在。
%run fairywriter.py
- Let's write a story!
- Give me a title ("quit" or "q" to quit) The Wolf Goes Home
- finish,f | back,b | quit,q
- Suggestion: > > > > > > > > > and when they had walked for the time , and
- the king ' s son seated himself on the leg , and said , " i will go to
- church , and you shall be have lost my life ??? ###
- Sentence: There was once a prince who got lost in the woods on the way
- to a church.
- Suggestion: > > > > > > > > > she was called hans , and as the king ' s
- daughter , who was so beautiful than the children , who was called clever
- elsie . ###
- Sentence: The prince was called Hans, and he was more handsome than the
- boys.
- Suggestion: > > > > > > > > > no one will do not know what to say , but i
- have been compelled to you ??? ###
- Sentence: The Wolf came along and asked, "does no one know where are?"
- Suggestion: > > > > > > > > > there was once a man who had a daughter who
- had three daughters , and he had a child and went , the king ' s daughter
- , and said , " you are growing and thou now , i will go and fetch
- Sentence: The Wolf had three daughters, and he said to the prince, "I
- will help you return home if you take one of my daughters as your
- betrothed."
- Suggestion: > > > > > > > > > but the king ' s daughter was humble , and
- said , " you are not afraid ??? ###
- Sentence: The prince asked, "are you not afraid that she will be killed
- as soon as we return home?"
- Suggestion: > > > > > > > > > i will go and fetch the golden horse ???
- ###
- Sentence: The Wolf said, "I will go and fetch a golden horse as dowry."
- Suggestion: > > > > > > > > > one day , the king ' s daughter , who was
- a witch , and lived in a great forest , and the clouds of earth , and in
- the evening , came to the glass mountain , and the king ' s son
- Sentence: The Wolf went to find the forest witch that she might conjure
- a golden horse.
- Suggestion: > > > > > > > > > when the king ' s daughter , however , was
- sitting on a chair , and sang and reproached , and said , " you are not
- to be my wife , and i will take you to take care of your ??? ###
- Sentence: The witch reproached the wolf saying, "you come and ask me such
- a favor with no gift yourself?"
- Suggestion: > > > > > > > > > then the king said , " i will go with you
- ??? ###
- Sentence: So the wolf said, "if you grant me this favor, I will be your
- servant."
- Suggestion: > > > > > > > > > he was now to go with a long time , and
- the other will be polluted , and we will leave you ??? ###
- Sentence: f
- The Wolf Goes Home
- There was once a prince who got lost in the woods on the way to a church.
- The prince was called Hans, and he was more handsome than the boys.
- The Wolf came along and asked, "does no one know where are?"
- The Wolf had three daughters, and he said to the prince, "I will help
- you return home if you take one of my daughters as your betrothed."
- The prince asked, "are you not afraid that she will be killed as soon as
- we return home?"
- The Wolf said, "I will go and fetch a golden horse as dowry."
- The Wolf went to find the forest witch that she might conjure a golden
- horse.
- The witch reproached the wolf saying, "you come and ask me such a favor
- with no gift yourself?"
- So the wolf said, "if you grant me this favor, I will be your servant."
- THE END
- exiting..
您可以进行额外的 epochs 以获得更好的建议,但要注意过度拟合。如果你过度拟合这个模型,那么如果你向它提供它无法识别的上下文和输入,它会产生更糟糕的结果。
现在我们有了一个可以与之交互的模型,下一步就是将它与聊天机器人系统集成。大多数系统都需要一些服务于模型的服务器。具体情况取决于您的聊天机器人平台。