深度学习【使用seq2seq实现聊天机器人】

文章目录

一 Seq2Seq实现闲聊机器人

一 Seq2Seq实现闲聊机器人

1. 准备训练数据

单轮次的聊天数据非常不好获取，所以从github上使用一些开放的数据集来训练闲聊机器人模型

数据地址：https://github.com/codemayq/chaotbot_corpus_Chinese

主要的数据有两个：

小黄鸡的聊天语料：噪声很大
微博的标题和评论：质量相对较高

在这里插入图片描述

2. 数据的处理和保存

由于数据中存到大量的噪声，可以对其进行基础的处理，然后分别把input和target使用两个文件保存，即input中的第N行为问，target的第N行为答

后续可能会把单个字作为特征（存放在input_word.txt），也可能会把词语作为特征(input.txt)

2.1 小黄鸡的语料的处理

def format_xiaohuangji_corpus(word=False):
    """处理小黄鸡的语料"""
    if word:
        corpus_path = "./chatbot/corpus/xiaohuangji50w_nofenci.conv"
        input_path = "./chatbot/corpus/input_word.txt"
        output_path = "./chatbot/corpus/output_word.txt"
    else:

        corpus_path = "./chatbot/corpus/xiaohuangji50w_nofenci.conv"
        input_path = "./chatbot/corpus/input.txt"
        output_path = "./chatbot/corpus/output.txt"

    f_input = open(input_path,"a")
    f_output = open(output_path,"a")
    pair = []
    for line in tqdm(open(corpus_path),ascii=True):
        if line.strip() == "E":
            if not pair:
                continue
            else:
                assert len(pair) == 2,"长度必须是2"
                if len(pair[0].strip())>=1 and len(pair[1].strip())>=1:
                    f_input.write(pair[0]+"\n")
                    f_output.write(pair[1]+"\n")
                pair = []
        elif line.startswith("M"):
            line = line[1:]
            if word:
                pair.append(" ".join(list(line.strip())))
            else:
                pair.append(" ".join(jieba_cut(line.strip())))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

2.2 微博语料的处理

def format_weibo(word=False):
    """
    微博数据存在一些噪声，未处理
    :return:
    """
    if word:
        origin_input = "./chatbot/corpus/stc_weibo_train_post"
        input_path = "./chatbot/corpus/input_word.txt"

        origin_output = "./chatbot/corpus/stc_weibo_train_response"
        output_path = "./chatbot/corpus/output_word.txt"

    else:
        origin_input = "./chatbot/corpus/stc_weibo_train_post"
        input_path = "./chatbot/corpus/input.txt"

        origin_output = "./chatbot/corpus/stc_weibo_train_response"
        output_path = "./chatbot/corpus/output.txt"

    f_input = open(input_path,"a")
    f_output = open(output_path, "a")
    with open(origin_input) as in_o,open(origin_output) as out_o:
        for _in,_out in tqdm(zip(in_o,out_o),ascii=True):
            _in = _in.strip()
            _out = _out.strip()

            if _in.endswith("）") or _in.endswith("」") or _in.endswith(")"):
                _in = re.sub("（.*）|「.*?」|\(.*?\)"," ",_in)
            _in = re.sub("我在.*?alink|alink|（.*?\d+x\d+.*?）|#|】|【|-+|_+|via.*?：*.*"," ",_in)

            _in = re.sub("\s+"," ",_in)
            if len(_in)<1 or len(_out)<1:
                continue

            if word:
                _in = re.sub("\s+","",_in)  #转化为一整行，不含空格
                _out = re.sub("\s+","",_out)
                if len(_in)>=1 and len(_out)>=1:
                    f_input.write(" ".join(list(_in)) + "\n")
                    f_output.write(" ".join(list(_out)) + "\n")
            else:
                if len(_in) >= 1 and len(_out) >= 1:
                    f_input.write(_in.strip()+"\n")
                    f_output.write(_out.strip()+"\n")

    f_input.close()
    f_output.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

2.3 处理后的结果

在这里插入图片描述

3. 构造文本序列化和反序列化方法

和之前的操作相同，需要把文本能转化为数字，同时还需实现方法把数字转化为文本

# word_sequence.py
import config
import pickle

class Word2Sequence():
    UNK_TAG = "UNK"
    PAD_TAG = "PAD"
    SOS_TAG = "SOS"
    EOS_TAG = "EOS"

    UNK = 0
    PAD = 1
    SOS = 2
    EOS = 3

    def __init__(self):
        self.dict = {
            self.UNK_TAG :self.UNK,
            self.PAD_TAG :self.PAD,
            self.SOS_TAG :self.SOS,
            self.EOS_TAG :self.EOS
        }
        self.count = {}
        self.fited = False

    def to_index(self,word):
        """word -> index"""
        assert self.fited == True,"必须先进行fit操作"
        return self.dict.get(word,self.UNK)

    def to_word(self,index):
        """index -> word"""
        assert self.fited , "必须先进行fit操作"
        if index in self.inversed_dict:
            return self.inversed_dict[index]
        return self.UNK_TAG

    def __len__(self):
        return len(self.dict)

    def fit(self, sentence):
        """
        :param sentence:[word1,word2,word3]
        :param min_count: 最小出现的次数
        :param max_count: 最大出现的次数
        :param max_feature: 总词语的最大数量
        :return:
        """
        for a in sentence:
            if a not in self.count:
                self.count[a] = 0
            self.count[a] += 1

        self.fited = True

    def build_vocab(self, min_count=1, max_count=None, max_feature=None):

        # 比最小的数量大和比最大的数量小的需要
        if min_count is not None:
            self.count = {k: v for k, v in self.count.items() if v >= min_count}
        if max_count is not None:
            self.count = {k: v for k, v in self.count.items() if v <= max_count}

        # 限制最大的数量
        if isinstance(max_feature, int):
            count = sorted(list(self.count.items()), key=lambda x: x[1])
            if max_feature is not None and len(count) > max_feature:
                count = count[-int(max_feature):]
            for w, _ in count:
                self.dict[w] = len(self.dict)
        else:
            for w in sorted(self.count.keys()):
                self.dict[w] = len(self.dict)

        # 准备一个index->word的字典
        self.inversed_dict = dict(zip(self.dict.values(), self.dict.keys()))

    def transform(self, sentence,max_len=None,add_eos=False):
        """
        实现把句子转化为数组（向量）
        :param sentence:
        :param max_len:
        :return:
        """
        assert self.fited, "必须先进行fit操作"

        r = [self.to_index(i) for i in sentence]
        if max_len is not None:
            if max_len>len(sentence):
                if add_eos:
                    r+=[self.EOS]+[self.PAD for _ in range(max_len-len(sentence)-1)]
                else:
                    r += [self.PAD for _ in range(max_len - len(sentence))]
            else:
                if add_eos:
                    r = r[:max_len-1]
                    r += [self.EOS]
                else:
                    r = r[:max_len]
        else:
            if add_eos:
                r += [self.EOS]
        # print(len(r),r)
        return r

    def inverse_transform(self,indices):
        """
        实现从数组 转化为 向量
        :param indices: [1,2,3....]
        :return:[word1,word2.....]
        """
        sentence = []
        for i in indices:
            word = self.to_word(i)
            sentence.append(word)
        return sentence

#之后导入该word_sequence使用
word_sequence = pickle.load(open("./pkl/ws.pkl","rb")) if not config.use_word else pickle.load(open("./pkl/ws_word.pkl","rb"))



if __name__ == '__main__':
    from word_sequence import Word2Sequence
    from tqdm import tqdm
    import pickle

    word_sequence = Word2Sequence()
    #词语级别
    input_path = "../corpus/input.txt"
    target_path = "../corpus/output.txt"
    for line in tqdm(open(input_path).readlines()):
        word_sequence.fit(line.strip().split())
    for line in tqdm(open(target_path).readlines()):
        word_sequence.fit(line.strip().split())
	
    #使用max_feature=5000个数据
    word_sequence.build_vocab(min_count=5,max_count=None,max_feature=5000)
    print(len(word_sequence))
    pickle.dump(word_sequence,open("./pkl/ws.pkl","wb"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

4. 构建Dataset和DataLoader

创建dataset.py文件，准备数据集

import torch
import config
from torch.utils.data import Dataset,DataLoader
from word_sequence import word_sequence


class ChatDataset(Dataset):
    def __init__(self):
        super(ChatDataset,self).__init__()

        input_path = "../corpus/input.txt"
        target_path = "../corpus/output.txt"
        if config.use_word:
            input_path = "../corpus/input_word.txt"
            target_path = "../corpus/output_word.txt"

        self.input_lines = open(input_path).readlines()
        self.target_lines = open(target_path).readlines()
        assert len(self.input_lines) == len(self.target_lines) ,"input和target文本的数量必须相同"
    def __getitem__(self, index):
        input = self.input_lines[index].strip().split()
        target = self.target_lines[index].strip().split()
        if len(input) == 0 or len(target)==0:
            input = self.input_lines[index+1].strip().split()
            target = self.target_lines[index+1].strip().split()
        #此处句子的长度如果大于max_len，那么应该返回max_len
        return input,target,min(len(input),config.max_len),min(len(target),config.max_len)

    def __len__(self):
        return len(self.input_lines)

def collate_fn(batch):
    #1.排序
    batch = sorted(batch,key=lambda x:x[2],reverse=True)
    input, target, input_length, target_length = zip(*batch)

    # 2.进行padding的操作
    input = torch.LongTensor([word_sequence.transform(i, max_len=config.max_len) for i in input])
    target = torch.LongTensor([word_sequence.transform(i, max_len=config.max_len, add_eos=True) for i in target])
    input_length = torch.LongTensor(input_length)
    target_length = torch.LongTensor(target_length)

    return input, target, input_length, target_length

data_loader = DataLoader(dataset=ChatDataset(),batch_size=config.batch_size,shuffle=True,collate_fn=collate_fn,drop_last=True)

if __name__ == '__main__':
    for idx, (input, target, input_lenght, target_length) in enumerate(data_loader):
        print(idx)
        print(input)
        print(target)
        print(input_lenght)
        print(target_length)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

5. 完成`encoder`编码器逻辑

import torch.nn as nn
from word_sequence import word_sequence
import config


class Encoder(nn.Module):
    def __init__(self):
        super(Encoder,self).__init__()
        self.vocab_size = len(word_sequence)
        self.dropout = config.dropout
        self.embedding_dim = config.embedding_dim
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size,embedding_dim=self.embedding_dim,padding_idx=word_sequence.PAD)
        self.gru = nn.GRU(input_size=self.embedding_dim,
                          hidden_size=config.hidden_size,
                          num_layers=1,
                          batch_first=True,
                          dropout=config.dropout)

    def forward(self, input,input_length):
        embeded = self.embedding(input)
        embeded = nn.utils.rnn.pack_padded_sequence(embeded,lengths=input_length,batch_first=True)

        #hidden:[1,batch_size,vocab_size]
        out,hidden = self.gru(embeded)
        out,outputs_length = nn.utils.rnn.pad_packed_sequence(out,batch_first=True,padding_value=word_sequence.PAD)
        #hidden [1,batch_size,hidden_size]
        return out,hidden
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

6. 完成`decoder`解码器的逻辑

import torch
import torch.nn as nn
import config
import random
import torch.nn.functional as F
from word_sequence import word_sequence

class Decoder(nn.Module):
    def __init__(self):
        super(Decoder,self).__init__()
        self.max_seq_len = config.max_len
        self.vocab_size = len(word_sequence)
        self.embedding_dim = config.embedding_dim
        self.dropout = config.dropout

        self.embedding = nn.Embedding(num_embeddings=self.vocab_size,embedding_dim=self.embedding_dim,padding_idx=word_sequence.PAD)
        self.gru = nn.GRU(input_size=self.embedding_dim,
                          hidden_size=config.hidden_size,
                          num_layers=1,
                          batch_first=True,
                          dropout=self.dropout)
        self.log_softmax = nn.LogSoftmax()

        self.fc = nn.Linear(config.hidden_size,self.vocab_size)

    def forward(self, encoder_hidden,target,target_length):
        # encoder_hidden [batch_size,hidden_size]
        # target [batch_size,seq-len]

        decoder_input = torch.LongTensor([[word_sequence.SOS]]*config.batch_size).to(config.device)
        decoder_outputs = torch.zeros(config.batch_size,config.max_len,self.vocab_size).to(config.device) #[batch_size,seq_len,14]

        decoder_hidden = encoder_hidden #[batch_size,hidden_size]

        for t in range(config.max_len):
            decoder_output_t , decoder_hidden = self.forward_step(decoder_input,decoder_hidden)
            decoder_outputs[:,t,:] = decoder_output_t
            value, index = torch.topk(decoder_output_t, 1) # index [batch_size,1]
            decoder_input = index
        return decoder_outputs,decoder_hidden

    def forward_step(self,decoder_input,decoder_hidden):
        """
        :param decoder_input:[batch_size,1]
        :param decoder_hidden: [1,batch_size,hidden_size]
        :return: out:[batch_size,vocab_size],decoder_hidden:[1,batch_size,didden_size]
        """
        embeded = self.embedding(decoder_input)  #embeded: [batch_size,1 , embedding_dim]
        out,decoder_hidden = self.gru(embeded,decoder_hidden) #out [1, batch_size, hidden_size]
        out = out.squeeze(0)
        out = F.log_softmax(self.fc(out),dim=-1)#[batch_Size, vocab_size]
        out = out.squeeze(1)
        # print("out size:",out.size(),decoder_hidden.size())
        return out,decoder_hidden
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

7.完成seq2seq的模型

import torch
import torch.nn as nn

class Seq2Seq(nn.Module):
    def __init__(self,encoder,decoder):
        super(Seq2Seq,self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input,target,input_length,target_length):
        encoder_outputs,encoder_hidden = self.encoder(input,input_length)
        decoder_outputs,decoder_hidden = self.decoder(encoder_hidden,target,target_length)
        return decoder_outputs,decoder_hidden

    def evaluation(self,inputs,input_length):
        encoder_outputs,encoder_hidden = self.encoder(inputs,input_length)
        decoded_sentence = self.decoder.evaluation(encoder_hidden)
        return decoded_sentence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

8. 完成训练逻辑

为了加速训练，可以考虑在gpu上运行，那么tensor和model都需要转化为CUDA支持的类型。

当前的数据量为500多万条，在GTX1070（8G显存）上训练，大概需要90分一个epoch，耐心的等待吧

import torch
import config
from torch import optim
import torch.nn as nn
from encoder import Encoder
from decoder import Decoder
from seq2seq import Seq2Seq
from dataset import data_loader as train_dataloader
from word_sequence import word_sequence

encoder = Encoder()
decoder = Decoder()
model = Seq2Seq(encoder,decoder)

#device在config文件中实现
model.to(config.device)

print(model)

model.load_state_dict(torch.load("model/seq2seq_model.pkl"))
optimizer =  optim.Adam(model.parameters())
optimizer.load_state_dict(torch.load("model/seq2seq_optimizer.pkl"))
criterion= nn.NLLLoss(ignore_index=word_sequence.PAD,reduction="mean")

def get_loss(decoder_outputs,target):
    target = target.view(-1) #[batch_size*max_len]
    decoder_outputs = decoder_outputs.view(config.batch_size*config.max_len,-1)
    return criterion(decoder_outputs,target)


def train(epoch):
    for idx,(input,target,input_length,target_len) in enumerate(train_dataloader):
        input = input.to(config.device)
        target = target.to(config.device)
        input_length = input_length.to(config.device)
        target_len = target_len.to(config.device)

        optimizer.zero_grad()
        ##[seq_len,batch_size,vocab_size] [batch_size,seq_len]
        decoder_outputs,decoder_hidden = model(input,target,input_length,target_len)
        loss = get_loss(decoder_outputs,target)
        loss.backward()
        optimizer.step()

        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
            epoch, idx * len(input), len(train_dataloader.dataset),
                   100. * idx / len(train_dataloader), loss.item()))

        torch.save(model.state_dict(), "model/seq2seq_model.pkl")
        torch.save(optimizer.state_dict(), 'model/seq2seq_optimizer.pkl')

if __name__ == '__main__':
    for i in range(10):
        train(i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

训练10个epoch之后的效果如下,可以看出损失依然很高：

Train Epoch: 9 [2444544/4889919 (50%)]	Loss: 4.923604
Train Epoch: 9 [2444800/4889919 (50%)]	Loss: 4.364594
Train Epoch: 9 [2445056/4889919 (50%)]	Loss: 4.613254
Train Epoch: 9 [2445312/4889919 (50%)]	Loss: 4.143538
Train Epoch: 9 [2445568/4889919 (50%)]	Loss: 4.412729
Train Epoch: 9 [2445824/4889919 (50%)]	Loss: 4.516526
Train Epoch: 9 [2446080/4889919 (50%)]	Loss: 4.124945
Train Epoch: 9 [2446336/4889919 (50%)]	Loss: 4.777015
Train Epoch: 9 [2446592/4889919 (50%)]	Loss: 4.358538
Train Epoch: 9 [2446848/4889919 (50%)]	Loss: 4.513412
Train Epoch: 9 [2447104/4889919 (50%)]	Loss: 4.202757
Train Epoch: 9 [2447360/4889919 (50%)]	Loss: 4.589584
1
2
3
4
5
6
7
8
9
10
11
12

效果不好

相关阅读:
java并发编程学习六——乐观锁CAS
【ICE】2：基于webrtc的 ice session设计及实现
 软硬件协同仿真 l 原理及主要组成部分概述
 加快访问github的速度解决方案
 Kubernetes学习笔记-kubernetes API服务器的安全防护（1）通过基于角色的权限控制加强集群安全20220814
office办公软件太贵了 Microsoft的Word为什么要买 Microsoft365家庭版多少钱 Microsoft365密钥
 手部IK，自制动画，蒙太奇——开门手臂自动弯曲、靠墙手自动扶墙
 RibbonControl
最全解决方式java.net.BindException Address already in use JVM_Bind
【前端工程化】webpack使用require.context批量导入模块
原文地址：https://blog.csdn.net/weixin_43923463/article/details/126642263