ESIM实战文本匹配

引言

今天我们来实现ESIM文本匹配，这是一个典型的交互型文本匹配方式，也是近期第一个测试集准确率超过80%的模型。

我们来看下是如何实现的。

模型架构

我们主要实现左边的ESIM网络。

从下往上看，分别是

输入编码层(Input Ecoding)
对前提和假设进行编码
把语句中的单词转换为词向量，得到一个向量序列
把两句话的向量序列分别送入各自的Bi-LSTM网络进行语义特征抽取
局部推理建模层(Local Inference Modeling)
就是注意力层
通过注意力层捕获LSTM输出向量间的局部特征
然后通过元素级方法构造了一些特征
推理组合层(Inference Composition)
和输入编码层一样，也是Bi-LSTM
在捕获了文本间的注意力特征后，进一步做的融合/提取语义特征工作
预测层(Prediction)
拼接平均池化和最大池化后得到的向量
接Softmax进行分类

模型实现

class ESIM(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embedding_size: int,
        hidden_size: int,
        num_classes: int,
        lstm_dropout: float = 0.1,
        dropout: float = 0.5,
    ) -> None:
        """_summary_

        Args:
            vocab_size (int): the size of the Vocabulary
            embedding_size (int): the size of each embedding vector
            hidden_size (int): the size of the hidden layer
            num_classes (int): the output size
            lstm_dropout (float, optional): dropout ratio in lstm layer. Defaults to 0.1.
            dropout (float, optional): dropout ratio in linear layer. Defaults to 0.5.
        """
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        # lstm for input embedding
        self.lstm_a = nn.LSTM(
            hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
        self.lstm_b = nn.LSTM(
            hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

首先有一个嵌入层，然后对于输入的两个句子分别有一个Bi-LSTM。

然后定义推理组合层：

        # lstm for augment inference vector
        self.lstm_v_a = nn.LSTM(
            8 * hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
        self.lstm_v_b = nn.LSTM(
            8 * hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

完了就是最后的预测层，这里是一个多层前馈网络：

   self.predict = nn.Sequential(
            nn.Linear(8 * hidden_size, 2 * hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(2 * hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_classes),
        )
1
2
3
4
5
6
7
8
9

初始化函数的完整实现为：

class ESIM(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embedding_size: int,
        hidden_size: int,
        num_classes: int,
        lstm_dropout: float = 0.1,
        dropout: float = 0.5,
    ) -> None:
        """_summary_

        Args:
            vocab_size (int): the size of the Vocabulary
            embedding_size (int): the size of each embedding vector
            hidden_size (int): the size of the hidden layer
            num_classes (int): the output size
            lstm_dropout (float, optional): dropout ratio in lstm layer. Defaults to 0.1.
            dropout (float, optional): dropout ratio in linear layer. Defaults to 0.5.
        """
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        # lstm for input embedding
        self.lstm_a = nn.LSTM(
            hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
        self.lstm_b = nn.LSTM(
            hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
        # lstm for augment inference vector
        self.lstm_v_a = nn.LSTM(
            8 * hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )
        self.lstm_v_b = nn.LSTM(
            8 * hidden_size,
            hidden_size,
            batch_first=True,
            bidirectional=True,
            dropout=lstm_dropout,
        )

        self.predict = nn.Sequential(
            nn.Linear(8 * hidden_size, 2 * hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(2 * hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_classes),
        )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

使用ReLU激活函数，在激活函数后都有一个Dropout。

重点是forward方法：

  def forward(self, a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
        """

        Args:
            a (torch.Tensor): input sequence a with shape (batch_size, a_seq_len)
            b (torch.Tensor): input sequence b with shape (batch_size, b_seq_len)

        Returns:
            torch.Tensor:
        """
        # a (batch_size, a_seq_len, embedding_size)
        a_embed = self.embedding(a)
        # b (batch_size, b_seq_len, embedding_size)
        b_embed = self.embedding(b)

        # a_bar (batch_size, a_seq_len, 2 * hidden_size)
        a_bar, _ = self.lstm_a(a_embed)
        # b_bar (batch_size, b_seq_len, 2 * hidden_size)
        b_bar, _ = self.lstm_b(b_embed)

        # score (batch_size, a_seq_len, b_seq_len)
        score = torch.matmul(a_bar, b_bar.permute(0, 2, 1))

        # softmax (batch_size, a_seq_len, b_seq_len) x (batch_size, b_seq_len, 2 * hidden_size)
        # a_tilde (batch_size, a_seq_len, 2 * hidden_size)
        a_tilde = torch.matmul(torch.softmax(score, dim=2), b_bar)
        # permute (batch_size, b_seq_len, a_seq_len) x (batch_size, a_seq_len, 2 * hidden_size)
        # b_tilde (batch_size, b_seq_len, 2 * hidden_size)
        b_tilde = torch.matmul(torch.softmax(score, dim=1).permute(0, 2, 1), a_bar)

        # m_a (batch_size, a_seq_len, 8 * hidden_size)
        m_a = torch.cat([a_bar, a_tilde, a_bar - a_tilde, a_bar * a_tilde], dim=-1)
        # m_b (batch_size, b_seq_len, 8 * hidden_size)
        m_b = torch.cat([b_bar, b_tilde, b_bar - b_tilde, b_bar * b_tilde], dim=-1)

        # v_a (batch_size, a_seq_len, 2 * hidden_size)
        v_a, _ = self.lstm_v_a(m_a)
        # v_b (batch_size, b_seq_len, 2 * hidden_size)
        v_b, _ = self.lstm_v_b(m_b)

        # (batch_size, 2 * hidden_size)
        avg_a = torch.mean(v_a, dim=1)
        avg_b = torch.mean(v_b, dim=1)

        max_a, _ = torch.max(v_a, dim=1)
        max_b, _ = torch.max(v_b, dim=1)
        # (batch_size, 8 * hidden_size)
        v = torch.cat([avg_a, max_a, avg_b, max_b], dim=-1)

        return self.predict(v)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

标注出每个向量的维度就挺好理解的。

首先分别为两个输入得到嵌入向量；

其次喂给各自的LSTM层，得到每个时间步的输出，注意由于是双向LSTM，因此最后的维度是2 * hidden_size；

接着对它两计算注意力分数score (batch_size, a_seq_len, b_seq_len)；

然后分别计算a对b和b对a的注意力，这里要注意softmax中dim的维度。在计算a的注意力输出时，是利用b每个输入的加权和，所以softmax(score, dim=2)在b_seq_len维度上间求和；

然后是增强的局部推理，差、乘各种拼接，得到一个8 * hidden_size的维度，所以lstm_v_a的输入维度是8 * hidden_size；

接下来就是对得到的a和b进行平均池化和最大池化，然后把它们拼接起来，也得到了8 * hidden_size的一个维度；

最后经过我们的预测层，多层前馈网络，进行了一些非线性变换后变成了输出维度大小，比较最后一个维度的值哪个大当成对哪个类的一个预测；

数据准备

和前面几篇几乎一样，这里直接贴代码：

from collections import defaultdict
from tqdm import tqdm
import numpy as np
import json
from torch.utils.data import Dataset
import pandas as pd
from typing import Tuple

UNK_TOKEN = ""
PAD_TOKEN = ""


class Vocabulary:
    """Class to process text and extract vocabulary for mapping"""

    def __init__(self, token_to_idx: dict = None, tokens: list[str] = None) -> None:
        """
        Args:
            token_to_idx (dict, optional): a pre-existing map of tokens to indices. Defaults to None.
            tokens (list[str], optional): a list of unique tokens with no duplicates. Defaults to None.
        """

        assert any(
            [tokens, token_to_idx]
        ), "At least one of these parameters should be set as not None."
        if token_to_idx:
            self._token_to_idx = token_to_idx
        else:
            self._token_to_idx = {}
            if PAD_TOKEN not in tokens:
                tokens = [PAD_TOKEN] + tokens

            for idx, token in enumerate(tokens):
                self._token_to_idx[token] = idx

        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}

        self.unk_index = self._token_to_idx[UNK_TOKEN]
        self.pad_index = self._token_to_idx[PAD_TOKEN]

    @classmethod
    def build(
        cls,
        sentences: list[list[str]],
        min_freq: int = 2,
        reserved_tokens: list[str] = None,
    ) -> "Vocabulary":
        """Construct the Vocabulary from sentences

        Args:
            sentences (list[list[str]]): a list of tokenized sequences
            min_freq (int, optional): the minimum word frequency to be saved. Defaults to 2.
            reserved_tokens (list[str], optional): the reserved tokens to add into the Vocabulary. Defaults to None.

        Returns:
            Vocabulary: a Vocubulary instane
        """

        token_freqs = defaultdict(int)
        for sentence in tqdm(sentences):
            for token in sentence:
                token_freqs[token] += 1

        unique_tokens = (reserved_tokens if reserved_tokens else []) + [UNK_TOKEN]
        unique_tokens += [
            token
            for token, freq in token_freqs.items()
            if freq >= min_freq and token != UNK_TOKEN
        ]
        return cls(tokens=unique_tokens)

    def __len__(self) -> int:
        return len(self._idx_to_token)

    def __getitem__(self, tokens: list[str] | str) -> list[int] | int:
        """Retrieve the indices associated with the tokens or the index with the single token

        Args:
            tokens (list[str] | str): a list of tokens or single token

        Returns:
            list[int] | int: the indices or the single index
        """
        if not isinstance(tokens, (list, tuple)):
            return self._token_to_idx.get(tokens, self.unk_index)
        return [self.__getitem__(token) for token in tokens]

    def lookup_token(self, indices: list[int] | int) -> list[str] | str:
        """Retrive the tokens associated with the indices or the token with the single index

        Args:
            indices (list[int] | int): a list of index or single index

        Returns:
            list[str] | str: the corresponding tokens (or token)
        """

        if not isinstance(indices, (list, tuple)):
            return self._idx_to_token[indices]

        return [self._idx_to_token[index] for index in indices]

    def to_serializable(self) -> dict:
        """Returns a dictionary that can be serialized"""
        return {"token_to_idx": self._token_to_idx}

    @classmethod
    def from_serializable(cls, contents: dict) -> "Vocabulary":
        """Instantiates the Vocabulary from a serialized dictionary


        Args:
            contents (dict): a dictionary generated by `to_serializable`

        Returns:
            Vocabulary: the Vocabulary instance
        """
        return cls(**contents)

    def __repr__(self):
        return f"{len(self)})>"


class TMVectorizer:
    """The Vectorizer which vectorizes the Vocabulary"""

    def __init__(self, vocab: Vocabulary, max_len: int) -> None:
        """
        Args:
            vocab (Vocabulary): maps characters to integers
            max_len (int): the max length of the sequence in the dataset
        """
        self.vocab = vocab
        self.max_len = max_len

    def _vectorize(
        self, indices: list[int], vector_length: int = -1, padding_index: int = 0
    ) -> np.ndarray:
        """Vectorize the provided indices

        Args:
            indices (list[int]): a list of integers that represent a sequence
            vector_length (int, optional): an arugment for forcing the length of index vector. Defaults to -1.
            padding_index (int, optional): the padding index to use. Defaults to 0.

        Returns:
            np.ndarray: the vectorized index array
        """

        if vector_length <= 0:
            vector_length = len(indices)

        vector = np.zeros(vector_length, dtype=np.int64)
        if len(indices) > vector_length:
            vector[:] = indices[:vector_length]
        else:
            vector[: len(indices)] = indices
            vector[len(indices) :] = padding_index

        return vector

    def _get_indices(self, sentence: list[str]) -> list[int]:
        """Return the vectorized sentence

        Args:
            sentence (list[str]): list of tokens
        Returns:
            indices (list[int]): list of integers representing the sentence
        """
        return [self.vocab[token] for token in sentence]

    def vectorize(
        self, sentence: list[str], use_dataset_max_length: bool = True
    ) -> np.ndarray:
        """
        Return the vectorized sequence

        Args:
            sentence (list[str]): raw sentence from the dataset
            use_dataset_max_length (bool): whether to use the global max vector length
        Returns:
            the vectorized sequence with padding
        """
        vector_length = -1
        if use_dataset_max_length:
            vector_length = self.max_len

        indices = self._get_indices(sentence)
        vector = self._vectorize(
            indices, vector_length=vector_length, padding_index=self.vocab.pad_index
        )

        return vector

    @classmethod
    def from_serializable(cls, contents: dict) -> "TMVectorizer":
        """Instantiates the TMVectorizer from a serialized dictionary

        Args:
            contents (dict): a dictionary generated by `to_serializable`

        Returns:
            TMVectorizer:
        """
        vocab = Vocabulary.from_serializable(contents["vocab"])
        max_len = contents["max_len"]
        return cls(vocab=vocab, max_len=max_len)

    def to_serializable(self) -> dict:
        """Returns a dictionary that can be serialized

        Returns:
            dict: a dict contains Vocabulary instance and max_len attribute
        """
        return {"vocab": self.vocab.to_serializable(), "max_len": self.max_len}

    def save_vectorizer(self, filepath: str) -> None:
        """Dump this TMVectorizer instance to file

        Args:
            filepath (str): the path to store the file
        """
        with open(filepath, "w") as f:
            json.dump(self.to_serializable(), f)

    @classmethod
    def load_vectorizer(cls, filepath: str) -> "TMVectorizer":
        """Load TMVectorizer from a file

        Args:
            filepath (str): the path stored the file

        Returns:
            TMVectorizer:
        """
        with open(filepath) as f:
            return TMVectorizer.from_serializable(json.load(f))


class TMDataset(Dataset):
    """Dataset for text matching"""

    def __init__(self, text_df: pd.DataFrame, vectorizer: TMVectorizer) -> None:
        """

        Args:
            text_df (pd.DataFrame): a DataFrame which contains the processed data examples
            vectorizer (TMVectorizer): a TMVectorizer instance
        """

        self.text_df = text_df
        self._vectorizer = vectorizer

    def __getitem__(self, index: int) -> Tuple[np.ndarray, np.ndarray, int]:
        row = self.text_df.iloc[index]
        return (
            self._vectorizer.vectorize(row.sentence1),
            self._vectorizer.vectorize(row.sentence2),
            row.label,
        )

    def get_vectorizer(self) -> TMVectorizer:
        return self._vectorizer

    def __len__(self) -> int:
        return len(self.text_df)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267

模型训练

定义辅助函数和指标：

def make_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)


def tokenize(sentence: str):
    return list(jieba.cut(sentence))


def build_dataframe_from_csv(dataset_csv: str) -> pd.DataFrame:
    df = pd.read_csv(
        dataset_csv,
        sep="\t",
        header=None,
        names=["sentence1", "sentence2", "label"],
    )

    df.sentence1 = df.sentence1.apply(tokenize)
    df.sentence2 = df.sentence2.apply(tokenize)

    return df


def metrics(y: torch.Tensor, y_pred: torch.Tensor) -> Tuple[float, float, float, float]:
    TP = ((y_pred == 1) & (y == 1)).sum().float()  # True Positive
    TN = ((y_pred == 0) & (y == 0)).sum().float()  # True Negative
    FN = ((y_pred == 0) & (y == 1)).sum().float()  # False Negatvie
    FP = ((y_pred == 1) & (y == 0)).sum().float()  # False Positive
    p = TP / (TP + FP).clamp(min=1e-8)  # Precision
    r = TP / (TP + FN).clamp(min=1e-8)  # Recall
    F1 = 2 * r * p / (r + p).clamp(min=1e-8)  # F1 score
    acc = (TP + TN) / (TP + TN + FP + FN).clamp(min=1e-8)  # Accurary
    return acc, p, r, F1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

定义评估函数和训练函数：

def evaluate(
    data_iter: DataLoader, model: nn.Module
) -> Tuple[float, float, float, float]:
    y_list, y_pred_list = [], []
    model.eval()
    for x1, x2, y in tqdm(data_iter):
        x1 = x1.to(device).long()
        x2 = x2.to(device).long()
        y = torch.LongTensor(y).to(device)

        output = model(x1, x2)

        pred = torch.argmax(output, dim=1).long()

        y_pred_list.append(pred)
        y_list.append(y)

    y_pred = torch.cat(y_pred_list, 0)
    y = torch.cat(y_list, 0)
    acc, p, r, f1 = metrics(y, y_pred)
    return acc, p, r, f1


def train(
    data_iter: DataLoader,
    model: nn.Module,
    criterion: nn.CrossEntropyLoss,
    optimizer: torch.optim.Optimizer,
    print_every: int = 500,
    verbose=True,
) -> None:
    model.train()

    for step, (x1, x2, y) in enumerate(tqdm(data_iter)):
        x1 = x1.to(device).long()
        x2 = x2.to(device).long()
        y = torch.LongTensor(y).to(device)

        output = model(x1, x2)

        loss = criterion(output, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if verbose and (step + 1) % print_every == 0:
            pred = torch.argmax(output, dim=1).long()
            acc, p, r, f1 = metrics(y, pred)

            print(
                f" TRAIN iter={step+1} loss={loss.item():.6f} accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}"
            )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

参数定义：

 args = Namespace(
        dataset_csv="text_matching/data/lcqmc/{}.txt",
        vectorizer_file="vectorizer.json",
        model_state_file="model.pth",
        save_dir=f"{os.path.dirname(__file__)}/model_storage",
        reload_model=False,
        cuda=True,
        learning_rate=4e-4,
        batch_size=128,
        num_epochs=10,
        max_len=50,
        embedding_dim=300,
        hidden_size=300,
        num_classes=2,
        lstm_dropout=0.8,
        dropout=0.5,
        min_freq=2,
        print_every=500,
        verbose=True,
    )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

这个模型非常简单，但效果是非常不错的，可以作为一个很好的baseline。在训练过程中发现对训练集的准确率达到了100%，有点过拟合了，因此最后把lstm的dropout加大到0.8，最后看一下训练效果如何：

  make_dirs(args.save_dir)

    if args.cuda:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    else:
        device = torch.device("cpu")

    print(f"Using device: {device}.")

    vectorizer_path = os.path.join(args.save_dir, args.vectorizer_file)

    train_df = build_dataframe_from_csv(args.dataset_csv.format("train"))
    test_df = build_dataframe_from_csv(args.dataset_csv.format("test"))
    dev_df = build_dataframe_from_csv(args.dataset_csv.format("dev"))

    if os.path.exists(vectorizer_path):
        print("Loading vectorizer file.")
        vectorizer = TMVectorizer.load_vectorizer(vectorizer_path)
        args.vocab_size = len(vectorizer.vocab)
    else:
        print("Creating a new Vectorizer.")

        train_sentences = train_df.sentence1.to_list() + train_df.sentence2.to_list()

        vocab = Vocabulary.build(train_sentences, args.min_freq)

        args.vocab_size = len(vocab)

        print(f"Builds vocabulary : {vocab}")

        vectorizer = TMVectorizer(vocab, args.max_len)

        vectorizer.save_vectorizer(vectorizer_path)

    train_dataset = TMDataset(train_df, vectorizer)
    test_dataset = TMDataset(test_df, vectorizer)
    dev_dataset = TMDataset(dev_df, vectorizer)

    train_data_loader = DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True
    )
    dev_data_loader = DataLoader(dev_dataset, batch_size=args.batch_size)
    test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size)

    print(f"Arguments : {args}")
    model = ESIM(
        args.vocab_size,
        args.embedding_dim,
        args.hidden_size,
        args.num_classes,
        args.lstm_dropout,
        args.dropout,
    )

    print(f"Model: {model}")

    model_saved_path = os.path.join(args.save_dir, args.model_state_file)
    if args.reload_model and os.path.exists(model_saved_path):
        model.load_state_dict(torch.load(args.model_saved_path))
        print("Reloaded model")
    else:
        print("New model")

    model = model.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(args.num_epochs):
        train(
            train_data_loader,
            model,
            criterion,
            optimizer,
            print_every=args.print_every,
            verbose=args.verbose,
        )
        print("Begin evalute on dev set.")
        with torch.no_grad():
            acc, p, r, f1 = evaluate(dev_data_loader, model)

            print(
                f"EVALUATE [{epoch+1}/{args.num_epochs}]  accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}"
            )

    model.eval()

    acc, p, r, f1 = evaluate(test_data_loader, model)
    print(f"TEST accuracy={acc:.3f} precision={p:.3f} recal={r:.3f} f1 score={f1:.4f}")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

python .\text_matching\esim\train.py
Using device: cuda:0.
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.563 seconds.
Prefix dict has been built successfully.
Loading vectorizer file.
Arguments : Namespace(dataset_csv='text_matching/data/lcqmc/{}.txt', vectorizer_file='vectorizer.json', model_state_file='model.pth', save_dir='D:\\workspace\\nlp-in-action\\text_matching\\esim/model_storage', reload_model=False, cuda=True, learning_rate=0.0004, batch_size=128, num_epochs=10, max_len=50, embedding_dim=300, hidden_size=300, num_classes=2, lstm_dropout=0.8, dropout=0.5, min_freq=2, print_every=500, verbose=True, vocab_size=35925)      
D:\workspace\nlp-in-action\.venv\lib\site-packages\torch\nn\modules\rnn.py:67: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1
  warnings.warn("dropout option adds dropout after all but last "
Model: ESIM(
  (embedding): Embedding(35925, 300)
  (lstm_a): LSTM(300, 300, batch_first=True, dropout=0.8, bidirectional=True)
  (lstm_b): LSTM(300, 300, batch_first=True, dropout=0.8, bidirectional=True)
  (lstm_v_a): LSTM(2400, 300, batch_first=True, dropout=0.8, bidirectional=True)
  (lstm_v_b): LSTM(2400, 300, batch_first=True, dropout=0.8, bidirectional=True)
  (predict): Sequential(
    (0): Linear(in_features=2400, out_features=600, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=600, out_features=300, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=300, out_features=2, bias=True)
  )
)
New model
 27%|█████████████████████████████████████████████████▋                                                                                                                                        | 499/1866 [01:33<04:13,  5.40it/s] 
TRAIN iter=500 loss=0.500601 accuracy=0.750 precision=0.714 recal=0.846 f1 score=0.7746
 54%|███████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                      | 999/1866 [03:04<02:40,  5.41it/s] 
TRAIN iter=1000 loss=0.250350 accuracy=0.883 precision=0.907 recal=0.895 f1 score=0.9007
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                    | 1499/1866 [04:35<01:07,  5.43it/s] 
TRAIN iter=1500 loss=0.311494 accuracy=0.844 precision=0.868 recal=0.868 f1 score=0.8684
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1866/1866 [05:42<00:00,  5.45it/s] 
Begin evalute on dev set.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:04<00:00, 17.10it/s] 
EVALUATE [1/10]  accuracy=0.771 precision=0.788 recal=0.741 f1 score=0.7637
...
TRAIN iter=500 loss=0.005086 accuracy=1.000 precision=1.000 recal=1.000 f1 score=1.0000
...
EVALUATE [10/10]  accuracy=0.815 precision=0.807 recal=0.829 f1 score=0.8178
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98/98 [00:05<00:00, 17.38it/s] 
TEST accuracy=0.817 precision=0.771 recal=0.903 f1 score=0.8318
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

可以看到，训练集上的损失第一次降到了0.005级别，准确率甚至达到了100%，但最终在测试集上的准确率相比训练集还是逊色了一些，只有81.7%，不过相比前面几个模型已经很不错了。

完整代码

https://github.com/nlp-greyfoss/nlp-in-action-public/tree/master/text_matching/esim

相关阅读:
Java堆栈区别
 考试宝典——软件过程与管理重点知识总结_01
Win11文件类型怎么改？Win11修改文件后缀的方法
 QT DAY4
数据库实验九存储过程（新）
QComboBox的信号
 SpringFramework：Spring 概述
 必须要会回答的Java面试题（字符串篇）
软件设计师笔记系列（四）
表扬+彩虹屁，帮孩子消减情绪内耗
原文地址：https://blog.csdn.net/yjw123456/article/details/132912715