动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词关键词

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词Topic

主题模型与BERTopic

主题模型Topic Model最常用的算法是LDA隐含迪利克雷分布，然而LDA有很多缺陷，如：

LDA需要主题数量作为输入，非常依赖这个值；
LDA存在长尾问题，对于大量低频词数据集表现不好；
LDA只考虑词频，没有考虑词与词之间的关系；
LDA不考虑时间信息，难以应用到动态主题模型任务。

为了解决这些问题，学界提出了DTM、ETM、DETM、BERTopic等方法，其中BERTopic是近年提出的热度很高的方法，它主要思路是寻找文本整体的BERT特征向量，然后对各文本特征在样本空间中做聚类，找到Topic，然后基于TF-IDF模型寻找每个Topic的关键词，最后寻找Topic在每个时间段的关键词表示。
然而BERTopic也存在几个问题：

BERTopic本身是为英文任务设计的，不适应于中文任务，因为英文无需分词，词与词之间天然用空格隔开，BERTopic对英文文本直接提取BERT特征，然后在空格隔开的词上找每个Topic的关键词，很便捷；对于中文来说，中文是需要分词的，如果对中文文本整体提取特征，就需要在中文的分词结果上提取每个Topic的关键词；
由于提取的是BERT特征，BERT本身要求文本长度不超过512，否则就会截断，对于这个问题，BERTopic里面是直接进行了截断，然而这种方法并不很合适，对长文本不太友好；

分别针对这两个问题，本文做了两个改进：

在文本整体上提取特征，在分词结果上提取关键词

改法很简单，调用topic_model.fit_transform()时，同时传入原始文本和分词（以及去停用词）结果，修改_bertopic.py中的源码，主要是改fit_transform()函数；

对文本的每512个字符提取BERT特征，然后求均值作为文本特征

改法很简单，经过读源码可知主要是SenteTransformer包里的SentenceTransformer.py里的encode()函数在进行特征提取，然后更改一下这个函数，更改为如下结果：

    def encode(self, sentences: Union[str, List[str]],
               batch_size: int = 1,
               show_progress_bar: bool = None,
               output_value: str = 'sentence_embedding',
               convert_to_numpy: bool = True,
               convert_to_tensor: bool = False,
               device: str = None,
               normalize_embeddings: bool = False) -> Union[List[Tensor], ndarray, Tensor]:
        """
        Computes sentence embeddings
        :param sentences: the sentences to embed
        :param batch_size: the batch size used for the computation
        :param show_progress_bar: Output a progress bar when encode sentences
        :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
        :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
        :param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
        :param device: Which torch.device to use for the computation
        :param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.
        :return:
           By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
        """
        self.eval()
        if show_progress_bar is None:
            show_progress_bar = (logger.getEffectiveLevel()==logging.INFO or logger.getEffectiveLevel()==logging.DEBUG)

        if convert_to_tensor:
            convert_to_numpy = False

        if output_value != 'sentence_embedding':
            convert_to_tensor = False
            convert_to_numpy = False

        input_was_string = False
        if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
            sentences = [sentences]
            input_was_string = True

        if device is None:
            device = self._target_device

        self.to(device)

        all_embeddings = []
        length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
        sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
        maxworklength = 512 # 每次最多提取maxlength个字的特征
        for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=False):
            # sentences_batch = sentences_sorted[start_index:start_index+batch_size] # sentences_batch里面有batch_size个文本

            tempsentence = sentences_sorted[start_index]
            sentence_length = len(tempsentence)
            if sentence_length%maxworklength:
                numofclip = sentence_length//maxworklength+1
            else:
                numofclip = sentence_length//maxworklength
            if sentence_length:
                features = self.tokenize([tempsentence[clipi*maxworklength:(clipi+1)*maxworklength] for clipi in range(numofclip)])
            else:
                features = self.tokenize([''])
            features = batch_to_device(features, device)

            with torch.no_grad():
                out_features = self.forward(features)

                if output_value == 'token_embeddings':
                    embeddings = []
                    for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
                        last_mask_id = len(attention)-1
                        while last_mask_id > 0 and attention[last_mask_id].item() == 0:
                            last_mask_id -= 1

                        embeddings.append(token_emb[0:last_mask_id+1])
                elif output_value is None:  #Return all outputs
                    embeddings = []
                    for sent_idx in range(len(out_features['sentence_embedding'])):
                        row =  {name: out_features[name][sent_idx] for name in out_features}
                        embeddings.append(row)
                else:   #Sentence embeddings
                    embeddings = out_features[output_value]
                    embeddings = embeddings.detach()
                    if normalize_embeddings:
                        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

                    # fixes for #522 and #487 to avoid oom problems on gpu with large datasets
                    if convert_to_numpy:
                        embeddings = embeddings.cpu() # 维度是[batch_size, 768]
                # all_embeddings.extend(np.average(embeddings, axis=0))
                all_embeddings.append(np.average(embeddings, axis=0).tolist())

        all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]

        # if convert_to_tensor:
        #     all_embeddings = torch.stack(all_embeddings)
        # elif convert_to_numpy:
        #     all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])

        # if input_was_string:
        #     all_embeddings = all_embeddings[0]

        # ans = np.mean(np.array(all_embeddings), axis=0).tolist()

        return np.array(all_embeddings)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

具体的提取特征的代码是调用的sentence_transformers>models>Transformer.py里的tokenize函数。
完成。

相关阅读:
基于SSM框架茶叶商城系统【项目源码+数据库脚本+报告】
从数据到AI，B2B迎来「拐点时刻」
Linux学习第24天：Linux 阻塞和非阻塞 IO 实验（一）：挂起
 一文学会如何使用工厂模式
 第四章：Spring Boot 配置文件指南
 开源版商城源码V2.0【小程序 + H5+ 公众号 + APP】
SLA 、SLO & SLI
好用的源代码加密软件有哪些？5款源代码防泄密软件推荐
 一、XSS加解密编码解码工具
 慧销平台ThreadPoolExecutor内存泄漏分析
原文地址：https://blog.csdn.net/qq_30565883/article/details/128126611

动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词关键词

动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词Topic

主题模型与BERTopic

在文本整体上提取特征，在分词结果上提取关键词

对文本的每512个字符提取BERT特征，然后求均值作为文本特征

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词关键词

动态TopicModel BERTopic 中文长文本 SentenceTransformer BERT 均值特征向量整体特征分词Topic