• 动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词关键词


    动态TopicModel BERTopic 中文 长文本 SentenceTransformer BERT 均值特征向量 整体特征分词Topic

    主题模型与BERTopic

    主题模型Topic Model最常用的算法是LDA隐含迪利克雷分布,然而LDA有很多缺陷,如:

    1. LDA需要主题数量作为输入,非常依赖这个值;
    2. LDA存在长尾问题,对于大量低频词数据集表现不好;
    3. LDA只考虑词频,没有考虑词与词之间的关系;
    4. LDA不考虑时间信息,难以应用到动态主题模型任务。

    为了解决这些问题,学界提出了DTM、ETM、DETM、BERTopic等方法,其中BERTopic是近年提出的热度很高的方法,它主要思路是寻找文本整体的BERT特征向量,然后对各文本特征在样本空间中做聚类,找到Topic,然后基于TF-IDF模型寻找每个Topic的关键词,最后寻找Topic在每个时间段的关键词表示。
    然而BERTopic也存在几个问题:

    1. BERTopic本身是为英文任务设计的,不适应于中文任务,因为英文无需分词,词与词之间天然用空格隔开,BERTopic对英文文本直接提取BERT特征,然后在空格隔开的词上找每个Topic的关键词,很便捷;对于中文来说,中文是需要分词的,如果对中文文本整体提取特征,就需要在中文的分词结果上提取每个Topic的关键词;
    2. 由于提取的是BERT特征,BERT本身要求文本长度不超过512,否则就会截断,对于这个问题,BERTopic里面是直接进行了截断,然而这种方法并不很合适,对长文本不太友好;

    分别针对这两个问题,本文做了两个改进:

    在文本整体上提取特征,在分词结果上提取关键词

    改法很简单,调用topic_model.fit_transform()时,同时传入原始文本和分词(以及去停用词)结果,修改_bertopic.py中的源码,主要是改fit_transform()函数;

    对文本的每512个字符提取BERT特征,然后求均值作为文本特征

    改法很简单,经过读源码可知主要是SenteTransformer包里的SentenceTransformer.py里的encode()函数在进行特征提取,然后更改一下这个函数,更改为如下结果:

        def encode(self, sentences: Union[str, List[str]],
                   batch_size: int = 1,
                   show_progress_bar: bool = None,
                   output_value: str = 'sentence_embedding',
                   convert_to_numpy: bool = True,
                   convert_to_tensor: bool = False,
                   device: str = None,
                   normalize_embeddings: bool = False) -> Union[List[Tensor], ndarray, Tensor]:
            """
            Computes sentence embeddings
            :param sentences: the sentences to embed
            :param batch_size: the batch size used for the computation
            :param show_progress_bar: Output a progress bar when encode sentences
            :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
            :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
            :param convert_to_tensor: If true, you get one large tensor as return. Overwrites any setting from convert_to_numpy
            :param device: Which torch.device to use for the computation
            :param normalize_embeddings: If set to true, returned vectors will have length 1. In that case, the faster dot-product (util.dot_score) instead of cosine similarity can be used.
            :return:
               By default, a list of tensors is returned. If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
            """
            self.eval()
            if show_progress_bar is None:
                show_progress_bar = (logger.getEffectiveLevel()==logging.INFO or logger.getEffectiveLevel()==logging.DEBUG)
    
            if convert_to_tensor:
                convert_to_numpy = False
    
            if output_value != 'sentence_embedding':
                convert_to_tensor = False
                convert_to_numpy = False
    
            input_was_string = False
            if isinstance(sentences, str) or not hasattr(sentences, '__len__'): #Cast an individual sentence to a list with length 1
                sentences = [sentences]
                input_was_string = True
    
            if device is None:
                device = self._target_device
    
            self.to(device)
    
            all_embeddings = []
            length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences])
            sentences_sorted = [sentences[idx] for idx in length_sorted_idx]
            maxworklength = 512 # 每次最多提取maxlength个字的特征
            for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=False):
                # sentences_batch = sentences_sorted[start_index:start_index+batch_size] # sentences_batch里面有batch_size个文本
    
                tempsentence = sentences_sorted[start_index]
                sentence_length = len(tempsentence)
                if sentence_length%maxworklength:
                    numofclip = sentence_length//maxworklength+1
                else:
                    numofclip = sentence_length//maxworklength
                if sentence_length:
                    features = self.tokenize([tempsentence[clipi*maxworklength:(clipi+1)*maxworklength] for clipi in range(numofclip)])
                else:
                    features = self.tokenize([''])
                features = batch_to_device(features, device)
    
                with torch.no_grad():
                    out_features = self.forward(features)
    
                    if output_value == 'token_embeddings':
                        embeddings = []
                        for token_emb, attention in zip(out_features[output_value], out_features['attention_mask']):
                            last_mask_id = len(attention)-1
                            while last_mask_id > 0 and attention[last_mask_id].item() == 0:
                                last_mask_id -= 1
    
                            embeddings.append(token_emb[0:last_mask_id+1])
                    elif output_value is None:  #Return all outputs
                        embeddings = []
                        for sent_idx in range(len(out_features['sentence_embedding'])):
                            row =  {name: out_features[name][sent_idx] for name in out_features}
                            embeddings.append(row)
                    else:   #Sentence embeddings
                        embeddings = out_features[output_value]
                        embeddings = embeddings.detach()
                        if normalize_embeddings:
                            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    
                        # fixes for #522 and #487 to avoid oom problems on gpu with large datasets
                        if convert_to_numpy:
                            embeddings = embeddings.cpu() # 维度是[batch_size, 768]
                    # all_embeddings.extend(np.average(embeddings, axis=0))
                    all_embeddings.append(np.average(embeddings, axis=0).tolist())
    
            all_embeddings = [all_embeddings[idx] for idx in np.argsort(length_sorted_idx)]
    
            # if convert_to_tensor:
            #     all_embeddings = torch.stack(all_embeddings)
            # elif convert_to_numpy:
            #     all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
    
            # if input_was_string:
            #     all_embeddings = all_embeddings[0]
    
            # ans = np.mean(np.array(all_embeddings), axis=0).tolist()
    
            return np.array(all_embeddings)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102

    具体的提取特征的代码是调用的sentence_transformers>models>Transformer.py里的tokenize函数。
    完成。

  • 相关阅读:
    基于SSM框架茶叶商城系统【项目源码+数据库脚本+报告】
    从数据到AI,B2B迎来「拐点时刻」
    Linux学习第24天:Linux 阻塞和非阻塞 IO 实验(一): 挂起
    一文学会如何使用工厂模式
    第四章 :Spring Boot 配置文件指南
    开源版商城源码V2.0【小程序 + H5+ 公众号 + APP】
    SLA 、SLO & SLI
    好用的源代码加密软件有哪些?5款源代码防泄密软件推荐
    一、XSS加解密编码解码工具
    慧销平台ThreadPoolExecutor内存泄漏分析
  • 原文地址:https://blog.csdn.net/qq_30565883/article/details/128126611