• Kaggle 专利匹配比赛金牌方案赛后总结


    比赛简介

    在专利匹配数据集中,选手需要判断两个短语的相似度,一个是anchor ,一个是target
    ,然后输出两者在不同语义(context)的相似度,范围是0-1,我们队伍id为xlyhq,a榜rank 13,b榜ran12,非常感谢@heng zheng@pythonlan,@leolu1998,@syzong四位队友的努力和付出,最后比较幸运的狗到金牌。

    和其他前排核心思路差不多,我们在这里主要分享下我们的比赛历程以及相关实验的具体结果,以及有意思的尝试

    文本处理

    数据集主要有anchor、target和context字段,另外有额外的文本拼接信息,在比赛过程中我们主要是尝试了以下拼接的尝试:

    • v1:test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ + test[‘context_text’]
    • v2:test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ +test[‘context’]+ ‘[SEP]’ + test[‘context_text’],相当于直接把A47类似编码拼接上去
    • v3:test[‘text’] = test[‘anchor’] + ‘[SEP]’ + test[‘target’] + ‘[SEP]’ + test[‘context’] + ‘[SEP]’ + test[‘context_text’] 获取更多的文本进行拼接,相当于把A47下面的子类别拼接上去,比如A47B,A47C
    context_mapping = {
        "A": "Human Necessities",
        "B": "Operations and Transport",
        "C": "Chemistry and Metallurgy",
        "D": "Textiles",
        "E": "Fixed Constructions",
        "F": "Mechanical Engineering",
        "G": "Physics",
        "H": "Electricity",
        "Y": "Emerging Cross-Sectional Technologies",
    }
    
    titles = pd.read_csv('./input/cpc-codes/titles.csv')
    
    
    def process(text):
        return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)
    
    
    def get_context(cpc_code):
        cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
        texts = cpc_data['title'].values.tolist()
        texts = [process(text) for text in texts]
        return ";".join([context_mapping[cpc_code[0]]] + texts)
    
    
    def get_cpc_texts():
        cpc_texts = dict()
        for code in tqdm(train['context'].unique()):
            cpc_texts[code] = get_context(code)
        return cpc_texts
    
    
    cpc_texts = get_cpc_texts()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    这个拼接方式可以得到不小的提升,但是文本长度变得更长,最大长度设置为300,导致训练更慢

    • v4:核心的拼接方式:test[‘text’] = test[‘text’] + ‘[SEP]’ + test[‘target_info’]
    # 拼接target info
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
    target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
    target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
    target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
    target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()
    
    del target_info['target']
    test=test.merge(target_info,on=['anchor','context'],how='left')
    test['text'] = test['text'] + '[SEP]' + test['target_info'] 
    test.head()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    这种拼接方式可以让模型cv和lb分数得到较大提升,通过v3和v4两种不同拼接方式的对比,我们可以发现选取质量更高的文本进行拼接对模型更有提升作用,v3方式中有很多冗余信息,而v4方式中有很多实体级别的关键信息。

    数据划分

    在比赛过程中,我们尝试了不同的数据划分方式,其中包括:

    • StratifiedGroupKFold,这种拼接方式cv与lb线差比较小,分数稍微好一点
    • StratifiedKFold:线下cv比较高
    • 其他Kfold和GrouFold效果不好

    损失函数

    主要可以参考的损失函数有:

    • BCE: nn.BCEWithLogitsLoss(reduction=“mean”)
    • MSE:nn.MSELoss()
    • Mixture Loss:MseCorrloss
    class CorrLoss(nn.Module):
        """
        use 1 - correlational coefficience between the output of the network and the target as the loss
        input (o, t):
            o: Variable of size (batch_size, 1) output of the network
            t: Variable of size (batch_size, 1) target value
        output (corr):
            corr: Variable of size (1)
        """
        def __init__(self):
            super(CorrLoss, self).__init__()
    
        def forward(self, o, t):
            assert(o.size() == t.size())
            # calcu z-score for o and t
            o_m = o.mean(dim = 0)
            o_s = o.std(dim = 0)
            o_z = (o - o_m)/o_s
    
            t_m = t.mean(dim =0)
            t_s = t.std(dim = 0)
            t_z = (t - t_m)/t_s
    
            # calcu corr between o and t
            tmp = o_z * t_z
            corr = tmp.mean(dim = 0)
            return  1 - corr
        
    class MSECorrLoss(nn.Module):
        def __init__(self, p = 1.5):
            super(MSECorrLoss, self).__init__()
            self.p = p
            self.mseLoss = nn.MSELoss()
            self.corrLoss = CorrLoss()
    
        def forward(self, o, t):
            mse = self.mseLoss(o, t)
            corr = self.corrLoss(o, t)
            loss = mse + self.p * corr
            return loss
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40

    我们实验采用的这个损失函数,效果稍微比BCE好一点

    模型设计

    为了提高模型的差异度,我们主要选择了不同模型的变体,其中包括以下五个模型:

    • Deberta-v3-large
    • Bert-for-patents
    • Roberta-large
    • Ernie-en-2.0-Large
    • Electra-large-discriminator

    具体cv分数如下:

    
    deberta-v3-large:[0.8494,0.8455,0.8523,0.8458,0.8658] cv 0.85176
    bertforpatents [0.8393, 0.8403, 0.8457, 0.8402, 0.8564] cv 0.8444
    roberta-large [0.8183,0.8172,0.8203,0.8193,0.8398] cv 0.8233
    ernie-large [0.8276,0.8277,0.8251,0.8296,0.8466] cv 0.8310
    electra-large [0.8429,0.8309,0.8259,0.8416,0.846] cv 0.8376
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    训练优化

    根据以往比赛经验,我们主要采用了以下模型训练优化方式:

    • 对抗训练:尝试了FGM 对模型训练有提升效果
    class FGM():
        def __init__(self, model):
            self.model = model
            self.backup = {}
        def attack(self, epsilon=1., emb_name='word_embeddings'):
            # emb_name这个参数要换成你模型中embedding的参数名
            for name, param in self.model.named_parameters():
                if param.requires_grad and emb_name in name:
                    self.backup[name] = param.data.clone()
                    norm = torch.norm(param.grad)
                    if norm != 0 and not torch.isnan(norm):
                        r_at = epsilon * param.grad / norm
                        param.data.add_(r_at)
        def restore(self, emb_name='emb.'):
            # emb_name这个参数要换成你模型中embedding的参数名
            for name, param in self.model.named_parameters():
                if param.requires_grad and emb_name in name: 
                    assert name in self.backup
                    param.data = self.backup[name]
            self.backup = {}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 模型泛化:加入了multidroout
    • ema对模型训练有提升效果
    class EMA():
        def __init__(self, model, decay):
            self.model = model
            self.decay = decay
            self.shadow = {}
            self.backup = {}
     
        def register(self):
            for name, param in self.model.named_parameters():
                if param.requires_grad:
                    self.shadow[name] = param.data.clone()
     
        def update(self):
            for name, param in self.model.named_parameters():
                if param.requires_grad:
                    assert name in self.shadow
                    new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                    self.shadow[name] = new_average.clone()
     
        def apply_shadow(self):
            for name, param in self.model.named_parameters():
                if param.requires_grad:
                    assert name in self.shadow
                    self.backup[name] = param.data
                    param.data = self.shadow[name]
     
        def restore(self):
            for name, param in self.model.named_parameters():
                if param.requires_grad:
                    assert name in self.backup
                    param.data = self.backup[name]
            self.backup = {}
     
    # 初始化
    ema = EMA(model, 0.999)
    ema.register()
     
    # 训练过程中,更新完参数后,同步update shadow weights
    def train():
        optimizer.step()
        ema.update()
     
    # eval前,apply shadow weights;eval之后,恢复原来模型的参数
    def evaluate():
        ema.apply_shadow()
        # evaluate
        ema.restore()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47

    没有用的尝试:

    • AWP
    • PGD

    模型融合

    根据线下交叉验证分数以及线上分数反馈,我们通过加权融合的方式进行平均融合

    from sklearn.preprocessing import MinMaxScaler
    MMscaler = MinMaxScaler()
    predictions1 = MMscaler.fit_transform(submission['predictions1'].values.reshape(-1,1)).reshape(-1)
    predictions2 = MMscaler.fit_transform(submission['predictions2'].values.reshape(-1,1)).reshape(-1)
    predictions3 = MMscaler.fit_transform(submission['predictions3'].values.reshape(-1,1)).reshape(-1)
    predictions4 = MMscaler.fit_transform(submission['predictions4'].values.reshape(-1,1)).reshape(-1)
    predictions5 = MMscaler.fit_transform(submission['predictions5'].values.reshape(-1,1)).reshape(-1)
    
    
    # final_predictions=(predictions1+predictions2)/2
    # final_predictions=(predictions1+predictions2+predictions3+predictions4+predictions5)/5
    # 5:2:1:1:1
    final_predictions=0.5*predictions1+0.2*predictions2+0.1*predictions3+0.1*predictions4+0.1*predictions5
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    其他尝试

    • two stage
      前期我们做了不同预训练模型的微调,所以特征数量相对较多,我们尝试基于树模型的对文本统计特征以及模型预测做stacking尝试,当时模型是有比较不错的融合效果,下面含有部分代码
    # ====================================================
    # predictions1
    # ====================================================
    
    def get_fold_pred(CFG, path, model):
        CFG.path = path
        CFG.model = model
        CFG.config_path = CFG.path + "config.pth"
        CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path)
        test_dataset = TestDataset(CFG, test)
    
        test_loader = DataLoader(test_dataset,
                                 batch_size=CFG.batch_size,
                                 shuffle=False,
                                 num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
        predictions = []
        for fold in CFG.trn_fold:
            model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
            state = torch.load(CFG.path + f"{CFG.model.split('/')[-1]}_fold{fold}_best.pth",
                               map_location=torch.device('cpu'))
            model.load_state_dict(state['model'])
            prediction = inference_fn(test_loader, model, device)
            predictions.append(prediction.flatten())
            del model, state, prediction
            gc.collect()
            torch.cuda.empty_cache()
        # predictions1 = np.mean(predictions, axis=0)
    
        # fea_df = pd.DataFrame(predictions).T
        # fea_df.columns = [f"{CFG.model.split('/')[-1]}_fold{fold}" for fold in CFG.trn_fold]
        # del test_dataset, test_loader
    
        return predictions
    
    
    model_paths = [
        "../input/albert-xxlarge-v2/albert-xxlarge-v2/",
        "../input/bert-large-cased-cv5/bert-large-cased/",
        "../input/deberta-base-cv5/deberta-base/",
        "../input/deberta-v3-base-cv5/deberta-v3-base/",
        "../input/deberta-v3-small/deberta-v3-small/",
        "../input/distilroberta-base/distilroberta-base/",
        "../input/roberta-large/roberta-large/",
        "../input/xlm-roberta-base/xlm-roberta-base/",
        "../input/xlmrobertalarge-cv5/xlm-roberta-large/",
    ]
    
    print("train.shape, test.shape", train.shape, test.shape)
    print("titles.shape", titles.shape)
    
    
    # for model_path in model_paths:
    #     with open(f'{model_path}/oof_df.pkl', "rb") as fh:
    #         oof = pickle.load(fh)[['id', 'fold', 'pred']]
    # #     oof = pd.read_pickle(f'{model_path}/oof_df.pkl')[['id', 'fold', 'pred']]
    #     oof[f"{model_path.split('/')[1]}"] = oof['pred']
    #     train = train.merge(oof[['id', f"{model_path.split('/')[1]}"]], how='left', on='id')
        
    oof_res=pd.read_csv('../input/train-res/train_oof.csv')
    
    train = train.merge(oof_res, how='left', on='id')
    
    model_infos = {
        'albert-xxlarge-v2': ['../input/albert-xxlarge-v2/albert-xxlarge-v2/', "albert-xxlarge-v2"],
        'bert-large-cased': ['../input/bert-large-cased-cv5/bert-large-cased/', "bert-large-cased"],
        'deberta-base': ['../input/deberta-base-cv5/deberta-base/', "deberta-base"],
        'deberta-v3-base': ['../input/deberta-v3-base-cv5/deberta-v3-base/', "deberta-v3-base"],
        'deberta-v3-small': ['../input/deberta-v3-small/deberta-v3-small/', "deberta-v3-small"],
        'distilroberta-base': ['../input/distilroberta-base/distilroberta-base/', "distilroberta-base"],
        'roberta-large': ['../input/roberta-large/roberta-large/', "roberta-large"],
        'xlm-roberta-base': ['../input/xlm-roberta-base/xlm-roberta-base/', "xlm-roberta-base"],
        'xlm-roberta-large': ['../input/xlmrobertalarge-cv5/xlm-roberta-large/', "xlm-roberta-large"],
    }
    
    for model, path_info in model_infos.items():
        print(model)
        model_path, model_name = path_info[0], path_info[1]
        fea_df = get_fold_pred(CFG, model_path, model_name)
        model_infos[model].append(fea_df)
        del model_path, model_name
    
    del oof_res
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82

    训练代码:

    for fold_ in range(5):
        print("Fold:", fold_)
    
        trn_ = train[train['fold'] != fold_].index
        val_ = train[train['fold'] == fold_].index
    #     print(train.iloc[val_].sort_values('id'))
        trn_x, trn_y = train[train_features].iloc[trn_], train['score'].iloc[trn_]
        val_x, val_y = train[train_features].iloc[val_], train['score'].iloc[val_]
    
        # train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
        # valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    
        reg = lgb.LGBMRegressor(**params,n_estimators=1100)
        xgb = XGBRegressor(**xgb_params, n_estimators=1000)
        cat = CatBoostRegressor(iterations=1000,learning_rate=0.03,
                                depth=10,
                                eval_metric='RMSE',
                                random_seed = 42,
                                bagging_temperature = 0.2,
                                od_type='Iter',
                                metric_period = 50,
                                od_wait=20)
        print("-"* 20 + "LightGBM Training" + "-"* 20)
        reg.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,verbose=100,eval_metric='rmse')
        print("-"* 20 + "XGboost Training" + "-"* 20)
        xgb.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,eval_metric='rmse',verbose=100)
        print("-"* 20 + "Catboost Training" + "-"* 20)
        cat.fit(trn_x, np.log1p(trn_y), eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,use_best_model=True,verbose=100)
        
        imp_df = pd.DataFrame()
        imp_df['feature'] = train_features
        imp_df['gain_reg'] = reg.booster_.feature_importance(importance_type='gain')
        imp_df['fold'] = fold_ + 1
        importances = pd.concat([importances, imp_df], axis=0, sort=False)
        
        
        for model, values in model_infos.items():
            test[model] = values[2][fold_]
            
        for model, values in uspppm_model_infos.items():
            test[f"uspppm_{model}"] = values[2][fold_]
    
            
            
            
    #     for f in tqdm(amount_feas, desc="amount_feas 基本聚合特征"):
    #         for cate in category_fea:
    #             if f != cate:
    #                 test['{}_{}_medi'.format(cate, f)] = test.groupby(cate)[f].transform('median')
    #                 test['{}_{}_mean'.format(cate, f)] = test.groupby(cate)[f].transform('mean')
    #                 test['{}_{}_max'.format(cate, f)] = test.groupby(cate)[f].transform('max')
    #                 test['{}_{}_min'.format(cate, f)] = test.groupby(cate)[f].transform('min')
    #                 test['{}_{}_std'.format(cate, f)] = test.groupby(cate)[f].transform('std')
                
                
                
        # LightGBM
        oof_reg_preds[val_] = reg.predict(val_x, num_iteration=reg.best_iteration_)
    #     oof_reg_preds[oof_reg_preds < 0] = 0
        lgb_preds = reg.predict(test[train_features], num_iteration=reg.best_iteration_)
    #     lgb_preds[lgb_preds < 0] = 0
        
        
        # Xgboost
        oof_reg_preds1[val_] = xgb.predict(val_x)
        oof_reg_preds1[oof_reg_preds1 < 0] = 0
        xgb_preds = xgb.predict(test[train_features])
    #     xgb_preds[xgb_preds < 0] = 0
        
        # catboost
        oof_reg_preds2[val_] = cat.predict(val_x)
        oof_reg_preds2[oof_reg_preds2 < 0] = 0
        cat_preds = cat.predict(test[train_features])
        cat_preds[xgb_preds < 0] = 0
            
    #     merge all prediction
        merge_pred[val_] = oof_reg_preds[val_] * 0.4 + oof_reg_preds1[val_] * 0.3 +oof_reg_preds2[val_] * 0.3
        
    #     sub_reg_preds += np.expm1(_preds) / len(folds)
    #     sub_reg_preds += np.expm1(_preds) / len(folds)
    
        sub_preds += (lgb_preds / 5) * 0.6 + (xgb_preds / 5) * 0.2 + (cat_preds / 5) * 0.2 #三个模型五折测试集预测结果
        
        sub_reg_preds+=lgb_preds / 5 # lgb五折测试集预测结果
    print("lgb",pearsonr(train['score'], np.expm1(oof_reg_preds))[0]) # lgb
    print("xgb",pearsonr(train['score'], np.expm1(oof_reg_preds1))[0]) # xgb
    print("cat",pearsonr(train['score'], np.expm1(oof_reg_preds2))[0]) # cat
    print("xgb lgb cat",pearsonr(train['score'], np.expm1(merge_pred))[0]) # xgb lgb cat
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
  • 相关阅读:
    国标GB28181视频平台EasyGBS在Win系统服务运行中,配置文件写入失败该如何解决?
    C语言--求一个 3 X 3 的整型矩阵对角线元素之和
    [附源码]java毕业设计社区健康服务平台管理系统lunwen
    Day43 尚硅谷JUC——异步回调
    (附源码)spring boot校园二手销售网站 毕业设计 161417
    “Spark+Hive”在DPU环境下的性能测评 | OLAP数据库引擎选型白皮书(24版)DPU部分节选
    堆排序代码模板
    保姆级vue-pdf的使用过程
    【公众号文章备份】从零开始学或许是一个谎言
    Metabase学习教程:系统管理-6
  • 原文地址:https://blog.csdn.net/yanqianglifei/article/details/125414355