• Kaggle 专利匹配比赛赛后总结


    比赛简介

    在专利匹配数据集中,选手需要判断两个短语的相似度,一个是anchor ,一个是target
    ,然后输出两者在不同语义(context)的相似度,范围是0-1,我们队伍id为xlyhq,a榜rank 13,b榜ran12,非常感谢@heng zheng@pythonlan,@leolu1998,@syzong四位队友的努力和付出,最后比较幸运的狗到金牌。

    和其他前排核心思路差不多,我们在这里主要分享下我们的比赛历程以及相关实验的具体结果,以及有意思的尝试

    文本处理

    数据集主要有anchor、target和context字段,另外有额外的文本拼接信息,在比赛过程中我们主要是尝试了以下拼接的尝试:

    • v1:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
    • v2:test['anchor'] + '[SEP]' + test['target'] + '[SEP]' +test['context']+ '[SEP]' + test['context_text'],相当于直接把A47类似编码拼接上去
    • v3:test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context'] + '[SEP]' + test['context_text'] 获取更多的文本进行拼接,相当于把A47下面的子类别拼接上去,比如A47B,A47C
    1. context_mapping = {
    2. "A": "Human Necessities",
    3. "B": "Operations and Transport",
    4. "C": "Chemistry and Metallurgy",
    5. "D": "Textiles",
    6. "E": "Fixed Constructions",
    7. "F": "Mechanical Engineering",
    8. "G": "Physics",
    9. "H": "Electricity",
    10. "Y": "Emerging Cross-Sectional Technologies",
    11. }
    12. titles = pd.read_csv('./input/cpc-codes/titles.csv')
    13. def process(text):
    14. return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)
    15. def get_context(cpc_code):
    16. cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
    17. texts = cpc_data['title'].values.tolist()
    18. texts = [process(text) for text in texts]
    19. return ";".join([context_mapping[cpc_code[0]]] + texts)
    20. def get_cpc_texts():
    21. cpc_texts = dict()
    22. for code in tqdm(train['context'].unique()):
    23. cpc_texts[code] = get_context(code)
    24. return cpc_texts
    25. cpc_texts = get_cpc_texts()

    这个拼接方式可以得到不小的提升,但是文本长度变得更长,最大长度设置为300,导致训练更慢

    • v4:核心的拼接方式:test['text'] = test['text'] + '[SEP]' + test['target_info']
    1. # 拼接target info
    2. test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
    3. target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
    4. target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
    5. target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
    6. target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()
    7. del target_info['target']
    8. test=test.merge(target_info,on=['anchor','context'],how='left')
    9. test['text'] = test['text'] + '[SEP]' + test['target_info']
    10. test.head()

    这种拼接方式可以让模型cv和lb分数得到较大提升,通过v3和v4两种不同拼接方式的对比,我们可以发现选取质量更高的文本进行拼接对模型更有提升作用,v3方式中有很多冗余信息,而v4方式中有很多实体级别的关键信息。

    数据划分

    在比赛过程中,我们尝试了不同的数据划分方式,其中包括:

    • StratifiedGroupKFold,这种拼接方式cv与lb线差比较小,分数稍微好一点
    • StratifiedKFold:线下cv比较高
    • 其他Kfold和GrouFold效果不好

    损失函数

    主要可以参考的损失函数有:

    • BCE: nn.BCEWithLogitsLoss(reduction="mean")
    • MSE:nn.MSELoss()
    • Mixture Loss:MseCorrloss
    1. class CorrLoss(nn.Module):
    2. """
    3. use 1 - correlational coefficience between the output of the network and the target as the loss
    4. input (o, t):
    5. o: Variable of size (batch_size, 1) output of the network
    6. t: Variable of size (batch_size, 1) target value
    7. output (corr):
    8. corr: Variable of size (1)
    9. """
    10. def __init__(self):
    11. super(CorrLoss, self).__init__()
    12. def forward(self, o, t):
    13. assert(o.size() == t.size())
    14. # calcu z-score for o and t
    15. o_m = o.mean(dim = 0)
    16. o_s = o.std(dim = 0)
    17. o_z = (o - o_m)/o_s
    18. t_m = t.mean(dim =0)
    19. t_s = t.std(dim = 0)
    20. t_z = (t - t_m)/t_s
    21. # calcu corr between o and t
    22. tmp = o_z * t_z
    23. corr = tmp.mean(dim = 0)
    24. return 1 - corr
    25. class MSECorrLoss(nn.Module):
    26. def __init__(self, p = 1.5):
    27. super(MSECorrLoss, self).__init__()
    28. self.p = p
    29. self.mseLoss = nn.MSELoss()
    30. self.corrLoss = CorrLoss()
    31. def forward(self, o, t):
    32. mse = self.mseLoss(o, t)
    33. corr = self.corrLoss(o, t)
    34. loss = mse + self.p * corr
    35. return loss

    我们实验采用的这个损失函数,效果稍微比BCE好一点

    模型设计

    为了提高模型的差异度,我们主要选择了不同模型的变体,其中包括以下五个模型:

    • Deberta-v3-large
    • Bert-for-patents
    • Roberta-large
    • Ernie-en-2.0-Large
    • Electra-large-discriminator

    具体cv分数如下:

    1. deberta-v3-large:[0.8494,0.8455,0.8523,0.8458,0.8658] cv 0.85176
    2. bertforpatents [0.8393, 0.8403, 0.8457, 0.8402, 0.8564] cv 0.8444
    3. roberta-large [0.8183,0.8172,0.8203,0.8193,0.8398] cv 0.8233
    4. ernie-large [0.8276,0.8277,0.8251,0.8296,0.8466] cv 0.8310
    5. electra-large [0.8429,0.8309,0.8259,0.8416,0.846] cv 0.8376

    训练优化

    根据以往比赛经验,我们主要采用了以下模型训练优化方式:

    • 对抗训练:尝试了FGM 对模型训练有提升效果
    1. class FGM():
    2. def __init__(self, model):
    3. self.model = model
    4. self.backup = {}
    5. def attack(self, epsilon=1., emb_name='word_embeddings'):
    6. # emb_name这个参数要换成你模型中embedding的参数名
    7. for name, param in self.model.named_parameters():
    8. if param.requires_grad and emb_name in name:
    9. self.backup[name] = param.data.clone()
    10. norm = torch.norm(param.grad)
    11. if norm != 0 and not torch.isnan(norm):
    12. r_at = epsilon * param.grad / norm
    13. param.data.add_(r_at)
    14. def restore(self, emb_name='emb.'):
    15. # emb_name这个参数要换成你模型中embedding的参数名
    16. for name, param in self.model.named_parameters():
    17. if param.requires_grad and emb_name in name:
    18. assert name in self.backup
    19. param.data = self.backup[name]
    20. self.backup = {}
    • 模型泛化:加入了multidroout
    • ema对模型训练有提升效果
    1. class EMA():
    2. def __init__(self, model, decay):
    3. self.model = model
    4. self.decay = decay
    5. self.shadow = {}
    6. self.backup = {}
    7. def register(self):
    8. for name, param in self.model.named_parameters():
    9. if param.requires_grad:
    10. self.shadow[name] = param.data.clone()
    11. def update(self):
    12. for name, param in self.model.named_parameters():
    13. if param.requires_grad:
    14. assert name in self.shadow
    15. new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
    16. self.shadow[name] = new_average.clone()
    17. def apply_shadow(self):
    18. for name, param in self.model.named_parameters():
    19. if param.requires_grad:
    20. assert name in self.shadow
    21. self.backup[name] = param.data
    22. param.data = self.shadow[name]
    23. def restore(self):
    24. for name, param in self.model.named_parameters():
    25. if param.requires_grad:
    26. assert name in self.backup
    27. param.data = self.backup[name]
    28. self.backup = {}
    29. # 初始化
    30. ema = EMA(model, 0.999)
    31. ema.register()
    32. # 训练过程中,更新完参数后,同步update shadow weights
    33. def train():
    34. optimizer.step()
    35. ema.update()
    36. # eval前,apply shadow weights;eval之后,恢复原来模型的参数
    37. def evaluate():
    38. ema.apply_shadow()
    39. # evaluate
    40. ema.restore()

    没有用的尝试:

    • AWP
    • PGD

    模型融合

    根据线下交叉验证分数以及线上分数反馈,我们通过加权融合的方式进行平均融合

    1. from sklearn.preprocessing import MinMaxScaler
    2. MMscaler = MinMaxScaler()
    3. predictions1 = MMscaler.fit_transform(submission['predictions1'].values.reshape(-1,1)).reshape(-1)
    4. predictions2 = MMscaler.fit_transform(submission['predictions2'].values.reshape(-1,1)).reshape(-1)
    5. predictions3 = MMscaler.fit_transform(submission['predictions3'].values.reshape(-1,1)).reshape(-1)
    6. predictions4 = MMscaler.fit_transform(submission['predictions4'].values.reshape(-1,1)).reshape(-1)
    7. predictions5 = MMscaler.fit_transform(submission['predictions5'].values.reshape(-1,1)).reshape(-1)
    8. # final_predictions=(predictions1+predictions2)/2
    9. # final_predictions=(predictions1+predictions2+predictions3+predictions4+predictions5)/5
    10. # 5:2:1:1:1
    11. final_predictions=0.5*predictions1+0.2*predictions2+0.1*predictions3+0.1*predictions4+0.1*predictions5

    其他尝试

    • two stage
      前期我们做了不同预训练模型的微调,所以特征数量相对较多,我们尝试基于树模型的对文本统计特征以及模型预测做stacking尝试,当时模型是有比较不错的融合效果,下面含有部分代码
    1. # ====================================================
    2. # predictions1
    3. # ====================================================
    4. def get_fold_pred(CFG, path, model):
    5. CFG.path = path
    6. CFG.model = model
    7. CFG.config_path = CFG.path + "config.pth"
    8. CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path)
    9. test_dataset = TestDataset(CFG, test)
    10. test_loader = DataLoader(test_dataset,
    11. batch_size=CFG.batch_size,
    12. shuffle=False,
    13. num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
    14. predictions = []
    15. for fold in CFG.trn_fold:
    16. model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
    17. state = torch.load(CFG.path + f"{CFG.model.split('/')[-1]}_fold{fold}_best.pth",
    18. map_location=torch.device('cpu'))
    19. model.load_state_dict(state['model'])
    20. prediction = inference_fn(test_loader, model, device)
    21. predictions.append(prediction.flatten())
    22. del model, state, prediction
    23. gc.collect()
    24. torch.cuda.empty_cache()
    25. # predictions1 = np.mean(predictions, axis=0)
    26. # fea_df = pd.DataFrame(predictions).T
    27. # fea_df.columns = [f"{CFG.model.split('/')[-1]}_fold{fold}" for fold in CFG.trn_fold]
    28. # del test_dataset, test_loader
    29. return predictions
    30. model_paths = [
    31. "../input/albert-xxlarge-v2/albert-xxlarge-v2/",
    32. "../input/bert-large-cased-cv5/bert-large-cased/",
    33. "../input/deberta-base-cv5/deberta-base/",
    34. "../input/deberta-v3-base-cv5/deberta-v3-base/",
    35. "../input/deberta-v3-small/deberta-v3-small/",
    36. "../input/distilroberta-base/distilroberta-base/",
    37. "../input/roberta-large/roberta-large/",
    38. "../input/xlm-roberta-base/xlm-roberta-base/",
    39. "../input/xlmrobertalarge-cv5/xlm-roberta-large/",
    40. ]
    41. print("train.shape, test.shape", train.shape, test.shape)
    42. print("titles.shape", titles.shape)
    43. # for model_path in model_paths:
    44. # with open(f'{model_path}/oof_df.pkl', "rb") as fh:
    45. # oof = pickle.load(fh)[['id', 'fold', 'pred']]
    46. # # oof = pd.read_pickle(f'{model_path}/oof_df.pkl')[['id', 'fold', 'pred']]
    47. # oof[f"{model_path.split('/')[1]}"] = oof['pred']
    48. # train = train.merge(oof[['id', f"{model_path.split('/')[1]}"]], how='left', on='id')
    49. oof_res=pd.read_csv('../input/train-res/train_oof.csv')
    50. train = train.merge(oof_res, how='left', on='id')
    51. model_infos = {
    52. 'albert-xxlarge-v2': ['../input/albert-xxlarge-v2/albert-xxlarge-v2/', "albert-xxlarge-v2"],
    53. 'bert-large-cased': ['../input/bert-large-cased-cv5/bert-large-cased/', "bert-large-cased"],
    54. 'deberta-base': ['../input/deberta-base-cv5/deberta-base/', "deberta-base"],
    55. 'deberta-v3-base': ['../input/deberta-v3-base-cv5/deberta-v3-base/', "deberta-v3-base"],
    56. 'deberta-v3-small': ['../input/deberta-v3-small/deberta-v3-small/', "deberta-v3-small"],
    57. 'distilroberta-base': ['../input/distilroberta-base/distilroberta-base/', "distilroberta-base"],
    58. 'roberta-large': ['../input/roberta-large/roberta-large/', "roberta-large"],
    59. 'xlm-roberta-base': ['../input/xlm-roberta-base/xlm-roberta-base/', "xlm-roberta-base"],
    60. 'xlm-roberta-large': ['../input/xlmrobertalarge-cv5/xlm-roberta-large/', "xlm-roberta-large"],
    61. }
    62. for model, path_info in model_infos.items():
    63. print(model)
    64. model_path, model_name = path_info[0], path_info[1]
    65. fea_df = get_fold_pred(CFG, model_path, model_name)
    66. model_infos[model].append(fea_df)
    67. del model_path, model_name
    68. del oof_res

    训练代码:

    1. for fold_ in range(5):
    2. print("Fold:", fold_)
    3. trn_ = train[train['fold'] != fold_].index
    4. val_ = train[train['fold'] == fold_].index
    5. # print(train.iloc[val_].sort_values('id'))
    6. trn_x, trn_y = train[train_features].iloc[trn_], train['score'].iloc[trn_]
    7. val_x, val_y = train[train_features].iloc[val_], train['score'].iloc[val_]
    8. # train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    9. # valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    10. reg = lgb.LGBMRegressor(**params,n_estimators=1100)
    11. xgb = XGBRegressor(**xgb_params, n_estimators=1000)
    12. cat = CatBoostRegressor(iterations=1000,learning_rate=0.03,
    13. depth=10,
    14. eval_metric='RMSE',
    15. random_seed = 42,
    16. bagging_temperature = 0.2,
    17. od_type='Iter',
    18. metric_period = 50,
    19. od_wait=20)
    20. print("-"* 20 + "LightGBM Training" + "-"* 20)
    21. reg.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,verbose=100,eval_metric='rmse')
    22. print("-"* 20 + "XGboost Training" + "-"* 20)
    23. xgb.fit(trn_x, np.log1p(trn_y),eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,eval_metric='rmse',verbose=100)
    24. print("-"* 20 + "Catboost Training" + "-"* 20)
    25. cat.fit(trn_x, np.log1p(trn_y), eval_set=[(val_x, np.log1p(val_y))],early_stopping_rounds=50,use_best_model=True,verbose=100)
    26. imp_df = pd.DataFrame()
    27. imp_df['feature'] = train_features
    28. imp_df['gain_reg'] = reg.booster_.feature_importance(importance_type='gain')
    29. imp_df['fold'] = fold_ + 1
    30. importances = pd.concat([importances, imp_df], axis=0, sort=False)
    31. for model, values in model_infos.items():
    32. test[model] = values[2][fold_]
    33. for model, values in uspppm_model_infos.items():
    34. test[f"uspppm_{model}"] = values[2][fold_]
    35. # for f in tqdm(amount_feas, desc="amount_feas 基本聚合特征"):
    36. # for cate in category_fea:
    37. # if f != cate:
    38. # test['{}_{}_medi'.format(cate, f)] = test.groupby(cate)[f].transform('median')
    39. # test['{}_{}_mean'.format(cate, f)] = test.groupby(cate)[f].transform('mean')
    40. # test['{}_{}_max'.format(cate, f)] = test.groupby(cate)[f].transform('max')
    41. # test['{}_{}_min'.format(cate, f)] = test.groupby(cate)[f].transform('min')
    42. # test['{}_{}_std'.format(cate, f)] = test.groupby(cate)[f].transform('std')
    43. # LightGBM
    44. oof_reg_preds[val_] = reg.predict(val_x, num_iteration=reg.best_iteration_)
    45. # oof_reg_preds[oof_reg_preds < 0] = 0
    46. lgb_preds = reg.predict(test[train_features], num_iteration=reg.best_iteration_)
    47. # lgb_preds[lgb_preds < 0] = 0
    48. # Xgboost
    49. oof_reg_preds1[val_] = xgb.predict(val_x)
    50. oof_reg_preds1[oof_reg_preds1 < 0] = 0
    51. xgb_preds = xgb.predict(test[train_features])
    52. # xgb_preds[xgb_preds < 0] = 0
    53. # catboost
    54. oof_reg_preds2[val_] = cat.predict(val_x)
    55. oof_reg_preds2[oof_reg_preds2 < 0] = 0
    56. cat_preds = cat.predict(test[train_features])
    57. cat_preds[xgb_preds < 0] = 0
    58. # merge all prediction
    59. merge_pred[val_] = oof_reg_preds[val_] * 0.4 + oof_reg_preds1[val_] * 0.3 +oof_reg_preds2[val_] * 0.3
    60. # sub_reg_preds += np.expm1(_preds) / len(folds)
    61. # sub_reg_preds += np.expm1(_preds) / len(folds)
    62. sub_preds += (lgb_preds / 5) * 0.6 + (xgb_preds / 5) * 0.2 + (cat_preds / 5) * 0.2 #三个模型五折测试集预测结果
    63. sub_reg_preds+=lgb_preds / 5 # lgb五折测试集预测结果
    64. print("lgb",pearsonr(train['score'], np.expm1(oof_reg_preds))[0]) # lgb
    65. print("xgb",pearsonr(train['score'], np.expm1(oof_reg_preds1))[0]) # xgb
    66. print("cat",pearsonr(train['score'], np.expm1(oof_reg_preds2))[0]) # cat
    67. print("xgb lgb cat",pearsonr(train['score'], np.expm1(merge_pred))[0]) # xgb lgb cat
  • 相关阅读:
    HtmlJavaScript的 getElementBYId和querySelector速度性能对比测试 2210011540
    java基础
    Scanner例题讲解
    纳米体育数据足球数据接口:资料库数据包接口文档API示例②
    HIVE内置函数hash() -- 源码解析
    JavaScript 64 JavaScript 函数 64.4 JavaScript 函数 Call
    Unity WebGL 中文输入解决方案(UGUI、TextMeshPro、UIToolkit)
    杀掉进程但是fastapi程序还在运行
    《编译原理》实验二句法分析器
    Blender点操作
  • 原文地址:https://blog.csdn.net/yanqianglifei/article/details/125424006