• 钉钉杯初赛A题建模-多模型融合预测银行卡诈骗模型(详细代码、解释)


    钉钉杯初赛A题建模-多模型融合预测银行卡诈骗模型

    前言:

    8月10结束的钉钉杯a题,整体简单,建模整体代码分享如下,主要是进行了多个模型投票法融合的模型。

    数据+全部代码:

    链接:https://pan.baidu.com/s/1SZtLsuPHSmlaOy111uW_YA
    提取码:xx78

    在这里插入图片描述

    1、对数据进行前两列特征数据进行标准化

    2、采用上采样和下采样进行数据处理,使数据极不平衡得到处理

    3、使用上采样和下采样的数据分别用第一阶段的5个模型进行训练和预测

    4、模型优化,使用roc_auc曲线选出最好的三个模型进行保存,在第三阶段进行模型融合

    5、加载四个模型融合为一个模型

    6、对融合后的模型进行训练和模型评估

    7、混淆矩阵查看模型的效果

    数据读取与查看

    读取数据

    import pandas as pd
    import numpy as np
    df=pd.read_csv("数据集/card_transdata.csv",encoding='utf-8')  #文件路径为绝对路径,根据自己电脑文件夹的路径修改
    df    
    
    • 1
    • 2
    • 3
    • 4
    distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
    057.8778570.3111401.9459401.01.00.00.00.0
    110.8299430.1755921.2942191.00.00.00.00.0
    25.0910790.8051530.4277151.00.00.01.00.0
    32.2475645.6000440.3626631.01.00.01.00.0
    444.1909360.5664862.2227671.01.00.01.00.0
    ...........................
    9999952.2071010.1126511.6267981.01.00.00.00.0
    99999619.8727262.6839042.7783031.01.00.00.00.0
    9999972.9148571.4726870.2180751.01.00.01.00.0
    9999984.2587290.2420230.4758221.00.00.01.00.0
    99999958.1081250.3181100.3869201.01.00.01.00.0

    1000000 rows × 8 columns

    查看数据情况

    
    df.info()
    
    • 1
    • 2
    
    RangeIndex: 1000000 entries, 0 to 999999
    Data columns (total 8 columns):
     #   Column                          Non-Null Count    Dtype  
    ---  ------                          --------------    -----  
     0   distance_from_home              1000000 non-null  float64
     1   distance_from_last_transaction  1000000 non-null  float64
     2   ratio_to_median_purchase_price  1000000 non-null  float64
     3   repeat_retailer                 1000000 non-null  float64
     4   used_chip                       1000000 non-null  float64
     5   used_pin_number                 1000000 non-null  float64
     6   online_order                    1000000 non-null  float64
     7   fraud                           1000000 non-null  float64
    dtypes: float64(8)
    memory usage: 61.0 MB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 数据的类型基本是float64。即都是数字形式
    • 数据中没有空行
    # 介绍数据集各列的 数据统计情况
    df.describe() 
    
    • 1
    • 2
    distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
    count1000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.000000
    mean26.6287925.0365191.8241820.8815360.3503990.1006080.6505520.087403
    std65.39078425.8430932.7995890.3231570.4770950.3008090.4767960.282425
    min0.0048740.0001180.0043990.0000000.0000000.0000000.0000000.000000
    25%3.8780080.2966710.4756731.0000000.0000000.0000000.0000000.000000
    50%9.9677600.9986500.9977171.0000000.0000000.0000001.0000000.000000
    75%25.7439853.3557482.0963701.0000001.0000000.0000001.0000000.000000
    max10632.72367211851.104565267.8029421.0000001.0000001.0000001.0000001.000000
    • 可以观察到distance_from_home(银行卡交易地点与家的距离)和distance_from_last_transaction(与上次交易发生的距离)的方差相对于其他特征很大

    数据分析可视化

    print('distance_from_home不是诈骗统计:'+str(len(df.loc[(df['fraud'] == 0),'distance_from_home'])))
    print('distance_from_home是诈骗统计:'+str(len(df.loc[(df['fraud'] == 1),'distance_from_home'])))
    
    print('distance_from_last_transaction不是诈骗统计:'+str(len(df.loc[(df['fraud'] == 0),'distance_from_last_transaction'])))
    print('distance_from_last_transaction是诈骗统计:'+str(len(df.loc[(df['fraud'] == 1),'distance_from_last_transaction'])))
    
    print('ratio_to_median_purchase_price不是诈骗统计:'+str(len(df.loc[(df['fraud'] == 0),'ratio_to_median_purchase_price'])))
    print('ratio_to_median_purchase_price是诈骗统计:'+str(len(df.loc[(df['fraud'] == 1),'ratio_to_median_purchase_price'])))
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    distance_from_home不是诈骗统计:912597
    distance_from_home是诈骗统计:87403
    distance_from_last_transaction不是诈骗统计:912597
    distance_from_last_transaction是诈骗统计:87403
    ratio_to_median_purchase_price不是诈骗统计:912597
    ratio_to_median_purchase_price是诈骗统计:87403
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    查看正负样本数量:

    from pyecharts.charts import Pie
    from pyecharts import options as opts
    L1=['fraud','Not fraud']
    num=[87403,912597]
    c=Pie()
    c.add("",[list(z) for z in zip(L1,num)])
    c.set_global_opts(title_opts=opts.TitleOpts(title="正负样本分布")) 
    c.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c}"))
    c.render_notebook()
    
         
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    在这里插入图片描述

    • 从中我们可以观察到,正样本数量远远大于负样本数量,正负样本数量不均衡。

    • 大部分分类器的输出类别是基于阈值的,如小于0.5的为反例,大于则为正例。在数据不平衡时,默认的阈值会导致模型输出倾向与类别数据多的类别

    • 这里我们采用下采样的方法平衡数据

      下采样 :从大量的正样本中挑选若干个,使得正样本和负样本数目一样小

    观察distance_from_home和是否诈骗的关系

    #### 观察distance和是否诈骗的关系
    import matplotlib.pyplot as plt
    
    # 构建两个子图
    f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(16,4))
    
    # 设置柱状宽度
    bins = 30
    
    # 统计欺诈案例的交易金额
    ax1.hist(df["distance_from_home"][df["fraud"]== 1], bins = bins)
    ax1.set_title('Fraud')
    
    # 统计正常案例的交易金额
    ax2.hist(df["distance_from_home"][df["fraud"] == 0], bins = bins)
    ax2.set_title('Not Fraud')
    
    # 画坐标系
    plt.xlabel('distance')
    plt.ylabel('Number of Transactions')
    plt.yscale('log')
    
    plt.show()    # 展示图像
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23


    在这里插入图片描述

    • k可以看出咋骗集中在distance大约为500以内,说明可能是同城诈骗居多。

    观察distance_from_last_transaction:和是否诈骗的关系

    #### 观察distance和是否诈骗的关系
    import matplotlib.pyplot as plt
    
    # 构建两个子图
    f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(16,4))
    
    # 设置柱状宽度
    bins = 30
    
    # 统计欺诈案例的交易金额
    ax1.hist(df["distance_from_last_transaction"][df["fraud"]== 1], bins = bins)
    ax1.set_title('Fraud')
    
    # 统计正常案例的交易金额
    ax2.hist(df["distance_from_last_transaction"][df["fraud"] == 0], bins = bins)
    ax2.set_title('Not Fraud')
    
    # 画坐标系
    plt.xlabel('distance')
    plt.ylabel('Number of Transactions')
    plt.yscale('log')
    
    plt.show()    # 展示图像
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23


    在这里插入图片描述

    同城诈骗,可能距离小不容易引起怀疑

    观察各个特征之间的联系

    import seaborn as sns
    
    # 创建图像
    grid_kws = {"width_ratios": (.9, .9, .05), "wspace": 0.2}
    f, (ax1, ax2, cbar_ax) = plt.subplots(1, 3, gridspec_kw=grid_kws, figsize = (18, 9))
    
    # 定义调色板
    cmap = sns.diverging_palette(220, 8, as_cmap=True)
    
    # 计算正常案例中的特征联系
    correlation_NonFraud = df[df["fraud"] == 0].loc[:, df.columns != 'fraud'].corr()
    # 计算欺诈案例中的特征联系
    correlation_Fraud = df[df["fraud"] == 1].loc[:, df.columns != 'fraud'].corr()
    # 计算上三角mask矩阵
    mask = np.zeros_like(correlation_NonFraud)
    indices = np.triu_indices_from(correlation_NonFraud)
    mask[indices] = True
    
    
    # 画正常案例的特征联系热力图
    ax1 =sns.heatmap(correlation_NonFraud, ax = ax1, vmin = -1, vmax = 1, cmap = cmap, square = False, \
                     linewidths = 0.5, mask = mask, cbar = False)
    ax1.set_xticklabels(ax1.get_xticklabels(), size = 16); 
    ax1.set_yticklabels(ax1.get_yticklabels(), size = 16); 
    ax1.set_title('Normal', size = 20)
    
    # 画欺诈案例的特征联系热力图
    ax2 = sns.heatmap(correlation_Fraud, vmin = -1, vmax = 1, cmap = cmap, ax = ax2, square = False, \
                      linewidths = 0.5, mask = mask, yticklabels = False, \
                      cbar_ax = cbar_ax, cbar_kws={'orientation': 'vertical',  'ticks': [-1, -0.5, 0, 0.5, 1]})
    
    ax2.set_xticklabels(ax2.get_xticklabels(), size = 16); 
    ax2.set_title('Fraud', size = 20);
    
    cbar_ax.set_yticklabels(cbar_ax.get_yticklabels(), size = 14);
    
    plt.show()    # 展示图像
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37


    在这里插入图片描述

    df.corr()
    
    • 1
    distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
    distance_from_home1.0000000.000193-0.0013740.143124-0.000697-0.001622-0.0013010.187571
    distance_from_last_transaction0.0001931.0000000.001013-0.0009280.002055-0.0008990.0001410.091917
    ratio_to_median_purchase_price-0.0013740.0010131.0000000.0013740.0005870.000942-0.0003300.462305
    repeat_retailer0.143124-0.0009280.0013741.000000-0.001345-0.000417-0.000532-0.001357
    used_chip-0.0006970.0020550.000587-0.0013451.000000-0.001393-0.000219-0.060975
    used_pin_number-0.001622-0.0008990.000942-0.000417-0.0013931.000000-0.000291-0.100293
    online_order-0.0013010.000141-0.000330-0.000532-0.000219-0.0002911.0000000.191973
    fraud0.1875710.0919170.462305-0.001357-0.060975-0.1002930.1919731.000000

    从上图可以看出

    • 在银行卡诈骗事件中,变量distance_from_home、ratio_to_median_purchase_price、repeat_retailer与属于诈骗有较强的

    观察各个特征分布

    import matplotlib.pyplot as plt
    import matplotlib.gridspec as gridspec
    import seaborn as sns
    # 特征名
    feature_num = len(df.columns)
    v_feat = list(df.columns)
    
    # 构建图像
    plt.figure(figsize=(16,feature_num*4))
    gs = gridspec.GridSpec(feature_num, 1) 
    
    for i, cn in enumerate(df[v_feat]):
        ax = plt.subplot(gs[i])
        sns.distplot(df[cn][df["fraud"] == 1], bins=50)
        sns.distplot(df[cn][df["fraud"] == 0], bins=100)
        ax.set_xlabel('')
        ax.set_title('特征直方图: ' + str(cn))
    plt.rcParams['font.sans-serif']=['SimHei']
    plt.show()    # 展示图像
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    在这里插入图片描述

    特征工程

    数据标准化

    # 统一导入工具包
    import numpy as np
    import pandas as pd
    import os
    
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score, classification_report, roc_curve, auc, plot_confusion_matrix, precision_score, recall_score, f1_score 
    
    from imblearn.over_sampling import SMOTE
    from imblearn.under_sampling import RandomUnderSampler
    from imblearn.pipeline import Pipeline
    from joblib import dump, load
    
    import matplotlib.pyplot as plt
    import matplotlib.gridspec as gridspec
    import seaborn as sns
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    # 观察特征返回每列的标准偏差
    df.var()  
    
    
    • 1
    • 2
    • 3
    distance_from_home                4275.954684
    distance_from_last_transaction     667.865469
    ratio_to_median_purchase_price       7.837698
    repeat_retailer                      0.104430
    used_chip                            0.227620
    used_pin_number                      0.090486
    online_order                         0.227334
    fraud                                0.079764
    dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 前两个方差太大,我们需要对其进行标准化 因为模型会对方差较大的特征值误认为它对与分类有着较大的权重,因此把数据大小劲量均衡
    ### 单独使用StandardScaler()进行标准化
    from sklearn.preprocessing import StandardScaler
    old_dfh = df['distance_from_home'].values.reshape(-1, 1)
    # print(old_amount)
    print('distance_from_home标准化之前的方差', old_dfh.std())
    
    norm_dfh = StandardScaler().fit_transform(df['distance_from_home'].values.reshape(-1, 1))
    
    print('distance_from_home标准化之后的方差', norm_dfh.std())
    
    ### 单独使用StandardScaler()进行标准化
    from sklearn.preprocessing import StandardScaler
    old_dflt = df['distance_from_last_transaction'].values.reshape(-1, 1)
    # print(old_amount)
    print('distance_from_last_transaction标准化之前的方差', old_dflt.std())
    
    norm_dflt = StandardScaler().fit_transform(df['distance_from_last_transaction'].values.reshape(-1, 1))
    
    print('distance_from_last_transaction标准化之后的方差', norm_dflt.std())
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    distance_from_home标准化之前的方差 65.39075170364431
    distance_from_home标准化之后的方差 1.0000000000000004
    distance_from_last_transaction标准化之前的方差 25.843080339696936
    distance_from_last_transaction标准化之后的方差 1.0
    
    • 1
    • 2
    • 3
    • 4
    # 封装到ColumnTransformer中,方便后续调用标准化操作
    column_trans = Pipeline([('scaler', StandardScaler())])
    
    preprocessing = ColumnTransformer(
        transformers=[
            ('column_trans', column_trans, ['distance_from_home','distance_from_last_transaction'])
        ], remainder='passthrough'
    )
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    过采样与欠采样数据处理

    • SMOTE算法过采样的思想是合成新的少数类样本,合成的策略是对每个少数类样本a,从它的最近邻中随机选一个样本b,然后在a、b之间的连线上随机选一点作为新合成的少数类样本。

    • 如果采用欠采样的方法,通常是对数目较多的那一类样本进行随机挑选样本,使得两类样本数目相等。这种做法会抛弃了大部分数据。

    划分数据集

    x = df.drop('fraud',axis=1)
    x
    y = df['fraud']
    y
    
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) 
    
    #查看维度
    print('x_train.shape:',X_train.shape)
    print('y_train.shape:',y_train.shape)
    print('x_test.shape:',X_test.shape)
    print('y_test.shape:',y_test.shape)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    x_train.shape: (800000, 7)
    y_train.shape: (800000,)
    x_test.shape: (200000, 7)
    y_test.shape: (200000,)
    
    • 1
    • 2
    • 3
    • 4

    SMOTE过采样

    ### 单独使用SMOTE的结果
    # 利用SMOTE进行过采样
    
    print('过采样前,1的样本的个数为:',len(y_train[y_train==1]))
    print('过采样前,0的样本的个数为:',len(y_train[y_train==0]))
    over_sampler=SMOTE(random_state=0)
    X_os_train,y_os_train=over_sampler.fit_resample(X_train,y_train)
    print('过采样后,1的样本的个数为:',len(y_os_train[y_os_train==1]))
    print('过采样后,0的样本的个数为:',len(y_os_train[y_os_train==0]))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    过采样前,1的样本的个数为: 69960
    过采样前,0的样本的个数为: 730040
    过采样后,1的样本的个数为: 730040
    过采样后,0的样本的个数为: 730040
    
    • 1
    • 2
    • 3
    • 4

    随机欠采样

    ### 单独使用随机欠采样的结果
    
    print('欠采样前,1的样本的个数为:',len(y_train[y_train==1]))
    print('欠采样前,0的样本的个数为:',len(y_train[y_train==0]))
    under_sampler=RandomUnderSampler(random_state=0) 
    X_us_train,y_us_train=under_sampler.fit_resample(X_train,y_train)
    print('欠采样后,1的样本的个数为:',len(y_us_train[y_us_train==1]))
    print('欠采样后,0的样本的个数为:',len(y_us_train[y_us_train==0]))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    欠采样前,1的样本的个数为: 69960
    欠采样前,0的样本的个数为: 730040
    欠采样后,1的样本的个数为: 69960
    欠采样后,0的样本的个数为: 69960
    
    • 1
    • 2
    • 3
    • 4

    流水线建构模型

    数据在进行模型拟合之前,需要先将数据进行输入标准化等操作转换为新数据,对于新数据,模型的预测和评估都需要进行多次转换。使用Pipeline(流水线)技术可以将数据处理和模型拟合结合在一起,减少代码量。

    Pipeline 的中间过程由sklearn相适配的转换器(transformer)构成,最后一步是一个estimator(模型)。中间的节点都可以执行fit和transform方法,这样预处理都可以封装进去;最后节点只需要实现fit方法

    from sklearn.linear_model import SGDClassifier      # 随机梯度
    from sklearn.neighbors import KNeighborsClassifier  # K近邻
    from sklearn.tree import DecisionTreeClassifier     # 决策树
    from sklearn.ensemble import RandomForestClassifier # 随机森林
    from sklearn.model_selection import cross_val_score # 交叉验证计算accuracy
    from sklearn.model_selection import GridSearchCV    # 网格搜索,获取最优参数
    from sklearn.model_selection import StratifiedKFold # 交叉验证
    from collections import Counter
    from xgboost import XGBClassifier
    # 评估指标
    from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
    
    from sklearn.ensemble import BaggingClassifier # 集成学习
       
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    过采样流水线模型训练分数

    classifiers = {
        "KNN":KNeighborsClassifier(),              # K近邻
        'DT':DecisionTreeClassifier(),             # 决策树
        'RFC':RandomForestClassifier(),            # 随机森林
        'Bagging':BaggingClassifier(),             # 集成学习bagging
        'SGD':SGDClassifier(),                      #随机梯度
        'XGB':XGBClassifier()                       #XGBoost算法
       
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    def accuracy_scores(x_train, y_train):
        for key, classifier in classifiers.items(): # 遍历每一个分类器,分别训练、计算得分
            over_pipe = Pipeline([
            ('preprocessing', preprocessing),
            ('sampler', SMOTE() ), # 数据高度不平衡,因此使用SMOTE对少数类进行过采样
            ('classifier',classifier)
            ])
            over_pipe.fit(x_train, y_train)
            training_score = cross_val_score(over_pipe, x_train, y_train, cv=5) # 5折交叉验证
            print("Classifier Name : ", classifier.__class__.__name__,"  Training Score :", round(training_score.mean(), 4)*100,'%')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    print("过采样的各个分类模型的训练分数:")
    accuracy_scores(X_train,y_train)
    
    • 1
    • 2
    过采样的各个分类模型的训练分数:
    Classifier Name :  KNeighborsClassifier   Training Score : 99.8 %
    Classifier Name :  DecisionTreeClassifier   Training Score : 100.0 %
    Classifier Name :  RandomForestClassifier   Training Score : 100.0 %
    Classifier Name :  BaggingClassifier   Training Score : 100.0 %
    Classifier Name :  SGDClassifier   Training Score : 92.77 %
    Classifier Name :  XGBClassifier   Training Score : 100.0 %
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    欠采样流水线模型训练分数

    def under_accuracy_scores(x_train, y_train):
        for key, classifier in classifiers.items(): # 遍历每一个分类器,分别训练、计算得分
            # 欠采样
            under_pipe = Pipeline([
                ('preprocessing', preprocessing),
                ('sampler', RandomUnderSampler() ), # The data is highly imbalanced, hence undersample majority class with RandomUnderSampler 
                ('classifier', classifier)
            ])
            under_pipe.fit(x_train, y_train)
            training_score = cross_val_score(under_pipe, x_train, y_train, cv=5) # 5折交叉验证
            print("Classifier Name : ", classifier.__class__.__name__,"  Training Score :", round(training_score.mean(), 4)*100,'%')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    print("欠采样的各个分类模型的训练分数:")
    under_accuracy_scores(X_train,y_train)
    
    • 1
    • 2
    欠采样的各个分类模型的训练分数:
    Classifier Name :  KNeighborsClassifier   Training Score : 99.26 %
    Classifier Name :  DecisionTreeClassifier   Training Score : 99.99 %
    Classifier Name :  RandomForestClassifier   Training Score : 99.99 %
    Classifier Name :  BaggingClassifier   Training Score : 99.99 %
    Classifier Name :  SGDClassifier   Training Score : 92.46 %
    Classifier Name :  XGBClassifier   Training Score : 99.99 %
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    综上过采样和欠采样的模型训练分数:

    • 过采样的模型比欠采样的模型训练分数高

    • 在过采样模型中 ,决策树模型、随机森林模型、Bagging模型、XGBoost模型的训练分数达到了100%,我们选择这四个模型作为分类模型,

    • 后面对这四个模型进行最优参数搜索、融合模型

    网格搜索,得到每个模型的最优参数模型

    • cross_val_score :

    一般用于获取每折的交叉验证的得分,然后根据这个得分为模型选择合适的超参数,通常需要编写循环手动完成交叉验证过程;

    • GridSearchCV :

    除了自行完成叉验证外,还返回了最优的超参数及对应的最优模型
    所以相对于cross_val_score来说,GridSearchCV在使用上更为方便;但是对于细节理解上,手动实现循环调用cross_val_score会更好些。

    #1、决策树模型最优参数寻找
    def DT_gs(x_train, y_train):
        DT_param = {
            'classifier__criterion':['gini', 'entropy'],          # 衡量标准
            'classifier__max_depth':list(range(2, 5, 1)),         # 树的深度
            'classifier__min_samples_leaf':list(range(2, 7, 1))   # 最小叶子节点数
            
            
        } 
        DT_pipe =Pipeline([ 
            ('preprocessing', preprocessing),
            ('sampler', SMOTE() ), # 数据高度不平衡,因此使用SMOTE对少数类进行过采样
            ('classifier',DecisionTreeClassifier())
            ]) 
        dt_gs = GridSearchCV(estimator=DT_pipe,param_grid=DT_param, n_jobs=-1,verbose=50, cv=4, scoring='roc_auc')
        dt_gs.fit(x_train, y_train)
    
        dt_best_estimators = dt_gs.best_estimator_ # 最优参数
        
        return dt_best_estimators
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    # 2、随机森林最优参数选择
    def RFC_gs(x_train, y_train):
        grid_search_models = {
            'classifier__n_estimators': [25,50,75,100,200] # 仅作示例,可以选择其他参数
        }
        RFC_over_pipe = Pipeline([
                ('preprocessing', preprocessing),
                ('sampler', SMOTE() ), # 数据高度不平衡,因此使用SMOTE对少数类进行过采样
                ('classifier',RandomForestClassifier())
                ])
        pipe = GridSearchCV(RFC_over_pipe, grid_search_models, verbose=50, cv=5, scoring='roc_auc')
        pipe.fit(x_train, y_train)
        bst = pipe.best_estimator_ # 最优参数
        return bst
    
       
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    # 3、Bagging模型最优参数选择
    def bag_gs(x_train, y_train):
        BAG_param = {
            'classifier__n_estimators':[10, 15, 20]      #集成的基估计器的个数
        }
        bag_over_pipe = Pipeline([
                ('preprocessing', preprocessing),
                ('sampler', SMOTE() ), # 数据高度不平衡,因此使用SMOTE对少数类进行过采样
                ('classifier',BaggingClassifier())
                ])
        bag_over_pipe = GridSearchCV(bag_over_pipe, BAG_param, verbose=50, cv=5, scoring='roc_auc')
        bag_over_pipe.fit(x_train, y_train)
        bag_bst = bag_over_pipe.best_estimator_ # 最优参数
        return bag_bst
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    
    # 3、XGBoost模型最优参数选择
    
    def xgb_gs(x_train, y_train):
        XGB_param = {
            'classifier__max_depth':[3,4,5,6]
        }
        xgb_over_pipe = Pipeline([
                ('preprocessing', preprocessing),
                ('sampler', SMOTE() ), # 数据高度不平衡,因此使用SMOTE对少数类进行过采样
                ('classifier',XGBClassifier()) 
                ])
        xgb_over_pipe = GridSearchCV(xgb_over_pipe, XGB_param, verbose=50, cv=5, scoring='roc_auc')
        xgb_over_pipe.fit(x_train, y_train)
        xgb_bst = xgb_over_pipe.best_estimator_ # 最优参数
        return xgb_bst
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    #  得到最优参数模型
    DT_best_estimator = DT_gs(X_train, y_train)
    
    RFC_best_estimator = RFC_gs(X_train, y_train)
    
    BAG_best_estimator = bag_gs(X_train, y_train)
    
    XGB_best_estimator = xgb_gs(X_train,y_train)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    print('4个模型最优参数:')
    
    print('DT_best_estimator:',DT_best_estimator)
    print('RFC_best_estimator:',RFC_best_estimator)
    print('BAG_best_estimator:',BAG_best_estimator)
    print('XGB_best_estimator:',XGB_best_estimator)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    4个模型最优参数:
    DT_best_estimator: Pipeline(steps=[('preprocessing',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('column_trans',
                                                      Pipeline(steps=[('scaler',
                                                                       StandardScaler())]),
                                                      ['distance_from_home',
                                                       'distance_from_last_transaction'])])),
                    ('sampler', SMOTE()),
                    ('classifier',
                     DecisionTreeClassifier(max_depth=4, min_samples_leaf=4))])
    RFC_best_estimator: Pipeline(steps=[('preprocessing',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('column_trans',
                                                      Pipeline(steps=[('scaler',
                                                                       StandardScaler())]),
                                                      ['distance_from_home',
                                                       'distance_from_last_transaction'])])),
                    ('sampler', SMOTE()),
                    ('classifier', RandomForestClassifier(n_estimators=75))])
    BAG_best_estimator: Pipeline(steps=[('preprocessing',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('column_trans',
                                                      Pipeline(steps=[('scaler',
                                                                       StandardScaler())]),
                                                      ['distance_from_home',
                                                       'distance_from_last_transaction'])])),
                    ('sampler', SMOTE()), ('classifier', BaggingClassifier())])
    XGB_best_estimator: Pipeline(steps=[('preprocessing',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('column_trans',
                                                      Pipeline(steps=[('scaler',
                                                                       StandardScaler())]),
                                                      ['distance_from_home',
                                                       'distance_from_last_transaction'])])),
                    ('sampler', SMOTE()),
                    ('classifier',
                     XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                                   colsample_bylevel=1, colsample_bynode=1,
                                   colsa...
                                   gamma=0, gpu_id=-1, grow_policy='depthwise',
                                   importance_type=None, interaction_constraints='',
                                   learning_rate=0.300000012, max_bin=256,
                                   max_cat_to_onehot=4, max_delta_step=0,
                                   max_depth=3, max_leaves=0, min_child_weight=1,
                                   missing=nan, monotone_constraints='()',
                                   n_estimators=100, n_jobs=0, num_parallel_tree=1,
                                   predictor='auto', random_state=0, reg_alpha=0,
                                   reg_lambda=1, ...))])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49

    保存最优参数的四个模型

    import joblib as jl
    #保存决策树模型:
    jl.dump(DT_best_estimator,'./模型保存/dt.pkl')
    #保存随机森林模型:
    jl.dump(RFC_best_estimator,'./模型保存/rfc.pkl')
    #保存bagging模型:
    jl.dump(BAG_best_estimator,'./模型保存/bag.pkl')
    #保存xgboost模型:
    jl.dump(XGB_best_estimator,'./模型保存/xgb.pkl')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    四个模型模型评估

    评估指标数据

    用 precision_recall_fscore_support 可以同时计算真实值和预测值之间的精确率、召回率、F 值、支持度。支持度为在真实值每一类出现的事件次数。

    
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import accuracy_score
    
    
    
    def caculate(models, x_test, y_test):
        # 计算各种参数的值
        accuracy_results = []
        F1_score_results = []
        Recall_results = []
        Precision_results = []
        AUC_ROC_results = []
        
        for model in models:
            y_pred = model.predict(x_test)
            accuracy = accuracy_score(y_test, y_pred) # 计算准确度
            precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred) # 计算:精确度,召回率,f1_score
            AUC_ROC = roc_auc_score(y_test, y_pred) # 计算AUC
            
            # 保存计算值
            accuracy_results.append(round(accuracy,4))
            F1_score_results.append(round(f1_score.mean(),4))
            Recall_results.append(round(recall.mean(),4))
            AUC_ROC_results.append(AUC_ROC)
            Precision_results.append(round(precision.mean(),4))
            
        return accuracy_results, F1_score_results, Recall_results, AUC_ROC_results, Precision_results
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    # 将所有最优超参数的模型放在一起
    best_models = [ DT_best_estimator, RFC_best_estimator, BAG_best_estimator, XGB_best_estimator]
    
    # 调用函数计算各项指标值
    accuracy_results, F1_score_results, Recall_results, AUC_ROC_results, Precision_results = caculate(best_models, X_test, y_test)
    
    # 将各项值放入到DataFrame中
    result_df = pd.DataFrame(columns=['Accuracy', 'F1-score', 'Recall', 'Precision', 'AUC_ROC'],
                             index=['DT','RFC','Bagging','XGBoost'])
    result_df['Accuracy'] = accuracy_results
    result_df['F1-score'] = F1_score_results
    result_df['Recall'] = Recall_results
    result_df['Precision'] = Precision_results
    result_df['AUC_ROC'] = AUC_ROC_results
    
    result_df
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    AccuracyF1-scoreRecallPrecisionAUC_ROC
    DT0.99340.97990.99270.96800.992728
    RFC1.00001.00000.99991.00000.999940
    Bagging1.00000.99990.99990.99990.999929
    XGBoost1.00000.99991.00000.99990.999986

    可视化 AUC_ROC的评分.

    # 可视化 AUC的评分.
    g = sns.barplot('AUC_ROC', result_df.index, data=result_df, palette='hsv', orient='h')
    
    • 1
    • 2

    在这里插入图片描述

    计算各个模型的AUC值

    from sklearn.model_selection import cross_val_predict
    DT_pred = DT_best_estimator.predict(X_test)
    RFC_pred = RFC_best_estimator.predict(X_test)
    BAG_pred = BAG_best_estimator.predict(X_test)
    XGB_pred = XGB_best_estimator.predict(X_test)
    print(DT_pred)
    print(RFC_pred)
    print(BAG_pred)
    print(XGB_pred)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    [0. 0. 0. ... 1. 0. 0.]
    [0. 0. 0. ... 1. 0. 0.]
    [0. 0. 0. ... 1. 0. 0.]
    [0 0 0 ... 1 0 0]
    
    • 1
    • 2
    • 3
    • 4
    # 计算auc的评分
    
    
    print('决策树模型auc分数 :', round(roc_auc_score(y_test, DT_pred),2))
    print('随机森林模型auc分数 :', round(roc_auc_score(y_test, RFC_pred),2))
    print('bagging模型auc分数 :', round(roc_auc_score(y_test, BAG_pred),2))
    print('xgboost模型auc分数 :', round(roc_auc_score(y_test, XGB_pred),2))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    决策树模型auc分数 : 0.99
    随机森林模型auc分数 : 1.0
    bagging模型auc分数 : 1.0
    xgboost模型auc分数 : 1.0
    
    • 1
    • 2
    • 3
    • 4

    绘制各个模型的roc曲线:

    
    
    
    DT_fpr, DT_tpr, DT_threshold = roc_curve(y_test, DT_pred)
    
    
    
    RFC_fpr, RFC_tpr, RFC_threshold = roc_curve(y_test, RFC_pred)
    
    BAG_fpr, BAG_tpr, BAG_threshold = roc_curve(y_test, BAG_pred)
    
    XGB_fpr, XGB_tpr, XGB_threshold = roc_curve(y_test, XGB_pred)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 绘制roc曲线
    def graph_roc(fpr, tpr, name, score):
        plt.figure(figsize=(8,4)) # 画布大小
        plt.title("ROC Curve", fontsize=14)
        plt.plot(fpr, tpr, 'b',label=name+"  AUC: "+ str(round(score,2)))
        plt.plot([0, 1], [0, 1], color='r', linestyle='--')
        plt.axis([-0.01, 1, 0, 1]) # 坐标轴
        plt.xlabel("False Positive Rate (FPR)", fontsize=14)
        plt.ylabel("True Positive Rate (TPR)", fontsize=14)
        plt.legend()
        plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    
    #决策树
    graph_roc(DT_fpr, DT_tpr, 'DT', roc_auc_score(y_test, DT_pred))
    # 随机森林
    
    graph_roc(RFC_fpr, RFC_tpr, 'RFC', roc_auc_score(y_test, RFC_pred))
    #bag
    graph_roc(BAG_fpr,BAG_tpr, 'BAG', roc_auc_score(y_test, BAG_pred))
    #XGB
    graph_roc(XGB_fpr,XGB_tpr, 'XGB', roc_auc_score(y_test, XGB_pred))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10


    在这里插入图片描述

    在这里插入图片描述

    在这里插入图片描述

    在这里插入图片描述

    模型评估总结

    ,四个效果好的模型在模型评估上展现
    准确率、精确率、召回率、F1值、ROC曲线、AUC 都显示很高, 说明四个模型在分类上效果显著,可信度高

    加载模型、融合

    import joblib 
    model1 = joblib.load(filename="./模型保存/dt.pkl")
    model1
    model2 = joblib.load('./模型保存/rfc.pkl')
    model3 = joblib.load('./模型保存/bag.pkl')
    model4 = joblib.load('./模型保存/xgb.pkl')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    # 将4个较好的模型集成起来,当做一个模型
    from sklearn.ensemble import VotingClassifier
    
    voting_clf = VotingClassifier(estimators=[('DT', model1), ('RFC', model2), ('BAG',model3),
                                              ('XGB', model4)], n_jobs=-1,voting='soft')
    voting_clf
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    模型训练

    import pandas as pd
    import numpy as np
    df=pd.read_csv("数据集/card_transdata.csv",encoding='utf-8')  #文件路径为绝对路径,根据自己电脑文件夹的路径修改
    df   
    
    x = df.drop('fraud',axis=1)
    x
    y = df['fraud']
    y
    from sklearn.model_selection import train_test_split, GridSearchCV
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) 
    
    #查看维度
    print('x_train.shape:',X_train.shape)
    print('y_train.shape:',y_train.shape)
    print('x_test.shape:',X_test.shape)
    print('y_test.shape:',y_test.shape)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    x_train.shape: (800000, 7)
    y_train.shape: (800000,)
    x_test.shape: (200000, 7)
    y_test.shape: (200000,)
    
    • 1
    • 2
    • 3
    • 4
    # 训练
    voting_clf.fit(X_train, y_train)
    
    • 1
    • 2
    #训练分数
    from sklearn.model_selection import cross_val_score # 交叉验证计算accuracy
    training_score = cross_val_score(voting_clf, X_train, y_train, cv=5) # 5折交叉验证
    print("融合后模型训练分数:", round(training_score.mean(), 4)*100,'%')
    
    • 1
    • 2
    • 3
    • 4
    融合后模型训练分数: 100.0 %
    
    • 1
    # 预测
    y_final_pred = voting_clf.predict(X_test)
    y_final_pred
    
    • 1
    • 2
    • 3
    array([0., 0., 0., ..., 1., 0., 0.])
    
    • 1

    模型评估

    from sklearn.metrics import roc_auc_score, classification_report, roc_curve, auc, plot_confusion_matrix, precision_score, recall_score, f1_score 
    
    • 1

    混淆矩阵

    #
    import matplotlib.pyplot as plt
    plot_confusion_matrix(voting_clf, X_test, y_test)
    plt.show()    # 展示图像
    
    • 1
    • 2
    • 3
    • 4

    在这里插入图片描述

    计算精确率、召回率以及综合两者的F1值

    y_preds = voting_clf.predict(X_test)
    p = precision_score(y_test, y_preds)
    r = recall_score(y_test, y_preds)
    f1 = f1_score(y_test, y_preds)
    
    print("precision(准确率): ", p)
    print("recall(召回率): ", r)
    print("F1: ", f1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    precision(准确率):  0.9998853408243995
    recall(召回率):  0.9998853408243995
    F1:  0.9998853408243995
    
    • 1
    • 2
    • 3
    print(classification_report(y_test, y_preds)) #评价指标
    
    • 1
                  precision    recall  f1-score   support
    
             0.0       1.00      1.00      1.00    182557
             1.0       1.00      1.00      1.00     17443
    
        accuracy                           1.00    200000
       macro avg       1.00      1.00      1.00    200000
    weighted avg       1.00      1.00      1.00    200000
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    保存融合后的模型

    joblib.dump(voting_clf,'./模型保存/best_model.pkl')
    
    • 1

    最后:

    创作不易,如果觉得有参考价值,请点个关注再走呗,请点个关注再走呗,请点个关注再走呗,蟹蟹
    在这里插入图片描述

  • 相关阅读:
    力扣(LeetCode)791. 自定义字符串排序(C++)
    APP备案您清楚了吗?
    Guava Preconditions类的各种用法
    Java并发编程学习十:线程协作
    数据结构实验二(2) 简单模拟病人看病过程
    LeetCode每日一题(689. Maximum Sum of 3 Non-Overlapping Subarrays)
    浅谈双指针技巧(二)---通过双指针判断链表成环问题
    brew 安装或卸载mysql
    k-均值聚类
    【农村屋顶安装的光伏发电靠谱吗?】分布式光伏运维平台在公益场馆屋顶光伏发电系统的解决方案
  • 原文地址:https://blog.csdn.net/shine_Lee_/article/details/126324091