• hands-on-data-analysis 第三单元 模型搭建和评估


    hands-on-data-analysis 第三单元 模型搭建和评估

    1.模型搭建

    1.1.导入相关库

    import pandas as pd
    import numpy as np
    # matplotlib.pyplot 和 seaborn 是绘图库
    import matplotlib.pyplot as plt
    import seaborn as sns
    from IPython.display import Image
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    # 内嵌显示图片
    %matplotlib inline
    
    • 1
    • 2
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    plt.rcParams['figure.figsize'] = (10, 6)  # 设置输出图片大小
    
    • 1
    • 2
    • 3

    1.2.数据集的载入

    # 读取原数据数集
    train = pd.read_csv('train.csv')
    train.shape
    
    • 1
    • 2
    • 3

    输出为:

    (891, 12)
    
    • 1

    1.3.数据集分析

    train.head()
    
    • 1

    输出为:

    PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
    0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
    1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
    2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
    3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
    4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
    train.info()
    
    • 1
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  891 non-null    int64  
     1   Survived     891 non-null    int64  
     2   Pclass       891 non-null    int64  
     3   Name         891 non-null    object 
     4   Sex          891 non-null    object 
     5   Age          714 non-null    float64
     6   SibSp        891 non-null    int64  
     7   Parch        891 non-null    int64  
     8   Ticket       891 non-null    object 
     9   Fare         891 non-null    float64
     10  Cabin        204 non-null    object 
     11  Embarked     889 non-null    object 
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.7+ KB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    可以看到这些数据还是需要清洗的,清洗过后的数据集如下:

    #读取清洗过的数据集
    data = pd.read_csv('clear_data.csv')
    
    • 1
    • 2
    data.head()
    
    • 1
    PassengerIdPclassAgeSibSpParchFareSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
    00322.0107.250001001
    11138.01071.283310100
    22326.0007.925010001
    33135.01053.100010001
    44335.0008.050001001
    data.info()
    
    • 1

    输出为:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  891 non-null    int64  
     1   Pclass       891 non-null    int64  
     2   Age          891 non-null    float64
     3   SibSp        891 non-null    int64  
     4   Parch        891 non-null    int64  
     5   Fare         891 non-null    float64
     6   Sex_female   891 non-null    int64  
     7   Sex_male     891 non-null    int64  
     8   Embarked_C   891 non-null    int64  
     9   Embarked_Q   891 non-null    int64  
     10  Embarked_S   891 non-null    int64  
    dtypes: float64(2), int64(9)
    memory usage: 76.7 KB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    1.4.模型搭建

    sklearn的算法选择路径

    分割数据集

    # train_test_split 是用来切割数据集的函数
    from sklearn.model_selection import train_test_split
    
    • 1
    • 2
    # 一般先取出X和y后再切割,有些情况会使用到未切割的,这时候X和y就可以用,x是清洗好的数据,y是我们要预测的存活数据'Survived'
    X = data
    y = train['Survived']
    
    • 1
    • 2
    • 3
    # 对数据集进行切割
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
    
    • 1
    • 2
    # 查看数据形状
    X_train.shape, X_test.shape
    
    • 1
    • 2

    输出为:

    ((668, 11), (223, 11))
    
    • 1
    X_train.info()
    
    • 1

    输出为:

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 668 entries, 671 to 80
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  668 non-null    int64  
     1   Pclass       668 non-null    int64  
     2   Age          668 non-null    float64
     3   SibSp        668 non-null    int64  
     4   Parch        668 non-null    int64  
     5   Fare         668 non-null    float64
     6   Sex_female   668 non-null    int64  
     7   Sex_male     668 non-null    int64  
     8   Embarked_C   668 non-null    int64  
     9   Embarked_Q   668 non-null    int64  
     10  Embarked_S   668 non-null    int64  
    dtypes: float64(2), int64(9)
    memory usage: 82.6 KB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    X_test.info()
    
    • 1

    输出为:

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 223 entries, 288 to 633
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  223 non-null    int64  
     1   Pclass       223 non-null    int64  
     2   Age          223 non-null    float64
     3   SibSp        223 non-null    int64  
     4   Parch        223 non-null    int64  
     5   Fare         223 non-null    float64
     6   Sex_female   223 non-null    int64  
     7   Sex_male     223 non-null    int64  
     8   Embarked_C   223 non-null    int64  
     9   Embarked_Q   223 non-null    int64  
     10  Embarked_S   223 non-null    int64  
    dtypes: float64(2), int64(9)
    memory usage: 30.9 KB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    1.5.导入模型

    1.5.1.默认参数的逻辑回归模型

    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    
    • 1
    • 2
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    
    • 1
    • 2

    输出为:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    # 查看训练集和测试集score值
    print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
    print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
    
    • 1
    • 2
    • 3
    Training set score: 0.80
    Testing set score: 0.79
    
    • 1
    • 2

    1.5.2.调节参数的逻辑回归模型

    lr2 = LogisticRegression(C=100)
    lr2.fit(X_train, y_train)
    
    • 1
    • 2

    输出为:

    LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
    print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))
    
    • 1
    • 2

    输出为:

    Training set score: 0.79
    Testing set score: 0.78
    
    • 1
    • 2

    1.5.3.默认参数的随机森林分类模型

    rfc = RandomForestClassifier()
    rfc.fit(X_train, y_train)
    
    • 1
    • 2

    输出为:

    RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                           criterion='gini', max_depth=None, max_features='auto',
                           max_leaf_nodes=None, max_samples=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_jobs=None, oob_score=False, random_state=None,
                           verbose=0, warm_start=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
    print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))
    
    • 1
    • 2

    输出为:

    Training set score: 1.00
    Testing set score: 0.82
    
    • 1
    • 2

    1.5.4.调整参数后的随机森林分类模型

    rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
    rfc2.fit(X_train, y_train)
    
    • 1
    • 2

    输出为:

    RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                           criterion='gini', max_depth=5, max_features='auto',
                           max_leaf_nodes=None, max_samples=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_jobs=None, oob_score=False, random_state=None,
                           verbose=0, warm_start=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
    print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))
    
    • 1
    • 2

    输出为:

    Training set score: 0.87
    Testing set score: 0.81
    
    • 1
    • 2

    1.6.预测模型

    一般监督模型在sklearn里面有个predict能输出预测标签,predict_proba则可以输出标签概率

    # 预测标签
    pred = lr.predict(X_train)
    
    • 1
    • 2
    # 此时我们可以看到0和1的数组
    pred[:10]
    
    • 1
    • 2

    输出为:

    array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])
    
    • 1
    # 预测标签概率
    pred_proba = lr.predict_proba(X_train)
    
    • 1
    • 2
    pred_proba[:10]
    
    • 1

    输出为:

    array([[0.60884602, 0.39115398],
           [0.17563455, 0.82436545],
           [0.40454114, 0.59545886],
           [0.1884778 , 0.8115222 ],
           [0.88013064, 0.11986936],
           [0.91411123, 0.08588877],
           [0.13260197, 0.86739803],
           [0.90571178, 0.09428822],
           [0.05273217, 0.94726783],
           [0.10924951, 0.89075049]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    2.模型评估

    2.1.交叉验证

    交叉验证有很多种,第一种是最简单的,也是很容易就想到的:把数据集分成两部分,一个是训练集(training set),一个是测试集(test set)。
    在这里插入图片描述

    不过,这个简单的方法存在两个弊端。

    1.最终模型与参数的选取将极大程度依赖于你对训练集和测试集的划分方法。

    2.该方法只用了部分数据进行模型的训练,未能充分利用数据集的数据。

    为了解决这个问题,后面的技术人员进行了多种优化,接下来提到的就是K折交叉验证:

    我们每次的测试集将不再只包含一个数据,而是多个,具体数目将根据K的选取决定。比如,如果K=5,那么我们利用七折交叉验证的步骤就是:

    1.将所有数据集分成7份

    2.不重复地每次取其中一份做测试集,用其他 6 份做训练集训练模型,之后计算该模型在测试集上的MSE

    3.将7次的取平均得到最后的MSE

    from sklearn.model_selection import cross_val_score
    
    • 1
    lr = LogisticRegression(C=100)
    scores = cross_val_score(lr, X_train, y_train, cv=10)
    
    • 1
    • 2
    # k折交叉验证分数
    scores
    
    • 1
    • 2

    输出:

    array([0.82089552, 0.74626866, 0.74626866, 0.7761194 , 0.88059701,
           0.8358209 , 0.76119403, 0.8358209 , 0.74242424, 0.75757576])
    
    • 1
    • 2
    # 平均交叉验证分数
    print("Average cross-validation score: {:.2f}".format(scores.mean()))
    
    • 1
    • 2

    输出:

    Average cross-validation score: 0.79
    
    • 1

    2.2.混淆矩阵

    混淆矩阵是用来总结一个分类器结果的矩阵。对于k元分类,其实它就是一个k x k的表格,用来记录分类器的预测结果。

    混淆矩阵的方法在sklearn中的sklearn.metrics模块

    混淆矩阵需要输入真实标签和预测标签

    精确率、召回率以及f-分数可使用classification_report模块

    实际上模型的好坏,看混淆矩阵的主对角线即可。

    from sklearn.metrics import confusion_matrix
    
    • 1
    # 训练模型
    lr = LogisticRegression(C=100)
    lr.fit(X_train, y_train)
    
    • 1
    • 2
    • 3
    LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    # 模型预测结果
    pred = lr.predict(X_train)
    
    • 1
    • 2
    # 混淆矩阵
    confusion_matrix(y_train, pred)
    
    • 1
    • 2
    array([[354,  58],
           [ 83, 173]])
    
    • 1
    • 2
    # 分类报告
    from sklearn.metrics import classification_report
    
    • 1
    • 2
    # 精确率、召回率以及f1-score
    print(classification_report(y_train, pred))
    
    • 1
    • 2
        			precision    recall  f1-score   support
    
               0       0.81      0.86      0.83       412
               1       0.75      0.68      0.71       256
    
        accuracy                           0.79       668
       macro avg       0.78      0.77      0.77       668
    weighted avg       0.79      0.79      0.79       668
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    2.3.ROC曲线

    ROC曲线起源于第二次世界大战时期雷达兵对雷达的信号判断。当时每一个雷达兵的任务就是去解析雷达的信号,但是当时的雷达技术还没有那么先进,存在很多噪声,所以每当有信号出现在雷达屏幕上,雷达兵就需要对其进行破译。有的雷达兵比较谨慎,凡是有信号过来,他都会倾向于解析成是敌军轰炸机,有的雷达兵又比较神经大条,会倾向于解析成是飞鸟。在这种情况下就急需一套评估指标来帮助他汇总每一个雷达兵的预测信息以及来评估这台雷达的可靠性。于是,最早的ROC曲线分析方法就诞生了。在那之后,ROC曲线就被广泛运用于医学以及机器学习领域。

    ROC的全称是Receiver Operating Characteristic Curve,中文名字叫【受试者工作特征曲线】

    ROC曲线在sklearn中的模块为sklearn.metrics

    ROC曲线下面所包围的面积越大越好

    from sklearn.metrics import roc_curve
    
    • 1
    fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
    plt.plot(fpr, tpr, label="ROC Curve")
    plt.xlabel("FPR")
    plt.ylabel("TPR (recall)")
    # 找到最接近于0的阈值
    close_zero = np.argmin(np.abs(thresholds))
    plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
    plt.legend(loc=4)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    在这里插入图片描述

    3.参考资料

    【机器学习】Cross-Validation(交叉验证)详解 - 知乎 (zhihu.com)

    https://www.jianshu.com/p/2ca96fce7e81

  • 相关阅读:
    9.25
    q-learning强化学习使用基础
    算法进修Day-36
    Mac Redis 安装 RedisJSON模块教程
    100天精通Python——第39天:操作MySQL和SqlServer
    【设计模式】结构型设计模式之 享元模式
    注解和反射
    Linux centos环境 安装谷歌浏览器
    docker全家桶(基本命令、dockerhub、docker-compose)
    Web前端:如何提高React原生应用性能
  • 原文地址:https://blog.csdn.net/CANGYE0504/article/details/125436816