• 基于sklearn的集成学习实战


    集成学习投票法与bagging

    投票法

    sklearn提供了VotingRegressor和VotingClassifier两个投票方法。使用模型需要提供一个模型的列表,列表中每个模型采用tuple的结构表示,第一个元素代表名称,第二个元素代表模型,需要保证每个模型拥有唯一的名称。看下面的例子:

    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import VotingClassifier
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler
    
    • 1
    • 2
    • 3
    • 4
    • 5
    models = [('lr',LogisticRegression()),('svm',SVC())]
    ensemble = VotingClassifier(estimators=models)  # 硬投票
    
    models = [('lr',LogisticRegression()),('svm',make_pipeline(StandardScaler(),SVC()))]
    ensemble = VotingClassifier(estimators=models,voting='soft')  # 软投票
    
    • 1
    • 2
    • 3
    • 4
    • 5

    我们可以通过一个例子来判断集成对模型的提升效果。

    首先我们创建一个1000个样本,20个特征的随机数据集合:

    from sklearn.datasets import make_classification
    def get_dataset():
        X, y = make_classification(n_samples = 1000, # 样本数目为1000
                                  n_features = 20,  # 样本特征总数为20
                                  n_informative = 15,  # 含有信息量的特征为15
                                  n_redundant = 5, # 冗余特征为5
                                  random_state = 2)
        return X, y
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    补充一下函数make_classification的参数:

    • n_samples:样本数量,默认100
    • n_features:特征总数,默认20
    • n_imformative:信息特征的数量
    • n_redundant:冗余特征的数量,是信息特征的随机线性组合生成的
    • n_repeated:从信息特征和冗余特征中随机抽取的重复特征的数量
    • n_classes:分类数目
    • n_clusters_per_class:每个类的集群数
    • random_state:随机种子

    使用KNN模型来作为基模型:

    def get_voting():
        models = list()
        models.append(('knn1', KNeighborsClassifier(n_neighbors=1)))
        models.append(('knn3', KNeighborsClassifier(n_neighbors=3)))
        models.append(('knn5', KNeighborsClassifier(n_neighbors=5)))
        models.append(('knn7', KNeighborsClassifier(n_neighbors=7)))
        models.append(('knn9', KNeighborsClassifier(n_neighbors=9)))
        ensemble = VotingClassifier(estimators=models, voting='hard')
        return ensemble
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    为了显示每次模型的提升,加入下面的函数:

    def get_models():
        models = dict()
        models['knn1'] = KNeighborsClassifier(n_neighbors=1)
        models['knn3'] = KNeighborsClassifier(n_neighbors=3)
        models['knn5'] = KNeighborsClassifier(n_neighbors=5)
        models['knn7'] = KNeighborsClassifier(n_neighbors=7)
        models['knn9'] = KNeighborsClassifier(n_neighbors=9)
        models['hard_voting'] = get_voting()
        return models
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    接下来定义下面的函数来以分层10倍交叉验证3次重复的分数列表的形式返回:

    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import RepeatedStratifiedKFold
    def evaluate_model(model, X, y):
        cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state = 1)
        # 多次分层随机打乱的K折交叉验证
        scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
                                error_score='raise')
        return scores
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    接着来总体调用一下:

    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    X, y = get_dataset()
    models = get_models()
    results, names = list(), list()
    
    for name, model in models.items():
        score = evaluate_model(model,X, y)
        results.append(score)
        names.append(name)
        print("%s %.3f  (%.3f)" % (name, score.mean(), score.std()))
        
    plt.boxplot(results, labels = names, showmeans = True)
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    knn1 0.873  (0.030)
    knn3 0.889  (0.038)
    knn5 0.895  (0.031)
    knn7 0.899  (0.035)
    knn9 0.900  (0.033)
    hard_voting 0.902  (0.034)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    KNN提升箱型图

    可以看到结果不断在提升。

    bagging

    同样,我们生成数据集后采用简单的例子来介绍bagging对应函数的用法:

    from numpy import mean
    from numpy import std
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import RepeatedStratifiedKFold
    from sklearn.ensemble import BaggingClassifier
    
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=5)
    model = BaggingClassifier()
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    Accuracy: 0.861 (0.042)
    
    • 1

    Boosting

    关于这方面的理论知识的介绍可以看我这篇博客。

    这边继续关注这方面的代码怎么使用。

    Adaboost

    先导入各种包:

    import numpy as np
    import pandas as pd 
    import matplotlib.pyplot as plt
    plt.style.use("ggplot")
    %matplotlib inline
    import seaborn as sns
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    # 加载训练数据:         
    wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
    wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 
                    'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
    
    # 数据查看:
    print("Class labels",np.unique(wine["Class label"]))
    wine.head()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    Class labels [1 2 3]
    
    • 1

    image-20221121154232413

    那么接下来就需要对数据进行预处理:

    # 数据预处理
    # 仅仅考虑2,3类葡萄酒,去除1类
    wine = wine[wine['Class label'] != 1]
    y = wine['Class label'].values
    X = wine[['Alcohol','OD280/OD315 of diluted wines']].values
    
    # 将分类标签变成二进制编码:
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    y = le.fit_transform(y)
    
    # 按8:2分割训练集和测试集
    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)  # stratify参数代表了按照y的类别等比例抽样
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    这里补充一下LabelEncoder的用法,官方给出的最简洁的解释为:

    Encode labels with value between 0 and n_classes-1.

    也就是可以将原来的标签约束到[0,n_classes-1]之间,1个数字代表一个类别。其具体的方法为:

    • fit、transform:

      le = LabelEncoder()
      le = le.fit(['A','B','C'])  # 使用le去拟合所有标签
      data = le.transform(data)  # 将原来的标签转换为编码
      
      • 1
      • 2
      • 3
    • fit_transform:

      le = LabelEncoder()
      data = le.fit_transform(data)
      
      • 1
      • 2
    • inverse_transform:根据编码反向推出原来类别标签

      le.inverse_transform([0,1,2])
      # 输出['A','B','C']
      
      • 1
      • 2

    继续AdaBoost。

    我们接下来对比一下单一决策树和集成之间的效果差异。

    # 使用单一决策树建模
    from sklearn.tree import DecisionTreeClassifier
    tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
    from sklearn.metrics import accuracy_score
    tree = tree.fit(X_train,y_train)
    y_train_pred = tree.predict(X_train)
    y_test_pred = tree.predict(X_test)
    tree_train = accuracy_score(y_train,y_train_pred)
    tree_test = accuracy_score(y_test,y_test_pred)
    print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    Decision tree train/test accuracies 0.916/0.875
    
    • 1

    下面对AdaBoost进行建模:

    from sklearn.ensemble import AdaBoostClassifier
    ada = AdaBoostClassifier(base_estimator=tree,  # 基分类器
                             n_estimators=500, # 最大迭代次数
                            learning_rate=0.1,
                            random_state= 1)
    ada.fit(X_train, y_train)
    y_train_pred = ada.predict(X_train)
    y_test_pred = ada.predict(X_test)
    ada_train = accuracy_score(y_train,y_train_pred)
    ada_test = accuracy_score(y_test,y_test_pred)
    print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    Adaboost train/test accuracies 1.000/0.917
    
    • 1

    可以看见分类精度提升了不少,下面我们可以观察他们的决策边界有什么不同:

    x_min = X_train[:, 0].min() - 1
    x_max = X_train[:, 0].max() + 1
    y_min = X_train[:, 1].min() - 1
    y_max = X_train[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
    f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
    for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
        clf.fit(X_train, y_train)
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        axarr[idx].contourf(xx, yy, Z, alpha=0.3)
        axarr[idx].scatter(X_train[y_train==0, 0],X_train[y_train==0, 1],c='blue', marker='^')
        axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='red', marker='o')
        axarr[idx].set_title(tt)
    axarr[0].set_ylabel('Alcohol', fontsize=12)
    plt.tight_layout()
    plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    image-20221121155718367

    可以看淡AdaBoost的决策边界比单个决策树的决策边界复杂很多。

    梯度提升GBDT

    这里的理论解释用我对西瓜书的学习笔记吧:

    GBDT是一种迭代的决策树算法,由多个决策树组成,所有树的结论累加起来作为最终答案。接下来从这两个方面进行介绍:Regression Decision Tree(回归树)、Gradient Boosting(梯度提升)、Shrinkage

    DT

    先补充一下决策树的类别,包含两种:

    • 分类决策树:用于预测分类标签值,例如天气的类型、用户的性别等
    • 回归决策树:用来预测连续实数值,例如天气温度、用户年龄

    这两者的重要区别在于回归决策树的输出结果可以相加,分类决策树的输出结果不可以相加,例如回归决策树分别输出10岁、5岁、2岁,那他们相加得到17岁,可以作为结果使用;但是分类决策树分别输出男、男、女,这样的结果是无法进行相加处理的。因此GBDT中的树都是回归树,GBDT的核心在于累加所有树的结果作为最终结果

    回归树,救赎用树模型来做回归问题,其每一片叶子都输出一个预测值,而这个预测值一般是该片叶子所含训练集元素输出的均值,即 c m = a v e ( y i ∣ x i ∈ l e a f m ) c_m=ave(y_i\mid x_i \in leaf_m) cm=ave(yixileafm)

    一般来说常见的回归决策树为CART,因此下面先介绍CART如何应用于回归问题。

    在回归问题中,CART主要使用均方误差(mse)或者平均绝对误差(mae)来作为选择特征以及选择划分点时的依据。因此用于回归问题时,目标就是要构建出一个函数 f ( x ) f(x) f(x)能够有效地拟合数据集 D D D(样本数为n)中的元素,使得所选取的度量指标最小,例如取mse,即:
    m i n 1 n ∑ i = 0 n ( f ( x i ) − y i ) 2 min \frac{1}{n} \sum_{i=0}^{n}(f(x_i)-y_i)^2 minn1i=0n(f(xi)yi)2
    而对于构建好的CART回归树来说,假设其存在 M M M片叶子,那么其mse的公式可以写成:
    m i n 1 n ∑ m = 1 M ∑ x i ∈ R m ( c m − y i ) 2 min \frac{1}{n}\sum_{m=1}^{M}\sum_{x_i \in R_m}(c_m - y_i)^2 minn1m=1MxiRm(cmyi)2
    其中 R m R_m Rm代表第 m m m片叶子,而 c m c_m cm代表第 m m m片叶子的预测值。

    要使得最小化mse,就需要对每一片叶子的mse都进行最小化。由统计学的知识可知道,只需要使**每片叶子的预测值为该片叶子中含有的训练元素的均值即可,即 c m = a v e ( y i ∣ x i ∈ l e a f m ) c_m=ave(y_i\mid x_i \in leaf_m) cm=ave(yixileafm)

    因此CART的学习方法就是遍历每一个变量且遍历变量中的每一个切分点,寻找能够使得mse最小的切分变量和切分点,即最小化如下公式:
    m i n j , s [ m i n c 1 ∑ x i ∈ R 1 { j , s } ( y i − c 1 ) 2 + m i n c 2 ∑ x i ∈ R 2 { j , s } ( y i − c 2 ) 2 ] min_{j,s}[min_{c1}\sum_{x_i\in R_{1\{j,s\}}}(y_i-c_1)^2+min_{c_2}\sum_{x_i\in R_{2\{j,s\}}}(y_i-c_2)^2] minj,s[minc1xiR1{j,s}(yic1)2+minc2xiR2{j,s}(yic2)2]
    另外一个值得了解的想法就是为什么CART必须是二叉树,其实是因为如果划分结点增多,即划分出来的区间也增多,遍历的难度加大。而如果想要细分为多个区域,实际上只需要让CART回归树的层次更深即可,这样遍历难度将小很多。

    Gradient Boosting

    梯度提升算法的流程与AdaBoost有点类似,而区别在于AdaBoost的特点在于每一轮是对分类正确和错误的样本分别进行修改权重,而梯度提升算法的每一轮学习都是为了减少上一轮的误差,具体可以看以下算法描述:

    Step1、给定训练数据集T及损失函数L,这里可以认为损失函数为mse

    Step2、初始化:
    f 0 ( x ) = a r g m i n c ∑ i = 1 N L ( y i , c ) = a r g m i n c ∑ i = 1 N ( y i − f o ( x i ) ) 2 f_0(x)=argmin_c\sum_{i=1}^{N}L(y_i,c)=argmin_c\sum_{i=1}^{N}(y_i - f_o(x_i))^2 f0(x)=argminci=1NL(yi,c)=argminci=1N(yifo(xi))2
    上式求解出来为:
    f 0 ( x ) = ∑ i = 1 N y i N = c f_0(x)=\sum_{i=1}^{N} \frac{y_i}{N}=c f0(x)=i=1NNyi=c
    Step3、目标基分类器(回归树)数目为 M M M,对于 m = 1 , 2 , . . . M m=1,2,...M m=1,2,...M

    ​ (1)、对于 i = 1 , 2 , . . . N i=1,2,...N i=1,2,...N,计算
    r m i = − [ ∂ L ( y i , f ( x i ) ∂ f ( x i ) ] f ( x ) = f m − 1 ( x ) r_{mi}=-[\frac{\partial L(y_i,f(x_i)}{\partial f(x_i)}]_{f(x)=f_{m-1}(x)} rmi=[f(xi)L(yi,f(xi)]f(x)=fm1(x)
    ​ (2)、对 r m i r_{mi} rmi进行拟合学习得到一个回归树,其叶结点区域为 R m j , j = 1 , 2 , . . . J R_{mj},j=1,2,...J Rmj,j=1,2,...J

    ​ (3)、对 j = 1 , 2 , . . . J j=1,2,...J j=1,2,...J,计算该对应叶结点 R m j R_{mj} Rmj的输出预测值:
    c m j = a r g m i n c ∑ x i ∈ R m j L ( y i , f m − 1 ( x i ) + c ) c_{mj}=argmin_c\sum_{x_i \in R_{mj}}L(y_i,f_{m-1}(x_i)+c) cmj=argmincxiRmjL(yi,fm1(xi)+c)
    ​ (4)、更新 f m ( x ) = f m − 1 ( x ) + ∑ j = 1 J c m j I ( x ∈ R m j ) f_m(x)=f_{m-1}(x)+\sum _{j=1}^{J}c_{mj}I(x \in R_{mj}) fm(x)=fm1(x)+j=1JcmjI(xRmj)

    Step4、得到回归树:
    f ^ ( x ) = f M ( x ) = ∑ m = 0 M ∑ j = 1 J c m j I ( x ∈ R m j ) \hat{f}(x)=f_M(x)=\sum_{m=0}^{M} \sum_{j=1}^{J}c_{mj}I(x\in R_{mj}) f^(x)=fM(x)=m=0Mj=1JcmjI(xRmj)
    为什么损失函数的负梯度能够作为回归问题残差的近似值呢?:因为对于损失函数为mse来说,其求导的结果其实就是预测值与真实值之间的差值,那么负梯度也就是我们预测的残差 ( y i − f ( x i ) ) (y_i - f(x_i)) (yif(xi)),因此只要下一个回归树对负梯度进行拟合再对多个回归树进行累加,就可以更好地逼近真实值

    Shrinkage

    Shrinkage是一种用来对GBDT进行优化,防止其陷入过拟合的方法,其具体思想是:减少每一次迭代对于残差的收敛程度或者逼近程度,也就是说该思想认为迭代时每一次少逼近一些,然后迭代次数多一些的效果,比每一次多逼近一些,然后迭代次数少一些的效果要更好。那么具体的实现就是在每个回归树的累加前乘上学习率,即:
    f m ( x ) = f m − 1 ( x ) + γ ∑ j = 1 J c m j I ( x ∈ R m j ) f_m(x)=f_{m-1}(x)+\gamma\sum_{j=1}^{J}c_{mj}I(x\in R_{mj}) fm(x)=fm1(x)+γj=1JcmjI(xRmj)


    建模实现

    下面对GBDT的模型进行解释以及建模实现。

    引入相关库:

    from sklearn.metrics import mean_squared_error
    from sklearn.datasets import make_friedman1
    from sklearn.ensemble import GradientBoostingRegressor
    
    • 1
    • 2
    • 3

    GBDT还有一个做分类的模型是GradientBoostingClassifier。

    下面整理一下模型的各个参数:

    参数名称参数意义
    loss{‘ls’, ‘lad’, ‘huber’, ‘quantile’}, default=’ls’:‘ls’ 指最小二乘回归. ‘lad’是最小绝对偏差,是仅基于输入变量的顺序信息的高度鲁棒的损失函数;‘huber’ 是两者的结合
    quantile允许分位数回归(用于alpha指定分位数)
    learning_rate学习率缩小了每棵树的贡献learning_rate。在learning_rate和n_estimators之间需要权衡。
    n_estimators要执行的提升次数,可以认为是基分类器的数目
    subsample用于拟合各个基础学习者的样本比例。如果小于1.0,则将导致随机梯度增强
    criterion{‘friedman_mse’,‘mse’,‘mae’},默认=‘friedman_mse’:“ mse”是均方误差,“ mae”是平均绝对误差。默认值“ friedman_mse”通常是最好的,因为在某些情况下它可以提供更好的近似值。
    min_samples_split拆分内部节点所需的最少样本数
    min_samples_leaf在叶节点处需要的最小样本数
    min_weight_fraction_leaf在所有叶节点处(所有输入样本)的权重总和中的最小加权分数。如果未提供sample_weight,则样本的权重相等。
    max_depth各个回归模型的最大深度。最大深度限制了树中节点的数量。调整此参数以获得最佳性能;最佳值取决于输入变量的相互作用
    min_impurity_decrease如果节点分裂会导致杂质(损失函数)的减少大于或等于该值,则该节点将被分裂
    min_impurity_split提前停止树木生长的阈值。如果节点的杂质高于阈值,则该节点将分裂
    max_features{‘auto’, ‘sqrt’, ‘log2’},int或float:寻找最佳分割时要考虑的功能数量

    可能各个参数一开始难以理解,但是随着使用会加深印象的。

    X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
    X_train, X_test = X[:200], X[200:]
    y_train, y_test = y[:200], y[200:]
    est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
        max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
    mean_squared_error(y_test, est.predict(X_test))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    5.009154859960321
    
    • 1
    from sklearn.datasets import make_regression
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.model_selection import train_test_split
    X, y = make_regression(random_state=0)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=0)
    reg = GradientBoostingRegressor(random_state=0)
    reg.fit(X_train, y_train)
    reg.score(X_test, y_test)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    XGBoost

    关于XGBoost的理论,其实它跟GBDT差不多,只不过在泰勒展开时考虑了二阶导函数,并在库实现中增加了很多的优化。而关于其参数的设置也相当复杂,因此这里简单介绍其用法即可。

    安装XGBoost
    pip install xgboost
    
    • 1
    数据接口

    xgboost库所使用的数据类型为特殊的DMatrix类型,因此其读入文件比较特殊:

    # 1.LibSVM文本格式文件
    dtrain = xgb.DMatrix('train.svm.txt')
    dtest = xgb.DMatrix('test.svm.buffer')
    # 2.CSV文件(不能含类别文本变量,如果存在文本变量请做特征处理如one-hot)
    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
    # 3.NumPy数组
    data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
    label = np.random.randint(2, size=5)  # binary target
    dtrain = xgb.DMatrix(data, label=label)
    # 4.scipy.sparse数组
    csr = scipy.sparse.csr_matrix((dat, (row, col)))
    dtrain = xgb.DMatrix(csr)
    # pandas数据框dataframe
    data = pandas.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])
    label = pandas.DataFrame(np.random.randint(2, size=4))
    dtrain = xgb.DMatrix(data, label=label)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    第一次读入后可以先保存为对应的二进制文件方便下次读取:

    # 1.保存DMatrix到XGBoost二进制文件中
    dtrain = xgb.DMatrix('train.svm.txt')
    dtrain.save_binary('train.buffer')
    # 2. 缺少的值可以用DMatrix构造函数中的默认值替换:
    dtrain = xgb.DMatrix(data, label=label, missing=-999.0)
    # 3.可以在需要时设置权重:
    w = np.random.rand(5, 1)
    dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    参数设置
    import pandas as pd
    df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None)
    df_wine.columns = ['Class label', 'Alcohol','Malic acid', 'Ash','Alcalinity of ash','Magnesium', 'Total phenols',
                       'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline'] 
    df_wine = df_wine[df_wine['Class label'] != 1]  # drop 1 class      
    y = df_wine['Class label'].values
    X = df_wine[['Alcohol','OD280/OD315 of diluted wines']].values
    from sklearn.model_selection import train_test_split  # 切分训练集与测试集
    from sklearn.preprocessing import LabelEncoder   # 标签化分类变量
    import xgboost as xgb
    le = LabelEncoder()
    y = le.fit_transform(y)
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,
                                                     random_state=1,stratify=y)
    # 构造成目标类型的数据
    dtrain = xgb.DMatrix(X_train, label = y_train)
    dtest = xgb.DMatrix(X_test)
    
    # booster参数
    params = {
        "booster": 'gbtree',  # 基分类器
        "objective":  "multi:softmax",  # 使用softmax
        "num_class": 10,  # 类别数
        "gamma":0.1,  # 用于控制损失函数大于gamma时就剪枝
        "max_depth": 12,  # 构造树的深度,越大越容易过拟合
        "lambda":  2,  # 用来控制正则化参数
        "subsample": 0.7,  # 随机采样70%作为训练样本
        "colsample_bytree": 0.7,  # 生成树时进行列采样,相当于随机子空间
        "min_child_weight":3,
        'verbosity':0,  # 设置成1则没有运行信息输出,设置成0则有,原文中这里的silent已经在新版本中不使用了
        "eta": 0.007,  # 相当于学习率
        "seed": 1000,  # 随机数种子
        "nthread":4,   # 线程数
        "eval_metric": "auc"
    }
    plst = list(params.items())  # 这里必须加上list才能够使用后续的copy
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    这里如果发现报错为:

    Parameters: { “silent” } are not used.

    那是因为参数设置中新版本已经取消了silent参数,将其更改为verbosity即可。

    如果在后续中发现:

    ‘dict_items’ object has no attribute ‘copy’

    那是因为我们没有将items的返回变成list,像我上面那么改即可。

    训练及预测
    num_round = 10
    bst = xgb.train(plst, dtrain ,num_round)
    # 可以加上early_stoppint_rounds = 10来设置早停机制
    # 如果要指定验证集,就是
    # evallist = [(deval,'eval'),(dtrain, 'train')]
    # bst = xgb.train(plst, dtrain, num_round, evallist)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    ypred = bst.predict(dtest)  # 进行预测,跟sklearn的模型一致
    xgb.plot_importance(bst)  # 绘制特征重要性的图
    
    • 1
    • 2
    <AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
    
    • 1

    特征重要性

    模型保存与加载
    bst.save_model("model_new_1121.model")
    bst.dump_model("dump.raw.txt")
    bst_new = xgb.Booster({'nthread':4})  # 先初始化参数
    bst_new.load_model("model_new_1121.model")
    
    • 1
    • 2
    • 3
    • 4
    简单算例
    分类案例
    from sklearn.datasets import load_iris
    import xgboost as xgb
    from xgboost import plot_importance
    from matplotlib import pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score   # 准确率
    # 加载样本数据集
    iris = load_iris()
    X,y = iris.data,iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                        random_state=1234565) # 数据集分割
    # 算法参数
    params = {
        'booster': 'gbtree',
        'objective': 'multi:softmax',
        'num_class': 3,
        'gamma': 0.1,
        'max_depth': 6,
        'lambda': 2,
        'subsample': 0.7,
        'colsample_bytree': 0.75,
        'min_child_weight': 3,
        'verbosity': 0,
        'eta': 0.1,
        'seed': 1,
        'nthread': 4,
    }
    plst = list(params.items())
    dtrain = xgb.DMatrix(X_train, y_train) # 生成数据集格式
    num_rounds = 500
    model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练
    # 对测试集进行预测
    dtest = xgb.DMatrix(X_test)
    y_pred = model.predict(dtest)
    
    # 计算准确率
    accuracy = accuracy_score(y_test,y_pred)
    print("accuarcy: %.2f%%" % (accuracy*100.0))
    
    # 显示重要特征
    plot_importance(model)
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    accuarcy: 96.67%
    
    • 1

    特征重要性2

    回归案例
    import xgboost as xgb
    from xgboost import plot_importance
    from matplotlib import pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_boston
    from sklearn.metrics import mean_squared_error
    
    # 加载数据集
    boston = load_boston()
    X,y = boston.data,boston.target
    
    # XGBoost训练过程
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    params = {
        'booster': 'gbtree',
        'objective': 'reg:squarederror',  # 设置为回归,采用平方误差
        'gamma': 0.1,
        'max_depth': 5,
        'lambda': 3,
        'subsample': 0.7,
        'colsample_bytree': 0.7,
        'min_child_weight': 3,
        'verbosity': 1,
        'eta': 0.1,
        'seed': 1000,
        'nthread': 4,
    }
    dtrain = xgb.DMatrix(X_train, y_train)
    num_rounds = 300
    plst = list(params.items())
    model = xgb.train(plst, dtrain, num_rounds)
    # 对测试集进行预测
    dtest = xgb.DMatrix(X_test)
    ans = model.predict(dtest)
    
    # 显示重要特征
    plot_importance(model)
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39

    特征重要性3

    XGBoost的调参

    该模型的调参一般步骤为:

    1. 确定学习速率和提升参数调优的初始值
    2. max_depth 和 min_child_weight 参数调优
    3. gamma参数调优
    4. subsample 和 colsample_bytree 参数优
    5. 正则化参数alpha调优
    6. 降低学习速率和使用更多的决策树

    可以使用网格搜索来进行调优:

    import xgboost as xgb
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score
    
    iris = load_iris()
    X,y = iris.data,iris.target
    col = iris.target_names 
    train_x, valid_x, train_y, valid_y = train_test_split(X, y, test_size=0.3, random_state=1)   # 分训练集和验证集
    parameters = {
                  'max_depth': [5, 10, 15, 20, 25],
                  'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
                  'n_estimators': [500, 1000, 2000, 3000, 5000],
                  'min_child_weight': [0, 2, 5, 10, 20],
                  'max_delta_step': [0, 0.2, 0.6, 1, 2],
                  'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
                  'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
                  'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
                  'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
                  'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1]
    
    }
    
    xlf = xgb.XGBClassifier(max_depth=10,
                learning_rate=0.01,
                n_estimators=2000,
                silent=True,
                objective='multi:softmax',
                num_class=3 ,          
                nthread=-1,
                gamma=0,
                min_child_weight=1,
                max_delta_step=0,
                subsample=0.85,
                colsample_bytree=0.7,
                colsample_bylevel=1,
                reg_alpha=0,
                reg_lambda=1,
                scale_pos_weight=1,
                seed=0,
                missing=None)
    # 需要先给模型一定的初始值
    gs = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
    gs.fit(train_x, train_y)
    
    print("Best score: %0.3f" % gs.best_score_)
    print("Best parameters set: %s" % gs.best_params_ )
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    Best score: 0.933
    Best parameters set: {'max_depth': 5}
    
    • 1
    • 2

    LightGBM算法

    LightGBM也是像XGBoost一样,是一类集成算法,他跟XGBoost总体来说是一样的,算法本质上与Xgboost没有出入,只是在XGBoost的基础上进行了优化。

    其调优过程也是一个很复杂的学问。这里就附上课程调优代码吧:

    import lightgbm as lgb
    from sklearn import metrics
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
     
    canceData=load_breast_cancer()
    X=canceData.data
    y=canceData.target
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
     
    ### 数据转换
    print('数据转换')
    lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
     
    ### 设置初始参数--不含交叉验证参数
    print('设置参数')
    params = {
              'boosting_type': 'gbdt',
              'objective': 'binary',
              'metric': 'auc',
              'nthread':4,
              'learning_rate':0.1
              }
     
    ### 交叉验证(调参)
    print('交叉验证')
    max_auc = float('0')
    best_params = {}
     
    # 准确率
    print("调参1:提高准确率")
    for num_leaves in range(5,100,5):
        for max_depth in range(3,8,1):
            params['num_leaves'] = num_leaves
            params['max_depth'] = max_depth
     
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['auc'],
                                early_stopping_rounds=10,
                                verbose_eval=True
                                )
                
            mean_auc = pd.Series(cv_results['auc-mean']).max()
            boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
                
            if mean_auc >= max_auc:
                max_auc = mean_auc
                best_params['num_leaves'] = num_leaves
                best_params['max_depth'] = max_depth
    if 'num_leaves' and 'max_depth' in best_params.keys():          
        params['num_leaves'] = best_params['num_leaves']
        params['max_depth'] = best_params['max_depth']
     
    # 过拟合
    print("调参2:降低过拟合")
    for max_bin in range(5,256,10):
        for min_data_in_leaf in range(1,102,10):
                params['max_bin'] = max_bin
                params['min_data_in_leaf'] = min_data_in_leaf
                
                cv_results = lgb.cv(
                                    params,
                                    lgb_train,
                                    seed=1,
                                    nfold=5,
                                    metrics=['auc'],
                                    early_stopping_rounds=10,
                                    verbose_eval=True
                                    )
                        
                mean_auc = pd.Series(cv_results['auc-mean']).max()
                boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
     
                if mean_auc >= max_auc:
                    max_auc = mean_auc
                    best_params['max_bin']= max_bin
                    best_params['min_data_in_leaf'] = min_data_in_leaf
    if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
        params['min_data_in_leaf'] = best_params['min_data_in_leaf']
        params['max_bin'] = best_params['max_bin']
     
    print("调参3:降低过拟合")
    for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
        for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
            for bagging_freq in range(0,50,5):
                params['feature_fraction'] = feature_fraction
                params['bagging_fraction'] = bagging_fraction
                params['bagging_freq'] = bagging_freq
                
                cv_results = lgb.cv(
                                    params,
                                    lgb_train,
                                    seed=1,
                                    nfold=5,
                                    metrics=['auc'],
                                    early_stopping_rounds=10,
                                    verbose_eval=True
                                    )
                        
                mean_auc = pd.Series(cv_results['auc-mean']).max()
                boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
     
                if mean_auc >= max_auc:
                    max_auc=mean_auc
                    best_params['feature_fraction'] = feature_fraction
                    best_params['bagging_fraction'] = bagging_fraction
                    best_params['bagging_freq'] = bagging_freq
     
    if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
        params['feature_fraction'] = best_params['feature_fraction']
        params['bagging_fraction'] = best_params['bagging_fraction']
        params['bagging_freq'] = best_params['bagging_freq']
     
     
    print("调参4:降低过拟合")
    for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
        for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
            params['lambda_l1'] = lambda_l1
            params['lambda_l2'] = lambda_l2
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['auc'],
                                early_stopping_rounds=10,
                                verbose_eval=True
                                )
                    
            mean_auc = pd.Series(cv_results['auc-mean']).max()
            boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
     
            if mean_auc >= max_auc:
                max_auc=mean_auc
                best_params['lambda_l1'] = lambda_l1
                best_params['lambda_l2'] = lambda_l2
    if 'lambda_l1' and 'lambda_l2' in best_params.keys():
        params['lambda_l1'] = best_params['lambda_l1']
        params['lambda_l2'] = best_params['lambda_l2']
     
    print("调参5:降低过拟合2")
    for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
        params['min_split_gain'] = min_split_gain
        
        cv_results = lgb.cv(
                            params,
                            lgb_train,
                            seed=1,
                            nfold=5,
                            metrics=['auc'],
                            early_stopping_rounds=10,
                            verbose_eval=True
                            )
                
        mean_auc = pd.Series(cv_results['auc-mean']).max()
        boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
     
        if mean_auc >= max_auc:
            max_auc=mean_auc
            
            best_params['min_split_gain'] = min_split_gain
    if 'min_split_gain' in best_params.keys():
        params['min_split_gain'] = best_params['min_split_gain']
     
    print(best_params)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    {'bagging_fraction': 0.7,
    'bagging_freq': 30,
    'feature_fraction': 0.8,
    'lambda_l1': 0.1,
    'lambda_l2': 0.0,
    'max_bin': 255,
    'max_depth': 4,
    'min_data_in_leaf': 81,
    'min_split_gain': 0.1,
    'num_leaves': 10}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    Blending与Stacking

    Stacking,这个集成方法在比赛中被称为“懒人”算法,因为它不需要花费过多时间的调参就可以得到一个效果不错的算法。

    Stacking集成算法可以理解为一个两层的集成,第一层含有多个基础分类器,把预测的结果(元特征)提供给第二层, 而第二层的分类器通常是逻辑回归,他把一层分类器的结果当做特征做拟合输出预测结果。

    而Blending就是简化版的Stacking,因此我们先对前者进行介绍。

    Blending集成学习算法

    算法流程

    Blending的算法流程为:

    • 将数据集划分为训练集与测试集,训练集再划分为训练集与验证集
    • 创建第一层的多个模型(可同质也可异质),然后对训练集进行学习
    • 第一层的模型训练完成后对验证集和测试集做出预测,假设K个模型,那么就得到 A 1 , . . . , A K A_1,...,A_K A1,...,AK B 1 , . . . , B K B_1,...,B_K B1,...,BK,其中每个代表一个基分类器对验证集或测试集的所有结果输出。
    • 创建第二层的分类器,其将 A 1 , . . . , A K A_1,...,A_K A1,...,AK作为训练数据集,那么样本数目就是验证集的样本数目,特征数目就是K,将真实的验证集标签作为标签,从而来训练该分类器
    • 对测试集的预测则是将 B 1 , . . . , B K B_1,...,B_K B1,...,BK作为特征,用第二层的分类器进行预测。
    具体实现

    具体的实现如下:

    import numpy as np
    import pandas as pd 
    import matplotlib.pyplot as plt
    plt.style.use("ggplot")
    %matplotlib inline
    import seaborn as sns
    # 创建数据
    from sklearn import datasets 
    from sklearn.datasets import make_blobs
    from sklearn.model_selection import train_test_split
    data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 )
    ## 创建训练集和测试集
    X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1)
    ## 创建训练集和验证集
    X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)
    print("The shape of training X:",X_train.shape)
    print("The shape of training y:",y_train.shape)
    print("The shape of test X:",X_test.shape)
    print("The shape of test y:",y_test.shape)
    print("The shape of validation X:",X_val.shape)
    print("The shape of validation y:",y_val.shape)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    The shape of training X: (5600, 2)
    The shape of training y: (5600,)
    The shape of test X: (2000, 2)
    The shape of test y: (2000,)
    The shape of validation X: (2400, 2)
    The shape of validation y: (2400,)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    #  设置第一层分类器
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    
    clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]
    
    # 设置第二层分类器
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    # 输出第一层的验证集结果与测试集结果
    val_features = np.zeros((X_val.shape[0],len(clfs)))  # 初始化验证集结果
    test_features = np.zeros((X_test.shape[0],len(clfs)))  # 初始化测试集结果
    
    for i,clf in enumerate(clfs):
        clf.fit(X_train,y_train)
        # porba函数得到的是对于每个类别的预测分数,取出第一列代表每个样本为第一类的概率
        val_feature = clf.predict_proba(X_val)[:, 1]
        test_feature = clf.predict_proba(X_test)[:,1]
        val_features[:,i] = val_feature
        test_features[:,i] = test_feature
    # 将第一层的验证集的结果输入第二层训练第二层分类器
    lr.fit(val_features,y_val)
    # 输出预测的结果
    from sklearn.model_selection import cross_val_score
    cross_val_score(lr,test_features,y_test,cv=5)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    array([1., 1., 1., 1., 1.])
    
    • 1

    可以看到交叉验证的结果很好。

    对于小作业我总有些疑问,那就是这个iris数据的特征为4,然后预测类别数为3,那么首先是特征为4超过3维度,应该怎么决策边界,难道舍弃一些维度吗?其次是类别数为3,那么在计算的时候取的是[;1]也就是类别为1的概率,那么只取这个作为下一层的特征是否足够,因为类别为0和类别为2的概率完全舍弃的话不行吧。

    Stacking

    算法流程

    下面这张图可以很好的理解流程:

    5

    • 首先将所有数据划分测试集和训练集,假设训练集总共有10000行,测试集总共2500行,对第一层的分类器进行5折交叉验证,那么验证集就划分为2000行,训练集为8000行。
    • 每次验证就相当于使用8000条训练数据去训练模型然后用2000条验证数据去验证,并且每一次都将训练出来的模型对测试集的2500条数据进行预测,那么经过5次交叉验证,对于每个基分类器可以得到中间的 5 × 2000 5\times 2000 5×2000的五份验证集数据,以及 5 × 2500 5\times 2500 5×2500的五份测试集的预测结果
    • 接下来将验证集拼成10000行长的矩阵,记为 A 1 A_1 A1(对于第1个基分类器),而对于 5 × 2500 5\times 2500 5×2500行的测试集的预测结果进行加权平均,得到2500行1列的矩阵,记为 B 1 B_1 B1
    • 那么假设这里是3个基分类器,因此有 A 1 , A 2 , A 3 , B 1 , B 2 , B 3 A_1,A_2,A_3,B_1,B_2,B_3 A1,A2,A3,B1,B2,B3六个矩阵,接下来将 A 1 , A 2 , A 3 A_1,A_2,A_3 A1,A2,A3矩阵拼成10000行3列的矩阵作为训练数据集,而验证集的真实标签作为训练数据的标签;将 B 1 , B 2 , B 3 B_1,B_2,B_3 B1,B2,B3拼成2500行3列的矩阵作为测试数据集合。
    • 那么对下层的学习器进行训练
    具体代码
    from sklearn import datasets
    iris = datasets.load_iris()
    X, y = iris.data[:, 1:3], iris.target  # 只取出两个特征
    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB 
    from sklearn.ensemble import RandomForestClassifier
    from mlxtend.classifier import StackingCVClassifier
    
    RANDOM_SEED = 42
    
    clf1 = KNeighborsClassifier(n_neighbors = 1)
    clf2 = RandomForestClassifier(random_state = RANDOM_SEED)
    clf3 = GaussianNB()
    lr = LogisticRegression()
    
    sclf = StackingCVClassifier(classifiers = [clf1, clf2, clf3],  # 第一层的分类器
                               meta_classifier = lr,  # 第二层的分类器
                               random_state = RANDOM_SEED)
    print("3折交叉验证:\n")
    for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN','RandomForest','Naive Bayes',
                                                    'Stack']):
        scores = cross_val_score(clf, X, y, cv = 3, scoring = 'accuracy')
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    3折交叉验证:
    
    Accuracy: 0.91 (+/- 0.01) [KNN]
    Accuracy: 0.95 (+/- 0.01) [RandomForest]
    Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
    Accuracy: 0.93 (+/- 0.02) [Stack]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    接下来尝试将决策边界画出:

    from mlxtend.plotting import plot_decision_regions
    import matplotlib.gridspec as gridspec
    import itertools
    
    
    gs = gridspec.GridSpec(2,2)
    # 可以理解为将子图划分为了2*2的区域
    fig = plt.figure(figsize = (10,8))
    
    for clf, lab, grd in zip([clf1, clf2, clf3, sclf], ['KNN',
                                                        'RandomForest',
                                                        'Naive Bayes',
                                                        'Stack'],
                            itertools.product([0,1],repeat=2)):
        clf.fit(X, y)
        ax = plt.subplot(gs[grd[0],grd[1]])
        # grd依次为(0,0),(0,1),(1,0),(1,1),那么传入到gs中就可以得到指定的区域
        fig = plot_decision_regions(X = X, y = y, clf = clf)
        plt.title(lab)
        
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    下载

    这里补充两个第一次见到的函数:

    • itertools.product([0,1],repeat = 2):该模块下的product函数一般是传进入两个集合,例如传进入[0,1],[1,2]然后返回[(0,1),(0,2),(1,1),(1,2)],那么这里只传进去一个参数然后repeat=2相当于传进去[0,1],[0,1],产生[(0,0),(0,1),(1,0),(1,1)],如果repeat=3就是

      (0, 0, 0)
      (0, 0, 1)
      (0, 1, 0)
      (0, 1, 1)
      (1, 0, 0)
      (1, 0, 1)
      (1, 1, 0)
      (1, 1, 1)
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
    • gs = gridspec.GridSpec(2,2):这个函数相当于我将子图划分为2*2总共4个区域,那么在下面subplot中就可以例如调用gs[0,1]来获取(0,1)这个区域,下面的例子或者更好理解:

      plt.figure()
      
      gs=gridspec.GridSpec(3,3)#分为3行3列
      ax1=plt.subplot(gs[0,:])
      ax1=plt.subplot(gs[1,:2])
      ax1=plt.subplot(gs[1:,2])
      ax1=plt.subplot(gs[-1,0])
      ax1=plt.subplot(gs[-1,-2])
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8

      image-20221123150324209


    继续回到Stacking中。

    前面我们是使用第一层的分类器其输出作为第二层的输入,那么如果希望使用第一层所有基分类器所产生的的类别概率值作为第二层分类器的数目,需要在StackingClassifier 中增加一个参数设置:use_probas = True。还有一个参数设置average_probas = True,那么这些基分类器所产出的概率值将按照列被平均,否则会拼接

    image-20221123150534461

    我们来进行尝试:

    clf1 = KNeighborsClassifier(n_neighbors=1)
    clf2 = RandomForestClassifier(random_state=1)
    clf3 = GaussianNB()
    lr = LogisticRegression()
    
    sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                                use_probas=True,  #  使用概率
                                meta_classifier=lr,
                                random_state=42)
    
    print('3折交叉验证:')
    
    for clf, label in zip([clf1, clf2, clf3, sclf], 
                          ['KNN', 
                           'Random Forest', 
                           'Naive Bayes',
                           'StackingClassifier']):
    
        scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
              % (scores.mean(), scores.std(), label))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    3折交叉验证:
    Accuracy: 0.91 (+/- 0.01) [KNN]
    Accuracy: 0.95 (+/- 0.01) [Random Forest]
    Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
    Accuracy: 0.95 (+/- 0.02) [StackingClassifier]
    
    • 1
    • 2
    • 3
    • 4
    • 5

    另外,还可以跟网格搜索相结合:

    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB 
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import GridSearchCV
    from mlxtend.classifier import StackingCVClassifier
    
    
    clf1 = KNeighborsClassifier(n_neighbors=1)
    clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
    clf3 = GaussianNB()
    lr = LogisticRegression()
    
    sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], 
                                meta_classifier=lr,
                                random_state=42)
    
    params = {'kneighborsclassifier__n_neighbors': [1, 5],
              'randomforestclassifier__n_estimators': [10, 50],
              'meta_classifier__C': [0.1, 10.0]}# 
    grid = GridSearchCV(estimator=sclf,  # 分类为stacking 
                        param_grid=params, # 设置的参数
                        cv=5,
                        refit=True)  
    #最后一个参数代表在搜索参数结束后,用最佳参数结果再次fit一遍全部数据集
    
    grid.fit(X, y)
    
    cv_keys = ('mean_test_score', 'std_test_score', 'params')
    
    for r,_ in enumerate(grid.cv_results_['mean_test_score']):
        print("%0.3f +/- %0.2f %r"
              % (grid.cv_results_[cv_keys[0]][r],
                 grid.cv_results_[cv_keys[1]][r] / 2.0,
                 grid.cv_results_[cv_keys[2]][r]))
    print('Best parameters: %s' % grid.best_params_)
    print('Accuracy: %.2f' % grid.best_score_)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    0.947 +/- 0.03 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
    0.933 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
    0.940 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
    0.940 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
    0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
    0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
    0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
    0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
    Best parameters: {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
    Accuracy: 0.95
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    而如果希望在算法中多次使用某个模型,就可以在参数网格中添加一个附加的数字后缀:

    params = {'kneighborsclassifier-1__n_neighbors': [1, 5],
              'kneighborsclassifier-2__n_neighbors': [1, 5],
              'randomforestclassifier__n_estimators': [10, 50],
              'meta_classifier__C': [0.1, 10.0]}
    
    • 1
    • 2
    • 3
    • 4

    我们还可以结合随机子空间的思想,为Stacking第一层的不同子模型设置不同的特征:

    from sklearn.datasets import load_iris
    from mlxtend.classifier import StackingCVClassifier
    from mlxtend.feature_selection import ColumnSelector
    from sklearn.pipeline import make_pipeline
    from sklearn.linear_model import LogisticRegression
    
    iris = load_iris()
    X = iris.data
    y = iris.target
    
    pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),  # 选择第0,2列
                          LogisticRegression())  # 可以理解为先挑选特征再以基分类器为逻辑回归
    pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),  # 选择第1,2,3列
                          LogisticRegression())  # 两个基分类器都是逻辑回归
    sclf = StackingCVClassifier(classifiers=[pipe1, pipe2], 
                                # 两个基分类器区别在于使用特征不同
                                meta_classifier=LogisticRegression(),
                                random_state=42)
    
    sclf.fit(X, y)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    image-20221123171047273

    下面我们可以画ROC曲线:

    from sklearn import model_selection
    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier
    from mlxtend.classifier import StackingCVClassifier
    from sklearn.metrics import roc_curve, auc
    from sklearn.model_selection import train_test_split
    from sklearn import datasets
    from sklearn.preprocessing import label_binarize
    from sklearn.multiclass import OneVsRestClassifier
    
    iris = datasets.load_iris()
    X, y = iris.data[:, [0, 1]], iris.target  # 只用了前两个特征
    
    y = label_binarize(y, classes = [0,1,2])
    # 因为y里面有三个类别,分类标注为0,1,2,这里是将y变换为一个n*3的矩阵
    # 每一行为3代表类别数目为3,然后如果y是第0类就是100,第一类就是010,第二类就是001
    # 关键在于后面的classes如何制定,如果是[0,2,1]那么是第二类就是010,第一类是001
    n_classes = y.shape[1]
    
    RANDOM_SEED = 42
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
                                                        random_state=RANDOM_SEED)
    
    clf1 =  LogisticRegression()
    clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
    clf3 = SVC(random_state=RANDOM_SEED)
    lr = LogisticRegression()
    
    sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                                meta_classifier=lr)
    
    classifier = OneVsRestClassifier(sclf)  
    # 这个对象在拟合时会对每一类学习一个分类器,用来做二分类,分别该类和其他所有类
    y_score = classifier.fit(X_train, y_train).decision_function(X_test)
    # decision_function是预测X_test在决策边界的哪一边,然后距离有多大,可以认为是评估指标
    # Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    
    # Compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    plt.figure()
    lw = 2
    plt.plot(fpr[2], tpr[2], color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62

    在这里插入图片描述

    集成学习案例1——幸福感预测

    背景介绍

    此案例是一个数据挖掘类型的比赛——幸福感预测的baseline。比赛的数据使用的是官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,其共包含有139个维度的特征,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务)等特征。赛题要求使用以上 139 维的特征,使用 8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)

    具体代码

    导入需要的包
    import os
    import time 
    import pandas as pd
    import numpy as np
    import seaborn as sns
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC, LinearSVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.linear_model import Perceptron
    from sklearn.linear_model import SGDClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import metrics
    from datetime import datetime
    import matplotlib.pyplot as plt
    from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score
    import lightgbm as lgb
    import xgboost as xgb
    from sklearn.ensemble import RandomForestRegressor as rfr
    from sklearn.ensemble import ExtraTreesRegressor as etr
    from sklearn.linear_model import BayesianRidge as br
    from sklearn.ensemble import GradientBoostingRegressor as gbr
    from sklearn.linear_model import Ridge
    from sklearn.linear_model import Lasso
    from sklearn.linear_model import LinearRegression as lr
    from sklearn.linear_model import ElasticNet as en
    from sklearn.kernel_ridge import KernelRidge as kr
    from sklearn.model_selection import  KFold, StratifiedKFold,GroupKFold, RepeatedKFold
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn import preprocessing
    import logging
    import warnings
    
    warnings.filterwarnings('ignore') #消除warning
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    导入数据集
    train = pd.read_csv("train.csv", parse_dates=['survey_time'],encoding='latin-1')
    # 第二个参数代表把那一列转换为时间类型
    test = pd.read_csv("test.csv", parse_dates=['survey_time'],encoding='latin-1') 
    #latin-1向下兼容ASCII
    
    train = train[train["happiness"] != -8].reset_index(drop=True)
    # =8代表没有回答是否幸福因此不要这些,reset_index是重置索引,因为行数变化了
    
    train_data_copy = train.copy()  # 完全删除掉-8的行
    target_col = "happiness"
    target = train_data_copy[target_col]
    del train_data_copy[target_col]
    
    data = pd.concat([train_data_copy, test], axis = 0, ignore_index = True)
    # 把他们按照行叠在一起
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    数据预处理

    数据中出现的主要负数值是-1,-2,-3,-8,因此分别定义函数可以处理。

    #make feature +5
    #csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去
    def getres1(row):
        return len([x for x in row.values if type(x)==int and x<0])
    
    def getres2(row):
        return len([x for x in row.values if type(x)==int and x==-8])
    
    def getres3(row):
        return len([x for x in row.values if type(x)==int and x==-1])
    
    def getres4(row):
        return len([x for x in row.values if type(x)==int and x==-2])
    
    def getres5(row):
        return len([x for x in row.values if type(x)==int and x==-3])
    
    #检查数据
    # 这里的意思就是将函数应用到表格中,而axis=1是应用到每一行,因此得到新的特征行数跟原来
    # 是一样的,统计了该行为负数的个数
    data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1)
    data.loc[data['neg1']>20,'neg1'] = 20  #平滑处理
    data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1)
    data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1)
    data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1)
    data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    而对于缺失值的填补,需要针对特征加上自己的理解,例如如果家庭成员缺失值,那么就填补为1,例如家庭收入确实,那么可以填补为整个特征的平均值等等,这部分可以自己发挥想象力,采用fillna进行填补。

    先查看哪些列存在缺失值需要填补:

    data.isnull().sum()[data.isnull().sum() > 0]
    
    • 1
    edu_other          10950
    edu_status          1569
    edu_yr              2754
    join_party          9831
    property_other     10867
    hukou_loc              4
    social_neighbor     1096
    social_friend       1096
    work_status         6932
    work_yr             6932
    work_type           6931
    work_manage         6931
    family_income          1
    invest_other       10911
    minor_child         1447
    marital_1st         1128
    s_birth             2365
    marital_now         2445
    s_edu               2365
    s_political         2365
    s_hukou             2365
    s_income            2365
    s_work_exper        2365
    s_work_status       7437
    s_work_type         7437
    dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    那么对以上这些特征进行处理:

    data['work_status'] = data['work_status'].fillna(0)
    data['work_yr'] = data['work_yr'].fillna(0)
    data['work_manage'] = data['work_manage'].fillna(0)
    data['work_type'] = data['work_type'].fillna(0)
    
    data['edu_yr'] = data['edu_yr'].fillna(0)
    data['edu_status'] = data['edu_status'].fillna(0)
    
    data['s_work_type'] = data['s_work_type'].fillna(0)
    data['s_work_status'] = data['s_work_status'].fillna(0)
    data['s_political'] = data['s_political'].fillna(0)
    data['s_hukou'] = data['s_hukou'].fillna(0)
    data['s_income'] = data['s_income'].fillna(0)
    data['s_birth'] = data['s_birth'].fillna(0)
    data['s_edu'] = data['s_edu'].fillna(0)
    data['s_work_exper'] = data['s_work_exper'].fillna(0)
    
    data['minor_child'] = data['minor_child'].fillna(0)
    data['marital_now'] = data['marital_now'].fillna(0)
    data['marital_1st'] = data['marital_1st'].fillna(0)
    data['social_neighbor']=data['social_neighbor'].fillna(0)
    data['social_friend']=data['social_friend'].fillna(0)
    data['hukou_loc']=data['hukou_loc'].fillna(1) #最少为1,表示户口
    data['family_income']=data['family_income'].fillna(66365) #删除问题值后的平均值
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    特殊格式的信息也需要处理,例如跟时间有关的信息,可以化成对应方便处理的格式,以及对年龄进行分段处理:

    data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d',
                                         errors='coerce')
    #防止时间格式不同的报错errors='coerce‘,格式不对就幅值NaN
    data['survey_time'] = data['survey_time'].dt.year #仅仅是year,方便计算年龄
    data['age'] = data['survey_time']-data['birth']
    
    bins = [0,17,26,34,50,63,100]
    data['age_bin'] = pd.cut(data['age'], bins, labels=[0,1,2,3,4,5]) 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    还有就是对一些异常值的处理,例如在某些问题中不应该出现负数但是出现了负数,那么就可以根据我们的直观理解来进行处理:

    # 对宗教进行处理
    data.loc[data['religion'] < 0, 'religion'] = 1  # 1代表不信仰宗教
    data.loc[data['religion_freq'] < 0, "religion_freq"] = 0  # 代表不参加
    # 教育
    data.loc[data['edu'] < 0, 'edu'] = 1  # 负数说明没有接受过任何教育
    data.loc[data['edu_status'] < 0 , 'edu_status'] = 0
    data.loc[data['edu_yr']<0,'edu_yr'] = 0
    
    # 收入
    data.loc[data['income']<0,'income'] = 0 #认为无收入
    #对‘政治面貌’处理
    data.loc[data['political']<0,'political'] = 1 #认为是群众
    #对体重处理
    data.loc[(data['weight_jin']<=80)&(data['height_cm']>=160),'weight_jin']= data['weight_jin']*2
    data.loc[data['weight_jin']<=60,'weight_jin']= data['weight_jin']*2  #没有60斤的成年人吧
    #对身高处理
    data.loc[data['height_cm']<130,'height_cm'] = 150 #成年人的实际情况
    #对‘健康’处理
    data.loc[data['health']<0,'health'] = 4 #认为是比较健康
    data.loc[data['health_problem']<0,'health_problem'] = 4  # 4代表很少
    #对‘沮丧’处理
    data.loc[data['depression']<0,'depression'] = 4 #很少
    #对‘媒体’处理
    data.loc[data['media_1']<0,'media_1'] = 1 #都是从不
    data.loc[data['media_2']<0,'media_2'] = 1
    data.loc[data['media_3']<0,'media_3'] = 1
    data.loc[data['media_4']<0,'media_4'] = 1
    data.loc[data['media_5']<0,'media_5'] = 1
    data.loc[data['media_6']<0,'media_6'] = 1
    
    #对‘空闲活动’处理
    data.loc[data['leisure_1']<0,'leisure_1'] = data['leisure_1'].mode()  # 众数
    data.loc[data['leisure_2']<0,'leisure_2'] = data['leisure_2'].mode()
    data.loc[data['leisure_3']<0,'leisure_3'] = data['leisure_3'].mode()
    data.loc[data['leisure_4']<0,'leisure_4'] = data['leisure_4'].mode()
    data.loc[data['leisure_5']<0,'leisure_5'] = data['leisure_5'].mode()
    data.loc[data['leisure_6']<0,'leisure_6'] = data['leisure_6'].mode()
    data.loc[data['leisure_7']<0,'leisure_7'] = data['leisure_7'].mode()
    data.loc[data['leisure_8']<0,'leisure_8'] = data['leisure_8'].mode()
    data.loc[data['leisure_9']<0,'leisure_9'] = data['leisure_9'].mode()
    data.loc[data['leisure_10']<0,'leisure_10'] = data['leisure_10'].mode()
    data.loc[data['leisure_11']<0,'leisure_11'] = data['leisure_11'].mode()
    data.loc[data['leisure_12']<0,'leisure_12'] = data['leisure_12'].mode()
    data.loc[data['socialize']<0, 'socialize'] = 2  # 社交,很少
    data.loc[data['relax']<0, 'relax'] = 2  # 放松,很少
    data.loc[data['learn']<0, 'learn'] = 2  # 学习,很少
    data.loc[data['social_neighbor']<0, 'social_neighbor'] = 6  # 一年1次或更少社交
    data.loc[data['social_friend']<0, 'social_friend'] = 6 
    data.loc[data['socia_outing']<0, 'socia_outing'] = 1
    data.loc[data['neighbor_familiarity']<0,'social_neighbor']= 2  # 不太熟悉
    data.loc[data['equity']<0,'equity'] = 3  # 不知道公不公平
    #对‘社会等级’处理
    data.loc[data['class_10_before']<0,'class_10_before'] = 3
    data.loc[data['class']<0,'class'] = 5
    data.loc[data['class_10_after']<0,'class_10_after'] = 5
    data.loc[data['class_14']<0,'class_14'] = 2
    #对‘工作情况’处理
    data.loc[data['work_status']<0,'work_status'] = 0
    data.loc[data['work_yr']<0,'work_yr'] = 0
    data.loc[data['work_manage']<0,'work_manage'] = 0
    data.loc[data['work_type']<0,'work_type'] = 0
    #对‘社会保障’处理
    data.loc[data['insur_1']<0,'insur_1'] = 1
    data.loc[data['insur_2']<0,'insur_2'] = 1
    data.loc[data['insur_3']<0,'insur_3'] = 1
    data.loc[data['insur_4']<0,'insur_4'] = 1
    data.loc[data['insur_1']==0,'insur_1'] = 0
    data.loc[data['insur_2']==0,'insur_2'] = 0
    data.loc[data['insur_3']==0,'insur_3'] = 0
    data.loc[data['insur_4']==0,'insur_4'] = 0
    #对家庭情况处理
    family_income_mean = data['family_income'].mean()  # 计算均值
    data.loc[data['family_income']<0,'family_income'] = family_income_mean
    data.loc[data['family_m']<0,'family_m'] = 2
    data.loc[data['family_status']<0,'family_status'] = 3
    data.loc[data['house']<0,'house'] = 1
    data.loc[data['car']<0,'car'] = 0
    data.loc[data['car']==2,'car'] = 0 #变为0和1
    data.loc[data['son']<0,'son'] = 0
    data.loc[data['daughter']<0,'daughter'] = 0
    data.loc[data['minor_child']<0,'minor_child'] = 0
    #对‘婚姻’处理
    data.loc[data['marital_1st']<0,'marital_1st'] = 0
    data.loc[data['marital_now']<0,'marital_now'] = 0
    #对‘配偶’处理
    data.loc[data['s_birth']<0,'s_birth'] = 0
    data.loc[data['s_edu']<0,'s_edu'] = 0
    data.loc[data['s_political']<0,'s_political'] = 0
    data.loc[data['s_hukou']<0,'s_hukou'] = 0
    data.loc[data['s_income']<0,'s_income'] = 0
    data.loc[data['s_work_type']<0,'s_work_type'] = 0
    data.loc[data['s_work_status']<0,'s_work_status'] = 0
    data.loc[data['s_work_exper']<0,'s_work_exper'] = 0
    #对‘父母情况’处理
    data.loc[data['f_birth']<0,'f_birth'] = 1945
    data.loc[data['f_edu']<0,'f_edu'] = 1
    data.loc[data['f_political']<0,'f_political'] = 1
    data.loc[data['f_work_14']<0,'f_work_14'] = 2
    data.loc[data['m_birth']<0,'m_birth'] = 1940
    data.loc[data['m_edu']<0,'m_edu'] = 1
    data.loc[data['m_political']<0,'m_political'] = 1
    data.loc[data['m_work_14']<0,'m_work_14'] = 2
    #和同龄人相比社会经济地位
    data.loc[data['status_peer']<0,'status_peer'] = 2
    #和3年前比社会经济地位
    data.loc[data['status_3_before']<0,'status_3_before'] = 2
    #对‘观点’处理
    data.loc[data['view']<0,'view'] = 4
    #对期望年收入处理
    data.loc[data['inc_ability']<=0,'inc_ability']= 2
    inc_exp_mean = data['inc_exp'].mean()
    data.loc[data['inc_exp']<=0,'inc_exp']= inc_exp_mean #取均值
    
    #部分特征处理,取众数(首先去除缺失值的数据)
    for i in range(1,9+1):
        data.loc[data['public_service_'+str(i)]<0,'public_service_'+str(i)] = int(data['public_service_'+str(i)].dropna().mode().values)
    for i in range(1,13+1):
        data.loc[data['trust_'+str(i)]<0,'trust_'+str(i)] = int(data['trust_'+str(i)].dropna().mode().values)
    #上面的循环要加上int,因为values返回一个array是1*1的,那么跟右边的长度匹配不上
    # 转换成int就可以广播赋值
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    数据增广

    上述只是对原始的特征进行处理,下面我们需要进一步分析特征的关系,引入更多有可能具有较大重要性的特征来协助判断。

    #第一次结婚年龄 147
    data['marital_1stbir'] = data['marital_1st'] - data['birth'] 
    #最近结婚年龄 148
    data['marital_nowtbir'] = data['marital_now'] - data['birth'] 
    #是否再婚 149
    data['mar'] = data['marital_nowtbir'] - data['marital_1stbir']
    #配偶年龄 150
    data['marital_sbir'] = data['marital_now']-data['s_birth']
    #配偶年龄差 151
    data['age_'] = data['marital_nowtbir'] - data['marital_sbir'] 
    
    #收入比 151+7 =158
    data['income/s_income'] = data['income']/(data['s_income']+1) #同居伴侣
    data['income+s_income'] = data['income']+(data['s_income']+1)
    data['income/family_income'] = data['income']/(data['family_income']+1)
    data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1)
    data['income/inc_exp'] = data['income']/(data['inc_exp']+1)
    data['family_income/m'] = data['family_income']/(data['family_m']+0.01)
    data['income/m'] = data['income']/(data['family_m']+0.01)
    
    #收入/面积比 158+4=162
    data['income/floor_area'] = data['income']/(data['floor_area']+0.01)
    data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01)
    data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01)
    data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01)
    
    #class 162+3=165
    data['class_10_diff'] = (data['class_10_after'] - data['class'])
    data['class_diff'] = data['class'] - data['class_10_before']
    data['class_14_diff'] = data['class'] - data['class_14']
    #悠闲指数 166
    leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)]
    data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew
    #满意指数 167
    public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)]
    data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew
    
    #信任指数 168
    trust_fea_lis = ['trust_'+str(i) for i in range(1,14)]
    data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew
    
    #province mean 168+13=181
    data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values
    data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values
    data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values
    data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values
    data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values
    data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values
    data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values
    data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values
    data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values
    data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values
    data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values
    data['province_public_service_sum_mean'] = data.groupby(['province'])['public_service_sum'].transform('mean').values
    data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values
    
    #city   mean 181+13=194
    data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values #按照city分组
    data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values
    data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values
    data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values
    data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values
    data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values
    data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values
    data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values
    data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values
    data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values
    data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values
    data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values
    data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values
    
    #county  mean 194 + 13 = 207
    data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values
    data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values
    data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values
    data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values
    data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values
    data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values
    data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values
    data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values
    data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values
    data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values
    data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values
    data['county_public_service_sum_mean'] = data.groupby(['county'])['public_service_sum'].transform('mean').values
    data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values
    
    #ratio 相比同省 207 + 13 =220
    data['income/province'] = data['income']/(data['province_income_mean'])                                      
    data['family_income/province'] = data['family_income']/(data['province_family_income_mean'])   
    data['equity/province'] = data['equity']/(data['province_equity_mean'])       
    data['depression/province'] = data['depression']/(data['province_depression_mean'])                                                
    data['floor_area/province'] = data['floor_area']/(data['province_floor_area_mean'])
    data['health/province'] = data['health']/(data['province_health_mean'])
    data['class_10_diff/province'] = data['class_10_diff']/(data['province_class_10_diff_mean'])
    data['class/province'] = data['class']/(data['province_class_mean'])
    data['health_problem/province'] = data['health_problem']/(data['province_health_problem_mean'])
    data['family_status/province'] = data['family_status']/(data['province_family_status_mean'])
    data['leisure_sum/province'] = data['leisure_sum']/(data['province_leisure_sum_mean'])
    data['public_service_sum/province'] = data['public_service_sum']/(data['province_public_service_sum_mean'])
    data['trust_sum/province'] = data['trust_sum']/(data['province_trust_sum_mean']+1)
    
    #ratio 相比同市 220 + 13 =233
    data['income/city'] = data['income']/(data['city_income_mean'])                                      
    data['family_income/city'] = data['family_income']/(data['city_family_income_mean'])   
    data['equity/city'] = data['equity']/(data['city_equity_mean'])       
    data['depression/city'] = data['depression']/(data['city_depression_mean'])                                                
    data['floor_area/city'] = data['floor_area']/(data['city_floor_area_mean'])
    data['health/city'] = data['health']/(data['city_health_mean'])
    data['class_10_diff/city'] = data['class_10_diff']/(data['city_class_10_diff_mean'])
    data['class/city'] = data['class']/(data['city_class_mean'])
    data['health_problem/city'] = data['health_problem']/(data['city_health_problem_mean'])
    data['family_status/city'] = data['family_status']/(data['city_family_status_mean'])
    data['leisure_sum/city'] = data['leisure_sum']/(data['city_leisure_sum_mean'])
    data['public_service_sum/city'] = data['public_service_sum']/(data['city_public_service_sum_mean'])
    data['trust_sum/city'] = data['trust_sum']/(data['city_trust_sum_mean'])
    
    #ratio 相比同个地区 233 + 13 =246
    data['income/county'] = data['income']/(data['county_income_mean'])                                      
    data['family_income/county'] = data['family_income']/(data['county_family_income_mean'])   
    data['equity/county'] = data['equity']/(data['county_equity_mean'])       
    data['depression/county'] = data['depression']/(data['county_depression_mean'])                                                
    data['floor_area/county'] = data['floor_area']/(data['county_floor_area_mean'])
    data['health/county'] = data['health']/(data['county_health_mean'])
    data['class_10_diff/county'] = data['class_10_diff']/(data['county_class_10_diff_mean'])
    data['class/county'] = data['class']/(data['county_class_mean'])
    data['health_problem/county'] = data['health_problem']/(data['county_health_problem_mean'])
    data['family_status/county'] = data['family_status']/(data['county_family_status_mean'])
    data['leisure_sum/county'] = data['leisure_sum']/(data['county_leisure_sum_mean'])
    data['public_service_sum/county'] = data['public_service_sum']/(data['county_public_service_sum_mean'])
    data['trust_sum/county'] = data['trust_sum']/(data['county_trust_sum_mean'])
    
    #age   mean 246+ 13 =259
    data['age_income_mean'] = data.groupby(['age'])['income'].transform('mean').values
    data['age_family_income_mean'] = data.groupby(['age'])['family_income'].transform('mean').values
    data['age_equity_mean'] = data.groupby(['age'])['equity'].transform('mean').values
    data['age_depression_mean'] = data.groupby(['age'])['depression'].transform('mean').values
    data['age_floor_area_mean'] = data.groupby(['age'])['floor_area'].transform('mean').values
    data['age_health_mean'] = data.groupby(['age'])['health'].transform('mean').values
    data['age_class_10_diff_mean'] = data.groupby(['age'])['class_10_diff'].transform('mean').values
    data['age_class_mean'] = data.groupby(['age'])['class'].transform('mean').values
    data['age_health_problem_mean'] = data.groupby(['age'])['health_problem'].transform('mean').values
    data['age_family_status_mean'] = data.groupby(['age'])['family_status'].transform('mean').values
    data['age_leisure_sum_mean'] = data.groupby(['age'])['leisure_sum'].transform('mean').values
    data['age_public_service_sum_mean'] = data.groupby(['age'])['public_service_sum'].transform('mean').values
    data['age_trust_sum_mean'] = data.groupby(['age'])['trust_sum'].transform('mean').values
    
    # 和同龄人相比259 + 13 =272
    data['income/age'] = data['income']/(data['age_income_mean'])                                      
    data['family_income/age'] = data['family_income']/(data['age_family_income_mean'])   
    data['equity/age'] = data['equity']/(data['age_equity_mean'])       
    data['depression/age'] = data['depression']/(data['age_depression_mean'])                                                
    data['floor_area/age'] = data['floor_area']/(data['age_floor_area_mean'])
    data['health/age'] = data['health']/(data['age_health_mean'])
    data['class_10_diff/age'] = data['class_10_diff']/(data['age_class_10_diff_mean'])
    data['class/age'] = data['class']/(data['age_class_mean'])
    data['health_problem/age'] = data['health_problem']/(data['age_health_problem_mean'])
    data['family_status/age'] = data['family_status']/(data['age_family_status_mean'])
    data['leisure_sum/age'] = data['leisure_sum']/(data['age_leisure_sum_mean'])
    data['public_service_sum/age'] = data['public_service_sum']/(data['age_public_service_sum_mean'])
    data['trust_sum/age'] = data['trust_sum']/(data['age_trust_sum_mean'])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160

    经过上述的处理,特征扩充为272维,而经过分析发现应该删除掉有效样本数过少的特征:

    #删除数值特别少的和之前用过的特征
    del_list=['id','survey_time','edu_other','invest_other','property_other','join_party','province','city','county']
    use_feature = [clo for clo in data.columns if clo not in del_list]
    data.fillna(0,inplace=True) #还是补0
    train_shape = train.shape[0] #一共的数据量,训练集
    features = data[use_feature].columns #删除后所有的特征
    X_train_263 = data[:train_shape][use_feature].values
    y_train = target
    X_test_263 = data[train_shape:][use_feature].values
    X_train_263.shape #最终一种263个特征
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    再结合个人经验,挑选出最重要的49个特征:

    imp_fea_49 = ['equity','depression','health','class','family_status','health_problem','class_10_after',
               'equity/province','equity/city','equity/county',
               'depression/province','depression/city','depression/county',
               'health/province','health/city','health/county',
               'class/province','class/city','class/county',
               'family_status/province','family_status/city','family_status/county',
               'family_income/province','family_income/city','family_income/county',
               'floor_area/province','floor_area/city','floor_area/county',
               'leisure_sum/province','leisure_sum/city','leisure_sum/county',
               'public_service_sum/province','public_service_sum/city','public_service_sum/county',
               'trust_sum/province','trust_sum/city','trust_sum/county',
               'income/m','public_service_sum','class_diff','status_3_before','age_income_mean','age_floor_area_mean',
               'weight_jin','height_cm',
               'health/age','depression/age','equity/age','leisure_sum/age'
              ]
    train_shape = train.shape[0]
    X_train_49 = data[:train_shape][imp_fea_49].values
    X_test_49 = data[train_shape:][imp_fea_49].values
    # X_train_49.shape #最重要的49个特征
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    再将其中一些特征转换为one-hot特征:

    from sklearn import preprocessing
    cat_fea = ['survey_type','gender','nationality','edu_status','political','hukou','hukou_loc','work_exper','work_status','work_type',
               'work_manage','marital','s_political','s_hukou','s_work_exper','s_work_status','s_work_type','f_political','f_work_14',
               'm_political','m_work_14'] #已经是0、1的值不需要onehot
    noc_fea = [clo for clo in use_feature if clo not in cat_fea]
    
    onehot_data = data[cat_fea].values
    enc = preprocessing.OneHotEncoder(categories = 'auto')
    oh_data=enc.fit_transform(onehot_data).toarray()
    oh_data.shape #变为onehot编码格式
    
    X_train_oh = oh_data[:train_shape,:]
    X_test_oh = oh_data[train_shape:,:]
    X_train_oh.shape #其中的训练集
    
    X_train_383 = np.column_stack([data[:train_shape][noc_fea].values,X_train_oh])#先是noc,再是cat_fea
    X_test_383 = np.column_stack([data[train_shape:][noc_fea].values,X_test_oh])
    # X_train_383.shape
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    最终得到为383个特征。

    模型构建

    此处模型构建的主要流程为:

    • 构建多个集成分类算法进行训练,并对验证集进行预测
    • 将各个基础集成分类算法对验证集的预测作为训练数据,验证集的标签作为训练标签,由此来训练一个回归模型对各个基础集成分类算法进行融合,得到较好的结果

    具体处理的流程是创建一个模型,然后对其采用五折交叉验证的方式去训练,并且对验证集和测试集进行预测,多个模型得到的验证集预测进行堆叠得到下一层的训练数据,而多个模型得到的测试集预测直接进行平均求和,以此给训练好的下一层做出真正对测试集的预测。

    对原始263维特征进行处理

    LightGBM

    lgb_263_param = {  # 这是训练的参数列表
        "num_leaves":7,
        "min_data_in_leaf": 20,  # 一个叶子上最小分配到的数量,用来处理过拟合
        "objective": "regression",  # 设置类型为回归
        "max_depth": -1,  # 限制树的最大深度,-1代表没有限制
        "learning_rate": 0.003,
        "boosting": "gbdt",  # 用gbdt算法
        "feature_fraction": 0.18,  # 每次迭代时使用18%的特征参与建树,引入特征子空间的多样性
        "bagging_freq": 1,  # 每一次迭代都执行bagging
        "bagging_fraction": 0.55,  # 每次bagging在不进行重采样的情况下随机选择55%数据训练
        "bagging_seed": 14,
        "metric": 'mse',
        "lambda_l1": 0.1,
        "lambda_l2": 0.2,
        "verbosity": -1  # 打印消息的详细程度
    }
    
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state = 4)
    # 产生一个容器,可以用来对对数据集进行打乱的5次切分,以此来进行五折交叉验证
    oof_lgb_263 = np.zeros(len(X_train_263))
    predictions_lgb_263 = np.zeros(len(X_test_263))
    
    for fold_, (train_idx, valid_idx) in enumerate(folds.split(X_train_263, y_train)):
        # 切分后返回的训练集和验证集的索引
        print("fold n{}".format(fold_+1))  # 当前第几折
        train_data_now = lgb.Dataset(X_train_263[train_idx], y_train[train_idx])
        valid_data_now = lgb.Dataset(X_train_263[valid_idx], y_train[valid_idx])
        # 取出数据并转换为lgb的数据
        num_round = 10000
        lgb_263 = lgb.train(lgb_263_param, train_data_now, num_round, 
                            valid_sets=[train_data_now, valid_data_now], verbose_eval=500,
                           early_stopping_rounds = 800)
        # 第一个参数是原始参数,第二个是用来训练的数据,第三个参数代表迭代次数
        # 第四个参数是训练后评估所用的数据,第五个参数是每多少次迭代就打印下当前的评估信息
        # 第六个参数是超过多少次迭代没有变化就停止
        oof_lgb_263[valid_idx] = lgb_263.predict(X_train_263[valid_idx],
                                                 num_iteration=lgb_263.best_iteration)
        predictions_lgb_263 += lgb_263.predict(X_test_263, num_iteration=
                                               lgb_263.best_iteration) / folds.n_splits
        # 这是将预测概率进行平均
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb_263, target)))
    # 它是回归类型因此不能用正确率来衡量
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42

    由于这次打印出来的信息过多,因此我只补充最终的分数:

    CV score: 0.45153757
    
    • 1

    而我们可以利用lightGBM中的特征重要性来直观展现各个特征的重要性程度:

    pd.set_option("display.max_columns", None)  # 设置可以显示的最大行和最大列
    pd.set_option('display.max_rows', None)  # 如果超过就显示省略号,none表示不省略
    #设置value的显示长度为100,默认为50
    pd.set_option('max_colwidth',100)
    # 创建,然后只有一列就是刚才所使用的的特征
    df = pd.DataFrame(data[use_feature].columns.tolist(), columns=['feature'])
    df['importance'] = list(lgb_263.feature_importance())
    df = df.sort_values(by='importance', ascending=False)  # 降序排列
    plt.figure(figsize = (14,28))
    sns.barplot(x='importance', y='feature', data = df.head(50))# 取出前五十个画图
    plt.title('Features importance (averaged/folds)')
    plt.tight_layout()  # 自动调整适应范围
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    下载

    可以看到最重要的特征为同龄人的健康程度,这方面还是很符合直观感觉的。


    xgBoost

    xgb_263_params = {
        "eta":0.02,  # 学习率
        "max_depth": 6,  # 树的最大深度
        "min_child_weight":3,  # 划分为叶结点所含样本的最小权重和
        "gamma":0, # 在分裂时损失最小应该减少gamma,越大则越减少过拟合
        "subsample":0.7, # 控制对于每棵树随机采样的比例
        "colsample_bytree":0.3,  # 用来控制每棵树对于特征的采样占比
        "lambda":2,
        "objective":"reg:linear",
        "eval_metric":"rmse",
        "verbosity":1,  # 打印消息的详细程度
        "nthread":-1  # 并行线程数,使用最大
    }
    
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2019)
    oof_xgb_263 = np.zeros(len(X_train_263))
    predictions_xgb_263 = np.zeros(len(X_test_263))
    
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
        print("fold n°{}".format(fold_+1))
        trn_data = xgb.DMatrix(X_train_263[trn_idx], y_train[trn_idx])
        val_data = xgb.DMatrix(X_train_263[val_idx], y_train[val_idx])
    
        watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
        xgb_263 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, 
                            early_stopping_rounds=600, verbose_eval=500, 
                            params=xgb_263_params)
        oof_xgb_263[val_idx] = xgb_263.predict(xgb.DMatrix(X_train_263[val_idx]), 
                                               ntree_limit=xgb_263.best_ntree_limit)
        predictions_xgb_263 += xgb_263.predict(xgb.DMatrix(X_test_263), ntree_limit=
                                               xgb_263.best_ntree_limit) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb_263, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    CV score: 0.45402538
    
    • 1

    随机森林

    #RandomForestRegressor随机森林
    folds = KFold(n_splits=5, shuffle=True, random_state=2019)
    oof_rfr_263 = np.zeros(len(X_train_263))
    predictions_rfr_263 = np.zeros(len(X_test_263))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_263[trn_idx]
        tr_y = y_train[trn_idx]
        rfr_263 = rfr(n_estimators=1600,max_depth=9, min_samples_leaf=9, 
                      min_weight_fraction_leaf=0.0,max_features=0.25,
                      verbose=1,n_jobs=-1) #并行化
        #verbose = 0 为不在标准输出流输出日志信息
    #verbose = 1 为输出进度条记录
    #verbose = 2 为每个epoch输出一行记录
        rfr_263.fit(tr_x,tr_y)
        oof_rfr_263[val_idx] = rfr_263.predict(X_train_263[val_idx])
        
        predictions_rfr_263 += rfr_263.predict(X_test_263) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_rfr_263, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    CV score: 0.47810816
    
    • 1

    梯度提升GBDT

    #GradientBoostingRegressor梯度提升决策树
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
    oof_gbr_263 = np.zeros(train_shape)
    predictions_gbr_263 = np.zeros(len(X_test_263))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_263[trn_idx]
        tr_y = y_train[trn_idx]
        gbr_263 = gbr(n_estimators=400, learning_rate=0.01,subsample=0.65,max_depth=7, min_samples_leaf=20,
                max_features=0.22,verbose=1)
        gbr_263.fit(tr_x,tr_y)
        oof_gbr_263[val_idx] = gbr_263.predict(X_train_263[val_idx])
        
        predictions_gbr_263 += gbr_263.predict(X_test_263) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_gbr_263, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    CV score: 0.45781356
    
    • 1

    极端随机森林回归

    #ExtraTreesRegressor 极端随机森林回归
    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_etr_263 = np.zeros(train_shape)
    predictions_etr_263 = np.zeros(len(X_test_263))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_263[trn_idx]
        tr_y = y_train[trn_idx]
        etr_263 = etr(n_estimators=1000,max_depth=8, min_samples_leaf=12, min_weight_fraction_leaf=0.0,
                max_features=0.4,verbose=1,n_jobs=-1)# max_feature:划分时考虑的最大特征数
        etr_263.fit(tr_x,tr_y)
        oof_etr_263[val_idx] = etr_263.predict(X_train_263[val_idx])
        
        predictions_etr_263 += etr_263.predict(X_test_263) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_etr_263, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    CV score: 0.48598827
    
    • 1

    综上,我们得到了五个模型的预测结果、模型架构以及参数。那么我们需要对这五个模型进行融合,可以用一个Kernel Ridge Regression,核脊回归来进行融合,这个回归有点类似于岭回归,也是采用五折交叉验证重复两次:

    train_stack2 = np.vstack([oof_lgb_263,oof_xgb_263,oof_gbr_263,oof_rfr_263,oof_etr_263]).transpose()
    # transpose()函数的作用就是调换x,y,z的位置,也就是数组的索引值,变成zyx
    # 本来是5*验证集样本数*1,变成1*验证集样本数*5,二维矩阵每一行是单个样本在5个模型上的预测结果,行数是样本数
    test_stack2 = np.vstack([predictions_lgb_263, predictions_xgb_263,predictions_gbr_263,predictions_rfr_263,predictions_etr_263]).transpose()
    # 变成1*测试集样本数*5,二维矩阵每一行是单个测试集样本在5个模型上的预测结果,行数是样本数
    #交叉验证:5折,重复2次
    folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
    oof_stack2 = np.zeros(train_stack2.shape[0])
    predictions_lr2 = np.zeros(test_stack2.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack2,target)):
        print("fold {}".format(fold_))
        trn_data, trn_y = train_stack2[trn_idx], target.iloc[trn_idx].values
        val_data, val_y = train_stack2[val_idx], target.iloc[val_idx].values
        #Kernel Ridge Regression
        lr2 = kr()
        lr2.fit(trn_data, trn_y)
        
        oof_stack2[val_idx] = lr2.predict(val_data)
        predictions_lr2 += lr2.predict(test_stack2) / 10
        
    mean_squared_error(target.values, oof_stack2) 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    0.44815130114230267
    
    • 1
    对49维的特征进行处理

    同样我们可以对49维度的特征进行类似处理,我将代码总结如下:

    ##### lgb_49
    lgb_49_param = {
    'num_leaves': 9,
    'min_data_in_leaf': 23,
    'objective':'regression',
    'max_depth': -1,
    'learning_rate': 0.002,
    "boosting": "gbdt",
    "feature_fraction": 0.45, 
    "bagging_freq": 1,
    "bagging_fraction": 0.65, 
    "bagging_seed": 15,
    "metric": 'mse',
    "lambda_l2": 0.2, 
    "verbosity": -1} # 一个叶子上数据的最小数量 \ feature_fraction将会在每棵树训练之前选择 45% 的特征。可以用来加速训练,可以用来处理过拟合。 #bagging_fraction不进行重采样的情况下随机选择部分数据。可以用来加速训练,可以用来处理过拟合。
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=9)   
    oof_lgb_49 = np.zeros(len(X_train_49))
    predictions_lgb_49 = np.zeros(len(X_test_49))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        trn_data = lgb.Dataset(X_train_49[trn_idx], y_train[trn_idx])
        val_data = lgb.Dataset(X_train_49[val_idx], y_train[val_idx])
    
        num_round = 12000
        lgb_49 = lgb.train(lgb_49_param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 1000)
        oof_lgb_49[val_idx] = lgb_49.predict(X_train_49[val_idx], num_iteration=lgb_49.best_iteration)
        predictions_lgb_49 += lgb_49.predict(X_test_49, num_iteration=lgb_49.best_iteration) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb_49, target)))
    
    ##### xgb_49
    xgb_49_params = {'eta': 0.02, 
                  'max_depth': 5, 
                  'min_child_weight':3,
                  'gamma':0,
                  'subsample': 0.7, 
                  'colsample_bytree': 0.35, 
                  'lambda':2,
                  'objective': 'reg:linear', 
                  'eval_metric': 'rmse', 
                  'silent': True, 
                  'nthread': -1}
    
    
    folds = KFold(n_splits=5, shuffle=True, random_state=2019)
    oof_xgb_49 = np.zeros(len(X_train_49))
    predictions_xgb_49 = np.zeros(len(X_test_49))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        trn_data = xgb.DMatrix(X_train_49[trn_idx], y_train[trn_idx])
        val_data = xgb.DMatrix(X_train_49[val_idx], y_train[val_idx])
    
        watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
        xgb_49 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600, verbose_eval=500, params=xgb_49_params)
        oof_xgb_49[val_idx] = xgb_49.predict(xgb.DMatrix(X_train_49[val_idx]), ntree_limit=xgb_49.best_ntree_limit)
        predictions_xgb_49 += xgb_49.predict(xgb.DMatrix(X_test_49), ntree_limit=xgb_49.best_ntree_limit) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb_49, target)))
    
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
    oof_gbr_49 = np.zeros(train_shape)
    predictions_gbr_49 = np.zeros(len(X_test_49))
    #GradientBoostingRegressor梯度提升决策树
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_49[trn_idx]
        tr_y = y_train[trn_idx]
        gbr_49 = gbr(n_estimators=600, learning_rate=0.01,subsample=0.65,max_depth=6, min_samples_leaf=20,
                max_features=0.35,verbose=1)
        gbr_49.fit(tr_x,tr_y)
        oof_gbr_49[val_idx] = gbr_49.predict(X_train_49[val_idx])
        
        predictions_gbr_49 += gbr_49.predict(X_test_49) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_gbr_49, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77

    这里由于特征数目较少,只考虑使用3个基础模型,接下来便是进行融合:

    train_stack3 = np.vstack([oof_lgb_49,oof_xgb_49,oof_gbr_49]).transpose()
    test_stack3 = np.vstack([predictions_lgb_49, predictions_xgb_49,predictions_gbr_49]).transpose()
    #
    folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
    oof_stack3 = np.zeros(train_stack3.shape[0])
    predictions_lr3 = np.zeros(test_stack3.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack3,target)):
        print("fold {}".format(fold_))
        trn_data, trn_y = train_stack3[trn_idx], target.iloc[trn_idx].values
        val_data, val_y = train_stack3[val_idx], target.iloc[val_idx].values
            #Kernel Ridge Regression
        lr3 = kr()
        lr3.fit(trn_data, trn_y)
        
        oof_stack3[val_idx] = lr3.predict(val_data)
        predictions_lr3 += lr3.predict(test_stack3) / 10
        
    mean_squared_error(target.values, oof_stack3) 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    0.4662728551415085
    
    • 1
    对383维的特征进行处理

    这里对383维度的特征采用刚才类似的方法是可以的,但是也可以采用其他的方法,例如我们将基模型换成普通的回归模型:

    Kernel Ridge Regression 基于核的岭回归

    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_kr_383 = np.zeros(train_shape)
    predictions_kr_383 = np.zeros(len(X_test_383))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_383[trn_idx]
        tr_y = y_train[trn_idx]
        #Kernel Ridge Regression 岭回归
        kr_383 = kr()
        kr_383.fit(tr_x,tr_y)
        oof_kr_383[val_idx] = kr_383.predict(X_train_383[val_idx])
        
        predictions_kr_383 += kr_383.predict(X_test_383) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_kr_383, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    CV score: 0.51412085
    
    • 1

    可以看到普通回归的误差相比于集成算法会高一些。


    普通岭回归

    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_ridge_383 = np.zeros(train_shape)
    predictions_ridge_383 = np.zeros(len(X_test_383))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_383[trn_idx]
        tr_y = y_train[trn_idx]
        #使用岭回归
        ridge_383 = Ridge(alpha=1200)
        ridge_383.fit(tr_x,tr_y)
        oof_ridge_383[val_idx] = ridge_383.predict(X_train_383[val_idx])
        
        predictions_ridge_383 += ridge_383.predict(X_test_383) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_ridge_383, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    CV score: 0.48687670
    
    • 1

    ElasticNet 弹性网络

    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_en_383 = np.zeros(train_shape)
    predictions_en_383 = np.zeros(len(X_test_383))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_383[trn_idx]
        tr_y = y_train[trn_idx]
        #ElasticNet 弹性网络
        en_383 = en(alpha=1.0,l1_ratio=0.06)
        en_383.fit(tr_x,tr_y)
        oof_en_383[val_idx] = en_383.predict(X_train_383[val_idx])
        
        predictions_en_383 += en_383.predict(X_test_383) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_en_383, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    CV score: 0.53296555
    
    • 1

    BayesianRidge 贝叶斯岭回归

    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_br_383 = np.zeros(train_shape)
    predictions_br_383 = np.zeros(len(X_test_383))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_383[trn_idx]
        tr_y = y_train[trn_idx]
        #BayesianRidge 贝叶斯回归
        br_383 = br()
        br_383.fit(tr_x,tr_y)
        oof_br_383[val_idx] = br_383.predict(X_train_383[val_idx])
        
        predictions_br_383 += br_383.predict(X_test_383) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_br_383, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    CV score: 0.48717310
    
    • 1

    那么对383特征的四次回归也进行融合:

    train_stack1 = np.vstack([oof_br_383,oof_kr_383,oof_en_383,oof_ridge_383]).transpose()
    test_stack1 = np.vstack([predictions_br_383, predictions_kr_383,predictions_en_383,predictions_ridge_383]).transpose()
    
    folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
    oof_stack1 = np.zeros(train_stack1.shape[0])
    predictions_lr1 = np.zeros(test_stack1.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack1,target)):
        print("fold {}".format(fold_))
        trn_data, trn_y = train_stack1[trn_idx], target.iloc[trn_idx].values
        val_data, val_y = train_stack1[val_idx], target.iloc[val_idx].values
        # LinearRegression简单的线性回归
        lr1 = lr()
        lr1.fit(trn_data, trn_y)
        
        oof_stack1[val_idx] = lr1.predict(val_data)
        predictions_lr1 += lr1.predict(test_stack1) / 10
        
    mean_squared_error(target.values, oof_stack1) 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    0.4878202780283125
    
    • 1
    对49维特征的继续处理

    由于49维的特征是最重要的特征,所以这里考虑增加更多的模型进行49维特征的数据的构建工作。加入跟383特征一样的回归模型,具体如下:

    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_kr_49 = np.zeros(train_shape)
    predictions_kr_49 = np.zeros(len(X_test_49))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_49[trn_idx]
        tr_y = y_train[trn_idx]
        kr_49 = kr()
        kr_49.fit(tr_x,tr_y)
        oof_kr_49[val_idx] = kr_49.predict(X_train_49[val_idx])
        
        predictions_kr_49 += kr_49.predict(X_test_49) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_kr_49, target)))
    
    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_ridge_49 = np.zeros(train_shape)
    predictions_ridge_49 = np.zeros(len(X_test_49))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_49[trn_idx]
        tr_y = y_train[trn_idx]
        ridge_49 = Ridge(alpha=6)
        ridge_49.fit(tr_x,tr_y)
        oof_ridge_49[val_idx] = ridge_49.predict(X_train_49[val_idx])
        
        predictions_ridge_49 += ridge_49.predict(X_test_49) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_ridge_49, target)))
    
    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_br_49 = np.zeros(train_shape)
    predictions_br_49 = np.zeros(len(X_test_49))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_49[trn_idx]
        tr_y = y_train[trn_idx]
        br_49 = br()
        br_49.fit(tr_x,tr_y)
        oof_br_49[val_idx] = br_49.predict(X_train_49[val_idx])
        
        predictions_br_49 += br_49.predict(X_test_49) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_br_49, target)))
    
    folds = KFold(n_splits=5, shuffle=True, random_state=13)
    oof_en_49 = np.zeros(train_shape)
    predictions_en_49 = np.zeros(len(X_test_49))
    #
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):
        print("fold n°{}".format(fold_+1))
        tr_x = X_train_49[trn_idx]
        tr_y = y_train[trn_idx]
        en_49 = en(alpha=1.0,l1_ratio=0.05)
        en_49.fit(tr_x,tr_y)
        oof_en_49[val_idx] = en_49.predict(X_train_49[val_idx])
        
        predictions_en_49 += en_49.predict(X_test_49) / folds.n_splits
    
    print("CV score: {:<8.8f}".format(mean_squared_error(oof_en_49, target)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63

    再进行融合:

    train_stack4 = np.vstack([oof_br_49,oof_kr_49,oof_en_49,oof_ridge_49]).transpose()
    test_stack4 = np.vstack([predictions_br_49, predictions_kr_49,predictions_en_49,predictions_ridge_49]).transpose()
    
    folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
    oof_stack4 = np.zeros(train_stack4.shape[0])
    predictions_lr4 = np.zeros(test_stack4.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack4,target)):
        print("fold {}".format(fold_))
        trn_data, trn_y = train_stack4[trn_idx], target.iloc[trn_idx].values
        val_data, val_y = train_stack4[val_idx], target.iloc[val_idx].values
        #LinearRegression
        lr4 = lr()
        lr4.fit(trn_data, trn_y)
        
        oof_stack4[val_idx] = lr4.predict(val_data)
        predictions_lr4 += lr4.predict(test_stack1) / 10
        
    mean_squared_error(target.values, oof_stack4) 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    0.49491439094008133
    
    • 1
    模型融合

    我们对263、383、49特征的数据总共构建了4个模型,当前就需要对这些模型的预测结果进行融合,也可以通过学习一个回归来进行融合:

    train_stack5 = np.vstack([oof_stack1,oof_stack2,oof_stack3,oof_stack4]).transpose()
    test_stack5 = np.vstack([predictions_lr1, predictions_lr2,predictions_lr3,predictions_lr4]).transpose()
    
    folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
    oof_stack5 = np.zeros(train_stack5.shape[0])
    predictions_lr5= np.zeros(test_stack5.shape[0])
    
    for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack5,target)):
        print("fold {}".format(fold_))
        trn_data, trn_y = train_stack5[trn_idx], target.iloc[trn_idx].values
        val_data, val_y = train_stack5[val_idx], target.iloc[val_idx].values
        #LinearRegression
        lr5 = lr()
        lr5.fit(trn_data, trn_y)
        
        oof_stack5[val_idx] = lr5.predict(val_data)
        predictions_lr5 += lr5.predict(test_stack5) / 10
        
    mean_squared_error(target.values, oof_stack5) 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    0.4480223491250565
    
    • 1

    可以看到结果改进了不少。

    结果保存
    submit_example = pd.read_csv('submit_example.csv',sep=',',encoding='latin-1')
    
    submit_example['happiness'] = predictions_lr5
    submit_example.loc[submit_example['happiness']>4.96,'happiness']= 5
    submit_example.loc[submit_example['happiness']<=1.04,'happiness']= 1
    submit_example.loc[(submit_example['happiness']>1.96)&(submit_example['happiness']<2.04),'happiness']= 2
    # 进行整数解的近似
    submit_example.to_csv("submision.csv",index=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    集成学习案例2——蒸汽量预测

    背景介绍

    锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。我们如何使用以上的信息,根据锅炉的工况,预测产生的蒸汽量呢?

    数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。我们需要利用训练数据训练出模型,预测测试数据的目标变量。

    最终的评价指标为均方误差MSE

    具体代码

    导入需要的包
    import warnings
    warnings.filterwarnings("ignore")  # 忽略各种不影响正常运行的警告
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # 模型
    import pandas as pd
    import numpy as np
    from scipy import stats
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
    from sklearn.metrics import make_scorer,mean_squared_error
    from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
    from sklearn.svm import LinearSVR, SVR
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
    from xgboost import XGBRegressor
    from sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    加载数据
    data_train = pd.read_csv("train.txt", sep='\t')
    data_test = pd.read_csv("test.txt", sep = '\t')
    # 对训练和测试的数据集叠在一起,对特征处理的时候比较方便
    data_train['oringin'] = 'train'
    data_test['oringin'] = 'test'
    data_all = pd.concat([data_train, data_test], axis = 0, ignore_index = True)
    # 按照上下的方式叠在一起,并且忽略掉原来的索引,叠后重新建立索引
    data_all.head()
    data_all.tail()  # target列自动填充NaN
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    image-20221125103201233

    image-20221125103316593

    探索数据分布

    这里因为是传感器的数据,即连续变量,所以使用 kdeplot(核密度估计图) 进行数据的初步分析,即EDA:

    for column in data_all.columns[0:-2]:
        # 为每个特征画一个核密度估计图
        g = sns.kdeplot(data_all[column][(data_all['oringin'] == "train")], color='Red',
                       shade=True)  # 在曲线下面绘制阴影
        g = sns.kdeplot(data_all[column][(data_all["oringin"] == "test")], color='Blue',
                       ax=g, shade=True)  # ax=g还是在原来的画柄上
        g.set_xlabel(column)
        g.set_ylabel("Frequency")
        g = g.legend(["train", "test"])
        plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    这里为每个特征都画出了一个核密度的图,太多了我就放一个吧。

    下载 ()

    那么我们观察所有的图片,可以发现某些特征的训练数据和测试数据的分布存在很大的差异,例如下图:

    下载 ()

    那我们的选择是删除这些数据:

    # 从以上的图中可以看出特征"V5","V9","V11","V17","V22","V28"中
    # 训练集数据分布和测试集数据分布不均,所以我们删除这些特征数据
    for column in ["V5","V9","V11","V17","V22","V28"]:
        g = sns.kdeplot(data_all[column][(data_all["oringin"] == "train")], color="Red", shade = True)
        g = sns.kdeplot(data_all[column][(data_all["oringin"] == "test")], ax =g, color="Blue", shade= True)
        g.set_xlabel(column)
        g.set_ylabel("Frequency")
        g = g.legend(["train","test"])
        plt.show()
    
    data_all.drop(["V5","V9","V11","V17","V22","V28"],axis=1,inplace=True)
    # inplace 代表在原来的数据上做删除操作
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    剩下的特征,我们可以查看彼此之间的相关性:

    # 查看特征之之间的相关性
    data_train1 = data_all[data_all["oringin"] == "train"].drop("oringin", axis=1)
    plt.figure(figsize=(25,20))
    colnm = data_train1.columns.tolist()  # 得到特征名称构成的列表
    mcorr = data_train1[colnm].corr(method="spearman")  # 计算特征之间的相关系数矩阵
    mask = np.zeros_like(mcorr, dtype=np.bool)  # 构造与mcorr同维度的矩阵,bool类型
    mask[np.triu_indices_from(mask)] = True  # 里面那个是返回方阵的上三角的索引
    cmap = sns.diverging_palette(220, 10, as_cmap = True)  
    # 从220和10之间建立一个逐渐变化色板对象
    g = sns.heatmap(mcorr, mask = mask, cmap = cmap, square = True, 
                    annot = True, fmt='0.2f')
    # 第一个为数据集,cmap为颜色条的名称或者对象或者列表,annot代表是否写入数值,
    # square将每个单元格设置为方形
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    下载
    可以看到部分特征之间的相关性较强。而部分特征与标签列的相关性太小,可以认为这些特征的贡献很有限,因此可以进行删除:

    # 将与target相关性过小的特征进行删除
    threshold = 0.1
    corr_matrix = data_train1.corr().abs()  # 相关系数矩阵的绝对值
    drop_col = corr_matrix[corr_matrix["target"] < threshold ].index
    print(drop_col)
    data_all.drop(drop_col, axis=1, inplace=True)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    Index(['V14', 'V21', 'V25', 'V26', 'V32', 'V33', 'V34'], dtype='object')
    
    • 1

    接下来需要对数据进行归一化操作:

    # 进行归一化操作
    cols_numeric = list(data_all.columns)  # 当前的特征列表
    cols_numeric.remove("oringin")  # 拿掉这一列
    def scale_minmax(col):
        return (col-col.min()) / (col.max() - col.min())
    
    scale_cols = [col for col in cols_numeric if col != "target"]  # 除了目标,相当于remove
    data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax, axis = 0)
    data_all[scale_cols].describe()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    image-20221125103924276

    特征工程

    这里引入了一个Box-Cox变换,因为大部分假设对数据有一定的要求,那我们希望数据的特征能够尽量满足正态分布的关系,而ox-cox变换的目标有两个:一个是变换后,可以一定程度上减小不可观测的误差和预测变量的相关性。主要操作是对因变量转换,使得变换后的因变量于回归自变量具有线性相依关系,误差也服从正态分布,误差各分量是等方差且相互独立。第二个是用这个变换来使得因变量获得一些性质,比如在时间序列分析中的平稳性,或者使得因变量分布为正态分布。

    下面我们可以先观察变换前和变换后的区别:

    fcols = 6
    frows = len(cols_numeric) - 1  # 因为target不用
    plt.figure(figsize = (4*fcols, 4*frows))
    i = 0
    
    for var in cols_numeric:
        if var != "target":
            data_tem = data_all[[var, "target"]].dropna()  # 取出当前列和taget列并舍弃缺失值
            # 画处理前分布图
            i+=1
            plt.subplot(frows, fcols, i)
            sns.distplot(data_tem[var], fit = stats.norm)
            # 画出直方图,并且用stats.norm来进行拟合
            plt.title(var+"oringinal")
            plt.xlabel("")
            # 画QQ图,看数据分布是否服从正态分布
            i+=1
            plt.subplot(frows, fcols, i)
            _ = stats.probplot(data_tem[var],plot = plt)
            plt.title("skew=" + "{:.4f}".format(stats.skew(data_tem[var])))
            # stats.skew是计算偏度
            plt.xlabel("")
            plt.ylabel("")
            # 画散点图
            i+=1
            plt.subplot(frows, fcols, i)
            plt.plot(data_tem[var], data_tem["target"], ".", alpha = 0.5)
            # 横轴为var特征,纵轴为target,用.来画,透明度为0.5
            plt.title("corr=" + "{:.2f}".format(np.corrcoef(data_tem[var], 
                                                            data_tem["target"])[0][1]))
            # 画boxcox变换后的直方图
            i+=1
            plt.subplot(frows, fcols, i)
            trans_var, lambda_var = stats.boxcox(data_tem[var].dropna() + 1)
            trans_var = scale_minmax(trans_var)
            sns.distplot(trans_var, fit = stats.norm)
            plt.title(var+"Transformed")
            plt.xlabel("")
            # 画处理后的QQ图
            i+=1
            plt.subplot(frows,fcols,i)
            _=stats.probplot(trans_var, plot=plt)
            plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
            plt.xlabel('')
            plt.ylabel('')
            # 画处理后的散点图
            i+=1
            plt.subplot(frows,fcols,i)
            plt.plot(trans_var, data_tem['target'],'.',alpha=0.5)
            plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,
                                                          data_tem['target'])[0][1]))
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    下载 ()

    从QQ图中可以看到经过变换后,其正态性好了很多!

    因此对总体进行变换:

    # 进行boxcox变换
    cols_transform = data_all.columns[0:-2]  # 这些是需要变换的特征
    for col in cols_transform:
        data_all.loc[:, col], _ = stats.boxcox(data_all.loc[:,col] + 1)
        # 因为我们已经标准化到-1与+1之间,而该函数要求输入数据是正的,因此+1
    print(data_all.target.describe())
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    count    2888.000000
    mean        0.126353
    std         0.983966
    min        -3.044000
    25%        -0.350250
    50%         0.313000
    75%         0.793250
    max         2.538000
    Name: target, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    而对于标签列来说也是如此,我们同样可以对其尝试进行变换,使用对数变换target目标值提升特征数据的正太性:

    # 变化之前
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    sns.distplot(data_all.target.dropna(), fit = stats.norm)
    plt.subplot(1,2,2)
    _ = stats.probplot(data_all.target.dropna(), plot = plt)
    # 掉target使用对数变化来提升其正态性质
    sp = data_train.target
    data_train.target1 = np.power(1.5, sp)
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    sns.distplot(data_train.target1.dropna(),fit=stats.norm);
    plt.subplot(1,2,2)
    _=stats.probplot(data_train.target1.dropna(), plot=plt)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    下载 ()

    下载 ()

    模型构建与集成学习
    构建数据集
    # 构建训练集与测试集
    from sklearn.model_selection import train_test_split
    def get_training_data():
        df_train = data_all[data_all["oringin"] == "train"]
        df_train["label"] = data_train.target1  # 这是经过对数变化的标签
        y = df_train.target
        X = df_train.drop(["oringin","target","label"],axis =1)
        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3,
                                                             random_state = 100)
        return X_train, X_valid, y_train, y_valid
    def get_test_data():
        df_test = data_all[data_all["oringin"] == "test"].reset_index(drop=True)
        # 因为获取的测试集的标签是后面的,那么需要从零开始,因此重置
        return df_test.drop(["oringin", "target"], axis = 1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    构建评估函数
    # 自己实现评估函数
    from sklearn.metrics import make_scorer
    def rmse(y_true, y_pred):
        diff = y_pred - y_true
        sum_sq = sum(diff * 2)
        n = len(y_pred)
        return np.sqrt(sum_sq / n)
    
    def mse(y_true, y_pred):
        return mean_squared_error(y_true, y_pred)
    
    rmse_scorer = make_scorer(rmse, greater_is_better = False)
    # 构建损失函数对象,第二个参数为True代表值越大越好,False代表值越小越好
    mse_scorer = make_scorer(mse, greater_is_better = False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    构建寻找离群值的函数
    # 该函数用来寻找离群值
    def find_outliers(model, X, y, sigma = 3):
        model.fit(X, y)
        y_pred = pd.Series(model.predict(X), index = y.index)
        
        resid = y - y_pred
        mean_resid = resid.mean()
        std_resid = resid.std()
        
        z = (resid - mean_resid) / std_resid  # 将差距标准化
        outliers = z[abs(z) > sigma].index  # 那些离平均差距超过sigma的
        
        print("R2 = ", model.score(X, y))
        print("rmse = ",rmse(y, y_pred))
        print("mse = ", mean_squared_error(y, y_pred))
        print("--" * 20)
        
        print("mean of residuals:", mean_resid)
        print("std of residuals:", std_resid)
        print("--" * 20)
        
        print(len(outliers), "outliers:")
        print(outliers.tolist())
        
        plt.figure(figsize = (15,5))
        ax_131 = plt.subplot(1,3,1)
        plt.plot(y, y_pred,'.')
        plt.plot(y.loc[outliers], y_pred.loc[outliers], "ro")
        plt.legend(["Accepted","Outlier"])
        plt.xlabel("y")
        plt.ylabel("y_pred")
        
        ax_132 = plt.subplot(1,3,2)
        plt.plot(y, y- y_pred, ".")
        plt.plot(y.loc[outliers], y.loc[outliers] - y_pred.loc[outliers], "ro")
        plt.legend(['Accepted','Outlier'])
        plt.xlabel('y')
        plt.ylabel('y - y_pred');
        
        ax_133 = plt.subplot(1,3,3)
        z.plot.hist(bins =50, ax = ax_133)  # 画出z的直方图
        z.loc[outliers].plot.hist(color='r', bins = 50, ax = ax_133)
        plt.legend(['Accepted','Outlier'])
        plt.xlabel('z')
        
        return outliers
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46

    尝试用岭回归去寻找离散值:

    X_train, X_valid, y_train, y_valid = get_training_data()
    test = get_test_data()
    
    outliers = find_outliers(Ridge(), X_train, y_train)
    # 先用简单的岭回归去拟合试试看
    X_outliers=X_train.loc[outliers]
    y_outliers=y_train.loc[outliers]
    X_t=X_train.drop(outliers)
    y_t=y_train.drop(outliers)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    R2 =  0.8766692297367809
    rmse =  nan
    mse =  0.12180705697820839
    ----------------------------------------
    mean of residuals: 1.8875439448847292e-16
    std of residuals: 0.3490950551088699
    ----------------------------------------
    22 outliers:
    [2655, 2159, 1164, 2863, 1145, 2697, 2528, 2645, 691, 1085, 1874, 2647, 884, 2696, 2668, 1310, 1901, 1458, 2769, 2002, 2669, 1972]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    下载 ()
    可以看到效果很好。

    模型训练
    def get_trainning_data_omitoutliers():
        #获取训练数据省略异常值
        y=y_t.copy()
        X=X_t.copy()
        return X,y
    
    • 1
    • 2
    • 3
    • 4
    • 5
    def train_model(model, param_grid = [], X = [], y = [], splits = 5, repeats = 5):
        # 尝试实现训练函数
        if(len(y) == 0):
            X, y = get_trainning_data_omitoutliers()
            
        # 交叉验证
        rkfold = RepeatedKFold(n_splits = splits, n_repeats = repeats)
        
        # 网格搜索参数
        if(len(param_grid) > 0):
            gsearch = GridSearchCV(model, param_grid, cv = rkfold, 
                                  scoring = "neg_mean_squared_error",
                                  verbose = 1, return_train_score = True)
            gsearch.fit(X, y)
            model = gsearch.best_estimator_
            best_idx = gsearch.best_index_  # 最佳参数在param_gird中的索引
            
            # 获取交叉验证评价指标
            grid_results = pd.DataFrame(gsearch.cv_results_)
            cv_mean = abs(grid_results.loc[best_idx, "mean_test_score"])
            cv_std = grid_results.loc[best_idx, "std_test_score"]
            
        else:  # 不进行网格搜索
            grid_results = []
            cv_results = cross_val_score(model, X ,y, scoring = "neg_mean_squared_error",
                                        cv =rkfold)
            cv_mean = abs(np.mean(cv_results))
            cv_std = np.std(cv_results)
            
        # 合并数据
        cv_score = pd.Series({"mean":cv_mean, "std":cv_std})
        # 预测
        y_pred = model.predict(X)
        # 模型性能的统计数据        
        print('----------------------')
        print(model)
        print('----------------------')
        print('score=',model.score(X,y))
        print('rmse=',rmse(y, y_pred))
        print('mse=',mse(y, y_pred))
        print('cross_val: mean=',cv_mean,', std=',cv_std)
        
        # 残差分析与可视化
        y_pred = pd.Series(y_pred,index=y.index)
        resid = y - y_pred
        mean_resid = resid.mean()
        std_resid = resid.std()
        z = (resid - mean_resid)/std_resid    
        n_outliers = sum(abs(z)>3)
        outliers = z[abs(z)>3].index
        
        return model, cv_score, grid_results
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    用岭回归来进行拟合:

    # 定义训练变量存储数据
    opt_models = dict()
    score_models = pd.DataFrame(columns=['mean','std'])
    splits=5
    repeats=5
    model = 'RandomForest'  #可替换,见案例分析一的各种模型
    opt_models[model] = Ridge() #可替换,见案例分析一的各种模型
    alph_range = np.arange(0.25,6,0.25)
    param_grid = {'alpha': alph_range}
    
    opt_models[model],cv_score,grid_results = train_model(opt_models[model], 
                                                          param_grid=param_grid, 
                                                  splits=splits, repeats=repeats)
    
    cv_score.name = model
    score_models = score_models.append(cv_score)
    
    plt.figure()
    plt.errorbar(alph_range, abs(grid_results['mean_test_score']),
                 abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
    plt.xlabel('alpha')
    plt.ylabel('score')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    Ridge(alpha=0.25)
    ----------------------
    score= 0.8926884445023296
    rmse= 5.109572960503552e-08
    mse= 0.10540676395670681
    cross_val: mean= 0.10910070303768438 , std= 0.005867873719733897
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    下载 ()

    修改模型

    我认为岭回归比较简单,因此尝试将模型改为随机森林回归,但是一直跑不动,因此自己写了比较简单的训练过程。

    先大概给个范围进行随机网格搜索:

    from sklearn.model_selection import RandomizedSearchCV
    n_estimators_range=[int(x) for x in np.linspace(start=50,stop=500,num=50)]
    max_depth_range=[5,10,15]
    max_depth_range.append(None)
    min_samples_split_range=[2,5,10]
    min_samples_leaf_range=[4,8]
    random_forest_seed = 1
    random_forest_hp_range={'n_estimators':n_estimators_range,
                            'max_depth':max_depth_range,
                            'min_samples_split':min_samples_split_range,
                            'min_samples_leaf':min_samples_leaf_range
                            }
    X, y = get_trainning_data_omitoutliers()
    random_forest_model_test_base=RandomForestRegressor()
    random_forest_model_test_random=RandomizedSearchCV(estimator=random_forest_model_test_base,
                                                       param_distributions=random_forest_hp_range,
                                                       n_iter=200,
                                                       n_jobs=-1,
                                                       cv=5,
                                                       verbose=1,
                                                       random_state=random_forest_seed
                                                       )
    random_forest_model_test_random.fit(X,y)
    best_hp_now=random_forest_model_test_random.best_params_
    print(best_hp_now)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    {'n_estimators': 343, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': None}
    
    • 1

    再根据这个参数,缩小范围进行网格搜索:

    n_estimators_range=[int(x) for x in np.linspace(start=320,stop=370,num=10)]
    max_depth_range=[5,10,15]
    max_depth_range.append(None)
    min_samples_split_range=[1,2,3,4,5]
    min_samples_leaf_range=[2,3,4,5,6]
    random_forest_seed = 1
    random_forest_hp_range_2={'n_estimators':n_estimators_range,
                            'max_depth':max_depth_range,
                            'min_samples_split':min_samples_split_range,
                            'min_samples_leaf':min_samples_leaf_range
                            }
    random_forest_model_test_2_base=RandomForestRegressor()
    random_forest_model_test_2_random = GridSearchCV(estimator=random_forest_model_test_2_base,
                                                   param_grid=random_forest_hp_range_2,
                                                   cv=5,
                                                   verbose=1,
                                                   n_jobs=-1)
    random_forest_model_test_2_random.fit(X,y)
    best_hp_now_2=random_forest_model_test_2_random.best_params_
    print(best_hp_now_2)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 364}
    
    • 1

    因此接下来做出预测并保存:

    random_forest_model_final=random_forest_model_test_2_random.best_estimator_
    y_predict = pd.DataFrame(random_forest_model_final.predict(test))
    y_predict.to_csv("predict.txt", header = None,index = False)
    
    • 1
    • 2
    • 3

    完结!

  • 相关阅读:
    使用Servlet+Tomcat+MySQL开发-简易版部门信息管理系统(单表CRUD)
    Java笔记:JVM优化分析
    Java基础教程(21)-Java连接MongoDB
    【Pinia】Pinia的概念、优势及使用方式
    c# -- 内存讲解和类型
    1335_FreeRTOS中xQueueReceiveFromISR函数的实现
    js阻止浏览器返回上一页
    OSPF高级特性1(重发布,虚链路)
    华为自研编程语言仓颉首次面世,首席架构师冯新宇确认出席2024全球软件研发技术大会!
    Java IO Steam
  • 原文地址:https://blog.csdn.net/StarandTiAmo/article/details/128039559