• 【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(一)


    1、xgboost.XGBRegressor 详解

    xgboost.XGBRegressor 的详细参数可查看 https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=XGBRegressor#xgboost.XGBRegressor

    XGBRegressor 类:

    class xgboost.XGBRegressor(*, objective='reg:squarederror', **kwargs)
    
    • 1

    核心参数包括:

    n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
    
    max_depth (Optional[int]) – Maximum tree depth for base learners.
    
    learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
    
    verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
    
    tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
    
    n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
    
    gamma (Optional[float])(min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
    
    min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
    
    subsample (Optional[float]) – Subsample ratio of the training instance.
    
    colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
    
    scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    2、应用xgboost.XGBRegressor搭建空气质量预测模型

    2.1 依赖的库

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import KFold #k折交叉验证
    from sklearn.model_selection import GridSearchCV #网格搜索
    from sklearn.metrics import make_scorer
    import os
    import sys
    import time
    import math
    from sklearn.metrics import r2_score
    from sklearn.ensemble import GradientBoostingRegressor
    import numpy as np
    import warnings
    warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn", lineno=193)
    from sklearn.multioutput import MultiOutputRegressor
    import xgboost as xgb
    import joblib
    from sklearn.preprocessing import MinMaxScaler
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    2.2 搭建空气质量预测模型

    • 模型
      使用 xgboost.XGBRegressor 作为基础模型,使用 MultiOutputRegressor 包装 XGBRegressor 从而实现多维时间输出(多目标回归 Multi target regression)

      模型核心代码如下:

        def fit_model(self, x, y, learning_rate=0.05,
                             n_estimators=500,
                             max_depth=7,
                             min_child_weight=1,
                             gamma=0.0,
                             subsample=0.8,
                             colsample_bytree=0.8,
                             scale_pos_weight=0.8):
            
            model = xgb.XGBRegressor(learning_rate=learning_rate,
                                     n_estimators=n_estimators,
                                     max_depth=max_depth,
                                     min_child_weight=min_child_weight,
                                     gamma=gamma,
                                     subsample=subsample,
                                     colsample_bytree=colsample_bytree,
                                     scale_pos_weight=scale_pos_weight,
                                     seed=42,
                                     tree_method='gpu_hist',
                                     gpu_id=2)
            
            multioutput = MultiOutputRegressor(model).fit(x, y)
            
            return multioutput
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 输入 x
      shape 为 (N, W, 24)
      其中 N 为数据的天数, W 为特征的维度, 24 为输入数据的小时数
    • 输出 y
      shape 为 (N, 24)
      其中 N 为数据的天数,24 为输出数据的小时数

    2.3 核心代码

    
    # 基于 XGBRegressor 的空气质量模型
    class AQXGB():
    
        def __init__(self, factor, n_input, n_output, version):
            
            self.n_input = n_input
            self.n_output = n_output
            self.version = version
            self.factor = factor#空气因子
            
            if not os.path.exists('./ml_data/'):#保存机器学习的训练数据
                os.mkdir('./ml_data/')
            
    
        def train(self, train_data_path, test_data_path):
            
            x,y = self.load_data(self.version, 'train', train_data_path, self.n_input, self.n_output)
            train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2,random_state=2022)
            
            model = self.fit_model(train_x, train_y)
            
            pre_y = model.predict(test_x)
            
            #计算决策系数r方
            r2 = self.performance_metric(test_y, pre_y)  
            print('test_r2 = ', r2)
                
            x,y = self.load_data(self.version, 'test', test_data_path, self.n_input, self.n_output)
            pre_y = model.predict(x)
            r2 = self.performance_metric(y, pre_y)
            print('val_r2 = ', r2)
            
            #保存模型
            joblib.dump(model, './ml_data/xgb_%s_%d_%d_%s.model' %(self.factor, self.n_input, self.n_output, self.version)) 
    
        
    
        def performance_metric(self, y_true, y_predict):
            # 根据需要选择评估函数
            # r2
            score = r2_score(y_true,y_predict)
            
            # MSE
            MSE=np.mean(( y_predict- y_true)**2)
            print('RMSE: ',MSE**0.5)
            
            #MAE
            MAE=np.mean(np.abs( y_predict- y_true))
            print('MAE: ',MAE)
            
            #SMAPE
            SMAPE=self.smape(y_true, y_predict)
            print('SMAPE: ',SMAPE)
    
            return score
            
        def smape(self, A, F):
            A = A.reshape(-1)
            F = F.reshape(-1)
            return 1.0/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
    
        def fit_model(self, x, y, learning_rate=0.05,
                             n_estimators=500,
                             max_depth=7,
                             min_child_weight=1,
                             gamma=0.0,
                             subsample=0.8,
                             colsample_bytree=0.8,
                             scale_pos_weight=0.8):
            
            model = xgb.XGBRegressor(learning_rate=learning_rate,
                                     n_estimators=n_estimators,
                                     max_depth=max_depth,
                                     min_child_weight=min_child_weight,
                                     gamma=gamma,
                                     subsample=subsample,
                                     colsample_bytree=colsample_bytree,
                                     scale_pos_weight=scale_pos_weight,
                                     seed=42,
                                     tree_method='gpu_hist',
                                     gpu_id=2)
            
            multioutput = MultiOutputRegressor(model).fit(x, y)
            
            return multioutput
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86

    2.4 模型训练

    • 训练代码

      if __name__ == "__main__":
      
          if len(sys.argv) == 7:
              # 训练模型
              # python3 src/train_xgb_model.py data/train_data.csv data/test_data.csv O3 24 24 v2
              aq_model = AQXGB(sys.argv[3], int(sys.argv[4]), int(sys.argv[5]), sys.argv[6])
              aq_model.train(sys.argv[1], sys.argv[2])
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
    • 训练脚本
      输入过去24小时是特征数据,输出未来24小时的O3的预测结果

      python3 src/train_xgb_model.py data/train_data.csv data/test_data.csv O3 24 24 v2
      
      • 1

    2.5 数据格式

    • 数据格式
      csv文件
    • 示例
    air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
    1013.0,0.3,59.0,69.0,2019-02-01 00:00:00,15.0,80.0,88.0,26.0,8.0,xxx监测站,-0.4,205.8,1.1,116.97810856433719,36.61655020673796,shik
    1013.0,0.3,58.0,68.0,2019-02-01 01:00:00,15.0,80.0,86.0,26.0,8.0,xxx监测站,-0.5,179.4,1.0,116.97810856433719,36.61655020673796,shik
    1012.0,0.3,62.0,72.0,2019-02-01 02:00:00,15.0,80.0,94.0,26.0,8.0,xxx监测站,-0.9,175.7,0.8,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,64.0,76.0,2019-02-01 03:00:00,15.0,80.0,102.0,26.0,8.0,xxx监测站,-1.0,166.9,0.9,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,65.0,80.0,2019-02-01 04:00:00,15.0,80.0,110.0,26.0,8.0,xxx监测站,-0.8,191.1,0.9,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,66.0,84.0,2019-02-01 05:00:00,15.0,80.0,117.0,26.0,8.0,xxx监测站,-1.1,211.4,1.0,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,68.0,85.0,2019-02-01 06:00:00,15.0,80.0,119.0,26.0,8.0,xxx监测站,-1.4,137.3,1.3,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,68.0,65.75,2019-02-01 07:00:00,15.0,80.0,130.6,26.0,8.0,xxx监测站,-1.3,147.0,1.5,116.97810856433719,36.61655020673796,shik
    1011.0,0.3,58.0,46.5,2019-02-01 08:00:00,15.0,80.0,142.2,26.0,8.0,xxx监测站,0.7,157.0,1.4,116.97810856433719,36.61655020673796,shik
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    3、其他参考

    【AI实战】XGBRegressor模型加速训练,使用GPU秒级训练XGBRegressor

    【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参1

    【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参2(GPU训练模型)

  • 相关阅读:
    数据隐私新篇章:Facebook如何保护用户信息
    软考高级系统架构师_计算机组成与结构---备考笔记004
    华为机试真题 C++ 实现【等和子数组最小和】【2022.11 Q4新题】
    区间查找,二分,思维
    Java检测是否包含首字符串startsWith() 方法
    C语言 —— 操作符
    住房贷款等额本息(等额本金)还款计划计算
    Eavesdropping(窃听机制)在机器学习中的用法
    反向传播back propagation
    前后端小项目链接
  • 原文地址:https://blog.csdn.net/zengNLP/article/details/125498953