• 必知必会的 LightGBM 各种操作


    LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。

    LightGBM官网:https://lightgbm.readthedocs.io/en/latest/

    参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html

    技术提升交流

    本文来自技术群小伙伴总结分享,技术交流欢迎入群

    目前开通了技术交流群,群友已超过3000人,添加时最好的备注方式为:来源+兴趣方向,方便找到更快获取资料、入群

    方式①、添加微信号:dkl88191,备注:来自CSDN+加群
    方式②、微信搜索公众号:Python学习与数据挖掘,后台回复:加群

    1 安装方法

    LightGBM的安装非常简单,在Linux下很方便的就可以开启GPU训练。可以优先选用从pip安装,如果失败再从源码安装。

    • 安装方法:从源码安装
    git clone --recursive https://github.com/microsoft/LightGBM ; 
    cd LightGBM
    mkdir build ; cd build
    cmake ..
    
    # 开启MPI通信机制,训练更快
    # cmake -DUSE_MPI=ON ..
    
    # GPU版本,训练更快
    # cmake -DUSE_GPU=1 ..
    make -j4
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 安装方法:pip安装
    # 默认版本
    pip install lightgbm
    
    # MPI版本
    pip install lightgbm --install-option=--mpi
    
    # GPU版本
    pip install lightgbm --install-option=--gpu
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2 调用方法

    在Python语言中LightGBM提供了两种调用方式,分为为原生的API和Scikit-learn API,两种方式都可以完成训练和验证。当然原生的API更加灵活,看个人习惯来进行选择。

    2.1 定义数据集

    df_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train', header=None, sep='\t')
    df_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test', header=None, sep='\t')
    W_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train.weight', header=None)[0]
    W_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test.weight', header=None)[0]
    
    y_train = df_train[0]
    y_test = df_test[0]
    X_train = df_train.drop(0, axis=1)
    X_test = df_test.drop(0, axis=1)
    num_train, num_feature = X_train.shape
    
    # create dataset for lightgbm
    # if you want to re-use data, remember to set free_raw_data=False
    
    lgb_train = lgb.Dataset(X_train, y_train,
                            weight=W_train, free_raw_data=False)
    
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
                           weight=W_test, free_raw_data=False)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    2.2 模型训练

    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'binary_logloss',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
    }
    
    # generate feature names
    feature_name = ['feature_' + str(col) for col in range(num_feature)]
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    valid_sets=lgb_train,  # eval training data
                    feature_name=feature_name,
                    categorical_feature=[21])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    2.3 模型保存与加载

    # save model to file
    gbm.save_model('model.txt')
    
    print('Dumping model to JSON...')
    model_json = gbm.dump_model()
    
    with open('model.json', 'w+') as f:
        json.dump(model_json, f, indent=4)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2.4 查看特征重要性

    # feature names
    print('Feature names:', gbm.feature_name())
    
    # feature importances
    print('Feature importances:', list(gbm.feature_importance()))
    
    • 1
    • 2
    • 3
    • 4
    • 5

    2.5 继续训练

    # continue training
    # init_model accepts:
    # 1. model file name
    # 2. Booster()
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    init_model='model.txt',
                    valid_sets=lgb_eval)
    print('Finished 10 - 20 rounds with model file...')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    2.6 动态调整模型超参数

    # decay learning rates
    # learning_rates accepts:
    # 1. list/tuple with length = num_boost_round
    # 2. function(curr_iter)
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    init_model=gbm,
                    learning_rates=lambda iter: 0.05 * (0.99 ** iter),
                    valid_sets=lgb_eval)
    print('Finished 20 - 30 rounds with decay learning rates...')
    
    # change other parameters during training
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    init_model=gbm,
                    valid_sets=lgb_eval,
                    callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])
    print('Finished 30 - 40 rounds with changing bagging_fraction...')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    2.7 自定义损失函数

    # self-defined objective function
    # f(preds: array, train_data: Dataset) -> grad: array, hess: array
    # log likelihood loss
    def loglikelihood(preds, train_data):
        labels = train_data.get_label()
        preds = 1. / (1. + np.exp(-preds))
        grad = preds - labels
        hess = preds * (1. - preds)
        return grad, hess
    
    # self-defined eval metric
    # f(preds: array, train_data: Dataset) -> name: string, eval_result: float, is_higher_better: bool
    # binary error
    # NOTE: when you do customized loss function, the default prediction value is margin
    # This may make built-in evalution metric calculate wrong results
    # For example, we are doing log likelihood loss, the prediction is score before logistic transformation
    # Keep this in mind when you use the customization
    def binary_error(preds, train_data):
        labels = train_data.get_label()
        preds = 1. / (1. + np.exp(-preds))
        return 'error', np.mean(labels != (preds > 0.5)), False
    
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    init_model=gbm,
                    fobj=loglikelihood,
                    feval=binary_error,
                    valid_sets=lgb_eval)
    print('Finished 40 - 50 rounds with self-defined objective function and eval metric...')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30

    2.8 调参方法

    人工调参

    For Faster Speed
    • Use bagging by setting bagging_fraction and bagging_freq

    • Use feature sub-sampling by setting feature_fraction

    • Use small max_bin

    • Use save_binary to speed up data loading in future learning

    • Use parallel learning, refer to Parallel Learning Guide <./Parallel-Learning-Guide.rst>__

    For Better Accuracy
    • Use large max_bin (may be slower)

    • Use small learning_rate with large num_iterations

    • Use large num_leaves (may cause over-fitting)

    • Use bigger training data

    • Try dart

    Deal with Over-fitting
    • Use small max_bin

    • Use small num_leaves

    • Use min_data_in_leaf and min_sum_hessian_in_leaf

    • Use bagging by set bagging_fraction and bagging_freq

    • Use feature sub-sampling by set feature_fraction

    • Use bigger training data

    • Try lambda_l1, lambda_l2 and min_gain_to_split for regularization

    • Try max_depth to avoid growing deep tree

    • Try extra_trees

    • Try increasing path_smooth

    网格搜索

    lg = lgb.LGBMClassifier(silent=False)
    param_dist = {"max_depth": [4,5, 7],
                  "learning_rate" : [0.01,0.05,0.1],
                  "num_leaves": [300,900,1200],
                  "n_estimators": [50, 100, 150]
                 }
    
    grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 5, scoring="roc_auc", verbose=5)
    grid_search.fit(train,y_train)
    grid_search.best_estimator_, grid_search.best_score_
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    贝叶斯优化

    import warnings
    import time
    warnings.filterwarnings("ignore")
    from bayes_opt import BayesianOptimization
    def lgb_eval(max_depth, learning_rate, num_leaves, n_estimators):
        params = {
                 "metric" : 'auc'
            }
        params['max_depth'] = int(max(max_depth, 1))
        params['learning_rate'] = np.clip(0, 1, learning_rate)
        params['num_leaves'] = int(max(num_leaves, 1))
        params['n_estimators'] = int(max(n_estimators, 1))
        cv_result = lgb.cv(params, d_train, nfold=5, seed=0, verbose_eval =200,stratified=False)
        return 1.0 * np.array(cv_result['auc-mean']).max()
    
    lgbBO = BayesianOptimization(lgb_eval, {'max_depth': (4, 8),
                                                'learning_rate': (0.05, 0.2),
                                                'num_leaves' : (20,1500),
                                                'n_estimators': (5, 200)}, random_state=0)
    
    lgbBO.maximize(init_points=5, n_iter=50,acq='ei')
    print(lgbBO.max)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
  • 相关阅读:
    数据结构-链表(3)
    vuedraggable�拖拽列表设置某一条元素禁止被拖拽
    datax-hdfsReader 学习
    ESLint 中的“ space-before-function-paren ”相关报错及其解决方案
    设计模式之装饰器模式
    接口测试工具Postman使用实践
    AI大预言模型——ChatGPT在地学、GIS、气象、农业、生态、环境等应用
    c 取字符串中的子串
    Redis持久化
    现代循环神经网络 - 机器翻译与数据集
  • 原文地址:https://blog.csdn.net/m0_59596937/article/details/127660876