• 基于随机森林的otto商品分类


    数据集介绍

    Otto Group数据集来源于《Otto Group Product Classification Challenge》。Otto集团是世界上最大的电子商务公司之一,在20多个国家拥有子公司。我们每天在全球销售数百万种产品,在我们的产品线中添加了数千种产品。

    我们公司对我们产品性能的一致性分析至关重要。然而,由于我们的全球基础设施不同,许多相同的产品被分类不同。因此,我们的产品分析的质量在很大程度上取决于对类似产品进行准确分类的能力。分类越好,我们对产品范围的了解就越多。

    在这次竞争中,我们为超过200000种产品提供了一个具有93项功能的数据集。目的是建立一个预测模型,能够区分我们的主要产品类别。获奖模型将采用开源模式。

    奥托集团产品分类数据集

    • Target:共9个商品类别
    • Features:93个特征:整数型特征
    import pandas as pd
    import numpy as np
    import os
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.metrics import log_loss
    from sklearn.model_selection import GridSearchCV
    %matplotlib inline
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    读取数据

    查看当前工作路径

    os.path.abspath('.')
    
    • 1

    读取数据

    data = pd.read_csv("./otto-group-product-classification-challenge/train.csv")
    data.head()
    
    • 1
    • 2
    idfeat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9...feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93target
    01100000000...100000000Class_1
    12000000010...000000000Class_1
    23000000010...000000000Class_1
    34100161500...012000000Class_1
    45000000000...100001000Class_1

    5 rows × 95 columns

    # 数据维度
    data.shape
    
    • 1
    • 2
    (61878, 95)
    
    • 1

    数据特征分析

    # 描述性统计
    data.describe()
    
    • 1
    • 2
    idfeat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9...feat_84feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93
    count61878.00000061878.0000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.000000...61878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.000000
    mean30939.5000000.386680.2630660.9014670.7790810.0710430.0256960.1937040.6624331.011296...0.0707520.5323061.1285760.3935490.8749150.4577720.8124210.2649410.3801190.126135
    std17862.7843151.525331.2520732.9348182.7880050.4389020.2153331.0301022.2557703.474822...1.1514601.9004382.6815541.5754552.1154661.5273854.5978042.0456460.9823851.201720
    min1.0000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
    25%15470.2500000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
    50%30939.5000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
    75%46408.7500000.000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.000000...0.0000000.0000001.0000000.0000001.0000000.0000000.0000000.0000000.0000000.000000
    max61878.00000061.0000051.00000064.00000070.00000019.00000010.00000038.00000076.00000043.000000...76.00000055.00000065.00000067.00000030.00000061.000000130.00000052.00000019.00000087.000000

    8 rows × 94 columns

    # 查看数据分布
    sns.countplot(x=data.target)
    
    • 1
    • 2
    
    
    • 1

    在这里插入图片描述

    可以看出,数据类别不均衡

    数据处理

    # 特征值
    x = data.drop(["id","target"], axis=1)
    # 目标值
    y = data["target"]
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    x.head()
    
    • 1
    feat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9feat_10...feat_84feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93
    01000000000...0100000000
    10000000100...0000000000
    20000000100...0000000000
    31001615001...22012000000
    40000000000...0100001000

    5 rows × 93 columns

    y.value_counts().sort_index()

    # 由于数据集较大,同时样本类别分布不均衡,故通过欠采样缩小数据集规模
    # from imblearn.under_sampling import RandomUnderSampler
    
    • 1
    • 2

    把标签值转换为数字

    y = LabelEncoder().fit_transform(y)
    y
    
    • 1
    • 2
    array([0, 0, 0, ..., 8, 8, 8])
    
    • 1

    分割数据

    from sklearn.model_selection import train_test_split
    
    x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
    x_train.shape, y_train.shape, y_test.shape, x_test.shape
    
    • 1
    • 2
    • 3
    • 4
    ((49502, 93), (49502,), (12376,), (12376, 93))
    
    • 1

    模型训练

    from sklearn.ensemble import RandomForestClassifier
    rf_model = RandomForestClassifier(oob_score=True)
    rf_model.fit(x_train, y_train)
    
    • 1
    • 2
    • 3
    RandomForestClassifier(oob_score=True)
    
    • 1
    y_pred = rf_model.predict(x_test)
    
    • 1

    模型评估

    # 模型在训练集上的准确率 
    rf_model.score(x_train, y_train)
    
    • 1
    • 2
    0.9999797987960083
    
    • 1
    # 模型在测试集上的准确率 
    rf_model.score(x_test, y_test)
    
    • 1
    • 2
    0.8089043309631545
    
    • 1
    # 包外估计
    rf_model.oob_score_
    
    • 1
    • 2
    0.7993818431578522
    
    • 1
    encoder = OneHotEncoder(sparse=False)
    y_test = encoder.fit_transform(y_test.reshape(-1,1))
    y_pred = encoder.fit_transform(y_pred.reshape(-1,1))
    y_test,
    
    • 1
    • 2
    • 3
    • 4
    (array([[0., 0., 1., ..., 0., 0., 0.],
            [0., 1., 0., ..., 0., 0., 0.],
            [0., 0., 0., ..., 1., 0., 0.],
            ...,
            [0., 0., 0., ..., 0., 0., 1.],
            [0., 0., 1., ..., 0., 0., 0.],
            [1., 0., 0., ..., 0., 0., 0.]]),)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
     y_pred
    
    • 1
    array([[0., 0., 1., ..., 0., 0., 0.],
           [0., 1., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 1., 0.],
           ...,
           [0., 0., 0., ..., 0., 0., 1.],
           [0., 1., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    # logloss评估
    log_loss(y_test, y_pred, eps=1e-15, normalize=True)
    
    • 1
    • 2
    6.600210582899472
    
    • 1
    # 以概率形式输出
    y_pred_proba = rf_model.predict_proba(x_test)
    y_pred_proba
    
    • 1
    • 2
    • 3
    array([[0.  , 0.2 , 0.77, ..., 0.  , 0.02, 0.  ],
           [0.02, 0.48, 0.16, ..., 0.06, 0.  , 0.  ],
           [0.03, 0.02, 0.03, ..., 0.3 , 0.32, 0.02],
           ...,
           [0.12, 0.01, 0.05, ..., 0.08, 0.11, 0.53],
           [0.01, 0.56, 0.32, ..., 0.01, 0.02, 0.  ],
           [0.18, 0.09, 0.01, ..., 0.1 , 0.2 , 0.14]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    rf_model.oob_score_
    
    • 1
    0.7993818431578522
    
    • 1
    log_loss(y_test, y_pred_proba, eps=1e-15, normalize=True)
    
    • 1
    0.6232249914857839
    
    • 1
    
    
    • 1
  • 相关阅读:
    引入『客户端缓存』,Redis6算是把缓存玩明白了…
    【Python】基本数据类型(四)字符串类型的操作
    【Unity-Cinemachine相机】虚拟相机旋转Composer属性详解
    【论文不精读】Reinforced Path Reasoning for Counterfactual Explainable Recommendation
    imx6获取和同步时间
    怎样清理Mac存储空间 苹果电脑内存不够用怎么办 苹果电脑内存满了怎么清理
    黑马React18: 基础Part II
    解决前端可以进入首页,菜单导航无法加载
    云原生时代顶流消息中间件Apache Pulsar部署实操-上
    二叉树的概念、存储及遍历
  • 原文地址:https://blog.csdn.net/qq_43276566/article/details/132576111