• 【Python机器学习】回归模型:推土机售价预测


    文章目录

    使用机器学习预测推土机的售价

    1. 定义问题

    考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?

    2. 数据来源

    kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

    The data for this competition is split into three parts:

    • Train.csv is the training set, which contains data through the end of 2011.
    • Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
    • Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

    3. 评价标准

    The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

    更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

    注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

    4. 使用的特征

    特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

    零、导入模块

    # EDA
    import warnings
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy import stats
    sns.set()
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False
    %config InlineBackend.figure_config = 'svg'
    warnings.filterwarnings("ignore")
    
    # 数据预处理
    from sklearn.preprocessing import LabelEncoder
    
    # sklearn模型
    from sklearn.ensemble import RandomForestRegressor
    from xgboost import XGBRegressor
    
    # 模型评估
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    一、EDA

    bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                               low_memory = False)
    appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                               low_memory = False)
    
    • 1
    • 2
    • 3
    • 4

    1.1 查看基本信息

    # 查看各字段类型
    bulldozer_df.info()
    
    • 1
    • 2
    
    RangeIndex: 412698 entries, 0 to 412697
    Data columns (total 53 columns):
     #   Column                    Non-Null Count   Dtype  
    ---  ------                    --------------   -----  
     0   SalesID                   412698 non-null  int64  
     1   SalePrice                 412698 non-null  float64
     2   MachineID                 412698 non-null  int64  
     3   ModelID                   412698 non-null  int64  
     4   datasource                412698 non-null  int64  
     5   auctioneerID              392562 non-null  float64
     6   YearMade                  412698 non-null  int64  
     7   MachineHoursCurrentMeter  147504 non-null  float64
     8   UsageBand                 73670 non-null   object 
     9   saledate                  412698 non-null  object 
     10  fiModelDesc               412698 non-null  object 
     11  fiBaseModel               412698 non-null  object 
     12  fiSecondaryDesc           271971 non-null  object 
     13  fiModelSeries             58667 non-null   object 
     14  fiModelDescriptor         74816 non-null   object 
     15  ProductSize               196093 non-null  object 
     16  fiProductClassDesc        412698 non-null  object 
     17  state                     412698 non-null  object 
     18  ProductGroup              412698 non-null  object 
     19  ProductGroupDesc          412698 non-null  object 
     20  Drive_System              107087 non-null  object 
     21  Enclosure                 412364 non-null  object 
     22  Forks                     197715 non-null  object 
     23  Pad_Type                  81096 non-null   object 
     24  Ride_Control              152728 non-null  object 
     25  Stick                     81096 non-null   object 
     26  Transmission              188007 non-null  object 
     27  Turbocharged              81096 non-null   object 
     28  Blade_Extension           25983 non-null   object 
     29  Blade_Width               25983 non-null   object 
     30  Enclosure_Type            25983 non-null   object 
     31  Engine_Horsepower         25983 non-null   object 
     32  Hydraulics                330133 non-null  object 
     33  Pushblock                 25983 non-null   object 
     34  Ripper                    106945 non-null  object 
     35  Scarifier                 25994 non-null   object 
     36  Tip_Control               25983 non-null   object 
     37  Tire_Size                 97638 non-null   object 
     38  Coupler                   220679 non-null  object 
     39  Coupler_System            44974 non-null   object 
     40  Grouser_Tracks            44875 non-null   object 
     41  Hydraulics_Flow           44875 non-null   object 
     42  Track_Type                102193 non-null  object 
     43  Undercarriage_Pad_Width   102916 non-null  object 
     44  Stick_Length              102261 non-null  object 
     45  Thumb                     102332 non-null  object 
     46  Pattern_Changer           102261 non-null  object 
     47  Grouser_Type              102193 non-null  object 
     48  Backhoe_Mounting          80712 non-null   object 
     49  Blade_Type                81875 non-null   object 
     50  Travel_Controls           81877 non-null   object 
     51  Differential_Type         71564 non-null   object 
     52  Steering_Controls         71522 non-null   object 
    dtypes: float64(3), int64(5), object(45)
    memory usage: 166.9+ MB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    # 查看缺失值
    bulldozer_df.isna().sum()
    
    • 1
    • 2
    SalesID                          0
    SalePrice                        0
    MachineID                        0
    ModelID                          0
    datasource                       0
    auctioneerID                 20136
    YearMade                         0
    MachineHoursCurrentMeter    265194
    UsageBand                   339028
    saledate                         0
    fiModelDesc                      0
    fiBaseModel                      0
    fiSecondaryDesc             140727
    fiModelSeries               354031
    fiModelDescriptor           337882
    ProductSize                 216605
    fiProductClassDesc               0
    state                            0
    ProductGroup                     0
    ProductGroupDesc                 0
    Drive_System                305611
    Enclosure                      334
    Forks                       214983
    Pad_Type                    331602
    Ride_Control                259970
    Stick                       331602
    Transmission                224691
    Turbocharged                331602
    Blade_Extension             386715
    Blade_Width                 386715
    Enclosure_Type              386715
    Engine_Horsepower           386715
    Hydraulics                   82565
    Pushblock                   386715
    Ripper                      305753
    Scarifier                   386704
    Tip_Control                 386715
    Tire_Size                   315060
    Coupler                     192019
    Coupler_System              367724
    Grouser_Tracks              367823
    Hydraulics_Flow             367823
    Track_Type                  310505
    Undercarriage_Pad_Width     309782
    Stick_Length                310437
    Thumb                       310366
    Pattern_Changer             310437
    Grouser_Type                310505
    Backhoe_Mounting            331986
    Blade_Type                  330823
    Travel_Controls             330821
    Differential_Type           341134
    Steering_Controls           341176
    dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    # 查看标签分布
    bulldozer_df['SalePrice'].hist()
    
    • 1
    • 2
    • 1

    请添加图片描述

    1.2 特征类型转换

    # 改为帕斯卡命名
    bulldozer_df.rename(columns={'saledate': 'SaleDate'}, inplace=True)
    
    • 1
    • 2
    bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
    
    • 1
    # 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
    plt.figure(figsize=(20,15))
    pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
    plt.show()
    
    • 1
    • 2
    • 3
    • 4

    请添加图片描述

    2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息

    # 将数据集按时间排序
    bulldozer_df.sort_values(by='SaleDate', inplace=True)
    bulldozer_df['SaleDate'].head(20)
    
    • 1
    • 2
    • 3
    205615   1989-01-17
    274835   1989-01-31
    141296   1989-01-31
    212552   1989-01-31
    62755    1989-01-31
    54653    1989-01-31
    81383    1989-01-31
    204924   1989-01-31
    135376   1989-01-31
    113390   1989-01-31
    113394   1989-01-31
    116419   1989-01-31
    32138    1989-01-31
    127610   1989-01-31
    76171    1989-01-31
    127000   1989-01-31
    128130   1989-01-31
    127626   1989-01-31
    55455    1989-01-31
    55454    1989-01-31
    Name: SaleDate, dtype: datetime64[ns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    1.3 联表+特征初筛

    这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看

    # 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
    bd_df = bulldozer_df.copy()
    app_df = appendix_df.copy()
    
    • 1
    • 2
    • 3
    # SalesID列丢弃
    bd_df.drop('SalesID', axis=1, inplace=True)
    
    • 1
    • 2
    # 定义一个查看出入的函数
    def check_difference(df1, df2, target_col, on_col):
        temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
        return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
    
    • 1
    • 2
    • 3
    • 4
    # 定义一个合并时互补的函数,冲突时保留df1的数据
    def combine(df1, df2, target_col, on_col):
        temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
        temp_df0.fillna('', inplace=True)
        temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
        temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
        df1[target_col] = temp_df0[target_col+'_x']
        df2.drop(columns=[target_col], inplace=True)
        return df1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    1.3.1 删除包含重复信息的特征

    ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者

    fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者

    bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
    app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)
    
    • 1
    • 2

    1.3.2 fiBaseModel

    # 查看枚举值,后续需要做分箱合并处理
    bd_df['fiBaseModel'].value_counts()
    
    • 1
    • 2
    580      20179
    310      17886
    D6       13527
    416      12900
    D5        9636
             ...  
    56           1
    B4230        1
    IS30         1
    MM555        1
    WLK15        1
    Name: fiBaseModel, Length: 1961, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 查看出入部分有无空值
    check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
    
    • 1
    • 2
    MachineID        0
    fiBaseModel_x    0
    fiBaseModel_y    0
    dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    # 查看出入部分
    check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
    
    • 1
    • 2
    MachineIDfiBaseModel_xfiBaseModel_y
    711523610WA150HD465
    981059447WA300PC100
    1231303779415862
    1281340389MS240MS120
    1791208516D31PC100
    ............
    4124742308891184575
    4124842287735T135T133
    4125092292146450465
    4126551846321TB135TB125
    4126951918416337530

    12452 rows × 3 columns

    # 选择保留bd_df表的数据
    app_df.drop(columns=['fiBaseModel'], inplace=True)
    
    • 1
    • 2

    1.3.3 fiSecondaryDesc

    # 查看枚举值,后续需要做分箱合并处理
    bd_df['fiSecondaryDesc'].value_counts()
    
    • 1
    • 2
    C         44431
    B         40165
    G         37915
    H         24729
    E         21532
              ...  
    BLGPPS        1
    MSR           1
    LC7A          1
    CL            1
    BH            1
    Name: fiSecondaryDesc, Length: 177, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 查看不一致部分
    check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
    
    • 1
    • 2
    MachineIDfiSecondaryDesc_xfiSecondaryDesc_y
    01126363NaNNaN
    11194089NaNNaN
    31327630NaNNaN
    61082797NaNNaN
    71527216NaNNaN
    ............
    4126901823846NaNNaN
    4126911278794NaNNaN
    4126921792049NaNNaN
    4126941919104NaNNaN
    4126951918416GNaN

    147009 rows × 3 columns

    # 合并互补
    bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
    
    • 1
    • 2

    1.3.4 fiModelSeries

    # 查看枚举值,后续需要做分箱合并处理
    bd_df['fiModelSeries'].value_counts()
    
    • 1
    • 2
    II      13770
    LC       9175
    III      5351
    -1       4646
    -2       4033
            ...  
    LL          1
    6F          1
    -2LC        1
    -5A         1
    VII         1
    Name: fiModelSeries, Length: 123, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 查看不一致部分
    check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
    
    • 1
    • 2
    MachineIDfiModelSeries_xfiModelSeries_y
    01126363NaNNaN
    11194089NaNNaN
    21473654NaNNaN
    31327630NaNNaN
    41336053NaNNaN
    ............
    4126931915521NaNNaN
    4126941919104NaNNaN
    4126951918416NaNNaN
    412696509560NaNNaN
    4126971869284NaNNaN

    362366 rows × 3 columns

    # 合并互补
    bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')
    
    • 1
    • 2

    1.3.5 fiModelDescriptor

    # 查看枚举值
    bd_df['fiModelDescriptor'].value_counts()
    
    • 1
    • 2
    L            16464
    LGP          16143
    LC           13295
    XL            6700
    6             2944
                 ...  
    K5               1
    HighLift         1
    High Lift        1
    III              1
    SL               1
    Name: fiModelDescriptor, Length: 140, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 查看不一致部分
    check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
    
    • 1
    • 2
    MachineIDfiModelDescriptor_xfiModelDescriptor_y
    01126363NaNNaN
    11194089NaNNaN
    21473654NaNNaN
    31327630NaNNaN
    41336053NaNNaN
    ............
    4126931915521NaNNaN
    4126941919104NaNNaN
    4126951918416NaNNaN
    412696509560NaNNaN
    4126971869284NaNNaN

    342069 rows × 3 columns

    # 合并互补
    bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
    
    • 1
    • 2

    1.3.6 ProductGroupDesc

    # 查看枚举值
    bd_df['ProductGroupDesc'].value_counts()
    
    • 1
    • 2
    Track Excavators       104230
    Track Type Tractors     82582
    Backhoe Loaders         81401
    Wheel Loader            73216
    Skid Steer Loaders      45011
    Motor Graders           26258
    Name: ProductGroupDesc, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    # 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
    app_df['ProductGroupDesc'].value_counts()
    
    • 1
    • 2
    Track Excavators                 89094
    Backhoe Loaders                  74074
    Track Type Tractors              67362
    Wheel Loader                     62426
    Skid Steer Loaders               42121
    Motor Graders                    22602
    Track Loaders                      158
    Articulated Trucks                 109
    Ag Tractors                        103
    Wheel Tractor Scraper              102
    Off Highway Trucks                  81
    Multi Terrain Loaders               66
    Wheel Excavator                     66
    Forklift                            53
    Skidders                            46
    Wheel Feller Buncher                31
    Forestry Log Loaders                25
    Pipelayers                          11
    Telehandler                          9
    Wheel Dozer                          7
    Knuckleboom Loaders                  6
    Track Feller Bunchers                6
    Vibratory Double Drum Asphalt        4
    Vibratory Single Drum Asphalt        4
    Work Tool                            3
    Pneumatic Tired Compactor            3
    Compactors                           3
    Vibratory Single Drum Pad            3
    Tandem Roller Static                 2
    Harvesters                           2
    Forwarders                           2
    Engine, Industrial OEM               2
    Track Harvesters                     1
    Delimber Forestry                    1
    Asphalt/Concrete Pavers              1
    Vibratory Single Drum Smooth         1
    Crane/Dragline                       1
    Forestry Excavators                  1
    Cold Planers                         1
    Name: ProductGroupDesc, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    app_df.drop(columns=['ProductGroupDesc'], inplace=True)
    
    • 1

    1.3.7 MfgYear

    # 两边同一个特征名字不一样,先改名
    app_df.rename(columns={'MfgYear': 'YearMade'}, inplace=True)
    
    • 1
    • 2
    # 查看枚举值,发现异常值1000
    bd_df['YearMade'].value_counts()
    
    • 1
    • 2
    1000    39391
    2005    22096
    1998    21751
    2004    20914
    1999    19274
            ...  
    2012        1
    1949        1
    1942        1
    2013        1
    1937        1
    Name: YearMade, Length: 73, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 查看不一致部分
    check_difference(bd_df, app_df, 'YearMade', 'MachineID')
    
    • 1
    • 2
    MachineIDYearMade_xYearMade_y
    52153165619751987.0
    58107813219671966.0
    71152361019862007.0
    98105944719841996.0
    122152518019841985.0
    ............
    412651163172910001990.0
    412654191210610001996.0
    412655184632110002005.0
    412667189753519981999.0
    41269650956019931990.0

    36406 rows × 3 columns

    这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它

    app_df.drop('YearMade', axis=1, inplace=True)
    
    • 1

    1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

    这三个特征原训练集上都没有,直接联过去

    # total_df = bd_df.copy()
    
    • 1
    total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
    
    • 1
    # 动力上限和下限留一个就足够区分了,取下限
    total_df.drop(columns=['PrimaryUpper'], inplace=True)
    
    • 1
    • 2
    total_df.info()
    
    • 1
    
    Int64Index: 412698 entries, 0 to 412697
    Data columns (total 54 columns):
     #   Column                    Non-Null Count   Dtype         
    ---  ------                    --------------   -----         
     0   SalePrice                 412698 non-null  float64       
     1   MachineID                 412698 non-null  int64         
     2   ModelID                   412698 non-null  int64         
     3   datasource                412698 non-null  int64         
     4   auctioneerID              392562 non-null  float64       
     5   YearMade                  412698 non-null  int64         
     6   MachineHoursCurrentMeter  147504 non-null  float64       
     7   UsageBand                 73670 non-null   object        
     8   SaleDate                  412698 non-null  datetime64[ns]
     9   fiModelDesc               412698 non-null  object        
     10  fiBaseModel               412698 non-null  object        
     11  fiSecondaryDesc           412698 non-null  object        
     12  fiModelSeries             412698 non-null  object        
     13  fiModelDescriptor         412698 non-null  object        
     14  ProductSize               196093 non-null  object        
     15  state                     412698 non-null  object        
     16  ProductGroupDesc          412698 non-null  object        
     17  Drive_System              107087 non-null  object        
     18  Enclosure                 412364 non-null  object        
     19  Forks                     197715 non-null  object        
     20  Pad_Type                  81096 non-null   object        
     21  Ride_Control              152728 non-null  object        
     22  Stick                     81096 non-null   object        
     23  Transmission              188007 non-null  object        
     24  Turbocharged              81096 non-null   object        
     25  Blade_Extension           25983 non-null   object        
     26  Blade_Width               25983 non-null   object        
     27  Enclosure_Type            25983 non-null   object        
     28  Engine_Horsepower         25983 non-null   object        
     29  Hydraulics                330133 non-null  object        
     30  Pushblock                 25983 non-null   object        
     31  Ripper                    106945 non-null  object        
     32  Scarifier                 25994 non-null   object        
     33  Tip_Control               25983 non-null   object        
     34  Tire_Size                 97638 non-null   object        
     35  Coupler                   220679 non-null  object        
     36  Coupler_System            44974 non-null   object        
     37  Grouser_Tracks            44875 non-null   object        
     38  Hydraulics_Flow           44875 non-null   object        
     39  Track_Type                102193 non-null  object        
     40  Undercarriage_Pad_Width   102916 non-null  object        
     41  Stick_Length              102261 non-null  object        
     42  Thumb                     102332 non-null  object        
     43  Pattern_Changer           102261 non-null  object        
     44  Grouser_Type              102193 non-null  object        
     45  Backhoe_Mounting          80712 non-null   object        
     46  Blade_Type                81875 non-null   object        
     47  Travel_Controls           81877 non-null   object        
     48  Differential_Type         71564 non-null   object        
     49  Steering_Controls         71522 non-null   object        
     50  fiProductClassDesc        412698 non-null  object        
     51  fiManufacturerID          412698 non-null  int64         
     52  PrimarySizeBasis          407439 non-null  object        
     53  PrimaryLower              407439 non-null  float64       
    dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
    memory usage: 173.2+ MB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    raise KeyError
    
    • 1
    ---------------------------------------------------------------------------
    
    KeyError                                  Traceback (most recent call last)
    
    Input In [903], in ()
    ----> 1 raise KeyError
    
    
    KeyError: 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    1.4 逐个查看特征

    1.4.1 datasource

    # 173怀疑是172写错成173
    total_df['datasource'].value_counts()
    
    • 1
    • 2
    132    260776
    136     75491
    149     33325
    121     25191
    172     17914
    173         1
    Name: datasource, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    1.4.2 auctioneerID

    # 后续要合并
    total_df['auctioneerID'].value_counts()
    
    • 1
    • 2
    1.0     192773
    2.0      57441
    3.0      30288
    4.0      20877
    99.0     12042
    6.0      11950
    7.0       7847
    8.0       7419
    5.0       7002
    10.0      5876
    9.0       4764
    11.0      3823
    12.0      3610
    13.0      3068
    18.0      2359
    14.0      2277
    20.0      2238
    19.0      2074
    16.0      1807
    15.0      1742
    21.0      1601
    22.0      1429
    24.0      1357
    23.0      1322
    17.0      1275
    27.0      1150
    25.0       959
    28.0       860
    26.0       796
    0.0        536
    Name: auctioneerID, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    # 查看auctioneerID和价格的关系
    temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('median', 'SalePrice'))
    
    • 1
    • 2
    • 3
    countmeanmedian
    SalePriceSalePriceSalePrice
    auctioneerID
    22.0142918230.02099414000.0
    18.0235918839.79949114500.0
    21.0160119645.31542815000.0
    25.095922478.51929116000.0
    9.0476422515.00314916775.0
    14.0227720804.79798017000.0
    17.0127522974.94117618000.0
    12.0361024762.93628818250.0
    28.086028312.67441918750.0
    20.0223824737.81143919250.0
    99.01204226958.80611219500.0
    13.0306827017.44133020500.0
    16.0180726261.42778120500.0
    27.0115027606.73913022500.0
    4.02087729825.72687623000.0
    23.0132230613.72919823000.0
    2.05744129023.05167023000.0
    10.0587629561.54697123000.0
    0.053629979.85074623000.0
    15.0174229986.33754323500.0
    5.0700229150.21779523500.0
    24.0135733041.15696424500.0
    1.019277332684.87007525000.0
    8.0741932477.97883825000.0
    11.0382332707.08867425500.0
    3.03028833596.96259226000.0
    6.01195034708.07062827000.0
    26.079636157.41206028000.0
    7.0784736288.45023628500.0
    19.0207442715.76663534000.0
    # 画图查看
    temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
    plt.show()
    
    • 1
    • 2
    • 3

    请添加图片描述

    1.4.3 YearMade

    total_df['YearMade'] = total_df['YearMade'].astype(int)
    
    • 1
    total_df['YearMade'].value_counts()
    
    • 1
    1000    39391
    2005    22096
    1998    21751
    2004    20914
    1999    19274
            ...  
    2012        1
    1949        1
    1942        1
    2013        1
    1937        1
    Name: YearMade, Length: 73, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    plt.figure(figsize=(14,10))
    sns.countplot(total_df['YearMade'])
    plt.xticks(rotation=90)
    plt.show()
    
    • 1
    • 2
    • 3
    • 4

    请添加图片描述

    第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

    生产时间大于销售时间的,暂时认为是提前订货,不处理。

    1.4.4 MachineHoursCurrentMeter

    # 查看枚举值
    total_df['MachineHoursCurrentMeter'].value_counts()
    
    • 1
    • 2
    0.0        73834
    2000.0       124
    1000.0       117
    24.0         115
    1500.0       101
               ...  
    10834.0        1
    3499.0         1
    26270.0        1
    26901.0        1
    17920.0        1
    Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    1.4.5 UsageBand

    # 查看枚举值
    total_df['UsageBand'].value_counts()
    
    • 1
    • 2
    Medium    35832
    Low       25311
    High      12527
    Name: UsageBand, dtype: int64
    
    • 1
    • 2
    • 3
    • 4

    1.4.6 fiBaseModel

    total_df['fiBaseModel'].value_counts()
    
    • 1
    580      20179
    310      17886
    D6       13527
    416      12900
    D5        9636
             ...  
    56           1
    B4230        1
    IS30         1
    MM555        1
    WLK15        1
    Name: fiBaseModel, Length: 1961, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    1.4.7 fiSecondaryDesc

    total_df['fiSecondaryDesc'].value_counts().head(10)
    
    • 1
         136420
    C     44658
    B     40446
    G     38139
    H     24759
    E     21944
    D     20132
    F      9454
    K      8089
    A      5968
    Name: fiSecondaryDesc, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    fiSecondaryDesc
    XT217729736.81672023000.0
    LE251031586.06573723000.0
    R357731164.43584024000.0
    N384431028.67455823500.0
    SUPER L392031237.16836724000.0
    P447532841.13407826000.0
    J465531069.47089224000.0
    LC486231265.12793124000.0
    M539832225.97628824500.0
    L566931634.05362524500.0
    A596831266.45777524000.0
    K808930815.39139623000.0
    F945431958.08038925000.0
    D2013230766.16764424000.0
    E2194431089.64035724000.0
    H2475931279.81105924000.0
    G3813930902.04020023500.0
    B4044631238.50853024000.0
    C4465831024.10369924000.0
    13642031426.03821324000.0

    1.4.8 fiModelSeries

    total_df['fiModelSeries'].value_counts().head(10)
    
    • 1
           350453
    II      14039
    LC       9609
    III      5392
    -1       5142
    -2       4350
    -6       3538
    -3       2783
    -5       2664
    -12      1447
    Name: fiModelSeries, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    1.4.9 fiModelDescriptor

    total_df['fiModelDescriptor'].value_counts().head(10)
    
    • 1
           333040
    L       16676
    LGP     16541
    LC      15666
    XL       6704
    6        3238
    LT       2681
    5        2455
    3        2149
    CR       1798
    Name: fiModelDescriptor, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    temp = pd.pivot_table(total_df, index='fiModelDescriptor', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    fiModelDescriptor
    K34929030.37249322000.0
    835133369.80057027000.0
    E35934427.57660225500.0
    SR38328176.82767620000.0
    ZTS44228579.75113121250.0
    Z52527814.19047621000.0
    262732085.04784726000.0
    SSR68433503.50877227000.0
    H109230992.97161224000.0
    7111530983.37399124000.0
    CR179828777.15795322000.0
    3214931477.17543024000.0
    5245530897.66802424000.0
    LT268128216.72920621000.0
    6323830573.30512724000.0
    XL670430564.90170023000.0
    LC1566630387.23298923000.0
    LGP1654131327.26703324000.0
    L1667630739.47103624000.0
    33304031340.13789124000.0

    1.4.10 ProductSize

    total_df['ProductSize'].fillna('').value_counts()
    
    • 1
                      216605
    Medium             64342
    Large / Medium     51297
    Small              27057
    Mini               25721
    Large              21396
    Compact             6280
    Name: ProductSize, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    temp = pd.pivot_table(total_df.fillna(''), index='ProductSize', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    ProductSize
    Compact628017498.57882215000.0
    Large2139642023.43331534000.0
    Mini2572115194.64772013500.0
    Small2705732511.16406129000.0
    Large / Medium5129747828.92323144000.0
    Medium6434245703.87347341000.0
    21660524047.38340819000.0

    1.4.11 state

    # 查看数据分布
    total_df['state'].fillna('').value_counts()
    
    • 1
    • 2
    Florida           67320
    Texas             53110
    California        29761
    Washington        16222
    Georgia           14633
    Maryland          13322
    Mississippi       13240
    Ohio              12369
    Illinois          11540
    Colorado          11529
    New Jersey        11156
    North Carolina    10636
    Tennessee         10298
    Alabama           10292
    Pennsylvania      10234
    South Carolina     9951
    Arizona            9364
    New York           8639
    Connecticut        8276
    Minnesota          7885
    Missouri           7178
    Nevada             6932
    Louisiana          6627
    Kentucky           5351
    Maine              5096
    Indiana            4124
    Arkansas           3933
    New Mexico         3631
    Utah               3046
    Unspecified        2801
    Wisconsin          2745
    New Hampshire      2738
    Virginia           2353
    Idaho              2025
    Oregon             1911
    Michigan           1831
    Wyoming            1672
    Montana            1336
    Iowa               1336
    Oklahoma           1326
    Nebraska            866
    West Virginia       840
    Kansas              667
    Delaware            510
    North Dakota        480
    Alaska              430
    Massachusetts       347
    Vermont             300
    South Dakota        244
    Hawaii              118
    Rhode Island         83
    Puerto Rico          42
    Washington DC         2
    Name: state, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    # 查看州和价格的关系,有关系
    temp = pd.pivot_table(total_df, index='state', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    • 3
    countmeanmedian
    SalePriceSalePriceSalePrice
    state
    Minnesota788527199.05516819500.0
    Connecticut827629008.80860323500.0
    New York863925582.23752720000.0
    Arizona936431562.85775323000.0
    South Carolina995129848.89458322000.0
    Pennsylvania1023425463.00273619000.0
    Alabama1029235438.54148928000.0
    Tennessee1029831845.35735124000.0
    North Carolina1063632161.77369325000.0
    New Jersey1115630982.18886724500.0
    Colorado1152931777.74828725000.0
    Illinois1154029091.58145622500.0
    Ohio1236928228.10049322000.0
    Mississippi1324032574.59214525000.0
    Maryland1332228621.68218022000.0
    Georgia1463332265.55046824000.0
    Washington1622227690.73172221500.0
    California2976129815.20271522500.0
    Texas5311032977.19034125000.0
    Florida6732034387.51277527000.0
    # 进一步按价格排序查看
    temp.sort_values(by=[('median', 'SalePrice'), ('mean', 'SalePrice')])
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    state
    Indiana412424400.35766219000.0
    Pennsylvania1023425463.00273619000.0
    Maine509626176.94270019500.0
    Minnesota788527199.05516819500.0
    New York863925582.23752720000.0
    Kansas66728093.40329820000.0
    Puerto Rico4226011.90476220250.0
    Wisconsin274527656.24772321000.0
    Idaho202529263.23456821000.0
    Washington1622227690.73172221500.0
    Virginia235328798.64598421500.0
    Ohio1236928228.10049322000.0
    Maryland1332228621.68218022000.0
    Michigan183129003.92364822000.0
    Missouri717829046.00069722000.0
    Kentucky535129815.38030322000.0
    South Carolina995129848.89458322000.0
    Vermont30027285.33333322250.0
    Arkansas393328906.89397422500.0
    Illinois1154029091.58145622500.0
    California2976129815.20271522500.0
    Washington DC222750.00000022750.0
    Massachusetts34728382.42074923000.0
    New Hampshire273828928.74360823000.0
    Oregon191130277.59026723000.0
    Delaware51031160.09803923000.0
    Arizona936431562.85775323000.0
    Connecticut827629008.80860323500.0
    Louisiana662730201.89376823500.0
    Tennessee1029831845.35735124000.0
    Iowa133631927.54491024000.0
    Oklahoma132632258.71040724000.0
    Georgia1463332265.55046824000.0
    Montana133632616.65419224000.0
    Nebraska86632612.06697524250.0
    New Jersey1115630982.18886724500.0
    Colorado1152931777.74828725000.0
    North Carolina1063632161.77369325000.0
    Mississippi1324032574.59214525000.0
    Wyoming167232604.42583725000.0
    Texas5311032977.19034125000.0
    Hawaii11828879.23728826000.0
    Alaska43033281.97674426000.0
    New Mexico363133632.64940826000.0
    Utah304634190.54793227000.0
    Florida6732034387.51277527000.0
    Nevada693236332.09751927000.0
    Unspecified280134857.71153228000.0
    Alabama1029235438.54148928000.0
    North Dakota48039083.75000029750.0
    West Virginia84040258.75000033000.0
    Rhode Island8337622.28915734000.0
    South Dakota24443907.37704935000.0

    1.4.12 ProductGroupDesc

    # 查看数据分布
    total_df['ProductGroupDesc'].fillna('').value_counts()
    
    • 1
    • 2
    Track Excavators       104230
    Track Type Tractors     82582
    Backhoe Loaders         81401
    Wheel Loader            73216
    Skid Steer Loaders      45011
    Motor Graders           26258
    Name: ProductGroupDesc, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    # 查看和价格的关系
    temp = pd.pivot_table(total_df, index='ProductGroupDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    • 3
    countmeanmedian
    SalePriceSalePriceSalePrice
    ProductGroupDesc
    Motor Graders2625847561.95742240000.0
    Skid Steer Loaders4501110583.94401410000.0
    Wheel Loader7321637259.29242232000.0
    Backhoe Loaders8140120951.58232720500.0
    Track Type Tractors8258236280.13278029500.0
    Track Excavators10423035763.45701829000.0

    1.4.13 Drive_System

    total_df['Drive_System'].fillna('').value_counts()
    
    • 1
                        305611
    Two Wheel Drive      47546
    Four Wheel Drive     33551
    No                   25166
    All Wheel Drive        824
    Name: Drive_System, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    temp = pd.pivot_table(total_df.fillna(''), index='Drive_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Drive_System
    All Wheel Drive82460455.27912659000.0
    No2516647342.72145040000.0
    Four Wheel Drive3355124534.14941424000.0
    Two Wheel Drive4754618418.02687917500.0
    30561132532.70369325000.0

    1.4.14 Enclosure

    total_df['Enclosure'].fillna('').value_counts()
    
    • 1
    OROPS                  177971
    EROPS                  141769
    EROPS w AC              92601
                              334
    EROPS AC                   18
    NO ROPS                     3
    None or Unspecified         2
    Name: Enclosure, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    temp = pd.pivot_table(total_df.fillna(''), index='Enclosure', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Enclosure
    None or Unspecified216500.00000016500.0
    NO ROPS344333.33333342500.0
    EROPS AC1823500.00000020500.0
    33427689.44610825000.0
    EROPS w AC9260151671.16972847000.0
    EROPS14176928687.99327823000.0
    OROPS17797122592.08273918000.0

    1.4.15 Forks

    total_df['Forks'].fillna('').value_counts()
    
    • 1
                           214983
    None or Unspecified    183061
    Yes                     14654
    Name: Forks, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Forks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Forks
    Yes1465436761.35471531500.0
    None or Unspecified18306123593.41418418500.0
    21498337327.17495430000.0

    1.4.16 Pad_Type

    total_df['Pad_Type'].fillna('').value_counts()
    
    • 1
                           331602
    None or Unspecified     72395
    Reversible               5950
    Street                   2725
    Grouser                    26
    Name: Pad_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    temp = pd.pivot_table(total_df.fillna(''), index='Pad_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Pad_Type
    Grouser2630289.42307730500.0
    Street272524995.06422025000.0
    Reversible595027344.03361327500.0
    None or Unspecified7239520267.12035420000.0
    33160233725.99889726000.0

    1.4.17 Ride_Control

    total_df['Ride_Control'].fillna('').value_counts()
    
    • 1
                           259970
    No                      79389
    None or Unspecified     64693
    Yes                      8646
    Name: Ride_Control, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Ride_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Ride_Control
    Yes864655278.26914253000.0
    None or Unspecified6469334834.65002429000.0
    No7938920765.72010020000.0
    25997032705.23236225000.0

    1.4.18 Stick

    total_df['Stick'].fillna('').value_counts()
    
    • 1
                331602
    Standard     49854
    Extended     31242
    Name: Stick, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Stick', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Stick
    Extended3124223257.32091423000.0
    Standard4985419501.52511319000.0
    33160233725.99889726000.0

    1.4.19 Transmission

    total_df['Transmission'].fillna('').value_counts()
    
    • 1
                           224691
    Standard               143915
    None or Unspecified     23889
    Powershift              11991
    Powershuttle             4286
    Hydrostatic              3342
    Direct Drive              422
    Autoshift                 118
    AutoShift                  44
    Name: Transmission, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    temp = pd.pivot_table(total_df.fillna(''), index='Transmission', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Transmission
    AutoShift44102261.36363695000.0
    Autoshift11833141.94915333250.0
    Direct Drive42218675.82938411550.0
    Hydrostatic334234705.81418332000.0
    Powershuttle428619264.93093818000.0
    Powershift1199138574.35293129500.0
    None or Unspecified2388947354.81296840000.0
    Standard14391528364.13221823000.0
    22469131117.25384223000.0

    1.4.20 Turbocharged

    total_df['Turbocharged'].fillna('').value_counts()
    
    • 1
                           331602
    None or Unspecified     77111
    Yes                      3985
    Name: Turbocharged, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Turbocharged', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Turbocharged
    Yes398524144.30037624000.0
    None or Unspecified7711120783.27626420000.0
    33160233725.99889726000.0

    1.4.21 Blade_Extension

    total_df['Blade_Extension'].fillna('').value_counts()
    
    • 1
                           386715
    None or Unspecified     25406
    Yes                       577
    Name: Blade_Extension, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Blade_Extension', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Blade_Extension
    Yes57766587.69497465000.0
    None or Unspecified2540647337.07502240000.0
    38671530103.24427923500.0

    1.4.22 Blade_Width

    total_df['Blade_Width'].fillna('').value_counts()
    
    • 1
                           386715
    14'                      9867
    None or Unspecified      9521
    12'                      5201
    16'                       960
    13'                       335
    <12'                       99
    Name: Blade_Width, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    temp = pd.pivot_table(total_df.fillna(''), index='Blade_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Blade_Width
    <12'9927764.64646521000.0
    13'33527125.52238818500.0
    16'96059882.11979252750.0
    12'520136000.85214429000.0
    None or Unspecified952141899.46014132000.0
    14'986759347.22316855000.0
    38671530103.24427923500.0

    1.4.23 Enclosure_Type

    total_df['Enclosure_Type'].fillna('').value_counts()
    
    • 1
                           386715
    None or Unspecified     22469
    Low Profile              2675
    High Profile              839
    Name: Enclosure_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Enclosure_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Enclosure_Type
    High Profile83981897.43742680000.0
    Low Profile267581444.65831880000.0
    None or Unspecified2246942480.32475935000.0
    38671530103.24427923500.0

    1.4.24 Engine_Horsepower

    total_df['Engine_Horsepower'].fillna('').value_counts()
    
    • 1
                386715
    No           24642
    Variable      1341
    Name: Engine_Horsepower, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Engine_Horsepower', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Engine_Horsepower
    Variable134190245.15287188000.0
    No2464245452.80732138000.0
    38671530103.24427923500.0

    1.4.24 Hydraulics

    total_df['Hydraulics'].fillna('').value_counts()
    
    • 1
    2 Valve                145317
    Standard               106515
                            82565
    Auxiliary               43224
    Base + 1 Function       25511
    3 Valve                  5807
    4 Valve                  3077
    Base + 3 Function         311
    Base + 2 Function         132
    Base + 5 Function          94
    Base + 4 Function          81
    Base + 6 Function          54
    None or Unspecified        10
    Name: Hydraulics, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Hydraulics
    None or Unspecified1027400.00000026250.0
    Base + 6 Function5468333.33333369500.0
    Base + 4 Function8195253.08642092500.0
    Base + 5 Function9475601.06383071000.0
    Base + 2 Function13289424.24242489500.0
    Base + 3 Function31184141.80064379000.0
    4 Valve307756652.55118651000.0
    3 Valve580744479.30945440000.0
    Base + 1 Function2551146637.43397039000.0
    Auxiliary4322425076.74678416000.0
    8256521028.90029720500.0
    Standard10651529391.41013920500.0
    2 Valve14531736145.19639330000.0

    1.4.25 Pushblock

    total_df['Pushblock'].fillna('').value_counts()
    
    • 1
                           386715
    None or Unspecified     20017
    Yes                      5966
    Name: Pushblock, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Pushblock', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Pushblock
    Yes596668305.12772465000.0
    None or Unspecified2001741642.52565333000.0
    38671530103.24427923500.0

    1.4.26 Ripper

    total_df['Ripper'].fillna('').value_counts()
    
    • 1
                           305753
    None or Unspecified     85405
    Yes                      8185
    Multi Shank              8071
    Single Shank             5284
    Name: Ripper, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    temp = pd.pivot_table(total_df.fillna(''), index='Ripper', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Ripper
    Single Shank528450139.36638942000.0
    Multi Shank807148952.39747241000.0
    Yes818563677.68307960000.0
    None or Unspecified8540535253.46648628500.0
    30575328422.90210122000.0

    1.4.27 Scarifier

    total_df['Scarifier'].fillna('').value_counts()
    
    • 1
                           386704
    None or Unspecified     13033
    Yes                     12961
    Name: Scarifier, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Scarifier', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Scarifier
    Yes1296149523.26641540000.0
    None or Unspecified1303345996.79828140000.0
    38670430103.37522023500.0

    1.4.28 Tip_Control

    total_df['Tip_Control'].fillna('').value_counts()
    
    • 1
                           386715
    None or Unspecified     16832
    Sideshift & Tip          7164
    Tip                      1987
    Name: Tip_Control, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Tip_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Tip_Control
    Tip198762269.26572757000.0
    Sideshift & Tip716446739.03908441000.0
    None or Unspecified1683246488.79045937500.0
    38671530103.24427923500.0

    1.4.29 Tire_Size

    total_df['Tire_Size'].fillna('').value_counts()
    
    • 1
                           315060
    None or Unspecified     47823
    20.5                    15773
    14"                      9111
    23.5                     8760
    26.5                     4635
    17.5                     3971
    29.5                     2767
    17.5"                    1815
    13"                       776
    20.5"                     737
    15.5                      610
    15.5"                     463
    23.5"                     309
    7.0"                       56
    23.1"                      20
    10"                         9
    10 inch                     3
    Name: Tire_Size, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    temp = pd.pivot_table(total_df.fillna(''), index='Tire_Size', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Tire_Size
    10 inch316333.33333315500.0
    10"931166.66666725000.0
    23.1"2014837.50000012125.0
    7.0"5627625.00000025500.0
    23.5"30956487.94498451000.0
    15.5"46347401.72786240000.0
    15.561017153.44262315000.0
    20.5"73774012.63229368000.0
    13"77621085.37371116000.0
    17.5"181567478.65013865000.0
    29.5276746650.81315543000.0
    17.5397125344.85645922500.0
    26.5463550903.01186650000.0
    23.5876047236.34703244000.0
    14"911150506.22686944000.0
    20.51577341099.19799738000.0
    None or Unspecified4782335357.00821827000.0
    31506028433.53593722000.0

    1.4.30 Coupler

    total_df['Coupler'].fillna('').value_counts()
    
    • 1
                           192019
    None or Unspecified    190449
    Manual                  23918
    Hydraulic                6312
    Name: Coupler, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Coupler', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Coupler
    Hydraulic631246839.72908743000.0
    Manual2391833493.82987727000.0
    None or Unspecified19044930392.94884722500.0
    19201931233.25520524000.0

    1.4.31 Coupler_System

    total_df['Coupler_System'].fillna('').value_counts()
    
    • 1
                           367724
    None or Unspecified     41727
    Yes                      3247
    Name: Coupler_System, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Coupler_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Coupler_System
    Yes324711593.70187911500.0
    None or Unspecified4172710504.80993110000.0
    36772433738.52124226000.0

    1.4.32 Grouser_Tracks

    total_df['Grouser_Tracks'].fillna('').value_counts()
    
    • 1
                           367823
    None or Unspecified     41820
    Yes                      3055
    Name: Grouser_Tracks, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Tracks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Grouser_Tracks
    Yes305512871.26841213000.0
    None or Unspecified4182010410.9511969750.0
    36782333732.89662526000.0

    1.4.33 Hydraulics_Flow

    total_df['Hydraulics_Flow'].fillna('').value_counts()
    
    • 1
                           367823
    Standard                44251
    High Flow                 597
    None or Unspecified        27
    Name: Hydraulics_Flow, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics_Flow', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Hydraulics_Flow
    None or Unspecified2715053.70370415000.0
    High Flow59713183.16582913000.0
    Standard4425110540.57318510000.0
    36782333732.89662526000.0

    1.4.34 Track_Type

    total_df['Track_Type'].fillna('').value_counts()
    
    • 1
              310505
    Steel      87463
    Rubber     14730
    Name: Track_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Track_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Track_Type
    Rubber1473015823.90590613500.0
    Steel8746339245.89473334000.0
    31050529683.23574222500.0

    1.4.35 Undercarriage_Pad_Width

    total_df['Undercarriage_Pad_Width'].fillna('').value_counts()
    
    • 1
                           309782
    None or Unspecified     82444
    32 inch                  5287
    28 inch                  3152
    24 inch                  2998
    20 inch                  2664
    30 inch                  1602
    36 inch                  1544
    18 inch                  1439
    34 inch                   540
    16 inch                   481
    31 inch                   191
    27 inch                   144
    22 inch                   135
    26 inch                    98
    33 inch                    94
    14 inch                    51
    15 inch                    33
    25 inch                    17
    31.5 inch                   2
    Name: Undercarriage_Pad_Width, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    temp = pd.pivot_table(total_df.fillna(''), index='Undercarriage_Pad_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Undercarriage_Pad_Width
    31.5 inch2108000.000000108000.0
    25 inch1722591.17647120000.0
    15 inch3316803.03030315000.0
    14 inch5116869.60784314500.0
    33 inch9451848.93617047500.0
    26 inch9830831.63265328750.0
    22 inch13533398.51851926500.0
    27 inch14433826.04166728625.0
    31 inch19148378.79581244500.0
    16 inch48120531.70270317500.0
    34 inch54061289.72222260000.0
    18 inch143920302.71021518000.0
    36 inch154446214.29728041000.0
    30 inch160236943.00873930000.0
    20 inch266429552.34609627500.0
    24 inch299837054.41961331500.0
    28 inch315237732.44765233500.0
    32 inch528748901.32400245500.0
    None or Unspecified8244435017.39151428000.0
    30978229688.37073922500.0

    1.4.36 Stick_Length

    total_df['Stick_Length'].fillna('').value_counts()
    
    • 1
                           310437
    None or Unspecified     81539
    9' 6"                    5832
    10' 6"                   3519
    11' 0"                   1601
    9' 10"                   1463
    9' 8"                    1462
    9' 7"                    1423
    12' 10"                  1087
    10' 2"                   1004
    8' 6"                     908
    8' 2"                     614
    10' 10"                   414
    12' 8"                    322
    11' 10"                   307
    8' 4"                     274
    8' 10"                    104
    12' 4"                    103
    9' 5"                     101
    15' 9"                     87
    6' 3"                      51
    13' 7"                     11
    14' 1"                      7
    13' 10"                     7
    13' 9"                      7
    19' 8"                      5
    7' 10"                      3
    15' 4"                      3
    24' 3"                      2
    9' 2"                       1
    Name: Stick_Length, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    temp = pd.pivot_table(total_df.fillna(''), index='Stick_Length', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Stick_Length
    15' 9"8757068.96551755000.0
    9' 5"10146074.25742646000.0
    12' 4"10347505.33980640000.0
    8' 10"10443937.50000042750.0
    8' 4"27433362.22627728000.0
    11' 10"30748237.78501642500.0
    12' 8"32258167.70186356000.0
    10' 10"41460149.27536257000.0
    8' 2"61433933.30618931000.0
    8' 6"90836311.94933931000.0
    10' 2"100445855.22908444000.0
    12' 10"108766730.17479366000.0
    9' 7"142356526.52846155000.0
    9' 8"146249643.29685447000.0
    9' 10"146339891.14832536500.0
    11' 0"160148882.15178045000.0
    10' 6"351953677.64279650000.0
    9' 6"583246655.84705142500.0
    None or Unspecified8153932556.65969726000.0
    31043729683.17038322500.0

    1.4.37 Thumb

    total_df['Thumb'].fillna('').value_counts()
    
    • 1
                           310366
    None or Unspecified     85074
    Manual                   9678
    Hydraulic                7580
    Name: Thumb, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Thumb', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Thumb
    Hydraulic758038316.01464433000.0
    Manual967839451.08131834000.0
    None or Unspecified8507435234.45904728000.0
    31036629683.22436822500.0

    1.4.38 Pattern_Changer

    total_df['Pattern_Changer'].fillna('').value_counts()
    
    • 1
                           310437
    None or Unspecified     92924
    Yes                      9269
    No                         68
    Name: Pattern_Changer, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Pattern_Changer', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Pattern_Changer
    No6829981.61764727500.0
    Yes926952301.26108552000.0
    None or Unspecified9292434230.87077628000.0
    31043729683.17038322500.0

    1.4.39 Grouser_Type

    total_df['Grouser_Type'].fillna('').value_counts()
    
    • 1
              310505
    Double     86998
    Triple     15193
    Single         2
    Name: Grouser_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Grouser_Type
    Single252000.00000052000.0
    Triple1519342755.05594737500.0
    Double8699834667.09878428000.0
    31050529683.23574222500.0

    1.4.40 Backhoe_Mounting

    total_df['Backhoe_Mounting'].fillna('').value_counts()
    
    • 1
                           331986
    None or Unspecified     80692
    Yes                        20
    Name: Backhoe_Mounting, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    temp = pd.pivot_table(total_df.fillna(''), index='Backhoe_Mounting', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Backhoe_Mounting
    Yes2016462.50000015500.0
    None or Unspecified8069236486.29015530000.0
    33198629934.88268822500.0

    1.4.41 Blade_Type

    total_df['Blade_Type'].fillna('').value_counts()
    
    • 1
                           330823
    PAT                     39633
    Straight                13461
    None or Unspecified     11841
    Semi U                   8907
    VPAT                     3681
    U                        1888
    Angle                    1684
    No                        743
    Landfill                   26
    Coal                       11
    Name: Blade_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    temp = pd.pivot_table(total_df.fillna(''), index='Blade_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Blade_Type
    Coal1150227.27272748000.0
    Landfill2669942.30769263750.0
    No74325569.07806221000.0
    Angle168432742.92399025500.0
    U188844536.49364435000.0
    VPAT368162566.14371159000.0
    Semi U890758495.60738755000.0
    None or Unspecified1184131077.75214925000.0
    Straight1346135548.99012028000.0
    PAT3963330679.54411727000.0
    33082329949.80635922500.0

    1.4.42 Travel_Controls

    total_df['Travel_Controls'].fillna('').value_counts()
    
    • 1
                           330821
    None or Unspecified     71447
    Differential Steer       5257
    Finger Tip               2693
    2 Pedal                  1144
    Lever                     902
    Pedal                     423
    1 Speed                    11
    Name: Travel_Controls, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    temp = pd.pivot_table(total_df.fillna(''), index='Travel_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Travel_Controls
    1 Speed1115672.72727311500.0
    Pedal42324667.96690322500.0
    Lever90233187.90022226000.0
    2 Pedal114425588.00262221000.0
    Finger Tip269357974.05310155000.0
    Differential Steer525768752.64884967500.0
    None or Unspecified7144733407.34410427500.0
    33082129950.38559822500.0

    1.4.43 Differential_Type

    total_df['Differential_Type'].fillna('').value_counts()
    
    • 1
                    341134
    Standard         70169
    Limited Slip      1181
    No Spin            212
    Locking              2
    Name: Differential_Type, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    temp = pd.pivot_table(total_df.fillna(''), index='Differential_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Differential_Type
    Locking266000.00000066000.0
    No Spin21252889.38679251475.0
    Limited Slip118157566.55376857000.0
    Standard7016937061.72995232000.0
    34113429907.68366722500.0

    1.4.44 Steering_Controls

    total_df['Steering_Controls'].fillna('').value_counts()
    
    • 1
                           341176
    Conventional            70774
    Command Control           594
    Four Wheel Standard       139
    Wheel                      14
    No                          1
    Name: Steering_Controls, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    temp = pd.pivot_table(total_df.fillna(''), index='Steering_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    Steering_Controls
    No117500.00000017500.0
    Wheel1421517.85714317500.0
    Four Wheel Standard13924658.27338122000.0
    Command Control59474774.41077475000.0
    Conventional7077437168.65909832000.0
    34117629907.45541922500.0

    1.4.45 fiManufacturerDesc

    total_df['fiManufacturerID'].fillna('').value_counts()
    
    • 1
    26      169003
    43       74527
    25       42142
    103      38928
    121      25033
             ...  
    5            2
    923          1
    1518         1
    112          1
    525          1
    Name: fiManufacturerID, Length: 104, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    temp = pd.pivot_table(total_df.fillna(''), index='fiManufacturerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    fiManufacturerID
    405107010707.6915899500.0
    129111714373.20949010000.0
    86117834583.19185130000.0
    46134417102.77157715000.0
    158137830334.21625526000.0
    95162834679.65294831000.0
    166190623662.64428120000.0
    135255318356.46259314500.0
    54258518324.77640215000.0
    750330512966.82299511500.0
    176422443841.87263340000.0
    99508733323.88441128000.0
    92526021678.64543718500.0
    55590213283.74703511750.0
    741129133328.46293527000.0
    1212503310637.49506710000.0
    1033892833710.58798827500.0
    254214221210.95258918000.0
    437452727908.81508723000.0
    2616900340321.71042533000.0

    1.4.46 PrimarySizeBasis

    total_df['PrimarySizeBasis'].fillna('').value_counts()
    
    • 1
    Horsepower                     180262
    Weight - Metric Tons           105019
    Standard Digging Depth - Ft     77848
    Operating Capacity - Lbs        44226
                                     5259
    Model                              78
    Weight - Metric                     4
    Weight - Lbs                        1
    Cutting Width - Inches              1
    Name: PrimarySizeBasis, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    temp = pd.pivot_table(total_df.fillna(''), index='PrimarySizeBasis', values='SalePrice', aggfunc=['count', 'mean', 'median'])
    temp.sort_values(by=('count', 'SalePrice')).tail(20)
    
    • 1
    • 2
    countmeanmedian
    SalePriceSalePriceSalePrice
    PrimarySizeBasis
    Cutting Width - Inches123000.00000023000.0
    Weight - Lbs127500.00000027500.0
    Weight - Metric422562.50000022750.0
    Model7825535.25641017750.0
    525915941.91861614000.0
    Operating Capacity - Lbs4422610612.49387210000.0
    Standard Digging Depth - Ft7784821202.52707821000.0
    Weight - Metric Tons10501935654.97314829000.0
    Horsepower18026238455.69106232000.0

    1.4.47 PrimaryLower

    total_df['PrimaryLower'].fillna('').value_counts()
    
    • 1
    14.0       62037
    20.0       17943
    130.0      17113
    150.0      15322
    85.0       15054
               ...  
    1.8            3
    4.5            2
    25000.0        1
    2.7            1
    300.0          1
    Name: PrimaryLower, Length: 76, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    **总结:**发现的规律如下:

    1. 从EXCEL表中可以明显看出部分特征是要么一起出现,要么一起空,这些可以在创建衍生特征时做成交叉特征。

    2. 部分特征虽然是object字段,但是能排出顺序,比如“Low、Medium、High”等,和‘xx inch’等,后面处理成定序变量。

    3. 空值都单独用一列标记。

    4. 没有特殊规律的类别特征统一用Label Encoder编码。

    5. 与价格的透视表可以为分箱提供依据。

    EDA到此堂堂完结!!!!!!!!!!!!!!!!!!!!!!!

    2. 数据清洗+数据预处理

    2.1 创建衍生变量

    2.1.1 SaleDate

    def to_sin(n, i):
        return round(np.sin((2*np.pi/n)*(i-1)+(2*np.pi/7)) + 1, 2)
    
    • 1
    • 2
    total_df['SaleYear'] = total_df['SaleDate'].dt.year
    total_df['SaleMonth'] = total_df['SaleDate'].dt.month
    total_df['SaleDay'] = total_df['SaleDate'].dt.day
    total_df['SaleDayOfWeek'] = total_df['SaleDate'].dt.dayofweek
    total_df['SaleDayOfYear'] = total_df['SaleDate'].dt.dayofyear
    # 删除原来的SaleDate特征
    total_df.drop('SaleDate', axis=1, inplace=True)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    # 尝试余弦化拉近1月和12月的距离
    total_df['SaleDayOfWeek_sin'] = total_df['SaleDayOfWeek'].map(lambda x: to_sin(7, x)).value_counts()
    total_df['SaleMonth_sin'] = total_df['SaleMonth'].map(lambda x: to_sin(12, x)).value_counts()
    
    • 1
    • 2
    • 3

    2.1.2 Stick、Turbocharged

    def combine_features(df, col_list):
        temp = df[col_list[0]].astype(str)
        for col in col_list[1:]:
            temp += df[col].astype(str)
        return temp
    
    • 1
    • 2
    • 3
    • 4
    • 5
    total_df['Stick__Turbocharged'] = combine_features(total_df, ['Stick', 'Turbocharged'])
    total_df['Stick__Turbocharged'].value_counts()
    
    • 1
    • 2
    nannan                         331602
    StandardNone or Unspecified     47981
    ExtendedNone or Unspecified     29130
    ExtendedYes                      2112
    StandardYes                      1873
    Name: Stick__Turbocharged, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower

    total_df['Blade__Blade__Enclosure__Engine'] = combine_features(total_df, ['Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower'])
    total_df['Blade__Blade__Enclosure__Engine'].value_counts()
    
    • 1
    • 2
    nannannannan                                                         386715
    None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNo            8572
    None or Unspecified14'None or UnspecifiedNo                            7307
    None or Unspecified12'None or UnspecifiedNo                            4466
    None or Unspecified14'Low ProfileNo                                    1088
    None or Unspecified16'None or UnspecifiedNo                             818
    None or UnspecifiedNone or UnspecifiedLow ProfileNo                     435
    None or Unspecified12'Low ProfileNo                                     416
    None or Unspecified14'Low ProfileVariable                               377
    None or Unspecified14'High ProfileNo                                    348
    None or Unspecified14'None or UnspecifiedVariable                       343
    None or Unspecified13'None or UnspecifiedNo                             313
    None or UnspecifiedNone or UnspecifiedNone or UnspecifiedVariable       190
    Yes14'None or UnspecifiedNo                                             126
    None or Unspecified14'High ProfileVariable                              122
    None or UnspecifiedNone or UnspecifiedHigh ProfileNo                    112
    None or Unspecified<12'None or UnspecifiedNo                             99
    None or Unspecified12'High ProfileNo                                     96
    YesNone or UnspecifiedNone or UnspecifiedNo                              86
    Yes12'None or UnspecifiedNo                                              84
    None or UnspecifiedNone or UnspecifiedLow ProfileVariable                74
    None or Unspecified16'Low ProfileNo                                      58
    Yes14'Low ProfileNo                                                      55
    Yes14'Low ProfileVariable                                                50
    Yes12'Low ProfileNo                                                      47
    None or Unspecified12'High ProfileVariable                               43
    None or Unspecified16'High ProfileNo                                     34
    None or UnspecifiedNone or UnspecifiedHigh ProfileVariable               27
    None or Unspecified12'Low ProfileVariable                                21
    Yes14'High ProfileVariable                                               21
    Yes14'None or UnspecifiedVariable                                        18
    None or Unspecified12'None or UnspecifiedVariable                        17
    None or Unspecified13'Low ProfileNo                                      16
    Yes16'Low ProfileNo                                                      15
    Yes14'High ProfileNo                                                     12
    Yes16'None or UnspecifiedNo                                              10
    YesNone or UnspecifiedLow ProfileNo                                       9
    YesNone or UnspecifiedNone or UnspecifiedVariable                         9
    Yes16'High ProfileNo                                                      6
    Yes12'High ProfileNo                                                      6
    YesNone or UnspecifiedLow ProfileVariable                                 4
    None or Unspecified16'High ProfileVariable                                4
    None or Unspecified16'None or UnspecifiedVariable                         4
    None or Unspecified16'Low ProfileVariable                                 4
    Yes13'None or UnspecifiedNo                                               4
    Yes16'Low ProfileVariable                                                 3
    Yes12'Low ProfileVariable                                                 3
    YesNone or UnspecifiedHigh ProfileNo                                      3
    Yes12'High ProfileVariable                                                2
    Yes16'High ProfileVariable                                                2
    Yes16'None or UnspecifiedVariable                                         2
    None or Unspecified13'None or UnspecifiedVariable                         1
    None or Unspecified13'High ProfileNo                                      1
    Name: Blade__Blade__Enclosure__Engine, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54

    2.1.4 Pushblock、Scarifier、Tip_Control

    total_df['Pushblock__Scarifier__TipControl'] = combine_features(total_df, ['Pushblock', 'Scarifier', 'Tip_Control'])
    total_df['Pushblock__Scarifier__TipControl'].value_counts()
    
    • 1
    • 2
    nannannan                                                    386704
    None or UnspecifiedYesNone or Unspecified                      6742
    None or UnspecifiedNone or UnspecifiedNone or Unspecified      6228
    None or UnspecifiedYesSideshift & Tip                          3088
    YesNone or UnspecifiedNone or Unspecified                      2560
    None or UnspecifiedNone or UnspecifiedSideshift & Tip          2401
    YesYesNone or Unspecified                                      1302
    YesNone or UnspecifiedSideshift & Tip                          1182
    None or UnspecifiedYesTip                                      1103
    YesYesSideshift & Tip                                           493
    None or UnspecifiedNone or UnspecifiedTip                       455
    YesYesTip                                                       226
    YesNone or UnspecifiedTip                                       203
    nanYesnan                                                         7
    nanNone or Unspecifiednan                                         4
    Name: Pushblock__Scarifier__TipControl, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow

    total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'] = combine_features(total_df, ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
    total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'].value_counts()
    
    • 1
    • 2
    nannannan                                                    367724
    None or UnspecifiedNone or UnspecifiedStandard                39040
    YesNone or UnspecifiedStandard                                 2289
    None or UnspecifiedYesStandard                                 2149
    YesYesStandard                                                  773
    None or UnspecifiedNone or UnspecifiedHigh Flow                 364
    YesNone or UnspecifiedHigh Flow                                 122
    None or Unspecifiednannan                                        91
    None or UnspecifiedYesHigh Flow                                  57
    YesYesHigh Flow                                                  54
    None or UnspecifiedYesNone or Unspecified                        22
    Yesnannan                                                         8
    None or UnspecifiedNone or UnspecifiedNone or Unspecified         4
    YesNone or UnspecifiedNone or Unspecified                         1
    Name: CouplerSystem__GrouserTracks__HydraulicsFlow, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type

    col_list = ('Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type').split('\t')
    total_df['TUSTPG'] = combine_features(total_df, col_list)
    total_df['TUSTPG'].value_counts()
    
    • 1
    • 2
    • 3
    nannannannannannan                                                                          309643
    SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble      38336
    RubberNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble     10031
    SteelNone or UnspecifiedNone or UnspecifiedManualNone or UnspecifiedDouble                    3900
    SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriple       3214
                                                                                                 ...  
    Steel32 inch8' 2"ManualNone or UnspecifiedDouble                                                 1
    Steel30 inch12' 10"ManualNone or UnspecifiedTriple                                               1
    Steel20 inch8' 4"ManualNone or UnspecifiedTriple                                                 1
    Steel32 inch8' 4"None or UnspecifiedNone or UnspecifiedTriple                                    1
    Rubber30 inchNone or UnspecifiedHydraulicNone or UnspecifiedDouble                               1
    Name: TUSTPG, Length: 1065, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    2.1.7 Backhoe_Mounting Blade_Type Travel_Controls

    col_list = ('Backhoe_Mounting	Blade_Type	Travel_Controls').split('\t')
    total_df['BBT'] = combine_features(total_df, col_list)
    total_df['BBT'].value_counts()
    
    • 1
    • 2
    • 3
    nannannan                                                    330802
    None or UnspecifiedPATNone or Unspecified                     37249
    None or UnspecifiedStraightNone or Unspecified                11990
    None or UnspecifiedNone or UnspecifiedNone or Unspecified     11106
    None or UnspecifiedSemi UNone or Unspecified                   6582
    None or UnspecifiedSemi UDifferential Steer                    2122
    None or UnspecifiedVPATFinger Tip                              1857
    None or UnspecifiedUNone or Unspecified                        1816
    None or UnspecifiedAngleNone or Unspecified                    1560
    None or UnspecifiedVPATNone or Unspecified                     1092
    None or UnspecifiedPATDifferential Steer                       1011
    None or UnspecifiedStraightDifferential Steer                   913
    nanNo2 Pedal                                                    728
    None or UnspecifiedVPATDifferential Steer                       708
    None or UnspecifiedPATFinger Tip                                584
    nanStraight2 Pedal                                              416
    None or UnspecifiedPATLever                                     412
    None or UnspecifiedPATPedal                                     370
    None or UnspecifiedNone or UnspecifiedDifferential Steer        327
    None or UnspecifiedNone or UnspecifiedLever                     285
    None or UnspecifiedAngleDifferential Steer                      109
    None or UnspecifiedSemi UFinger Tip                              97
    None or UnspecifiedSemi ULever                                   96
    None or UnspecifiedNone or UnspecifiedFinger Tip                 77
    None or UnspecifiedStraightFinger Tip                            76
    None or UnspecifiedStraightLever                                 56
    None or UnspecifiedUDifferential Steer                           54
    None or UnspecifiedNone or UnspecifiedPedal                      33
    None or UnspecifiedVPATLever                                     24
    nanNonan                                                         15
    None or UnspecifiedLandfillNone or Unspecified                   14
    None or UnspecifiedULever                                        14
    YesNone or UnspecifiedNone or Unspecified                        13
    None or UnspecifiedAngleLever                                    13
    None or UnspecifiedLandfillDifferential Steer                    12
    nannan1 Speed                                                    11
    nannanNone or Unspecified                                        10
    None or UnspecifiedSemi UPedal                                   10
    None or UnspecifiedCoalNone or Unspecified                        8
    YesPATNone or Unspecified                                         7
    None or UnspecifiedStraightPedal                                  7
    nanStraightnan                                                    3
    None or UnspecifiedUPedal                                         2
    None or UnspecifiedCoalLever                                      2
    None or UnspecifiedAnglePedal                                     1
    nanUnan                                                           1
    None or UnspecifiedUFinger Tip                                    1
    None or UnspecifiedAngleFinger Tip                                1
    None or UnspecifiedCoalDifferential Steer                         1
    Name: BBT, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50

    2.1.8 Differential_Type Steering_Controls

    col_list = ('Differential_Type	Steering_Controls').split('\t')
    total_df['DS'] = combine_features(total_df, col_list)
    total_df['DS'].value_counts()
    
    • 1
    • 2
    • 3
    nannan                         341120
    StandardConventional            69411
    Limited SlipConventional         1154
    StandardCommand Control           564
    No SpinConventional               209
    StandardFour Wheel Standard       139
    Standardnan                        54
    Limited SlipCommand Control        27
    nanWheel                           14
    No SpinCommand Control              3
    Lockingnan                          2
    StandardNo                          1
    Name: DS, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    2.1.9 PrimarySizeBasis PrimaryLower

    col_list = ('PrimarySizeBasis	PrimaryLower').split('\t')
    total_df['PP'] = combine_features(total_df, col_list)
    total_df['PP'].value_counts()
    
    • 1
    • 2
    • 3
    Standard Digging Depth - Ft14.0    57395
    Horsepower20.0                     17941
    Horsepower130.0                    17113
    Horsepower150.0                    15299
    Horsepower85.0                     15054
                                       ...  
    Weight - Metric Tons20.0               2
    Weight - Lbs25000.0                    1
    Cutting Width - Inches0.0              1
    Weight - Metric Tons2.7                1
    Weight - Metric Tons300.0              1
    Name: PP, Length: 90, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    2.1.10 fiBaseModel

    def is_hybrid(x):
        try:
            return type(eval(x)) == str
        except:
            return True
    
    • 1
    • 2
    • 3
    • 4
    • 5
    total_df['fiBaseModel_is_hybrid'] = total_df['fiBaseModel'].map(is_hybrid)
    
    • 1

    2.1.11 ProductGroupDesc

    result = total_df['ProductGroupDesc'].str.split(' ')
    
    • 1
    total_df['ProductGroupDesc_father'] = [i[-1] for i in result]    
    
    • 1
    total_df['ProductGroupDesc_father'].value_counts()
    
    • 1
    Loaders       126412
    Excavators    104230
    Tractors       82582
    Loader         73216
    Graders        26258
    Name: ProductGroupDesc_father, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    2.2 独热编码

    def to_dummies(df, col_name):
        name_dict = {value: col_name + '_is_' + str(value) for value in df[col_name].unique()}
        return pd.get_dummies(df[col_name]).rename(columns=name_dict)
    
    • 1
    • 2
    • 3
    # col_names = ['CouplerSystem__GrouserTracks__HydraulicsFlow']
    
    # for col in col_names:
    #     total_df = pd.concat([total_df, to_dummies(total_df, col)], axis=1)
    
    # total_df.drop(columns=col_names, inplace=True)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    感觉没有明显提升,暂时不做了。

    2.3 标签编码

    2.3.1 UsageBand

    usageband_dict = {
        'Low': 1,
        'Medium': 2,
        'High': 3
    }
    total_df['UsageBand'] = total_df['UsageBand'].map(usageband_dict).fillna(0)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    total_df['UsageBand'].value_counts()
    
    • 1
    0.0    339028
    2.0     35832
    1.0     25311
    3.0     12527
    Name: UsageBand, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5

    2.3.2 ProductSize

    total_df['ProductSize_is_missing'] = total_df['ProductSize'].isna()
    
    • 1
    total_df['ProductSize'].value_counts()
    
    • 1
    Medium            64342
    Large / Medium    51297
    Small             27057
    Mini              25721
    Large             21396
    Compact            6280
    Name: ProductSize, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    ProductSize_dict = {
        'Compact': 1,
        'Mini': 2,
        'Small': 3,
        'Medium': 4,
        'Large / Medium': 5,
        'Large': 6,
    }
    total_df['ProductSize'] = total_df['ProductSize'].map(ProductSize_dict).fillna(0)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    total_df['ProductSize'].value_counts()
    
    • 1
    0.0    216605
    4.0     64342
    5.0     51297
    3.0     27057
    2.0     25721
    6.0     21396
    1.0      6280
    Name: ProductSize, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2.3.3 Undercarriage_Pad_Width

    total_df['Undercarriage_Pad_Width_is_missing'] = total_df['Undercarriage_Pad_Width'].isna()
    
    • 1
    total_df['Undercarriage_Pad_Width'].value_counts()
    
    • 1
    None or Unspecified    82444
    32 inch                 5287
    28 inch                 3152
    24 inch                 2998
    20 inch                 2664
    30 inch                 1602
    36 inch                 1544
    18 inch                 1439
    34 inch                  540
    16 inch                  481
    31 inch                  191
    27 inch                  144
    22 inch                  135
    26 inch                   98
    33 inch                   94
    14 inch                   51
    15 inch                   33
    25 inch                   17
    31.5 inch                  2
    Name: Undercarriage_Pad_Width, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    total_df['Undercarriage_Pad_Width'] = total_df['Undercarriage_Pad_Width'].fillna('0 inch').map(lambda x: 1 if x == 'None or Unspecified' else eval(x[:-4]))
    
    • 1
    total_df['Undercarriage_Pad_Width'].value_counts()
    
    • 1
    0.0     309782
    1.0      82444
    32.0      5287
    28.0      3152
    24.0      2998
    20.0      2664
    30.0      1602
    36.0      1544
    18.0      1439
    34.0       540
    16.0       481
    31.0       191
    27.0       144
    22.0       135
    26.0        98
    33.0        94
    14.0        51
    15.0        33
    25.0        17
    31.5         2
    Name: Undercarriage_Pad_Width, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    2.3.4 Stick_Length

    total_df['Stick_Length_is_missing'] = total_df['Stick_Length'].isna()
    
    • 1
    total_df['Stick_Length'].value_counts()
    
    • 1
    None or Unspecified    81539
    9' 6"                   5832
    10' 6"                  3519
    11' 0"                  1601
    9' 10"                  1463
    9' 8"                   1462
    9' 7"                   1423
    12' 10"                 1087
    10' 2"                  1004
    8' 6"                    908
    8' 2"                    614
    10' 10"                  414
    12' 8"                   322
    11' 10"                  307
    8' 4"                    274
    8' 10"                   104
    12' 4"                   103
    9' 5"                    101
    15' 9"                    87
    6' 3"                     51
    13' 7"                    11
    14' 1"                     7
    13' 10"                    7
    13' 9"                     7
    19' 8"                     5
    7' 10"                     3
    15' 4"                     3
    24' 3"                     2
    9' 2"                      1
    Name: Stick_Length, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    result = []
    
    for i in total_df['Stick_Length'].fillna(0):
        if  i == 0:
            pass
        elif i == 'None or Unspecified':
            i = 1
        else:
            a = i.find("'")
            b = i.find('"')
            i = round(int(i[:a]) + int(i[a+2:b])/12, 2)
        result.append(i)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    total_df['Stick_Length'] = result
    
    • 1
    total_df['Stick_Length'].value_counts()
    
    • 1
    0.00     310437
    1.00      81539
    9.50       5832
    10.50      3519
    11.00      1601
    9.83       1463
    9.67       1462
    9.58       1423
    12.83      1087
    10.17      1004
    8.50        908
    8.17        614
    10.83       414
    12.67       322
    11.83       307
    8.33        274
    8.83        104
    12.33       103
    9.42        101
    15.75        87
    6.25         51
    13.58        11
    14.08         7
    13.83         7
    13.75         7
    19.67         5
    7.83          3
    15.33         3
    24.25         2
    9.17          1
    Name: Stick_Length, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    2.3.2 其余随机标签编码

    le = LabelEncoder()
    
    • 1
    for col in total_df.columns:
        # 标记空值(事实证明用处不大。。。,只有一列标记派上了用场)
        if total_df[col].isna().sum() > 1000:
            total_df[col + '_is_missing'] = total_df[col].isna()
        # 编码
        if total_df[col].dtype == 'object':
            total_df[col] = le.fit_transform(total_df[col])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    2.4 空值和异常值

    # 异常值173
    total_df['datasource'].map(lambda x: 172 if x == 173 else x)
    
    • 1
    • 2
    0         132
    1         132
    2         132
    3         132
    4         132
             ... 
    412693    149
    412694    149
    412695    149
    412696    149
    412697    149
    Name: datasource, Length: 412698, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    # 这一列的空值用众数填充
    total_df['auctioneerID'].fillna(total_df['auctioneerID'].mode()[0], inplace=True)
    
    • 1
    • 2
    total_df['MachineHoursCurrentMeter'].fillna(0, inplace=True)
    total_df['PrimaryLower'].fillna(0, inplace=True)
    
    • 1
    • 2
    total_df.info()
    
    • 1
    
    Int64Index: 412698 entries, 0 to 412697
    Columns: 109 entries, SalePrice to SaleMonth_sin_is_missing
    dtypes: bool(40), float64(10), int32(50), int64(9)
    memory usage: 157.4 MB
    
    • 1
    • 2
    • 3
    • 4
    • 5

    2.5 删除重复列

    def drop_dup_cols(df):
        cols = df.columns[60:]
        n = len(cols)
        for i in range(n-1):
            if len(df[df[cols[i]] != df[cols[i+1]]]) < 1000:
                df.drop(columns=[cols[i]], inplace=True)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    drop_dup_cols(total_df)
    
    • 1

    3. 建模

    # 以DataFrame的形式展现特征重要性
    def show_feature_imp(model, df):
        return pd.DataFrame({df.columns[i]: model.feature_importances_[i] for i in range(len(df.columns))}, index=['imp']).T.sort_values(by='imp', ascending=False)
    
    • 1
    • 2
    • 3
    # 展示两个特征重要性表的共同倒数n%的特征
    def show_low_features(df1, df2, ratio):
        n = int(len(df1) * ratio)
        df1_features = df1.tail(n).index
        df2_features = df2.tail(n).index
        return n, {feature: [df1.loc[feature].values[0], df2.loc[feature].values[0]] for feature in df1_features if feature in df2_features}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    # 以字典的形式显示分数,可显示的分数有:R平方、MAE、MSE、RMSLE
    def show_scores(y_test, y_pred):
        scores_dict = {}
        scores_dict['R2'] = r2_score(y_test, y_pred)
        scores_dict['MAE'] = mean_absolute_error(y_test, y_pred)
        scores_dict['MSE'] = mean_squared_error(y_test, y_pred)
        scores_dict['RMSLE'] = np.sqrt(mean_squared_log_error(y_test, y_pred))
        return scores_dict
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    # 被后续的逐步回归筛出来的特征
    drop_list = ['SalePrice', 'MachineID', 'PrimaryLower_is_missing',  
                 'Drive_System_is_missing', 'Forks_is_missing', 'ProductSize_is_missing',
                 'Ride_Control_is_missing', 'Stick_is_missing', 'Stick_Length_is_missing', 'Transmission_is_missing', 
                 'Turbocharged_is_missing', 'Engine_Horsepower_is_missing', 'Hydraulics_is_missing', 
                 'Pad_Type_is_missing', 'Ripper_is_missing', 'Tip_Control_is_missing', 'Tire_Size_is_missing', 
                 'Coupler_is_missing', 'Hydraulics_Flow_is_missing', 'Grouser_Type_is_missing', 
                 'Backhoe_Mounting_is_missing', 'Travel_Controls_is_missing', 'Steering_Controls_is_missing', 
                 'Pushblock_is_missing', 'SaleDayOfWeek', 'SaleMonth', 'Differential_Type', 
                 'Stick__Turbocharged', 'SaleDayOfYear', 'MachineHoursCurrentMeter_is_missing', 
                 'Scarifier', 'Backhoe_Mounting', 'Forks', 'Track_Type', 'Engine_Horsepower', 
                 'Blade_Extension', 'Coupler_System', 'SaleMonth_sin', 'SaleDayOfWeek_sin', 'SaleMonth_sin_is_missing',
                 'fiBaseModel_is_hybrid', 'Hydraulics_Flow', 'fiModelDescriptor', 'SaleDay', 'Hydraulics', 'Pushblock', 'Transmission',
                 'PrimarySizeBasis', 'Stick','DS', 'Tip_Control', 'Grouser_Tracks', 'fiSecondaryDesc', 'fiModelSeries'
                ]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    # 划分训练集、验证集
    X = total_df.drop(drop_list, axis=1)
    y = total_df['SalePrice']
    
    X_train = X[X['SaleYear']<2012]
    X_test = X[X['SaleYear']>=2012]
    y_train = y[X['SaleYear']<2012]
    y_test = y[X['SaleYear']>=2012]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    raise KeyError
    
    • 1
    ---------------------------------------------------------------------------
    
    KeyError                                  Traceback (most recent call last)
    
    Input In [1044], in ()
    ----> 1 raise KeyError
    
    
    KeyError: 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    3.1 RF

    # rfr = RandomForestRegressor(
    #     n_estimators = 100,
    #     max_depth = 23,
    #     min_samples_leaf = 2,
    #     min_samples_split = 4,
    #     n_jobs = -1,
    #     random_state = 10
    # )
    # rfr.fit(X_train, y_train)
    # show_scores(y_test, rfr.predict(X_test))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    # rfr_imp = show_feature_imp(rfr, X_train)
    
    • 1
    # rfr_imp.head(20)
    
    • 1
    # rfr_grid = {
    #     'n_estimators': [100, 150, 200],
    #     'max_depth': [15, 20, 25],
    #     'max_features': [0.5, 0.6, 0.7],
    #     'min_samples_leaf': [2],
    #     'min_samples_split': [4],
    #     'max_samples': [2000],
    #     'n_jobs': [-1]  
    # }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    # gscv = GridSearchCV(RandomForestRegressor(), param_grid=rfr_grid, cv=5, verbose=True)
    
    • 1
    # gscv.fit(X_train, y_train)
    
    • 1

    3.2 XGB

    xgb_grid = {
        'n_estimators': [200],
        'learning_rate': [0.0835],
        'reg_lambda': [0.48, 0.5, 0.52]
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    gscv = GridSearchCV(XGBRegressor(), param_grid=xgb_grid)
    
    • 1
    gscv.fit(X_train, y_train)
    
    • 1
    gscv.best_params_
    
    • 1
    xgb = XGBRegressor(
        n_estimators = 250, # 这个很重要
        max_depth = 11, # 这个很重要
        learning_rate = 0.11, # 这个很重要
        reg_lambda = 0.2, # 这个比较重要
        n_jobs=-1,
        random_state = 13
    )
    xgb.fit(X_train, y_train)
    show_scores(y_test, xgb.predict(X_test))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    {'R2': 0.884087856079633,
     'MAE': 5745.138324664967,
     'MSE': 79616854.10269143,
     'RMSLE': 0.23743323282056006}
    
    • 1
    • 2
    • 3
    • 4

    总结:

    调参的时候自己要先清楚哪些参数对模型影响大,最好用gscv来两次,第一次先摸清大概范围,第二次再精确到小数点后1-3位。

    4. 特征筛选

    逐步回归筛选特征(注意,后续提到的逐步回归都以本文自定义的“青春版”逐步回归函数为准)

    1. 后向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排升序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

    2. 前向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

    3. 如果希望更稳重,不要如此轻易就丢弃一个特征,可选的函数改进方式是改为二重循环:每一轮都将每个特征遍历一遍,计算前后的得分(或者误差)差值gain,剔除gain最小/大的特征,然后再进入下一轮,直到达到最大迭代次数或gain达到阈值。

    def step_wise_reg(model, X_train, X_test, y_train, y_test, suspicion_list, method='backward', tol=0, show_info=True, dilution_ratio=1):
        X_train = X_train[::dilution_ratio]
        y_train = y_train[::dilution_ratio]
        
        if method == 'backward':
            drop_list = [] 
        elif method == 'forward':
            drop_list = list(suspicion_list)
        else:
            raise KeyError
        
        
        model.fit(X_train.drop(columns=drop_list), y_train)
        old_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))
    
        for last_feature in suspicion_list:
            
            if method == 'backward':
                drop_list.append(last_feature)
            else:
                drop_list.remove(last_feature)
    
            model.fit(X_train.drop(columns=drop_list), y_train)
            new_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))
    
            if show_info:
                print(old_score, new_score, drop_list)
            
            
            if old_score - new_score < tol:
                if method == 'backward':
                    drop_list.remove(last_feature)   
                else:
                    drop_list.append(last_feature)
            else:
                old_score = new_score
    
        return old_score, new_score, drop_list
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    old_score, new_score, add_drop_list1 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='backward', suspicion_list=show_feature_imp(xgb, X_train).index[::-1])
    
    • 1
    add_drop_list1
    
    • 1
    old_score, new_score, add_drop_list2 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='forward', tol=0.0001,suspicion_list=show_feature_imp(xgb, X_train).index[1:])
    
    • 1
    add_drop_list2
    
    • 1

    总结:

    逐步回归十分好用,过一遍特征,误差就降下去了,推测特征被筛出去的原因主要是:

    1. 和其他特征的信息有一定相关性,即存在冗余信息。简单计算一下,就算全部是二分类特征,2的20次方也能表示出100万个不同的样本了,因此对于此次只有40万条数据的数据集来说,50+的特征绝对是存在冗余的,会干扰模型判断。

    2. 本身是个好特征,但没有经过恰当的分箱/编码处理,导致蕴含的信息不能传达给模型,反而起到了负提升。这就只有根据业务逻辑反复尝试了,一边尝试分箱,一边要回来调参。

    3. 本身是好特征,预处理也到位了,但模型进行逐步回归时没有同步调参,导致误差反而上升。尤其是前向逐步回归,从1个特征增加到50个特征,跨度如此之大,模型需要的参数必然是不同的。

    以上2、3两点就造成了误筛。所以使用逐步回归时要注意:被筛出的特征也不一定就是没用的特征,还得自己判断。

    5. 最终成果

    show_scores(y_test, xgb.predict(X_test))
    
    • 1
    {'R2': 0.884087856079633,
     'MAE': 5745.138324664967,
     'MSE': 79616854.10269143,
     'RMSLE': 0.23743323282056006}
    
    • 1
    • 2
    • 3
    • 4
    xgb_imp = show_feature_imp(xgb, X_train)
    xgb_imp
    
    • 1
    • 2
    imp
    CouplerSystem__GrouserTracks__HydraulicsFlow0.397497
    ProductSize0.249447
    PP0.079305
    Ride_Control0.042499
    fiManufacturerID0.037125
    YearMade0.031770
    PrimaryLower0.027179
    ProductGroupDesc_father0.015217
    SaleYear0.012649
    fiModelDesc0.012630
    Blade_Width0.010005
    Pushblock__Scarifier__TipControl0.009107
    Drive_System0.006723
    Ripper0.006685
    Enclosure0.005481
    Blade__Blade__Enclosure__Engine0.004808
    Pad_Type0.004427
    fiProductClassDesc0.004194
    ProductGroupDesc0.003965
    Travel_Controls0.003888
    Tire_Size0.003268
    Pattern_Changer0.003167
    ModelID0.002938
    UsageBand0.002606
    BBT0.002432
    Steering_Controls0.002027
    Turbocharged0.001827
    fiBaseModel0.001778
    Blade_Type0.001742
    Undercarriage_Pad_Width0.001496
    Stick_Length0.001339
    TUSTPG0.001334
    Thumb0.001193
    Coupler0.001181
    Grouser_Type0.001138
    MachineHoursCurrentMeter0.001100
    state0.001093
    auctioneerID_is_missing0.001045
    Enclosure_Type0.001007
    auctioneerID0.000875
    datasource0.000812

    在这里插入图片描述

  • 相关阅读:
    期货举例(期货举例说明盈利)
    基于SVD的图像压缩,PCA特征降维
    el-table中点击跳转到详情页的两种方法
    【LeetCode】1870. 准时到达的列车最小时速
    【wp】2023鹏城杯初赛 Web web1(反序列化漏洞)
    项目中用的网关Gateway及SpringCloud
    有望引领下轮牛市的5大加密主题
    考生都难哭了,用 Python 分析了一下,这里才是高考地狱级难度
    什么是MVCC?
    【java学习—十】异常(1)
  • 原文地址:https://blog.csdn.net/SpriteNym/article/details/126787559