• 定量数据和定性数据


    定量数据本质上是数值,应该是衡量某样东西的数量。
    定性数据本质上是类别,应该是描述某样东西的性质。

    全部的数据列如下,其中既有定性列也有定量列;

    import pandas as pd
    
    pd.options.display.max_columns = None
    pd.set_option('expand_frame_repr', False)
    salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
    print(salary_ranges.head())
    #    SetID JobCode                Eff Date              SalEndDate SalarySetID SalPlan Grade  Step BiweeklyHighRate BiweeklyLowRate  UnionCode  ExtendedStep PayType
    # 0  COMMN     109  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1            $0.00           $0.00        330             0       C
    # 1  COMMN     110  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $15.00          $15.00        323             0       D
    # 2  COMMN     111  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $25.00          $25.00        323             0       D
    # 3  COMMN     112  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $50.00          $50.00        323             0       D
    # 4  COMMN     114  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1          $100.00         $100.00        323             0       M
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    .info()可以了解数据的列信息以及每列非null的行数;

    print(salary_ranges.info())
    
    # 
    # RangeIndex: 1356 entries, 0 to 1355
    # Data columns (total 13 columns):
    #  #   Column              Non-Null Count  Dtype
    # ---  ------              --------------  -----
    #  0   SetID               1356 non-null   object
    #  1   Job Code            1356 non-null   object
    #  2   Eff Date            1356 non-null   object
    #  3   Sal End Date        1356 non-null   object
    #  4   Salary SetID        1356 non-null   object
    #  5   Sal Plan            1356 non-null   object
    #  6   Grade               1356 non-null   object
    #  7   Step                1356 non-null   int64
    #  8   Biweekly High Rate  1356 non-null   object
    #  9   Biweekly Low Rate   1356 non-null   object
    #  10  Union Code          1356 non-null   int64
    #  11  Extended Step       1356 non-null   int64
    #  12  Pay Type            1356 non-null   object
    # dtypes: int64(3), object(10)
    # memory usage: 137.8+ KB
    # None
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    也可以使用以下方法更快速的计算缺失值的信息;

    print(salary_ranges.isnull().sum())
    # SetID                 0
    # Job Code              0
    # Eff Date              0
    # Sal End Date          0
    # Salary SetID          0
    # Sal Plan              0
    # Grade                 0
    # Step                  0
    # Biweekly High Rate    0
    # Biweekly Low Rate     0
    # Union Code            0
    # Extended Step         0
    # Pay Type              0
    # dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    describe方法查看定量数据的描述性统计;Pandas认为,数据只有3个定量列:Step、Union Code和Extended Step(步进、工会代码和增强步进)。先不说步进和增强步进,很明显工会代码不是定量的。虽然这一列是数,但这些数不代表数量,只代表某个工会的代码

    print( salary_ranges.describe())
    
    #               Step   Union Code  Extended Step
    # count  1356.000000  1356.000000    1356.000000
    # mean      1.294985   392.676991       0.150442
    # std       1.045816   338.100562       1.006734
    # min       1.000000     1.000000       0.000000
    # 25%       1.000000    21.000000       0.000000
    # 50%       1.000000   351.000000       0.000000
    # 75%       1.000000   790.000000       0.000000
    # max       5.000000   990.000000      11.000000
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    最值得注意的特征是一个定量列Biweekly High Rate(双周最高工资)和一个定性列Grade(工作种类);

    salary_ranges = salary_ranges[['BiweeklyHighRate', 'Grade']]
    print(salary_ranges.head())
    
    #   BiweeklyHighRate Grade
    # 0            $0.00     0
    # 1           $15.00     0
    # 2           $25.00     0
    # 3           $50.00     0
    # 4          $100.00     0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    查看两个字段的类型;

    print(salary_ranges.info())
    
    # 
    # RangeIndex: 1356 entries, 0 to 1355
    # Data columns (total 2 columns):
    #  #   Column            Non-Null Count  Dtype
    # ---  ------            --------------  -----
    #  0   BiweeklyHighRate  1356 non-null   object
    #  1   Grade             1356 non-null   object
    # dtypes: object(2)
    # memory usage: 21.3+ KB
    # None
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    我们清理一下数据,移除工资前面的美元符号,保证数据类型正确。当处理定量数据时,一般使用整数或浮点数作为类型(最好使用浮点数);定性数据则一般使用字符串或Unicode对象。

    salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].map(lambda value:value.replace('$',''))
    print(salary_ranges.head())
    
    #   BiweeklyHighRate Grade
    # 0             0.00     0
    # 1            15.00     0
    # 2            25.00     0
    # 3            50.00     0
    # 4           100.00     0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    数据类型并没有变

    print(salary_ranges.info())
    # 
    # RangeIndex: 1356 entries, 0 to 1355
    # Data columns (total 2 columns):
    #  #   Column            Non-Null Count  Dtype 
    # ---  ------            --------------  ----- 
    #  0   BiweeklyHighRate  1356 non-null   object
    #  1   Grade             1356 non-null   object
    # dtypes: object(2)
    # memory usage: 21.3+ KB
    # None
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    将BiweeklyHighRate和Grade列中的数据分别转换为浮点数、字符串;

    salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].astype(float)
    salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
    print(salary_ranges.info())
    
    # 
    # RangeIndex: 1356 entries, 0 to 1355
    # Data columns (total 2 columns):
    #  #   Column            Non-Null Count  Dtype
    # ---  ------            --------------  -----
    #  0   BiweeklyHighRate  1356 non-null   float64
    #  1   Grade             1356 non-null   object
    # dtypes: float64(1), object(1)
    # memory usage: 21.3+ KB
    # None
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
  • 相关阅读:
    【Kafka】分区与复制机制:解锁高性能与容错的密钥
    函数式编程:Lambda 表达式
    linux shell 编程之变量总结
    计算机视觉:基于Numpy的图像处理技术(二):图像主成分分析(PCA)
    一款免费无限制的AI生成视频工具(不输pika、runway)
    嵌入式硬件中常见的100种硬件选型方式
    Library ‘iconv2.4.0‘ not found 问题及解决方法
    文件上传漏洞利用与防御
    【字符串】重复的DNA序列
    JDBC编程
  • 原文地址:https://blog.csdn.net/hou478410969/article/details/134493494