• sklearn快速入门教程:缺失值


    一、准备&基本概况

    1. import pandas as pd
    2. data = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv", index_col = 0)
    3. data.head()

    data.info()

     

     由运行结果的第二行知,共有891条数据。由运行结果的表格知每个column有多少条数据

    data.loc[:,"Age"]

     

    data.loc[:,"Age"].values

            

    可见:

    1.加了“.values”后才是Age数组(不加时会保留PassengerId)所以千万别忘了加 .values !

    2.Age是一维数组

      二、处理缺失值

    from sklearn.impute import SimpleImputer

    1.Age

    1)一维改二维

    1. Age = data.loc[:,"Age"].values.reshape(-1, 1) #把age数组从上面的一维数组改成二维数组
    2. Age[:20] #Age的前20条

    2)实例化——处理缺失值

    下面介绍3种实例化方法(但最终使用了法3完成本步骤):

    a.用均值填补

    1. imp_mean = SimpleImputer() #()内为空,默认为用均值填补
    2. imp_mean = imp_mean.fit_transform(Age)
    3. imp_mean[:20]

     

    array([[22.        ],
           [38.        ],
           [26.        ],
           [35.        ],
           [35.        ],
           [29.69911765],
           [54.        ],
           [ 2.        ],
           [27.        ],
           [14.        ],
           [ 4.        ],
           [58.        ],
           [20.        ],
           [39.        ],
           [14.        ],
           [55.        ],
           [ 2.        ],
           [29.69911765],
           [31.        ],
           [29.69911765]])
    

    b.用0值填补 

    1. imp_0 = SimpleImputer(strategy = "constant", fill_value = 0) #用0填补
    2. imp_0 = imp_0.fit_transform(Age)
    3. imp_0[:20]

     

    array([[22.],
           [38.],
           [26.],
           [35.],
           [35.],
           [ 0.],
           [54.],
           [ 2.],
           [27.],
           [14.],
           [ 4.],
           [58.],
           [20.],
           [39.],
           [14.],
           [55.],
           [ 2.],
           [ 0.],
           [31.],
           [ 0.]])

    c.用中位数填补 

    1. imp_median = SimpleImputer(strategy = "median") #用中位数填补
    2. imp_median = imp_median.fit_transform(Age)
    3. imp_median[:20]

     

    array([[22.],
           [38.],
           [26.],
           [35.],
           [35.],
           [28.],
           [54.],
           [ 2.],
           [27.],
           [14.],
           [ 4.],
           [58.],
           [20.],
           [39.],
           [14.],
           [55.],
           [ 2.],
           [28.],
           [31.],
           [28.]])

    接下来用 3)中填补好了的数据来

    3)替代含缺失值的数据

    1. ata.loc[:,"Age"] = imp_median
    2. data.info()
    
    Int64Index: 891 entries, 1 to 891
    Data columns (total 11 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   Survived  891 non-null    int64  
     1   Pclass    891 non-null    int64  
     2   Name      891 non-null    object 
     3   Sex       891 non-null    object 
     4   Age       891 non-null    float64
     5   SibSp     891 non-null    int64  
     6   Parch     891 non-null    int64  
     7   Ticket    891 non-null    object 
     8   Fare      891 non-null    float64
     9   Cabin     204 non-null    object 
     10  Embarked  889 non-null    object 
    dtypes: float64(2), int64(4), object(5)
    memory usage: 83.5+ KB

    现在Age是891个数据,即没有缺失值了

    然而,同时发现Embarked是889,说明还缺2个

    2.Embarked 

    1)一维改二维 

    Embarked = data.loc[:, "Embarked"].values.reshape(-1,1)

    2)实例化——处理缺失值 & 3)替代含缺失值的数据 

    1. imp_mode = SimpleImputer(strategy = "most_frequent") #most_frequent == 众数
    2. data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)
    3. data.info()

     

    
    Int64Index: 891 entries, 1 to 891
    Data columns (total 11 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   Survived  891 non-null    int64  
     1   Pclass    891 non-null    int64  
     2   Name      891 non-null    object 
     3   Sex       891 non-null    object 
     4   Age       891 non-null    float64
     5   SibSp     891 non-null    int64  
     6   Parch     891 non-null    int64  
     7   Ticket    891 non-null    object 
     8   Fare      891 non-null    float64
     9   Cabin     204 non-null    object 
     10  Embarked  891 non-null    object 
    dtypes: float64(2), int64(4), object(5)
    memory usage: 83.5+ KB

    填补缺失值也可以用pandas的方法 

    1. import pandas as pd
    2. data = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv", index_col = 0)
    3. data.head()

     用中值填补

    1. data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median())
    2. data.loc[:,"Age"]
    PassengerId
    1      22.0
    2      38.0
    3      26.0
    4      35.0
    5      35.0
           ... 
    887    27.0
    888    19.0
    889    28.0
    890    26.0
    891    32.0
    Name: Age, Length: 891, dtype: float64

     

    删除列

    1. data.dropna(axis = 0, inplace = True)
    2. data.head()

    axis=0删除所有列,axis=1删除所有列
    inplace: =True表示在原数据集修改,=False表示生成一个复制对象,在复制对象上修改。(默认情况是True

  • 相关阅读:
    高速电路设计-前言
    80端口被占用问题根源解决 HTTP Error 404. The requested resource is not found.
    设计模式笔记 ——1(结构体的私有属性)
    468. 验证IP地址-c语言
    模拟Proactor模式实现 I/O 处理单元
    如何使用vue-cli来搭建vue项目?详细步骤跟着我来吧!
    0036力扣507题---完美数
    java(面向对象)的23种设计模式(11)——观察者模式
    71页全域旅游综合整体解决方案2021 ppt
    【刻意练习观后管】刻意练习
  • 原文地址:https://blog.csdn.net/ykrsgs/article/details/126259016