- import pandas as pd
- data = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv", index_col = 0)
-
- data.head()
data.info()
由运行结果的第二行知,共有891条数据。由运行结果的表格知每个column有多少条数据
data.loc[:,"Age"]
data.loc[:,"Age"].values
可见:
1.加了“.values”后才是Age数组(不加时会保留PassengerId)所以千万别忘了加 .values !
2.Age是一维数组
from sklearn.impute import SimpleImputer
- Age = data.loc[:,"Age"].values.reshape(-1, 1) #把age数组从上面的一维数组改成二维数组
- Age[:20] #Age的前20条
下面介绍3种实例化方法(但最终使用了法3完成本步骤):
a.用均值填补
- imp_mean = SimpleImputer() #()内为空,默认为用均值填补
- imp_mean = imp_mean.fit_transform(Age)
- imp_mean[:20]
array([[22. ], [38. ], [26. ], [35. ], [35. ], [29.69911765], [54. ], [ 2. ], [27. ], [14. ], [ 4. ], [58. ], [20. ], [39. ], [14. ], [55. ], [ 2. ], [29.69911765], [31. ], [29.69911765]])
b.用0值填补
- imp_0 = SimpleImputer(strategy = "constant", fill_value = 0) #用0填补
- imp_0 = imp_0.fit_transform(Age)
- imp_0[:20]
array([[22.], [38.], [26.], [35.], [35.], [ 0.], [54.], [ 2.], [27.], [14.], [ 4.], [58.], [20.], [39.], [14.], [55.], [ 2.], [ 0.], [31.], [ 0.]])
c.用中位数填补
- imp_median = SimpleImputer(strategy = "median") #用中位数填补
- imp_median = imp_median.fit_transform(Age)
- imp_median[:20]
array([[22.], [38.], [26.], [35.], [35.], [28.], [54.], [ 2.], [27.], [14.], [ 4.], [58.], [20.], [39.], [14.], [55.], [ 2.], [28.], [31.], [28.]])
接下来用 3)中填补好了的数据来
- ata.loc[:,"Age"] = imp_median
- data.info()
Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 891 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 9 Cabin 204 non-null object 10 Embarked 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB
现在Age是891个数据,即没有缺失值了
然而,同时发现Embarked是889,说明还缺2个
Embarked = data.loc[:, "Embarked"].values.reshape(-1,1)
- imp_mode = SimpleImputer(strategy = "most_frequent") #most_frequent == 众数
- data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)
- data.info()
Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 891 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 9 Cabin 204 non-null object 10 Embarked 891 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB
- import pandas as pd
- data = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv", index_col = 0)
- data.head()
- data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median())
- data.loc[:,"Age"]
PassengerId 1 22.0 2 38.0 3 26.0 4 35.0 5 35.0 ... 887 27.0 888 19.0 889 28.0 890 26.0 891 32.0 Name: Age, Length: 891, dtype: float64
- data.dropna(axis = 0, inplace = True)
- data.head()
axis=0删除所有列,axis=1删除所有列
inplace: =True表示在原数据集修改,=False表示生成一个复制对象,在复制对象上修改。(默认情况是True