【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
data = pd.read_csv("train.csv")
我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。
我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢
(1) 请查看每个特征缺失值个数
法一:
df.info() 可以查看每一列的列名、非空数据个数以及数据类型
#写入代码
data.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
可以看到,行索引为0-890,即有891条数据,故在非空数值计数中,非空值为891的列表示没有却是特征,反之则有
∴在导入数据中,Age、Cabin以及Embarked列有空值,即有特征缺失值,其他列没有特征缺失
此外,还有一些其他方法可以查看缺失值个数
法二:
df.isnull() 可以判断指定数据集合中的每一个值是否为空值,返回True表示此处为缺失值
结合sum() 方法使用,可以统计每一列的空值个数
data.isnull().head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | True | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | True | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | True | False |
data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
(2) 请查看Age, Cabin, Embarked列的数据
#写入代码
data[["Age", "Cabin", "Embarked"]].head(5)
| Age | Cabin | Embarked | |
|---|---|---|---|
| 0 | 22.0 | NaN | S |
| 1 | 38.0 | C85 | C |
| 2 | 26.0 | NaN | S |
| 3 | 35.0 | C123 | S |
| 4 | 35.0 | NaN | S |
(1)处理缺失值一般有几种思路
答:
【注】参数inplace可以决定是否更新原始表格,True表示更新原始表,False表示不更新原始表
(2) 请尝试对Age列的数据的缺失值进行处理
#写入代码
data["Age"].head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
#删除空值所在行数据,可以看到少了一行数据
data.dropna(subset = ["Age"])["Age"].head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
Name: Age, dtype: float64
#填充指定值,可以看到用指定值10填充了空值
data.fillna(10)["Age"].head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 10.0
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
#用均值进行填充
data["Age"].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
data.fillna(value = {"Age":round(data["Age"].mean(),1)})["Age"].head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 29.7
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
#用前一个值填充
data.fillna(method = "ffill")["Age"].head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 35.0
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理
#写入代码
data.head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
#删除具有空值的行
data.dropna().head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
| 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
| 21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
| 23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
| 27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
| 52 | 53 | 1 | 1 | Harper, Mrs. Henry Sleeper (Myna Haxtun) | female | 49.0 | 1 | 0 | PC 17572 | 76.7292 | D33 | C |
| 54 | 55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
#删除至少有2个空值的行
data.dropna(thresh = 11).head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
#仅删除Age列有空值的行
data.dropna(subset = ["Age"]).head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
#因为有空值的不同列数据类型不一样,故分别指定固定值替换
data.fillna(value = {"Age":0, "Cabin":"C85", "Embarked": "S"}).head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C85 | S |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | 0.0 | 0 | 0 | 330877 | 8.4583 | C85 | Q |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | C85 | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | C85 | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | C85 | C |
data["Cabin"].describe()
count 204
unique 147
top B96 B98
freq 4
Name: Cabin, dtype: object
【VC小注】
可以看到,对于非数值型的列,使用describe()方法做出的统计结果包括非空值计数、不同取值个数、频次最高取值、最高频次
【思考1】dropna和fillna有哪些参数,分别如何使用呢?
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
【思考2】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?
#思考回答
data[data["Age"]==np.nan]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|
data[data["Age"]==None]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|
data[data["Age"].isnull()].head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
| 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
| 26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
| 28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
可以看到只有.isnull()方法筛选出了Age列具有空值的行
由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢
df.duplicated() 可以查找并显示数据表中的重复值,配合sum()方法可以得到重复值个数
【注】
#写入代码
data.duplicated().head(5)
0 False
1 False
2 False
3 False
4 False
dtype: bool
data["Age"].duplicated().head(5)
0 False
1 False
2 False
3 False
4 True
Name: Age, dtype: bool
#检查有多少重复值
data["Age"].duplicated().sum()
802
(1)重复值有哪些处理方式呢?
答:删除重复值数据
df.drop_duplicates() 可以删除重复数据
(2)处理我们数据的重复值
#删除重复值,因为没有完全相同的两行数据,所以没有数据被删除
data.drop_duplicates().shape
(891, 12)
#指定列删除重复值
data.drop_duplicates(subset = ["Age"], keep = "first").shape
(89, 12)
data.drop_duplicates(subset = ["Age", "Cabin"]).shape
(284, 12)
#写入代码
data.to_csv("train_clear.csv")
我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征。
数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。
文本型特征往往需要转换成数值型特征才能用于建模分析。
(1) 分箱操作是什么?
答:
在建立模型前,一般需要对特征变量进行离散化,特征离散化后,有助于处理异常值或者样本量较少的值,模型会更稳定,降低模型过拟合的风险。尤其是采用 logsitic 建立评分卡模型时,必须对连续变量进行离散化。
特征离散化处理通常采用的就是分箱法,数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,提高泛化性。
分箱之后,数值型变量的取值空间缩小,也就取值的数量变少了,每个取值的样本量增多。
缺点: 方差变小,但相应的偏差变大了,也就是数据所能提供的信息没有分箱前那么精确了。
【VC小注】 可以理解为将数据按照一定规则进行分组,然后赋予离散化标签
常见的特征分箱的方法
根据有无目标变量,特征分箱的方法可分为两种:
无监督分箱
pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates=‘raise’)
x : 要分箱的数据,一维数组
bins :整数,标量序列或者间隔索引,是进行分组的依据
- 如果填入整数n,则表示将x中的数值分成等宽的n份(即每一组内的最大值与最小值之差约相等);
- 如果是标量序列,序列中的数值表示用来分档的分界值
- 如果是间隔索引,“ bins”的间隔索引必须不重叠
include_lowest:布尔值,左区间开闭状态,默认为false,也就左开。
right :布尔值,右区间开闭状态,默认为True表示右闭
- 当“ right = True”(默认值)时,则“ bins”=[1、2、3、4]表示(1,2],(2,3],(3,4]
- 当bins是一个间隔索引时,该参数被忽略。
labels : 数组或布尔值,可选.指定分箱的标签
- 如果是数组,长度要与分箱个数一致,比如“ bins”=[1、2、3、4]表示(1,2],(2,3],(3,4]一共3个区间,则labels的长度也就是标签的个数也要是3
- 如果为False,则仅返回分箱的整数指示符,即x中的数据在第几个箱子里
- 当bins是间隔索引时,将忽略此参数
retbins: 是否显示分箱的分界值。默认为False,当bins取整数时可以设置retbins=True以显示分界值,得到划分后的区间
precision:整数,默认3,存储和显示分箱标签的精度。
duplicates:如果分箱临界值不唯一,则引发ValueError或丢弃非唯一
pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates=‘raise’)
x :要分箱的数据,一维数组或者Serise
q : 表示分位数的整数或者数组,
- 如果是分位数的整数,例如10用于十分位,4用于四分位
- 如果是分位数数组,例如[0,0.25,0.5,0.75,1]用于四分位数
labels : 数组或者布尔值,默认为none,用于指定每个箱体的标签
- 如果是数组,长度要与分箱个数一致,比如用四分位数分箱,需要指定四个标签
- 如果为False,则仅返回分箱的整数指示符,即当前数据位于哪个箱子中
rebines :布尔值,可选。 是否显示分箱的分界值。(由于是按照分位数进行分箱,在不知道分位数具体数值的情况下,可以通过这个参数设置显示分界值即分位数的具体数值)
precision:整数,默认3,存储和显示分箱标签的精度。
duplicates:如果分箱临界值不唯一,则引发ValueError或丢弃非唯一
有监督分箱
对于特征的分箱,按步骤可分为两步:
【参考】
https://blog.csdn.net/qq_22172133/article/details/118883524
https://zhuanlan.zhihu.com/p/486766553
(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
#等宽分箱,可以看到,原始数据最后多了一列AgeBin,是根据Age分箱得到的标签结果
#每项宽度相同,故给bins参数赋予一个常数即可
data["AgeBin"] = pd.cut(data["Age"], bins = 5, labels = [1,2,3,4,5])
data.to_csv("train_agebin_1.csv")
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 3 |
(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
#仍是根据宽度进行分箱,但每箱宽度不一致,故给bins参数赋予分界值序列
#左右区间开闭可用include_lewest和right参数确定,默认左开(false)右闭(true)
data["AgeBin"] = pd.cut(data["Age"], bins = [0,5,15,30,50,80], labels = [1,2,3,4,5], include_lowest = True, right = False)
data.to_csv("train_agebin_2.csv")
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 3 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 |
(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示
#等频分箱,需要赋予分频数组
data["AgeBin"] = pd.qcut(data["Age"], q = [0,0.1,0.3,0.5,0.7,0.9], labels = [1,2,3,4,5])
data.to_csv("train_agebin_3.csv")
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 |
(5) 将上面的获得的数据分别进行保存,保存为csv格式
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
(1) 查看文本变量名及种类
可以看到,文本类型数据有:Name、Sex、Ticket、Fare、Cabin
data.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
12 AgeBin 650 non-null category
dtypes: category(1), float64(2), int64(5), object(5)
memory usage: 84.7+ KB
【总结】统计方法
法一:使用df.describe() 方法可以看到文本数据列的非空数据个数、取值个数、频次最高取值、最高频次等信息。但得到的信息有限。
法二:使用df.value_counts(normalize=False,sort=True,ascending=False,bins=None,dropna=True) 方法来统计dataframe中某列有多少个不同的取值,并且每个取值出现的次数
参数:
法三:使用df.unique() 方法对于一维数组或者列表去重并按元素由大到小返回一个新的无元素重复的元组或者列表。只能看到不同取值,但没有取值频次。
此外df.nunique() 方法可以得到不同取值的个数。
#法一
data["Sex"].describe()
count 891
unique 2
top male
freq 577
Name: Sex, dtype: object
#法二
data["Sex"].value_counts()
male 577
female 314
Name: Sex, dtype: int64
#法三
data["Sex"].unique()
array(['male', 'female'], dtype=object)
data["Sex"].nunique()
2
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
【总结】文本替换方法
法一:str.replace(old, new[, max]) 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),如果指定第三个参数max,则替换不超过max次。
法二:map(function, iterable, …) 方法会根据提供的函数对指定序列做映射。第一个参数function以参数序列中的每一个元素调用function函数,返回包含每次function函数返回值的新列表。
法三:sklearn.preprocessing.LabelEncoder 将n个类别编码为0~n-1之间的整数(包括0和n-1)
#法一
data["SexNum"] = data["Sex"].replace(["male","female"], [1,2])
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBin | SexNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1 |
#法二
data['SexNum'] = data['Sex'].map({'male': 1, 'female': 2})
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBin | SexNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1 |
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示
pd.got_dummies(data,prefix = None,prefix_sep =“_”,dummy_na = False,columns = None,sparse = False,drop_first = False,dtype = None ) 方法可以进行one-hot编码
参数:
for feat in ["Embarked"]:
# x = pd.get_dummies(df["Age"] // 6)
# x = pd.get_dummies(pd.cut(df['Age'],5))
x = pd.get_dummies(data[feat], prefix=feat)
data = pd.concat([data, x], axis=1)
#df[feat] = pd.get_dummies(df[feat], prefix=feat)
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_C | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 0 | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 0 | 0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 0 | 1 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 0 | 1 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 0 | 1 |
str.extract(pat, flags=0, expand=True) 用于提取正则表达式中的捕获组作为DataFrame中的列。
参数:
【补充】正则表达式(regular expression)
描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
具体用法参考:https://www.runoob.com/regexp/regexp-syntax.html
#写入代码
data['Title'] = data.Name.str.extract('([A-Za-z]+)\.', expand=False)
data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_C | Embarked_Q | Embarked_S | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 0 | 1 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 0 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 0 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 0 | 1 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 0 | 1 | Mr |
#保存最终你完成的已经清理好的数据
data.to_csv("train_day2.csv")