• Kaggle泰坦尼克号-决策树Top 3%-0基础代码详解


            Titanic Disaster Kaggle,里的经典入门题目,因为在学决策树所以找了一个实例学习了一下,完全萌新零基础,所以基本每一句都做了注释。

            原文链接:Titanic: Simple Decision Tree model score(Top 3%) | Kaggle

    目录

    1. Preprocessing and EDA  #预处理和探索性数据分析

    1.1. Missing Values  #缺失值

    1.3. Fare column  #票价列

    1.4. Embarked column #登船地

    1.5. Cabin column #船舱列

    2. Feature Extraction  #特征工程

    2.1. SibSp and Parch column #兄弟姐妹和父母孩子

    2.2. Ticket column #船票

    2.3. Name Column #姓名

    2.4. Woman or Child column #女人和孩子

    2.4 Family Survived Rate column #家庭生存率

    3. Modeling #训练模型

    4. Conclutions #结论

    5. References #参考文献


    Titanic Disaster

    Improve your score to 82.78% (Top 3%)

    In this work I have used some basic techniques to process of the easy way Titanic dataset.

    1. Preprocessing and EDA  #预处理和探索性数据分析

    Here, I reviewed the variables, impute missing values, found patterns and watched relationship between columns.

    #第一部分的工作是查看变量,修补缺失值,通过观察数据之间的关系,进行特征工程。

    1.1. Missing Values  #缺失值

    Reading the dataset and merging Train and Test to get better results.

    #读取数据集并合并训练和测试以获得更好的结果

    1. # Libraries used
    2. import numpy as np
    3. #运行速度非常快的数学库,主要用于数组计算
    4. import pandas as pd
    5. #分析结构化数据的工具集,基础是 Numpy
    6. import seaborn as sns
    7. #可视化库 是对matplotlib进行二次封装
    8. import matplotlib.pyplot as plt
    9. #可视化库
    10. #机器学习库
    11. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    12. from sklearn.tree import DecisionTreeClassifier
    13. from sklearn.preprocessing import StandardScaler
    14. from numpy.random import seed
    15. seed(11111)
    16. #随机种子 可以让每次的随机数都相同 保证程序可以复现
    1. # Reading
    2. #读取训练集和测试集
    3. train = pd.read_csv("../input/titanic/train.csv")
    4. test = pd.read_csv("../input/titanic/test.csv")
    5. # Putting on index to each dataset before split it
    6. #指定PassengerId列将被设置为索引
    7. train = train.set_index("PassengerId")
    8. test = test.set_index("PassengerId")
    9. # dataframe
    10. #纵向合并两个DataFrame对象 axis=0纵向 sort=False列的顺序维持原样, 不进行重新排序。
    11. df = pd.concat([train, test], axis=0, sort=False)
    12. #输出df
    13. df
    SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
    PassengerId
    10.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
    21.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
    31.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
    41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
    50.03Allen, Mr. William Henrymale35.0003734508.0500NaNS
    ....................................
    1305NaN3Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
    1306NaN1Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
    1307NaN3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
    1308NaN3Ware, Mr. FrederickmaleNaN003593098.0500NaNS
    1309NaN3Peter, Master. Michael JmaleNaN11266822.3583NaNC

    1309 rows × 11 columns

    #1309行 11列

    As you can see Name, Sex, Ticket, Cabin, and Embarked column are objects, before processing each column we should know if there are NAs or missing values.

    #姓名、性别、船票、客舱和登船地列都是对象,在预处理之前先查看一下数据的总体信息 判断是否有缺失数据

    1. df.info()
    2. #.info()函数用于获取 DataFrame 的简要摘要
    
    Int64Index: 1309 entries, 1 to 1309
    Data columns (total 11 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   Survived  891 non-null    float64
     1   Pclass    1309 non-null   int64  
     2   Name      1309 non-null   object 
     3   Sex       1309 non-null   object 
     4   Age       1046 non-null   float64
     5   SibSp     1309 non-null   int64  
     6   Parch     1309 non-null   int64  
     7   Ticket    1309 non-null   object 
     8   Fare      1308 non-null   float64
     9   Cabin     295 non-null    object 
     10  Embarked  1307 non-null   object 
    dtypes: float64(3), int64(3), object(5)
    memory usage: 122.7+ KB
    
    

    There are three columns with missing values (Age, Fare and Cabin) and Survived column has NaNs because the Test dataset doesn't have that information.

    #有三列缺少值(年龄、票价和舱位),“幸存”列具有 NaN,因为测试数据集没有该信息。

    1. df.isna().sum()
    2. #.isna()检测缺失值 .sum()加和 也就是显示每一列的缺失值数量
    Survived     418
    Pclass         0
    Name           0
    Sex            0
    Age          263
    SibSp          0
    Parch          0
    Ticket         0
    Fare           1
    Cabin       1014
    Embarked       2
    dtype: int64
    

    To visualize better the columns we will transform the Sex and Embarked columns to numeric. Sex column only has two categories Female and Male, Embarked column has tree labels S, C and Q.

    #为了更好地可视化列,我们将“性别”和“登船”列转换为数字。性别列只有两个类别“女性”=0和“男性”=1,“登船”列具有树形标签 S=0、C =1和 Q=2。

    1. # Sex
    2. change = {'female':0,'male':1}
    3. df.Sex = df.Sex.map(change)
    4. #.map() 可以用自己定义的或者是其他包提供的函数用在Pandas对象上实现批量修改
    5. # Embarked
    6. change = {'S':0,'C':1,'Q':2}
    7. df.Embarked = df.Embarked.map(change)

    The following figure show us numeric columns vs Survived column to know the behavior. In the last fig (3,3) you can see that we are working with unbalanced dataset.

    #下图向我们展示了数字列与幸存列以了解行为。在最后一个图 (3,3) 中,您可以看到我们正在处理不平衡的数据集。

    1. columns = ['Pclass', 'Sex','Embarked','SibSp', 'Parch','Survived']
    2. plt.figure(figsize=(16, 14))
    3. #figsize 设置图形的大小 长16,宽14
    4. sns.set(font_scale= 1.2)
    5. #font_scale 以基准倍数放大1.2倍
    6. sns.set_style('ticks')
    7. #set_style设置主题 有5个默认主题
    8. for i, feature in enumerate(columns):
    9. plt.subplot(3, 3, i+1)
    10. #subplot(子图行数,图像列数,每行第几个图)
    11. sns.countplot(data=df, x=feature, hue='Sex', palette='BrBG')
    12. #countplot 使用条形显示每个分箱器中的观察计数 hue: 在x或y标签划分的同时,再以hue标签划分统计个数 palette:使用不同的调色板
    13. sns.despine()
    14. #despine()函数移除坐标轴

    1.2. Age column

    The easy way to impute the missing values is with mean or median on base its correlation with other columns. Below you can see the correlation beetwen variables, Pclass has a good correlation with Age, but I also added Sex column to impute missing values.

    #估算缺失值的简单方法是根据其与其他列的相关性使用平均值或中位数。下面你可以看到变量相关性,Pclass 与 Age 有很好的相关性,但我也添加了 列来插补缺失值。

    1. corr_df = df.corr()
    2. #df.corr() 相关性分析
    3. fig, axs = plt.subplots(figsize=(8, 6))
    4. sns.heatmap(corr_df).set_title("Correlation Map",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    5. #设置图像的信息 标题 字号等信息

    1. df.groupby(['Pclass','Sex','Survived'])['Age'].median()
    2. #根据某个(多个)字段划分为不同的群体 根据阶级,性别,生存率分组,只查看年龄的中位数
    Pclass  Sex  Survived
    1       0    0.0         25.0
                 1.0         35.0
            1    0.0         45.5
                 1.0         36.0
    2       0    0.0         32.5
                 1.0         28.0
            1    0.0         30.5
                 1.0          3.0
    3       0    0.0         22.0
                 1.0         19.0
            1    0.0         25.0
                 1.0         25.0
    Name: Age, dtype: float64
    
    1. #Filling the missing values with mean of Pclass and Sex.
    2. df["Age"].fillna(df.groupby(['Pclass','Sex'])['Age'].transform("mean"), inplace=True)
    3. #把年龄的缺失值 按阶级和性别分组后的年龄均值填充
    1. fig, axs = plt.subplots(figsize=(10, 5))
    2. sns.histplot(data=df, x='Age').set_title("Age distribution",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    3. sns.despine()

    Let's binning the columns to process it the best way.

    #数据分箱是最好的处理方法。

    1. auxage = pd.cut(df['Age'], 4)
    2. fig, axs = plt.subplots(figsize=(15, 5))
    3. sns.countplot(x=auxage, hue='Survived', data=df).set_title("Age Bins",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    4. sns.despine()

     

    1. # converting to categorical
    2. #转换分类
    3. df['Age'] = LabelEncoder().fit_transform(auxage)
    4. # LabelEncoder()对数据进行编码 fit_transform就是将序列重新排列后再进行标准化
    1. pd.crosstab(df['Age'], df['Survived'])
    2. #构建交叉表
    Survived0.01.0
    Age
    09782
    1341200
    29455
    3175

    1.3. Fare column  #票价列

    Fare has only one missing value and I imputed with the median or moda

    #票价只有一个缺失值,用中位数或模数估算

    df["Fare"].fillna(df.groupby(['Pclass', 'Sex'])['Fare'].transform("median"), inplace=True)
    1. auxfare = pd.cut(df['Fare'],5)
    2. fig, axs = plt.subplots(figsize=(15, 5))
    3. sns.countplot(x=auxfare, hue='Survived', data=df).set_title("Fare Bins",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    4. sns.despine()

    df['Fare'] = LabelEncoder().fit_transform(auxfare) 
    pd.crosstab(df['Fare'], df['Survived'])
    Survived0.01.0
    Fare
    0535303
    1825
    2611
    303

    1.4. Embarked column #登船地

    Has two missing values.

    #有两个缺失值。

    1. print("mean of embarked",df.Embarked.median())
    2. df.Embarked.fillna(df.Embarked.median(), inplace = True)
    mean of embarked 0.0
    
    

    1.5. Cabin column #船舱列

    This column has many missing values and thats the reason I dropped it.

    #此列有许多缺失值,这就是我删除它的原因。缺失率达到了将近80% ,如果参数缺失率达到70以上就可以考虑 删除了。

    print("Percentage of missing values in the Cabin column :" ,round(df.Cabin.isna().sum()/ len(df.Cabin)*100,2))
    Percentage of missing values in the Cabin column : 77.46
    
    1. df.drop(['Cabin'], axis = 1, inplace = True)
    2. #.drop 从行或列中删除指定的标签 inplace是否返回副本 默认为False返回副本

    2. Feature Extraction  #特征工程

    In this part I have used the Name column to extract the Title of each person.

    #在这一部分中,我使用“姓名”列来提取每个人的头衔。外国人姓名前会加头衔 把头衔提取出来 可以判断年龄 职业 社会阶层等信息

    1. df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand = False)
    2. #.str.extract 提取的正则表达式
    df.Title.value_counts()
    Mr          757
    Miss        260
    Mrs         197
    Master       61
    Rev           8
    Dr            8
    Col           4
    Major         2
    Ms            2
    Mlle          2
    Don           1
    Dona          1
    Jonkheer      1
    Lady          1
    Capt          1
    Sir           1
    Mme           1
    Countess      1
    Name: Title, dtype: int64
    

    The four titles most ocurring are Mr, Miss, Mrs and Master.

    #最常出现的四个头衔是先生、小姐、夫人和师父。

    1. least_occuring = ['Rev','Dr','Major', 'Col', 'Capt','Jonkheer','Countess']
    2. df.Title = df.Title.replace(['Ms', 'Mlle','Mme','Lady'], 'Miss')
    3. df.Title = df.Title.replace(['Countess','Dona'], 'Mrs')
    4. df.Title = df.Title.replace(['Don','Sir'], 'Mr')
    5. df.Title = df.Title.replace(least_occuring,'Rare')
    6. df.Title.unique()
    7. #unique()方法返回的是去重之后的不同值

    array(['Mr', 'Mrs', 'Miss', 'Master', 'Rare'], dtype=object) 

    pd.crosstab(df['Title'], df['Survived'])
    Survived0.01.0
    Title
    Master1723
    Miss55132
    Mr43782
    Mrs26100
    Rare145
    df['Title'] = LabelEncoder().fit_transform(df['Title']) 

    2.1. SibSp and Parch column #兄弟姐妹和父母孩子

    #特征工程 新建了一个特征 家庭规模=兄弟姐妹+父母孩子+自己

    1. # I got the total number of each family adding SibSp and Parch. (1) is the same passenger.
    2. df['FamilySize'] = df['SibSp'] + df['Parch']+1
    3. df.drop(['SibSp','Parch'], axis = 1, inplace = True)
    1. fig, axs = plt.subplots(figsize=(15, 5))
    2. sns.countplot(x='FamilySize', hue='Survived', data=df).set_title("Raw Column",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    3. sns.despine()

    1. # Binning FamilySize column
    2. df.loc[ df['FamilySize'] == 1, 'FamilySize'] = 0 # Alone
    3. df.loc[(df['FamilySize'] > 1) & (df['FamilySize'] <= 4), 'FamilySize'] = 1 # Small Family
    4. df.loc[(df['FamilySize'] > 4) & (df['FamilySize'] <= 6), 'FamilySize'] = 2 # Medium Family
    5. df.loc[df['FamilySize'] > 6, 'FamilySize'] = 3 # Large Family

     #根据家庭人数划分了范围 单人,小家庭,中等家庭,大家庭

    1. fig, axs = plt.subplots(figsize=(15, 5))
    2. sns.countplot(x='FamilySize', hue='Survived', data=df).set_title("Variable Bined",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    3. sns.despine()

    2.2. Ticket column #船票

    With the following lambda function I got the ticket's number and I changed the LINE ticket to zero.

    #使用以下 lambda 函数,我得到了票证的编号,并将 LINE 工单更改为零。

    df['Ticket'] = df.Ticket.str.split().apply(lambda x : 0 if x[:][-1] == 'LINE' else x[:][-1])
    1. df.Ticket = df.Ticket.values.astype('int64')
    2. #.astype 变量转换为int64

    2.3. Name Column #姓名

    To get a better model,I got the Last Name of each passenger.

    #为了得到更好的模型,我得到了每个乘客的姓氏。

    df['LastName'] = last= df.Name.str.extract('^(.+?),', expand = False)

    2.4. Woman or Child column #女人和孩子

    Here, I created a new column to know if the passenger is woman a child, I selected the Title parameter because most of children less than 16 years have the master title.

    #在这里,我创建了一个新列来了解乘客是否是儿童女性,我选择了 Title 参数,因为大多数 16 岁以下的儿童都有主标题。

    df['WomChi'] = ((df.Title == 0) | (df.Sex == 0))

    2.4 Family Survived Rate column #家庭生存率

    In this part I created three new columns FTotalCount, FSurviviedCount and FSurvivalRate, the F is of Family. FTotalCount uses a lambda function to count of the WomChi column on base of LastName, PClass and Ticked detect families and then subtract the same passanger with a boolean process the passenger is woman or child. FSurvivedCount also uses a lambda function to sum WomChi column and then with mask function filters if the passenger is woman o child subtract the state of survival, and the last FSurvivalRate only divide FSurvivedCount and FTotalCount.

    #在这一部分中,我创建了三个新列 FTotalCount、FSurviedCount 和 FSurvivalRate,F 是 Family。FTotalCount 使用 lambda 函数对 LastName、PClass 和 Ticked 检测家庭基础上的 WomChi 列进行计数,然后用布尔过程减去乘客是女性或儿童的相同乘客。FSurvivedCount还使用lambda函数对WomChi列求和,然后用掩码函数过滤器,如果乘客是女人或孩子,则减去生存状态,最后一个FSurvivalRate仅除以FSurvivedCount和FTotalCount。

    1. family = df.groupby([df.LastName, df.Pclass, df.Ticket]).Survived
    2. df['FTotalCount'] = family.transform(lambda s: s[df.WomChi].fillna(0).count())
    3. df['FTotalCount'] = df.mask(df.WomChi, (df.FTotalCount - 1), axis=0)
    4. #df.mask 符合指定条件 进行替换 如果是WomChi,,则df.FTotalCount - 1
    5. df['FSurvivedCount'] = family.transform(lambda s: s[df.WomChi].fillna(0).sum())
    6. df['FSurvivedCount'] = df.mask(df.WomChi, df.FSurvivedCount - df.Survived.fillna(0), axis=0)
    7. df['FSurvivalRate'] = (df.FSurvivedCount / df.FTotalCount.replace(0, np.nan))
    1. df.isna().sum()
    2. #统计每列的缺失值数量
    Survived           418
    Pclass               0
    Name                 0
    Sex                  0
    Age                  0
    Ticket               0
    Fare                 0
    Embarked             0
    Title                0
    FamilySize           0
    LastName             0
    WomChi               0
    FTotalCount        245
    FSurvivedCount     245
    FSurvivalRate     1014
    dtype: int64
    
    1. # filling the missing values
    2. #把缺失值全部补0 填充
    3. df.FSurvivalRate.fillna(0, inplace = True)
    4. df.FTotalCount.fillna(0, inplace = True)
    5. df.FSurvivedCount.fillna(0, inplace = True)
    1. # You can review the result Family Survival Rate with these Families Heikkinen, Braund, Rice, Andersson,
    2. # Fortune, Asplund, Spector,Ryerson, Allison, Carter, Vander, Planke
    3. #展示一下这些家庭的存活率
    4. df[df['LastName'] == "Dean"]
    SurvivedPclassNameSexAgeTicketFareEmbarkedTitleFamilySizeLastNameWomChiFTotalCountFSurvivedCountFSurvivalRate
    PassengerId
    940.03Dean, Mr. Bertram Frank11231500.021DeanFalse0.00.00.0
    7891.03Dean, Master. Bertram Vere10231500.001DeanTrue2.00.00.0
    924NaN3Dean, Mrs. Bertram (Eva Georgetta Light)01231500.031DeanTrue2.01.00.5
    1246NaN3Dean, Miss. Elizabeth Gladys Millvina""00231500.011DeanTrue2.01.00.5

    3. Modeling #训练模型

    #第三部分 开始初始化 调用模型 进行训练

    1. df = pd.get_dummies(df, columns=['Sex','Fare','Pclass'])
    2. #对数据进行one-hot编码
    df.drop(['Name','LastName','WomChi','FTotalCount','FSurvivedCount','Embarked','Title'], axis = 1, inplace = True)
    1. df.columns
    2. #查看columns属性表示

    Index(['Survived', 'Age', 'Ticket', 'FamilySize', 'FSurvivalRate',
           'PassengerId', 'Sex_0', 'Sex_1', 'Fare_0', 'Fare_1', 'Fare_2', 'Fare_3',
           'Pclass_1', 'Pclass_2', 'Pclass_3'],
          dtype='object') 

    1. # I splitted df to train and test
    2. train, test = df.loc[train.index], df.loc[test.index]
    3. #df.loc[] 根据DataFrame的行标和列标进行数据的筛选
    4. X_train = train.drop(['PassengerId','Survived'], axis = 1)
    5. Y_train = train["Survived"]
    6. train_names = X_train.columns
    7. X_test = test.drop(['PassengerId','Survived'], axis = 1)
    1. corr_train = X_train.corr()
    2. fig, axs = plt.subplots(figsize=(10, 8))
    3. sns.heatmap(corr_train).set_title("Correlation Map",fontdict= { 'fontsize': 20, 'fontweight':'bold'});
    4. plt.show()

    1. # Scaler
    2. #标准化
    3. X_train = StandardScaler().fit_transform(X_train)
    4. X_test = StandardScaler().fit_transform(X_test)
    1. #决策树训练
    2. decision_tree = DecisionTreeClassifier()
    3. decision_tree.fit(X_train, Y_train)
    4. Y_predDT = decision_tree.predict(X_test)
    5. print("Accuracy of the model: ",round(decision_tree.score(X_train, Y_train) * 100, 2))
    Accuracy of the model:  99.89
    

    #精确度达到了 99.98%  竟然?

    1. importances = pd.DataFrame(decision_tree.feature_importances_, index = train_names)
    2. importances.sort_values(by = 0, inplace=True, ascending = False)
    3. importances = importances.iloc[0:6,:]
    4. plt.figure(figsize=(8, 5))
    5. sns.barplot(x=0, y=importances.index, data=importances,palette="deep").set_title("Feature Importances",
    6. fontdict= { 'fontsize': 20,
    7. 'fontweight':'bold'});
    8. sns.despine()

    1. submit = pd.DataFrame({"PassengerId":test.PassengerId, 'Survived':Y_predDT.astype(int).ravel()})
    2. submit.to_csv("submissionJavier_Vallejos.csv",index = False)
    3. #保存数据

    4. Conclutions #结论

    This report is part of a bootcamp of Data Science, and as you can see I achieved to be on the Top 3%. In the fist part I did an analysis to visualize each column and impute their missing values. After that I applied feature engineering to extract the title, last name of the Name column and Family Size is the adding of SibSp and Parch plus one that means the same passenger. Age and Fare columns have been Binning to get better results. To get Family Survival Rate is base on two rules:

    #这份报告是数据科学训练营的一部分,正如你所看到的,我达到了前3%的成绩。在第一部分中,我进行了分析以可视化每列并估算其缺失值。之后,我应用特征工程来提取标题,姓名列的姓氏和家庭大小是添加 SibSp 和 Parch 加上一个表示同一乘客。年龄和票价列已分箱以获得更好的结果。获得家庭存活率基于两个规则:

    1. All males die except boys in families where all females and boys live.
    2. All females live except those in families where all females and boys die.

    #

    1. 在所有女性和男孩居住的家庭中,除男孩外,所有男性都死亡。
    2. 所有女性都生活,但所有女性和男孩都死亡的家庭除外。

    With rules above you can get an score near to 81% but if you add the ticket number and other changes that I did you can improve it to 82.78% on Kaggle leaderboard.

    #使用上述规则,您可以获得接近 81% 的分数,但如果您添加票号和我所做的其他更改,您可以在 Kaggle 排行榜上将其提高到 82.78%。

    To the model part I used only Desicion tree because is the easy way to getting this score.

    #对于模型部分,我只使用了 Desicion 树,因为这是获得此分数的简单方法。

    Finally, if you want to increase your score, then I suggest you read this work. and like Chris Deotte said in his post this is the fist step to improve your score.

    #最后,如果你想提高你的分数,那么我建议你阅读这部作品。就像克里斯·迪奥特(Chris Deotte)在他的帖子中所说的那样,这是提高分数的第一步。

    5. References #参考文献

  • 相关阅读:
    P17 JLayeredPane层级面板
    TikTok美食狂潮:短视频如何塑造食物文化新趋势
    2022亚太杯数学建模竞赛C题思路解析
    Nacos使用实践
    英国生活需要交纳哪些税?
    Java8特性,Stream流的使用,收集成为map集合
    JVM: 设置代码执行模式(解释模式、编译模式、混合模式)
    YOLOv7改进之二十三:引入SimAM无参数注意力
    LD链接脚本
    ubuntu软件安装和管理(apt-get)
  • 原文地址:https://blog.csdn.net/StrawBerryTreea/article/details/128145200