🔎大家好,我是Sonhhxg_柒,希望你看完之后,能对你有所帮助,不足请指正!共同学习交流🔎
📝个人主页-Sonhhxg_柒的博客_CSDN博客 📃
🎁欢迎各位→点赞👍 + 收藏⭐️ + 留言📝
📣系列专栏 - 机器学习【ML】 自然语言处理【NLP】 深度学习【DL】
🖍foreword
✔说明⇢本人讲解主要包括Python、机器学习(ML)、深度学习(DL)、自然语言处理(NLP)等内容。
如果你对这个系列感兴趣的话,可以关注订阅哟👋
文章目录
缩放
首先,我们将导入 NumPy 和 Pandas 库并设置可重复性的种子。我们还将将要使用的数据集下载到磁盘上。
- import numpy as np
- import pandas as pd
- # Set seed for reproducibility
- np.random.seed(seed=1234)
我们将使用Titanic 数据集,该数据集包含有关 1912 年登上 RMS Titanic 的人员的数据,以及他们是否在探险中幸存下来。这是一个非常常见且丰富的数据集,非常适合使用 Pandas 进行探索性数据分析。
让我们将 CSV 文件中的数据加载到 Pandas 数据框中。header=0表示第一行(第 0 个索引)是标题行,其中包含数据集中每一列的名称。
-
- # Read from CSV to Pandas DataFrame
- url = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/titanic.csv"
- df = pd.read_csv(url, header=0)
- # First few items
- df.head(3)
| pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 1 |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 1 |
| 2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
这些是不同的功能:
class: 旅行等级name: 乘客全名sex: 性别age: 数字年龄sibsp: # 兄弟姐妹/配偶parch: 父母/孩子的人数ticket: 票号fare: 票价cabin: 房间位置embarked: 乘客登船的港口survived:生存指标(0 - 死亡,1 - 幸存)现在我们加载了数据,我们准备开始探索它以查找有趣的信息。
import matplotlib.pyplot as plt
我们可以.describe()用来提取一些关于我们的数值特征的标准细节。
-
- # Describe features
- df.describe()
| pclass | age | sibsp | parch | fare | survived | |
|---|---|---|---|---|---|---|
| count | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 1309.000000 |
| mean | 2.294882 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 0.381971 |
| std | 0.837836 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 0.486055 |
| min | 1.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 0.000000 |
| 50% | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 0.000000 |
| 75% | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 1.000000 |
| max | 3.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 1.000000 |
- # Correlation matrix
- plt.matshow(df.corr())
- continuous_features = df.describe().columns
- plt.xticks(range(len(continuous_features)), continuous_features, rotation="45")
- plt.yticks(range(len(continuous_features)), continuous_features, rotation="45")
- plt.colorbar()
- plt.show()

我们还可以.hist()用来查看每个特征值的直方图。
- # Histograms
- df["age"].hist()

- # Unique values
- df["embarked"].unique()
array(['S', 'C', nan, 'Q'], dtype=object)
我们可以按特征过滤数据,甚至可以按特定特征中的特定值(或值范围)过滤数据。
- # Selecting data by feature
- df["name"].head()
0 Allen, Miss. Elisabeth Walton 1 Allison, Master. Hudson Trevor 2 Allison, Miss. Helen Loraine 3 Allison, Mr. Hudson Joshua Creighton 4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) Name: name, dtype: object
-
- # Filtering
- df[df["sex"]=="female"].head() # only the female data appear
| pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 24160 | 211.3375 | B5 | S | 1 |
| 2 | 1 | Allison, Miss. Helen Loraine | female | 2.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
| 4 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
| 6 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S | 1 |
| 8 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53.0 | 2 | 0 | 11769 | 51.4792 | C101 | S | 1 |
我们还可以按升序或降序对特征进行排序。
- # Sorting
- df.sort_values("age", ascending=False).head()
| pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S | 1 |
| 61 | 1 | Cavendish, Mrs. Tyrell William (Julia Florence... | female | 76.0 | 1 | 0 | 19877 | 78.8500 | C46 | S | 1 |
| 1235 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S | 0 |
| 135 | 1 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | 0 |
| 9 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | 0 |
我们还可以获取某些组的功能统计信息。在这里,我们希望看到基于乘客是否幸存的连续特征的平均值。
- # Grouping
- survived_group = df.groupby("survived")
- survived_group.mean()
| survived | pclass | age | sibsp | parch | fare |
|---|---|---|---|---|---|
| 0 | 2.500618 | 30.545369 | 0.521632 | 0.328801 | 23.353831 |
| 1 | 1.962000 | 28.918228 | 0.462000 | 0.476000 | 49.361184 |
我们可以iloc用来获取数据框中特定位置的行或列。
-
- # Selecting row 0
- df.iloc[0, :]
pclass 1 name Allen, Miss. Elisabeth Walton sex female age 29 sibsp 0 parch 0 ticket 24160 fare 211.338 cabin B5 embarked S survived 1 Name: 0, dtype: object
- # Selecting a specific value
- df.iloc[0, 1]
'Allen, Miss. Elisabeth Walton'
探索之后,我们可以清理和预处理我们的数据集。
-
- # Rows with at least one NaN value
- df[pd.isnull(df).any(axis=1)].head()
| pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | 0 |
| 13 | 1 | Barber, Miss. Ellen "Nellie" | female | 26.0 | 0 | 0 | 19877 | 78.8500 | NaN | S | 1 |
| 15 | 1 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | 0 |
| 23 | 1 | Bidois, Miss. Rosalie | female | 42.0 | 0 | 0 | PC 17757 | 227.5250 | NaN | C | 1 |
| 25 | 1 | Birnbaum, Mr. Jakob | male | 25.0 | 0 | 0 | 13905 | 26.0000 | NaN | C | 0 |
-
- # Drop rows with Nan values
- df = df.dropna() # removes rows with any NaN values
- df = df.reset_index() # reset's row indexes in case any rows were dropped
- df.head()
| index | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 1 |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 1 |
| 2 | 2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
| 3 | 3 | 1 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
| 4 | 4 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 0 |
- # Dropping multiple columns
- df = df.drop(["name", "cabin", "ticket"], axis=1) # we won't use text features for our initial basic models
- df.head()
| index | pclass | sex | age | sibsp | parch | fare | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | female | 29.0000 | 0 | 0 | 211.3375 | S | 1 |
| 1 | 1 | 1 | male | 0.9167 | 1 | 2 | 151.5500 | S | 1 |
| 2 | 2 | 1 | female | 2.0000 | 1 | 2 | 151.5500 | S | 0 |
| 3 | 3 | 1 | male | 30.0000 | 1 | 2 | 151.5500 | S | 0 |
| 4 | 4 | 1 | female | 25.0000 | 1 | 2 | 151.5500 | S | 0 |
-
- # Map feature values
- df["sex"] = df["sex"].map( {"female": 0, "male": 1} ).astype(int)
- df["embarked"] = df["embarked"].dropna().map( {"S":0, "C":1, "Q":2} ).astype(int)
- df.head()
| index | pclass | sex | age | sibsp | parch | fare | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 29.0000 | 0 | 0 | 211.3375 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0.9167 | 1 | 2 | 151.5500 | 0 | 1 |
| 2 | 2 | 1 | 0 | 2.0000 | 1 | 2 | 151.5500 | 0 | 0 |
| 3 | 3 | 1 | 1 | 30.0000 | 1 | 2 | 151.5500 | 0 | 0 |
| 4 | 4 | 1 | 0 | 25.0000 | 1 | 2 | 151.5500 | 0 | 0 |
我们现在将使用特征工程来创建一个名为family_size. 我们将首先定义一个名为的函数,该函数get_family_size将使用父母和兄弟姐妹的数量来确定家庭规模
- # Lambda expressions to create new features
- def get_family_size(sibsp, parch):
- family_size = sibsp + parch
- return family_size
一旦我们定义了函数,我们就可以在每一行上使用lambda该apply函数(使用每行中兄弟姐妹和父母的数量来确定每行的家庭规模)。
-
- df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
- df.head()
| index | pclass | sex | age | sibsp | parch | fare | embarked | survived | family_size | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 29.0000 | 0 | 0 | 211.3375 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 0.9167 | 1 | 2 | 151.5500 | 0 | 1 | 3 |
| 2 | 2 | 1 | 0 | 2.0000 | 1 | 2 | 151.5500 | 0 | 0 | 3 |
| 3 | 3 | 1 | 1 | 30.0000 | 1 | 2 | 151.5500 | 0 | 0 | 3 |
| 4 | 4 | 1 | 0 | 25.0000 | 1 | 2 | 151.5500 | 0 | 0 | 3 |
-
- # Reorganize headers
- df = df[["pclass", "sex", "age", "sibsp", "parch", "family_size", "fare", '"mbarked", "survived"]]
- df.head()
| pclass | sex | age | sibsp | parch | family_size | fare | embarked | survived | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 29.0000 | 0 | 0 | 0 | 211.3375 | 0 | 1 |
| 1 | 1 | 1 | 0.9167 | 1 | 2 | 3 | 151.5500 | 0 | 1 |
| 2 | 1 | 0 | 2.0000 | 1 | 2 | 3 | 151.5500 | 0 | 0 |
| 3 | 1 | 1 | 30.0000 | 1 | 2 | 3 | 151.5500 | 0 | 0 |
| 4 | 1 | 0 | 25.0000 | 1 | 2 | 3 | 151.5500 | 0 | 0 |
最后,让我们将预处理后的数据保存到一个新的 CSV 文件中以备后用
- # Saving dataframe to CSV
- df.to_csv("processed_titanic.csv", index=False)
-
- # See the saved file
- !ls -l
total 96 -rw-r--r-- 1 root root 6975 Dec 3 17:36 processed_titanic.csv drwxr-xr-x 1 root root 4096 Nov 21 16:30 sample_data -rw-r--r-- 1 root root 85153 Dec 3 17:36 titanic.csv
当处理非常大的数据集时,我们的 Pandas DataFrames 可能会变得非常大,并且对它们进行操作可能会非常慢或不可能。这就是可以分配工作负载或在更高效的硬件上运行的软件包可以派上用场的地方。
而且,当然,我们可以将这些组合在一起(Dask-cuDF)在 GPU 上对数据帧的分区进行操作。