pandas 中的两种数据结构：Series, DataFrame

文章目录

Series
DataFrame
References

Series

Series 是一个一维的有标记的 array，它可以存储任何 Python 的基本数据类型。Series 的标记数据我们通常称之为 index。

我们先将 pandas 以及 numpy 加载：

import numpy as np
import pandas as pd
1
2

创建 Series

创建 Series 最基本的方式是调用 pandas 库中的 Series 对象：

s = pd.Series(data, index=index)

通常，我们的 data 有可能为 array，字典或者标量值。

`data` 为 array

如果 data 是一个 array，那么 index 的长度必须和 data 相同，且 index 值可以重复：

s = pd.Series([4, 7, -5, 3], index=["a", "b", "c", "a"])
s
"""
a    4
b    7
c   -5
a    3
dtype: int64
"""
1
2
3
4
5
6
7
8
9

如果 index 没有指定，那么将会默认设置为 0, 1,..., len(data)-1：

s = pd.Series([4, 7, -5, 3])
s
"""
0    4
1    7
2   -5
3    3
dtype: int64
"""
1
2
3
4
5
6
7
8
9

我们可以通过 Series 的 values 和 index 属性来得到对应的 data 值和 index 对象：

s.values
"""
array([ 4,  7, -5,  3], dtype=int64)
"""
1
2
3
4

s.index
"""
Index(['a', 'b', 'c', 'a'], dtype='object')
"""
1
2
3
4

`data` 为 dict

如果 data 为字典，那么 data 的 key 和 value 会对应到 Series 的 index 和 value上：

d = {"I": 1, "II": 2, "III": 3}
pd.Series(d)
"""
I      1
II     2
III    3
dtype: int64
"""
1
2
3
4
5
6
7
8

如果仍传入 index 参数，那么 index 中的 label 如果在 key 中，label 和 value 仍匹配，如果 label 不在 key 中，label 对应的值则为 NaN (missing data)：

d = {"I": 1, "II": 2, "III": 3}
pd.Series(d, index=["I", "III", "IV", "X"])
"""
I      1.0
III    3.0
IV     NaN
X      NaN
dtype: float64
"""
1
2
3
4
5
6
7
8
9

`data` 为标量

如果 data 为标量，那么 index 参数必须提供，data 会匹配 index 的长度：

pd.Series(5, index=["I", "II", "III", "IV", "V"])
"""
I      5
II     5
III    5
IV     5
V      5
dtype: int64
"""
1
2
3
4
5
6
7
8
9

类比 NumPy array

Series 和 ndarray 非常类似，且可以作为许多 NumPy 函数的参数：

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

s[s > s.median()]
"""
a    0.402311
b    0.473590
dtype: float64
"""
1
2
3
4
5
6
7
8

s + s
"""
a    0.804623
b    0.947180
c    0.315507
d   -1.647874
e   -1.980954
dtype: float64
"""

s * 2
"""
a    0.804623
b    0.947180
c    0.315507
d   -1.647874
e   -1.980954
dtype: float64
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

np.exp(s)
"""
a    1.495277
b    1.605749
c    1.170878
d    0.438701
e    0.371399
dtype: float64
"""
1
2
3
4
5
6
7
8
9

当我们索引单个值时，会返回对应位置的 value，但当我们对 Series 进行切片时，会同时返回对应的 index：

s[0]
"""
0.4023112822709314
"""

s[1:5]
"""
b    0.473590
c    0.157754
d   -0.823937
e   -0.990477
dtype: float64
"""
1
2
3
4
5
6
7
8
9
10
11
12
13

甚至可以使用类似 NumPy 中的 fancy indexing（《学懂 Python NumPy》）：

s[[4, 2, 1]]
"""
e   -0.990477
c    0.157754
b    0.473590
dtype: float64
"""
1
2
3
4
5
6
7

我们可以直接使用 to_numpy 来将 Series 变为 NumPy array：

s.to_numpy()
"""
array([ 0.40231128,  0.47359015,  0.15775353, -0.82393693, -0.99047702])
"""
1
2
3
4

Series 的运算会自动将数据根据 index 对齐，且最后的结果会取 index 的交集，交集之外的 index label 对应的 value 会被设置为 NaN：

s[1:] + s[:-1]
"""
a         NaN
b    0.947180
c    0.315507
d   -1.647874
e         NaN
dtype: float64
"""
1
2
3
4
5
6
7
8
9

类比字典

正如我们使用字典数据来创建 Series 时一样，我们可以把 index 当作 key 来操作 Series：

s["a"]
"""
0.4023112822709314
"""

s.get("a")
"""
0.4023112822709314
"""
1
2
3
4
5
6
7
8
9

"e" in s
"""
True
"""
1
2
3
4

补充

pandas 中的函数 isnull，notnull 可以帮助我们检测缺失值（NaN），Series 对象也带有这些方法：

(s[1:] + s[:-1]).isnull()
"""
a     True
b    False
c    False
d    False
e     True
dtype: bool
"""
1
2
3
4
5
6
7
8
9

Series 以及它的 index 对象都有 name 属性，可以修改：

s.name = 'Example'
s.index.name = 'letter'
s
1
2
3

Series 的 index 可以原地进行修改：

s.index = ["a", "a", "a", "a", "a"]
s
"""
a     0.40231128
a     0.47359015
a     0.15775353
a    -0.82393693
a    -0.99047702
Name: Example, dtype: float64
"""
1
2
3
4
5
6
7
8
9
10

DataFrame

DataFrame 通常代表一种矩形表结构，它有行 index 以及列 index。不同列的数据可以为不同的数据类型。和 Series 一样，有多种不同方式可以创建 DataFrame。

创建 DataFrame

从 array 创建

df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Baidi', 'Wuyi', 'Changan', 'Changshan'])
df
"""
                  b         d         e
Baidi      1.671894  0.902370  0.008920
Wuyi      -0.368150  0.265574  1.538348
Changan   -0.491342 -0.073567  0.257416
Changshan  1.703614  1.007506 -1.837523
"""
1
2
3
4
5
6
7
8
9
10

从字典创建

字典中的值可以为 ndarray 或列表：

d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)
"""
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

"""
1
2
3
4
5
6
7
8
9
10

也可以为 Series：

d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}

df = pd.DataFrame(d)
df
"""
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14

和 Series 的规则一样，如果我们在使用字典数据构建仍然传入 index 参数：

pd.DataFrame(d, index=["d", "b", "a"])
"""
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0
"""
1
2
3
4
5
6
7

pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
"""
 two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN
"""
1
2
3
4
5
6
7

注意，行和列的标签是分别通过 index 和 column 属性得到的：

df.index
"""
Index(['a', 'b', 'c', 'd'], dtype='object')
"""

df.columns
"""
Index(['one', 'two'], dtype='object')
"""
1
2
3
4
5
6
7
8
9

我们还可以直接通过函数 DataFrame.from_dict 来直接从字典创建，默认情况下，字典的键作为列标记：

pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))
"""
   A  B
0  1  4
1  2  5
2  3  6
"""
1
2
3
4
5
6
7

通过将 orient 参数设置为 index，字典的键作为行标记：

pd.DataFrame.from_dict(
    dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
    orient="index",
    columns=["one", "two", "three"],
)
"""
   one  two  three
A    1    2      3
B    4    5      6
"""
1
2
3
4
5
6
7
8
9
10

直接从 array 和字典创建 DataFrame 是最常用也是最直观的方式，其他方法感觉不是很直观，这里再简单介绍一个从列表创建的方法。

从列表创建

列表中对象为字典：

data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)
"""
   a   b     c
0  1   2   NaN
1  5  10  20.0
"""
1
2
3
4
5
6
7

pd.DataFrame(data2, index=["first", "second"])
"""
        a   b     c
first   1   2   NaN
second  5  10  20.0
"""
1
2
3
4
5
6

列操作

我们可以将 DataFrame 的每列都当成一个字典，因此列操作与字典操作类似。

这里再把上面定义的 df 拎出来：

df
"""
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
"""
1
2
3
4
5
6
7
8

添加列：

df["three"] = df["one"] * df["two"]
df["flag"] = df["one"] > 2
df
"""
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False
"""
1
2
3
4
5
6
7
8
9
10

删除列：

del df["two"]

three = df.pop("three")

df
"""
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False
"""
1
2
3
4
5
6
7
8
9
10
11
12

标量值会被 “广播” 到所有行：

df["foo"] = "bar"
df
"""
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar
"""
1
2
3
4
5
6
7
8
9

默认情况下，列会被添加到最后。我们可以使用 insert 方法来指定插入的位置：

df.insert(1, "bar", df["one"])
df
"""
   one  bar   flag  foo
a  1.0  1.0  False  bar
b  2.0  2.0  False  bar
c  3.0  3.0   True  bar
d  NaN  NaN  False  bar
"""
1
2
3
4
5
6
7
8
9

使用 assign 方法从现有列推导出新的列数据：

from sklearn import datasets
iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"])
df.head()
"""
   SepalLength  SepalWidth  PetalLength  PetalWidth
0          5.1         3.5          1.4         0.2
1          4.9         3.0          1.4         0.2
2          4.7         3.2          1.3         0.2
3          4.6         3.1          1.5         0.2 
4          5.0         3.6          1.4         0.2 
"""
1
2
3
4
5
6
7
8
9
10
11
12
13

df.assign(sepal_ratio=df["SepalWidth"] / df["SepalLength"]).head()
"""
   SepalLength  SepalWidth  PetalLength  PetalWidth  sepal_ratio
0          5.1         3.5          1.4         0.2     0.686275
1          4.9         3.0          1.4         0.2     0.612245
2          4.7         3.2          1.3         0.2     0.680851
3          4.6         3.1          1.5         0.2     0.673913
4          5.0         3.6          1.4         0.2     0.720000
"""
1
2
3
4
5
6
7
8
9

我们也可以一次添加多个列，且可以使用匿名函数（lambda）：

dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
dfa.assign(C=lambda x: x["A"] + x["B"], D=lambda x: x["A"] + x["C"])
"""
   A  B  C   D
0  1  4  5   6
1  2  5  7   9
2  3  6  9  12
"""
1
2
3
4
5
6
7
8

注意参数的顺序，在创建列 D 时，我们是在基于已经创建列 C 的基础之上的（新的拷贝），因此可以直接索引 x["C"]。

Indexing / Selection

一些基本的索引操作如下表所示，我会专门写一篇文章来详细介绍（《pandas 中的索引与选择操作》）。

操作	语法	结果
选择列	`df[col]`	Series
通过 `label` 选择行	`df.loc[label]`	Series
通过整数索引选择行	`df.iloc[loc]`	Series
行切片	`df[5:10]`	DataFrame
通过布尔向量选择行	`df[bool_vec]`	DataFrame

DataFrame 基本运算

与 Series 一样，DataFrame 之间的运算只会发生在行标和列标都有交集的位置，其它位置被填充 NaN：

df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

df + df2
"""
          A         B         C   D
0  0.045691 -0.014138  1.380871 NaN
1 -0.955398 -1.501007  0.037181 NaN
2 -0.662690  1.534833 -0.859691 NaN
3 -2.452949  1.237274 -0.133712 NaN
4  1.414490  1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

如果我们想在 NaN 的位置填充其他值，可以使用 add 方法，且传入 fill_value 参数：

df1.add(df2, fill_value=0)
1

当 DataFrame 与 Series 运算时，例如，我们让 DataFrame 中的每行数据都减去它的第一行：

df - df.iloc[0]
"""
          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2  0.253128  0.829678  0.010026 -1.991234
3 -1.311128  0.054325 -1.724913 -1.620544
4  0.573025  1.500742 -0.676070  1.367331
5 -1.741248  0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282  0.000430
7 -0.743894  0.411013 -0.929563 -0.282386
8 -1.194921  1.320690  0.238224 -1.482644
9  2.293786  1.856228  0.773289 -1.446531
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14

df.iloc[0] 这个 Series 的 index 会默认与 df 的 column 对齐，从而将 df.iloc[0] 逐行广播。如果我们想沿列广播，那么必须使用一些运算方法，例如 add、sub、div 等，且将 axis 参数指定为 index：

series = df['B']
df.sub(series, axis='index')
1
2

和标量的运算也是逐元素的：

df * 5 + 2
"""
           A         B         C          D
0   3.359299 -0.124862  4.835102   3.381160
1  -3.437003 -1.368449  2.568242  -5.392133
2   4.624938  4.023526  4.885230  -6.575010
3  -3.196342  0.146766 -3.789461  -4.721559
4   6.224426  7.378849  1.454750  10.217815
5  -5.346940  3.785103 -1.373001  -6.884519
6  -2.844569 -4.472618  4.068691   3.383309
7  -0.360173  1.930201  0.187285   1.969232
8  -2.615303  6.478587  6.026220  -4.032059
9  14.828230  9.156280  8.701544  -3.851494
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14

和 Series 相同，DataFrame 也可作为大多数 NumPy 函数的参数：

np.exp(df)
1

References

Pandas User Guide. https://pandas.pydata.org/docs/user_guide/dsintro.html#.

相关阅读:
Android Studio新建项目下载依赖慢，只需一个操作解决
 [深入研究4G/5G/6G专题-55]: L3信令控制-4-软件功能与流程的切分-CU网元的信令
 回溯算法（3）--n皇后问题及回溯法相关习题
 初识 - Linux
文件上传很难搞？10分钟带你学会阿里云OSS对象存储
 代码随想录 | Day 60（完结） - LeetCode 84. 柱状图中最大的矩形
 在windows上配置本地域名解析，配置hosts文件
 Python内置函数/方法详解—元组tuple
Upstream Consistent Hash
Window 安装多个版本的 java 并按需切换
原文地址：https://blog.csdn.net/myDarling_/article/details/127945179