Pandas是一个开源,为Python (opens new window)编程语言提供高性能,易于使用的数据结构和数据分析工具。它的功能比EXCEL还要强大。
代码如下:
import numpy as np
import pandas as pd
代码如下:
pd.read_csv(r'foo.csv')
Unnamed: 0 A B C D
0 ('2010-01-02',) 0.383275 0.694303 0.756312 0.325656
1 ('2010-01-03',) 0.789682 0.949437 0.018813 0.289073
2 ('2010-01-04',) 0.669891 0.575676 0.609218 0.371853
3 ('2010-01-05',) 0.335516 0.789114 0.201464 0.026223
4 ('2010-01-06',) 0.470739 0.080339 0.012590 0.636239
.. ... ... ... ... ...
995 ('2012-09-23',) 0.363259 0.533010 0.240098 0.171152
996 ('2012-09-24',) 0.238887 0.509116 0.847454 0.383001
997 ('2012-09-25',) 0.224259 0.645276 0.538157 0.539286
998 ('2012-09-26',) 0.556111 0.587099 0.599791 0.937187
999 ('2012-09-27',) 0.990479 0.073511 0.636951 0.485250
[1000 rows x 5 columns]
df = pd.read_csv(r'foo.csv')
df.head()
Unnamed: 0 A B C D
0 ('2010-01-02',) 0.383275 0.694303 0.756312 0.325656
1 ('2010-01-03',) 0.789682 0.949437 0.018813 0.289073
2 ('2010-01-04',) 0.669891 0.575676 0.609218 0.371853
3 ('2010-01-05',) 0.335516 0.789114 0.201464 0.026223
4 ('2010-01-06',) 0.470739 0.080339 0.012590 0.636239
df.tail()
Unnamed: 0 A B C D
995 ('2012-09-23',) 0.363259 0.533010 0.240098 0.171152
996 ('2012-09-24',) 0.238887 0.509116 0.847454 0.383001
997 ('2012-09-25',) 0.224259 0.645276 0.538157 0.539286
998 ('2012-09-26',) 0.556111 0.587099 0.599791 0.937187
999 ('2012-09-27',) 0.990479 0.073511 0.636951 0.485250
df.index
RangeIndex(start=0, stop=1000, step=1) # 表示索引范围0到1000(左闭右开区间,步进step为1,即0,1,2,3...)
df.columns
Index(['Unnamed: 0', 'A', 'B', 'C', 'D'], dtype='object')
df.T
0 1 ... 998 999
Unnamed: 0 ('2010-01-02',) ('2010-01-03',) ... ('2012-09-26',) ('2012-09-27',)
A 0.383275 0.789682 ... 0.556111 0.990479
B 0.694303 0.949437 ... 0.587099 0.073511
C 0.756312 0.018813 ... 0.599791 0.636951
D 0.325656 0.289073 ... 0.937187 0.48525
[5 rows x 1000 columns]
按A列排序,默认升序
df.sort_values(by='A')
Unnamed: 0 A B C D
747 ('2012-01-19',) 0.000409 0.054803 0.720072 0.544639
992 ('2012-09-20',) 0.001385 0.442260 0.138257 0.960748
388 ('2011-01-25',) 0.002953 0.942049 0.832475 0.161548
395 ('2011-02-01',) 0.005425 0.753291 0.708701 0.950734
541 ('2011-06-27',) 0.008852 0.484906 0.535997 0.321985
.. ... ... ... ... ...
583 ('2011-08-08',) 0.995928 0.128598 0.819095 0.691981
191 ('2010-07-12',) 0.997268 0.463426 0.436351 0.983158
873 ('2012-05-24',) 0.998636 0.772841 0.217930 0.373149
246 ('2010-09-05',) 0.998801 0.305137 0.083263 0.520577
576 ('2011-08-01',) 0.999404 0.630375 0.796221 0.376956
[1000 rows x 5 columns]
二维数据行的方向代表0轴,列的方向代表1轴
下面表示按列的方向排序列索引0ABCD ->ABCD0
df.sort_index(axis=1)
A B C D Unnamed: 0
0 0.383275 0.694303 0.756312 0.325656 ('2010-01-02',)
1 0.789682 0.949437 0.018813 0.289073 ('2010-01-03',)
2 0.669891 0.575676 0.609218 0.371853 ('2010-01-04',)
3 0.335516 0.789114 0.201464 0.026223 ('2010-01-05',)
4 0.470739 0.080339 0.012590 0.636239 ('2010-01-06',)
.. ... ... ... ... ...
995 0.363259 0.533010 0.240098 0.171152 ('2012-09-23',)
996 0.238887 0.509116 0.847454 0.383001 ('2012-09-24',)
997 0.224259 0.645276 0.538157 0.539286 ('2012-09-25',)
998 0.556111 0.587099 0.599791 0.937187 ('2012-09-26',)
999 0.990479 0.073511 0.636951 0.485250 ('2012-09-27',)
[1000 rows x 5 columns]
df.loc[0]获取标签为0的那行数据
df.loc[0]
Unnamed: 0 ('2010-01-02',)
A 0.383275
B 0.694303
C 0.756312
D 0.325656
Name: 0, dtype: object
选取多行多列
df.loc[:, [‘A’, ‘D’]] 选取AD两列的所有数据;
df.loc[:, ['A', 'D']]
A D
0 0.383275 0.325656
1 0.789682 0.289073
2 0.669891 0.371853
3 0.335516 0.026223
4 0.470739 0.636239
.. ... ...
995 0.363259 0.171152
996 0.238887 0.383001
997 0.224259 0.539286
998 0.556111 0.937187
999 0.990479 0.485250
[1000 rows x 2 columns]
选择第三行的数据
df.iloc[2]
Unnamed: 0 ('2010-01-04',)
A 0.669891
B 0.575676
C 0.609218
D 0.371853
Name: 2, dtype: object
选取4,5行 第一列和第二列的数据
df.iloc[3:5, 0:2]
Unnamed: 0 A
3 ('2010-01-05',) 0.335516
4 ('2010-01-06',) 0.470739
选取2,3,6行,第二三列的数据
df.iloc[3:5, 0:2]
Unnamed: 0 A
3 ('2010-01-05',) 0.335516
4 ('2010-01-06',) 0.470739
df.iloc[[1,2,5], [1,3]]
A C
1 0.789682 0.018813
2 0.669891 0.609218
5 0.283397 0.085547
A列大于0.7的所有行
df[df.A>0.7]
Unnamed: 0 A B C D
1 ('2010-01-03',) 0.789682 0.949437 0.018813 0.289073
6 ('2010-01-08',) 0.909243 0.089120 0.177339 0.580291
9 ('2010-01-11',) 0.866655 0.554461 0.498087 0.060520
13 ('2010-01-15',) 0.733417 0.372051 0.352662 0.134456
14 ('2010-01-16',) 0.973311 0.204662 0.786659 0.389995
.. ... ... ... ... ...
978 ('2012-09-06',) 0.855976 0.051711 0.289529 0.540159
981 ('2012-09-09',) 0.958141 0.729275 0.658488 0.079401
987 ('2012-09-15',) 0.939371 0.707939 0.068135 0.411151
994 ('2012-09-22',) 0.935422 0.334959 0.275858 0.652320
999 ('2012-09-27',) 0.990479 0.073511 0.636951 0.485250
[298 rows x 5 columns]
isin筛选E列里是aa或ee的数据
df2[df2['E'].isin(['aa', 'ee'])]
Unnamed: 0 A B C D E
0 ('2010-01-02',) 0.383275 0.694303 0.756312 0.325656 aa
4 ('2010-01-06',) 0.470739 0.080339 0.012590 0.636239 ee
df2
Unnamed: 0 A B C D E
0 ('2010-01-02',) 0.383275 0.694303 0.756312 0.325656 aa
1 ('2010-01-03',) 0.789682 0.949437 0.018813 0.289073 bb
2 ('2010-01-04',) 0.669891 0.575676 0.609218 0.371853 cc
3 ('2010-01-05',) 0.335516 0.789114 0.201464 0.026223 dd
4 ('2010-01-06',) 0.470739 0.080339 0.012590 0.636239 ee
5 ('2010-01-07',) 0.283397 0.927063 0.085547 0.608495 ff
df2[df2['E'].isin(['aa', 'ee'])]
Unnamed: 0 A B C D E
0 ('2010-01-02',) 0.383275 0.694303 0.756312 0.325656 aa
4 ('2010-01-06',) 0.470739 0.080339 0.012590 0.636239 ee
以上就是今天要讲的内容,本文仅仅简单介绍了pandas的使用,未完待续系列。。。敬请期待!!!