• 10分钟学会pandas库之一(查看和选择)



    一、pandas是什么?

    Pandas是一个开源,为Python (opens new window)编程语言提供高性能,易于使用的数据结构和数据分析工具。它的功能比EXCEL还要强大。

    二、快速使用

    1.引入库

    代码如下:

    import numpy as np
    import pandas as pd
    
    • 1
    • 2

    2.读取CSV文件数据

    代码如下:

    pd.read_csv(r'foo.csv')
              Unnamed: 0         A         B         C         D
    0    ('2010-01-02',)  0.383275  0.694303  0.756312  0.325656
    1    ('2010-01-03',)  0.789682  0.949437  0.018813  0.289073
    2    ('2010-01-04',)  0.669891  0.575676  0.609218  0.371853
    3    ('2010-01-05',)  0.335516  0.789114  0.201464  0.026223
    4    ('2010-01-06',)  0.470739  0.080339  0.012590  0.636239
    ..               ...       ...       ...       ...       ...
    995  ('2012-09-23',)  0.363259  0.533010  0.240098  0.171152
    996  ('2012-09-24',)  0.238887  0.509116  0.847454  0.383001
    997  ('2012-09-25',)  0.224259  0.645276  0.538157  0.539286
    998  ('2012-09-26',)  0.556111  0.587099  0.599791  0.937187
    999  ('2012-09-27',)  0.990479  0.073511  0.636951  0.485250
    [1000 rows x 5 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    3.查看数据

    head([n])查看前几行数据,默认n==5

    df = pd.read_csv(r'foo.csv')
    df.head()
            Unnamed: 0         A         B         C         D
    0  ('2010-01-02',)  0.383275  0.694303  0.756312  0.325656
    1  ('2010-01-03',)  0.789682  0.949437  0.018813  0.289073
    2  ('2010-01-04',)  0.669891  0.575676  0.609218  0.371853
    3  ('2010-01-05',)  0.335516  0.789114  0.201464  0.026223
    4  ('2010-01-06',)  0.470739  0.080339  0.012590  0.636239
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    tail([n])查看末尾数据,也是默认n==5

    df.tail()
              Unnamed: 0         A         B         C         D
    995  ('2012-09-23',)  0.363259  0.533010  0.240098  0.171152
    996  ('2012-09-24',)  0.238887  0.509116  0.847454  0.383001
    997  ('2012-09-25',)  0.224259  0.645276  0.538157  0.539286
    998  ('2012-09-26',)  0.556111  0.587099  0.599791  0.937187
    999  ('2012-09-27',)  0.990479  0.073511  0.636951  0.485250
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    显示索引与列名:

    df.index  
    RangeIndex(start=0, stop=1000, step=1)  # 表示索引范围0到1000(左闭右开区间,步进step为1,即0,1,2,3...)
    
    df.columns
    Index(['Unnamed: 0', 'A', 'B', 'C', 'D'], dtype='object')
    
    • 1
    • 2
    • 3
    • 4
    • 5

    转置数据:

    df.T
                            0                1    ...              998              999
    Unnamed: 0  ('2010-01-02',)  ('2010-01-03',)  ...  ('2012-09-26',)  ('2012-09-27',)
    A                  0.383275         0.789682  ...         0.556111         0.990479
    B                  0.694303         0.949437  ...         0.587099         0.073511
    C                  0.756312         0.018813  ...         0.599791         0.636951
    D                  0.325656         0.289073  ...         0.937187          0.48525
    [5 rows x 1000 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    按值排序:

    按A列排序,默认升序

    df.sort_values(by='A')
              Unnamed: 0         A         B         C         D
    747  ('2012-01-19',)  0.000409  0.054803  0.720072  0.544639
    992  ('2012-09-20',)  0.001385  0.442260  0.138257  0.960748
    388  ('2011-01-25',)  0.002953  0.942049  0.832475  0.161548
    395  ('2011-02-01',)  0.005425  0.753291  0.708701  0.950734
    541  ('2011-06-27',)  0.008852  0.484906  0.535997  0.321985
    ..               ...       ...       ...       ...       ...
    583  ('2011-08-08',)  0.995928  0.128598  0.819095  0.691981
    191  ('2010-07-12',)  0.997268  0.463426  0.436351  0.983158
    873  ('2012-05-24',)  0.998636  0.772841  0.217930  0.373149
    246  ('2010-09-05',)  0.998801  0.305137  0.083263  0.520577
    576  ('2011-08-01',)  0.999404  0.630375  0.796221  0.376956
    [1000 rows x 5 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    按轴排序

    二维数据行的方向代表0轴,列的方向代表1轴
    下面表示按列的方向排序列索引0ABCD ->ABCD0

    df.sort_index(axis=1)
                A         B         C         D       Unnamed: 0
    0    0.383275  0.694303  0.756312  0.325656  ('2010-01-02',)
    1    0.789682  0.949437  0.018813  0.289073  ('2010-01-03',)
    2    0.669891  0.575676  0.609218  0.371853  ('2010-01-04',)
    3    0.335516  0.789114  0.201464  0.026223  ('2010-01-05',)
    4    0.470739  0.080339  0.012590  0.636239  ('2010-01-06',)
    ..        ...       ...       ...       ...              ...
    995  0.363259  0.533010  0.240098  0.171152  ('2012-09-23',)
    996  0.238887  0.509116  0.847454  0.383001  ('2012-09-24',)
    997  0.224259  0.645276  0.538157  0.539286  ('2012-09-25',)
    998  0.556111  0.587099  0.599791  0.937187  ('2012-09-26',)
    999  0.990479  0.073511  0.636951  0.485250  ('2012-09-27',)
    [1000 rows x 5 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    4.选择数据

    按标签选择按

    df.loc[0]获取标签为0的那行数据

    df.loc[0]
    Unnamed: 0    ('2010-01-02',)
    A                    0.383275
    B                    0.694303
    C                    0.756312
    D                    0.325656
    Name: 0, dtype: object
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    选取多行多列
    df.loc[:, [‘A’, ‘D’]] 选取AD两列的所有数据;

    df.loc[:, ['A', 'D']]
                A         D
    0    0.383275  0.325656
    1    0.789682  0.289073
    2    0.669891  0.371853
    3    0.335516  0.026223
    4    0.470739  0.636239
    ..        ...       ...
    995  0.363259  0.171152
    996  0.238887  0.383001
    997  0.224259  0.539286
    998  0.556111  0.937187
    999  0.990479  0.485250
    [1000 rows x 2 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    按位置选择

    选择第三行的数据

    df.iloc[2]
    Unnamed: 0    ('2010-01-04',)
    A                    0.669891
    B                    0.575676
    C                    0.609218
    D                    0.371853
    Name: 2, dtype: object
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    选取4,5行 第一列和第二列的数据

    df.iloc[3:5, 0:2]
            Unnamed: 0         A
    3  ('2010-01-05',)  0.335516
    4  ('2010-01-06',)  0.470739
    
    • 1
    • 2
    • 3
    • 4

    选取2,3,6行,第二三列的数据

    df.iloc[3:5, 0:2]
            Unnamed: 0         A
    3  ('2010-01-05',)  0.335516
    4  ('2010-01-06',)  0.470739
    df.iloc[[1,2,5], [1,3]]
              A         C
    1  0.789682  0.018813
    2  0.669891  0.609218
    5  0.283397  0.085547
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    条件选择

    A列大于0.7的所有行

    df[df.A>0.7]
              Unnamed: 0         A         B         C         D
    1    ('2010-01-03',)  0.789682  0.949437  0.018813  0.289073
    6    ('2010-01-08',)  0.909243  0.089120  0.177339  0.580291
    9    ('2010-01-11',)  0.866655  0.554461  0.498087  0.060520
    13   ('2010-01-15',)  0.733417  0.372051  0.352662  0.134456
    14   ('2010-01-16',)  0.973311  0.204662  0.786659  0.389995
    ..               ...       ...       ...       ...       ...
    978  ('2012-09-06',)  0.855976  0.051711  0.289529  0.540159
    981  ('2012-09-09',)  0.958141  0.729275  0.658488  0.079401
    987  ('2012-09-15',)  0.939371  0.707939  0.068135  0.411151
    994  ('2012-09-22',)  0.935422  0.334959  0.275858  0.652320
    999  ('2012-09-27',)  0.990479  0.073511  0.636951  0.485250
    [298 rows x 5 columns]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    用isin()筛选

    isin筛选E列里是aa或ee的数据

    df2[df2['E'].isin(['aa', 'ee'])]
            Unnamed: 0         A         B         C         D   E
    0  ('2010-01-02',)  0.383275  0.694303  0.756312  0.325656  aa
    4  ('2010-01-06',)  0.470739  0.080339  0.012590  0.636239  ee
    df2
            Unnamed: 0         A         B         C         D   E
    0  ('2010-01-02',)  0.383275  0.694303  0.756312  0.325656  aa
    1  ('2010-01-03',)  0.789682  0.949437  0.018813  0.289073  bb
    2  ('2010-01-04',)  0.669891  0.575676  0.609218  0.371853  cc
    3  ('2010-01-05',)  0.335516  0.789114  0.201464  0.026223  dd
    4  ('2010-01-06',)  0.470739  0.080339  0.012590  0.636239  ee
    5  ('2010-01-07',)  0.283397  0.927063  0.085547  0.608495  ff
    df2[df2['E'].isin(['aa', 'ee'])]
            Unnamed: 0         A         B         C         D   E
    0  ('2010-01-02',)  0.383275  0.694303  0.756312  0.325656  aa
    4  ('2010-01-06',)  0.470739  0.080339  0.012590  0.636239  ee
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    总结

    以上就是今天要讲的内容,本文仅仅简单介绍了pandas的使用,未完待续系列。。。敬请期待!!!

  • 相关阅读:
    SHELL基础编程
    大数据入门篇
    OpenCV中CommandLineParser命令行输入使用方法介绍
    基于java+springmvc+mybatis+vue+mysql的校园安全管理系统
    Spring Boot 实现跨域的 5 种方式,看看哪种适合?
    猿创征文|瑞吉外卖——管理端_菜品管理_1
    增强基于Cortex-M3的MCU以处理480 Mbps高速USB
    NLP_Transformer架构
    java毕业设计宠物收养管理系统Mybatis+系统+数据库+调试部署
    商品 秒杀
  • 原文地址:https://blog.csdn.net/weixin_41986750/article/details/128116864