• day03_pandas_demo


    pandas介绍

    • pandas= panel + data + analysis 面板数据分析
    • panel面板数据-计量经济学 三维数据
    • 以numpy为基础,借力numpy模块在计算方面性能高的优势
    • 基于matplotlib,能够简便的画图
    • 独特的数据结构

    为什么使用pandas

    • 便捷的数据处理能力
    • 读取文件方便
    • 封装了matplotlib、numpy的画图和计算能力

    DataFrame

    ## 结构:既有行索引,又有列索引的二维数组
    import pandas as pd
    import numpy as np
    stock_change = np.random.normal(0, 1, (10, 5))
    stock_change
    
    • 1
    • 2
    • 3
    • 4
    • 5
    array([[ 0.52652359, -0.42210135,  0.45506419, -0.1319933 , -0.85892243],
           [-2.80978824,  0.68502373, -0.72809275, -1.56716962,  0.24278934],
           [ 0.1423945 , -0.14913827, -0.30118759,  0.80841083,  0.56448585],
           [-1.11053808, -0.91833131, -0.82696531,  0.33592674, -1.81590623],
           [-0.7972349 , -0.38960542, -0.64822525, -1.67732846, -1.1320404 ],
           [-0.83075257, -0.96589613,  1.21458607, -0.54116531,  0.5416992 ],
           [ 0.2346827 ,  0.38728822,  0.5534352 ,  0.49615629,  0.03958449],
           [ 1.32743523,  0.8559906 , -0.35473279, -0.40734067,  0.23585156],
           [ 2.217162  ,  0.43897264,  1.39278121, -0.17076621,  1.25111371],
           [-1.84123059, -1.00666366,  2.07583716,  1.03959872,  1.20092384]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    stock_change1 = pd.DataFrame(stock_change)  ## 添加默认的索引
    stock_change1
    
    • 1
    • 2
    01234
    0-0.230423-0.1086772.116127-0.405135-0.600457
    11.422377-1.136674-0.4623350.795195-0.013265
    20.708261-0.197826-0.177992-1.0787430.357987
    3-0.3254320.2643370.856580-1.035939-0.228252
    40.0167341.0075540.4549110.252380-0.691905
    5-0.4717900.557541-0.7031710.344268-0.083205
    6-0.013339-0.3003711.4249160.0283381.101670
    70.061438-0.802730-0.746614-0.919655-1.336464
    80.3692740.5154270.661126-0.550260-1.560633
    9-1.087217-1.164305-0.4087481.198835-0.389584
    # 添加行索引
    stock_code = ['股票{}'.format(i+1) for i in range(10)]
    stock_code
    pd.DataFrame(stock_change, index=stock_code)  ## 这里需要注意第一个参数是ndarray,不是DataFrame结构,否则数据会变为nan
    
    • 1
    • 2
    • 3
    • 4
    01234
    股票1-1.7961490.0634690.922334-0.3382072.157024
    股票2-0.0642180.9694530.223896-0.795105-2.020499
    股票3-0.0392860.046665-0.408812-0.2841451.852426
    股票4-1.8116170.588799-1.020581-0.421300-1.068160
    股票5-0.8671870.0702690.3624120.5958100.005319
    股票6-2.3842850.185213-0.0942010.5597061.156052
    股票71.2313960.226930-0.2845441.056286-0.765503
    股票81.451832-0.5184950.1155100.5782330.174324
    股票91.184461-0.327693-1.4054331.4804700.049133
    股票100.8913090.780864-0.858295-1.1544740.127319
    • pd.date_range(start=None,end=None,periods=None,freq=‘B’)

      start : 开始时间
      end : 结束时间
      periods : 时间天数
      freq : 递进单位,默认1天,'B’默认略过周末

    date = pd.date_range(start='20231021', end=None, periods=5, freq='B')
    date
    
    • 1
    • 2
    DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26',
                   '2023-10-27'],
                  dtype='datetime64[ns]', freq='B')
    
    • 1
    • 2
    • 3
    stock_c = pd.DataFrame(stock_change, index=stock_code, columns=date)
    stock_c
    
    • 1
    • 2
    2023-10-232023-10-242023-10-252023-10-262023-10-27
    股票1-1.7961490.0634690.922334-0.3382072.157024
    股票2-0.0642180.9694530.223896-0.795105-2.020499
    股票3-0.0392860.046665-0.408812-0.2841451.852426
    股票4-1.8116170.588799-1.020581-0.421300-1.068160
    股票5-0.8671870.0702690.3624120.5958100.005319
    股票6-2.3842850.185213-0.0942010.5597061.156052
    股票71.2313960.226930-0.2845441.056286-0.765503
    股票81.451832-0.5184950.1155100.5782330.174324
    股票91.184461-0.327693-1.4054331.4804700.049133
    股票100.8913090.780864-0.858295-1.1544740.127319

    DataFrame属性

    • 对象.shape 获取形状
    • 对象.index 获取行索引
    • 对象.columns 获取列索引
    • 对象.values 获取值
    • 对象.T 获取行列转换
    • 对象.head() 查看前几行,默认是5
    • 对象.tail() 查看最后几行 默认是5
    stock_c.shape
    
    • 1
    (10, 5)
    
    • 1
    stock_c.index
    
    • 1
    Index(['股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9', '股票10'], dtype='object')
    
    • 1
    stcok_c.columns
    
    • 1
    DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26',
                   '2023-10-27'],
                  dtype='datetime64[ns]', freq='B')
    
    • 1
    • 2
    • 3
    stock_c.values
    
    • 1
    array([[-1.7961491 ,  0.06346948,  0.92233413, -0.33820729,  2.15702396],
           [-0.06421753,  0.96945298,  0.22389647, -0.79510515, -2.02049945],
           [-0.03928641,  0.04666511, -0.40881248, -0.28414454,  1.85242648],
           [-1.81161734,  0.5887991 , -1.02058093, -0.42130023, -1.06816   ],
           [-0.86718681,  0.07026887,  0.36241195,  0.59581008,  0.00531913],
           [-2.38428482,  0.18521273, -0.09420118,  0.55970591,  1.15605167],
           [ 1.23139579,  0.22693018, -0.28454449,  1.05628637, -0.76550258],
           [ 1.45183169, -0.51849484,  0.11550995,  0.57823283,  0.17432416],
           [ 1.18446114, -0.3276933 , -1.40543347,  1.48046993,  0.04913251],
           [ 0.89130874,  0.78086438, -0.85829505, -1.15447368,  0.12731851]])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    stock_c.T
    
    • 1
    股票1股票2股票3股票4股票5股票6股票7股票8股票9股票10
    2023-10-23-1.796149-0.064218-0.039286-1.811617-0.867187-2.3842851.2313961.4518321.1844610.891309
    2023-10-240.0634690.9694530.0466650.5887990.0702690.1852130.226930-0.518495-0.3276930.780864
    2023-10-250.9223340.223896-0.408812-1.0205810.362412-0.094201-0.2845440.115510-1.405433-0.858295
    2023-10-26-0.338207-0.795105-0.284145-0.4213000.5958100.5597061.0562860.5782331.480470-1.154474
    2023-10-272.157024-2.0204991.852426-1.0681600.0053191.156052-0.7655030.1743240.0491330.127319
    stock_c.head()
    
    • 1
    2023-10-232023-10-242023-10-252023-10-262023-10-27
    股票1-1.7961490.0634690.922334-0.3382072.157024
    股票2-0.0642180.9694530.223896-0.795105-2.020499
    股票3-0.0392860.046665-0.408812-0.2841451.852426
    股票4-1.8116170.588799-1.020581-0.421300-1.068160
    股票5-0.8671870.0702690.3624120.5958100.005319
    stock_c.tail()
    
    • 1
    2023-10-232023-10-242023-10-252023-10-262023-10-27
    股票6-2.3842850.185213-0.0942010.5597061.156052
    股票71.2313960.226930-0.2845441.056286-0.765503
    股票81.451832-0.5184950.1155100.5782330.174324
    股票91.184461-0.327693-1.4054331.4804700.049133
    股票100.8913090.780864-0.858295-1.1544740.127319

    DataFrame的索引

    修改行列的索引值

    stock_c.index = [f'股票_{i+1}' for i in range(10)]
    ## 不能直接索引改变
    ## stock_c.index[2] = '123'  ## pandas不支持这样的索引
    stock_c
    
    • 1
    • 2
    • 3
    • 4
    2023-10-232023-10-242023-10-252023-10-262023-10-27
    股票_1-1.7961490.0634690.922334-0.3382072.157024
    股票_2-0.0642180.9694530.223896-0.795105-2.020499
    股票_3-0.0392860.046665-0.408812-0.2841451.852426
    股票_4-1.8116170.588799-1.020581-0.421300-1.068160
    股票_5-0.8671870.0702690.3624120.5958100.005319
    股票_6-2.3842850.185213-0.0942010.5597061.156052
    股票_71.2313960.226930-0.2845441.056286-0.765503
    股票_81.451832-0.5184950.1155100.5782330.174324
    股票_91.184461-0.327693-1.4054331.4804700.049133
    股票_100.8913090.780864-0.858295-1.1544740.127319

    重设索引值

    ## stock_c.reset_index(drop=True)  当drop=True就会删除之前的索引,为Fasle就不会删除之前的索引
    stock_c.reset_index()
    
    • 1
    • 2
    index2023-10-23 00:00:002023-10-24 00:00:002023-10-25 00:00:002023-10-26 00:00:002023-10-27 00:00:00
    0股票_1-1.7961490.0634690.922334-0.3382072.157024
    1股票_2-0.0642180.9694530.223896-0.795105-2.020499
    2股票_3-0.0392860.046665-0.408812-0.2841451.852426
    3股票_4-1.8116170.588799-1.020581-0.421300-1.068160
    4股票_5-0.8671870.0702690.3624120.5958100.005319
    5股票_6-2.3842850.185213-0.0942010.5597061.156052
    6股票_71.2313960.226930-0.2845441.056286-0.765503
    7股票_81.451832-0.5184950.1155100.5782330.174324
    8股票_91.184461-0.327693-1.4054331.4804700.049133
    9股票_100.8913090.780864-0.858295-1.1544740.127319
    stock_c.reset_index(drop=True)
    
    • 1
    2023-10-232023-10-242023-10-252023-10-262023-10-27
    0-1.7961490.0634690.922334-0.3382072.157024
    1-0.0642180.9694530.223896-0.795105-2.020499
    2-0.0392860.046665-0.408812-0.2841451.852426
    3-1.8116170.588799-1.020581-0.421300-1.068160
    4-0.8671870.0702690.3624120.5958100.005319
    5-2.3842850.185213-0.0942010.5597061.156052
    61.2313960.226930-0.2845441.056286-0.765503
    71.451832-0.5184950.1155100.5782330.174324
    81.184461-0.327693-1.4054331.4804700.049133
    90.8913090.780864-0.858295-1.1544740.127319

    以某列设置新索引

    df = pd.DataFrame({'year':[2021, 2021, 2023, 2024],
                      'month':[1, 2, 3, 4],
                      'sale':[22, 100, 222, 113]})
    df
    
    • 1
    • 2
    • 3
    • 4
    yearmonthsale
    02021122
    120212100
    220233222
    320244113
    df.index
    
    • 1
    RangeIndex(start=0, stop=4, step=1)
    
    • 1
    ## set_index(keys=, drop=True)  keys列索引名称或者列索引名称列表 drop表示是否将列索引数据删除
    df.set_index(keys=['year'])
    
    • 1
    • 2
    monthsale
    year
    2021122
    20212100
    20233222
    20244113
    new_df = df.set_index(keys=['year', 'month'], drop=False)
    new_df
    
    • 1
    • 2
    yearmonthsale
    yearmonth
    202112021122
    220212100
    2023320233222
    2024420244113
    new_df.index
    
    • 1
    MultiIndex([(2021, 1),
                (2021, 2),
                (2023, 3),
                (2024, 4)],
               names=['year', 'month'])
    
    • 1
    • 2
    • 3
    • 4
    • 5

    MultiIndex

    new_df.index.names
    
    • 1
    FrozenList(['year', 'month'])
    
    • 1
    tuples = [('bar', 'one'),
         ('bar', 'two'),
         ('baz', 'one'),
         ('baz', 'two'),
         ('foo', 'one'),
         ('foo', 'two'),
         ('qux', 'one'),
         ('qux', 'two')]
    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
    index
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    MultiIndex([('bar', 'one'),
                ('bar', 'two'),
                ('baz', 'one'),
                ('baz', 'two'),
                ('foo', 'one'),
                ('foo', 'two'),
                ('qux', 'one'),
                ('qux', 'two')],
               names=['first', 'second'])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    pd.Series(np.random.randn(8), index=index)
    
    • 1
    first  second
    bar    one      -0.816907
           two       0.660782
    baz    one      -1.032361
           two      -0.595878
    foo    one      -0.658145
           two      -0.891936
    qux    one       0.385722
           two      -0.192622
    dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
    ]
    df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
    df
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    0123
    barone-0.1627902.7991071.0706520.034360
    two-0.283814-0.551970-1.270871-0.813390
    bazone0.4221661.3801310.5938040.776062
    two1.888835-0.176970-0.568067-1.343601
    fooone-0.5329141.206831-0.3677050.912403
    two-1.576118-0.082882-0.1221761.521598
    quxone-0.074543-0.3592370.3097700.895598
    two0.9051860.670022-1.549954-0.539559
    pd.DataFrame(np.random.randn(8, 4), index=index)
    
    • 1
    0123
    firstsecond
    barone-1.208274-0.810972-1.820593-0.833156
    two-1.5016570.6838750.923321-0.710930
    bazone-0.008496-3.6450992.1257641.406796
    two-0.4406050.645926-1.6405361.002207
    fooone0.2647130.182264-1.4109300.837404
    two0.683733-0.3004261.2813740.440129
    quxone-0.179653-0.331090-0.8172770.583263
    two-0.305134-0.934428-0.479319-0.179533
    • MultiIndex.from_arrays():传入一个数组列表
    • MultiIndex.from_tuples():传入一个元组数组、
    • MultiIndex.from_product():传入一个交叉的迭代集合
    • MultiIndex.from_frame():传入一个 DataFrame

    Serias

    • 对象[flag1][flag2][flag3] 先列后行
    • 对象.loc[] # 先行后列,可以使用切片操作
    • 对象.iloc[] # 先行后列,通过索引去进行索引
    new_df['year'][2021][1]  ## 一定是先列后行
    
    • 1
    2021
    
    • 1
    df.loc[0:4, 'sale']   ## 先行后列,可以使用切片操作
    
    • 1
    0     22
    1    100
    2    222
    3    113
    Name: sale, dtype: int64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    df.iloc[0:3, :5] ## 前3行前5列 先行后列,通过索引去进行索引
    
    • 1
    yearmonthsale
    02021122
    120212100
    220233222
    new_df.iloc[0:3, :5]
    
    • 1
    yearmonthsale
    yearmonth
    202112021122
    220212100
    2023320233222
    sr = pd.Series(np.arange(2,10,2), index=['数值{}'.format(i+1) for i in range(4)])
    sr
    
    • 1
    • 2
    数值1    2
    数值2    4
    数值3    6
    数值4    8
    dtype: int32
    
    • 1
    • 2
    • 3
    • 4
    • 5
    sr.values
    
    • 1
    array([2, 4, 6, 8])
    
    • 1
    sr.index
    
    • 1
    Index(['数值1', '数值2', '数值3', '数值4'], dtype='object')
    
    • 1

    索引操作

    import numpy as np
    import pandas as pd
    mydata = np.random.normal(0, 1, (5, 5))
    mydata_index = ['index{}'.format(i+1) for i in range(5)]
    mydata_col =  ['col{}'.format(i+1) for i in range(5)]
    data = pd.DataFrame(mydata, index=mydata_index, columns=mydata_col)
    data
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    col1col2col3col4col5
    index10.1789610.849560-0.077123-0.550173-0.821073
    index2-0.479774-0.986681-0.9347250.010318-0.736170
    index3-0.384807-0.6364850.056328-1.383175-0.451370
    index4-0.770427-1.009373-0.283575-0.923803-1.502639
    index50.068687-0.3612691.8277310.0348581.239907

    直接索引

    data['col1']['index1']  ## 先列后行
    
    • 1
    -0.31201088599026405
    
    • 1

    按名字索引

    data.loc['index1']['col1']
    
    • 1
    -0.31201088599026405
    
    • 1
    data.loc['index1', 'col1']
    
    • 1
    -0.31201088599026405
    
    • 1
    data.loc[['index1', 'index2'], 'col1']
    
    • 1
    index1    0.178961
    index2   -0.479774
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3

    按数值索引

    data.iloc[1, 0]
    
    • 1
    -0.2269501796329433
    
    • 1
    data.iloc[:4, :1]
    
    • 1
    col1
    index10.178961
    index2-0.479774
    index3-0.384807
    index4-0.770427

    赋值操作

    data['col1'] = 0.01
    data
    
    • 1
    • 2
    col1col2col3col4col5
    index10.010.849560-0.077123-0.550173-0.821073
    index20.01-0.986681-0.9347250.010318-0.736170
    index30.01-0.6364850.056328-1.383175-0.451370
    index40.01-1.009373-0.283575-0.923803-1.502639
    index50.01-0.3612691.8277310.0348581.239907
    data.col1 = 0.02
    data
    
    • 1
    • 2
    col1col2col3col4col5
    index10.020.849560-0.077123-0.550173-0.821073
    index20.02-0.986681-0.9347250.010318-0.736170
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    data.col1.index1 = 0.1
    data
    
    • 1
    • 2
    col1col2col3col4col5
    index10.100.849560-0.077123-0.550173-0.821073
    index20.02-0.986681-0.9347250.010318-0.736170
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    data['col1']['index2'] = 0.3
    data
    
    • 1
    • 2
    col1col2col3col4col5
    index10.100.849560-0.077123-0.550173-0.821073
    index20.30-0.986681-0.9347250.010318-0.736170
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907

    排序

    对内容排序

    • 对象.sort_values(by=, key=, ascending=) 单个键或者多个键进行排序,默认升序 True升序 False降序
    data.sort_values(by=['col1'], ascending=False)
    
    • 1
    col1col2col3col4col5
    index20.30-0.986681-0.9347250.010318-0.736170
    index10.100.849560-0.077123-0.550173-0.821073
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    data.sort_values(by=['col1', 'col2'], ascending=False)
    
    • 1
    col1col2col3col4col5
    index20.30-0.986681-0.9347250.010318-0.736170
    index10.100.849560-0.077123-0.550173-0.821073
    index50.02-0.3612691.8277310.0348581.239907
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    sr = data['col1']  ## 对serias进行排序
    sr
    
    • 1
    • 2
    index1    0.10
    index2    0.30
    index3    0.02
    index4    0.02
    index5    0.02
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    sr.sort_values()
    
    • 1
    index3    0.02
    index4    0.02
    index5    0.02
    index1    0.10
    index2    0.30
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    按索引排序

    • 对象.sort_index()
    data.sort_index()
    
    • 1
    col1col2col3col4col5
    index10.100.849560-0.077123-0.550173-0.821073
    index20.30-0.986681-0.9347250.010318-0.736170
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    sr.sort_index()
    
    • 1
    index1    0.10
    index2    0.30
    index3    0.02
    index4    0.02
    index5    0.02
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    DataFrame的运算

    算术运算

    data.col1 + 2
    
    • 1
    index1    2.10
    index2    2.30
    index3    2.02
    index4    2.02
    index5    2.02
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    data.col1.add(3)
    
    • 1
    index1    3.10
    index2    3.30
    index3    3.02
    index4    3.02
    index5    3.02
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    data.sub(10).head(2)  ## data - 10
    
    • 1
    col1col2col3col4col5
    index1-9.9-9.150440-10.077123-10.550173-10.821073
    index2-9.7-10.986681-10.934725-9.989682-10.736170
    data.col1.sub(data.col2).head(3)
    
    • 1
    index1   -0.749560
    index2    1.286681
    index3    0.656485
    dtype: float64
    
    • 1
    • 2
    • 3
    • 4

    逻辑运算

    逻辑运算符号 < > | &
    ## 筛选col1的数据大于0.1的
    data.col1 > 0.1
    
    • 1
    • 2
    index1    False
    index2     True
    index3    False
    index4    False
    index5    False
    Name: col1, dtype: bool
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    data[data.col1 > 0.1]
    
    • 1
    col1col2col3col4col5
    index20.3-0.986681-0.9347250.010318-0.73617
    data[(data.col1 < 0.1) & (data.col2 < 0.1)]
    
    • 1
    col1col2col3col4col5
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    逻辑运算函数 query() isin()
    ## 对象.query(expr) expr: 查询的字符串
    data.query('col1 < 0.1 & col2 < 0.1')
    
    • 1
    • 2
    col1col2col3col4col5
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907
    ## 对象.isin(values) values数据列表 判断数据是否等于列表中的值
    data[data.col1.isin([0.02, 0.01])]
    
    • 1
    • 2
    col1col2col3col4col5
    index30.02-0.6364850.056328-1.383175-0.451370
    index40.02-1.009373-0.283575-0.923803-1.502639
    index50.02-0.3612691.8277310.0348581.239907

    统计运算

    • 统计函数:count、mean、std、min、max、var、prod、mode、abs、idmax、idmin
    • 上面的idmax、idmin、表示获取最小值最大值的位置 和numpy的argmax、argmin函数是类似的
    • 对象.describe() 一次性的获取平均值、标准差、最大值、最小值等值
    • 累计统计函数 cumsum、cummax、cummin、cumprod 分别是计算n个数的和、最大值、最小值、积
    data.max()
    
    • 1
    col1    0.300000
    col2    0.849560
    col3    1.827731
    col4    0.034858
    col5    1.239907
    dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    data.describe()
    
    • 1
    col1col2col3col4col5
    count5.0000005.0000005.0000005.0000005.000000
    mean0.092000-0.4288500.117727-0.562395-0.454269
    std0.1213260.7632481.0289010.6101551.022660
    min0.020000-1.009373-0.934725-1.383175-1.502639
    25%0.020000-0.986681-0.283575-0.923803-0.821073
    50%0.020000-0.636485-0.077123-0.550173-0.736170
    75%0.100000-0.3612690.0563280.010318-0.451370
    max0.3000000.8495601.8277310.0348581.239907
    data.col1.cumsum()
    
    • 1
    index1    0.10
    index2    0.40
    index3    0.42
    index4    0.44
    index5    0.46
    Name: col1, dtype: float64
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    data.col1.cumsum().plot()
    
    • 1


    在这里插入图片描述

    自定义运算

    • apply(func, axis=0)

    func:自定义函数
    axis=0:默认是列,axis=1表示进行行计算

    # 计算col1和col2列最大值减去最小值的值
    data[['col1', 'col2']].apply(lambda x: x.max() - x.min(), axis=0)
    
    • 1
    • 2
    col1    0.280000
    col2    1.858932
    dtype: float64
    
    • 1
    • 2
    • 3

    Pandas画图

    • Pandas.DataFrame.plot(x=None, y=None, kind=‘line’)
      • line折线图 bar柱状图 barh水平柱状图 hist直方图 pie饼图 scatter散点图
    • Pandas.Serias.plot
    import pandas as pd
    import numpy as np
    mydata = np.random.normal(0, 1, (5, 5))
    mydata_index = ['index{}'.format(i+1) for i in range(5)]
    mydata_col =  ['col{}'.format(i+1) for i in range(5)]
    data = pd.DataFrame(mydata, index=mydata_index, columns=mydata_col)
    data
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    col1col2col3col4col5
    index10.7067401.059931-0.2909750.4800270.869103
    index2-0.4610892.2782850.118369-0.141536-1.054914
    index30.871724-1.184708-0.729994-0.2911180.606099
    index4-0.300855-0.784571-1.8159730.7914390.861675
    index51.3804191.6757370.4000700.1302810.501257
    data.plot()
    
    • 1


    在这里插入图片描述

    data.plot(x='col1', y='col2', kind='barh')
    
    • 1


    在这里插入图片描述

    scv文件读取与存储

    pandas.read_csv(filepath_or_buffer, sep=‘,’, header=‘infer’, names=None, usecols=[])

    • DataFrame.to_csv(path_or_buf=None, sep=‘,’, na_rep=‘’, index=False, header=True, mode=‘w’, encoding=None)
      • path_or_buf:写入CSV文件的路径或文件对象
      • sep:列分隔符,默认为逗
      • na_rep:缺失值的表示,默认为空字
      • index:是否写入行索引,默认为 False
      • header:是否写入列名,默认为True
      • mode:写入模式 默认是w重写,还有a追加模式
    read_data = pd.read_csv('E:/Project/PyCharm_Projects/pandas_test/read.csv', encoding='GBK')
    read_data
    
    • 1
    • 2
    NameAge
    0李白21
    1杜甫32
    2孟浩然34
    data = {'Name': ['Alice', 'Bob', 'Carol'],
            'Age': [25, 30, 35]}
    df = pd.DataFrame(data)
    df.to_csv('E:/Project/PyCharm_Projects/pandas_test/output.csv', index=False, mode='a', header=False)
    df_read = pd.read_csv('E:/Project/PyCharm_Projects/pandas_test/output.csv')
    df_read
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    NameAge
    0Alice25
    1Bob30
    2Carol35
    3Alice25
    4Bob30
    5Carol35

    hdf5文件读取与存储

    • pandas.read_hdf(path_or_buf, key=None, **kwargs)
      • path_or_buf 文件路径
      • key:读取的键
      • mode:打开模式
      • return Theseselected objects
    • DataFrame.to_hdf(path_or_buf, key, **kwargs)
    • hdf5是使用键值对来存储数据的,他也是可以存储三维数据的
    • 跨平台、支持压缩、节省空间

    json文件读取与存储

    • pandas.read_json(path_or_buf=None, orient=None, type=‘frame’, lines=‘False’)

      • 将json格式数据转换为默认的Pandas DataFrame格式的数据、
      • orient:一般选择records
      • lines:是否把每行作为一个json
    • DataFrame.to_json(path_or_buf=None, orient=None, lines=‘False’)

    总结

    • Pandas基础数据处理
    • Pandas介绍:
      • 面板数据 数据处理工具 便捷的数据处理能力
      • 继承了Numpy和matplotlib,读取文件方便
      • Series:一维数据 DataFrame多维数据
      • Series属性:index values
      • DataFrame属性:shape、index、columns、values、T
      • DataFrame常用方法:head() tail()
      • Multiindex,多维数据存储方式
    • Pandas基本操作
      • 索引操作:直接索引(先列后行)、按名字索引loc、按数字索引iloc
      • 赋值操作
      • 排序操作:sort_values() sort_index()
    • Pandas运算:
      • 算术运算:
      • 逻辑运算:逻辑运算符 & 布尔索引 query() isin()
      • 统计运算:describe()、min、max、std、idmax、idmin、cumsum、cummax
      • 自定义运算:apply()
    • Pandas画图:
      • df.plot()
      • sr.plot()
    • PandasIO操作:
      • csv:pd.read_csv(path, names, usecols) pd.to_csv(path, header, mode, index)
      • hdf5:pd.read_hdf5(path, key) pd.to_hdf5(path, key)
      • json:pd.read_json(path, records, lines) pd.to_json(path, records, lines)
        =‘w’, encoding=None)
      • path_or_buf:写入CSV文件的路径或文件对象
      • sep:列分隔符,默认为逗
      • na_rep:缺失值的表示,默认为空字
      • index:是否写入行索引,默认为 False
      • header:是否写入列名,默认为True
      • mode:写入模式 默认是w重写,还有a追加模式
  • 相关阅读:
    用 Flutter 轻松做个红包封面
    Python决策树
    为何你的算法总是比别人的慢?【21天算法系列】之顺序查找算法【Java 版】
    redis的windows系统的安装教程
    LeetCode刷题--- 最长定差子序列
    【目标检测】【DDPM】DiffusionDet:用于检测的概率扩散模型
    Hbase基本概念
    给大家推荐一套 git 工作流
    嵌入式软件设计之美-以实际项目应用MVC框架与状态模式(上)
    CEAC之《职业素养》
  • 原文地址:https://blog.csdn.net/weixin_45205160/article/details/133996861