• Pandas


    Series

    Intro to data structures — pandas 1.4.4 documentationhttps://pandas.pydata.org/docs/user_guide/dsintro.html#series

    a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

    Creating a series

    s = pd.Series(data, index=index)
    • Data:

    • Index

      • a list of axis labels

    Access elements

    1. s = pd.Series(data, index=index)
    2. s[0] # access numbered series
    3. s["charlie"] # access dict-like series

    Extracting indices and values

    • series.index
      • return a list of indices
    • series.value
      • return a list of values only

    Operation on every value

    1. s = pd.Series(data, index=index)
    2. s / 100 # divide by 100 for each and every value in s

    Dataframe

    Intro to data structures — pandas 1.4.4 documentationhttps://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe2-dimensional labeled data structure with columns of potentially different types.

    Just like a SQL table

    Creating a dataframe

    df = pd.DataFrame(data, index=index, columns=columns)
    • Data:
      • Dict of 1D ndarrays, lists, dicts, or Series

      • 2-D numpy.ndarray

      • Structured or record ndarray

      • Series

      • Another DataFrame

    • Index

      • a list of row labels

      • Note: index is encoded as part of a row (like a column)

    • Columns

      • a list of column labels

    Projection (access column)

    • Return as a series
      • only one column at a time
      • Good for list-like operation such as operation
    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df['col_name'] # OR
    3. df.'col_name'
    • Return as an editable dataframe 
      • multiple columns as a subset
      • Good for further dataframe operation
    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df[['col_name_1', 'col_name_2', ...]]

    Logical operations on column

    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df.'col_name' == value # do this operation on selected column for all rows

    ​​​​​​​Extract indices and values

    • df.index
      • ​​​​​​​return a list of indices
    • df.value
      • ​​​​​​​return a list of values only

    Extract certain rows

    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df[df.capital == 'London'] # return a new version of df, in which rows satisfies this condition

    *df.head() // get top 5 rows

    Add column

    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df['new_col'] = list # create new column in the dataframe and apply list values to each row
    3. df['new_col'] = df.'col_1' + df.'col_2' # create new column in the dataframe and populate fields by combining values from individual columns

    Apply a function to a subset of a dataframe

    • use df.apply(), which is faster than just using for loop
      • Computes values in parallel whereas loops compute in sequence
      • series.map works only on Series but has the same functionality as apply.
      • df.applymap works only on dfs and applies to every element excluding the target column.
    1. df = pd.DataFrame(data, index=index, columns=columns)
    2. df.capital.apply(lambda x: x.upper()) # capitalize each capital col value in dataframe
    3. df['new_col'] = df.apply(lambda x: f(x['col_1']), axis = 1) # axis = 1 stands for along the column

    Merge()

    1. population = pd.DataFrame(data, index=index, columns=columns)
    2. countries = pd.DataFrame(data, index=index, columns=columns)
    3. pd.merge(left=population, right=countries, left_on="col_1", right_on="col_2") # default, inner merge
    4. pd.merge(left=population, right=countries, left_on="col_1", right_on="col_2", how="left") # left merge

    Groupby()

    • To put the dataframe into groups based on column value
    • Then apply aggregation function on the grouped dataframe
    1. population = pd.DataFrame(data, index=index, columns=columns)
    2. population.groupby('continent')[['area']].mean() # goupby continients and calculate means of each column, return dataframe with only area column
    3. population.groupby('continent').mean()[['area']] # alternative to the above function
    4. population.groupby('continent').mean()[['area']].reset_index(drop=False) # reformatting the index column of the grouped dataframe

    ** Cannot be converted to dataframe before aggregating

            pop_countries.groupby('continent').to_frame() # won't work

    Replaces NA/NaN values

    missing_df.fillna(0) # replace all NaN with 0

    Drop the rows or columns with NA/NaN values

    dopna(axis=axis, how=how)

    • axis argument determines if rows or columns which contain missing values are removed.
    • axis = 0: Drop rows which contain missing values.
    • axis = 1: Drop columns which contain missing value.
    • how argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.
    • how = any: If any NA values are present, drop that row or column. (default)
    • how = all : If all values are NA, drop that row or column.
    1. missing_df.dropna(axis=0) # drop all the rows that have missing values
    2. missing_df.dropna(axis=1) # drop all the cols that have missing values

  • 相关阅读:
    蓝桥杯1043
    openwrt 断网重启检测脚本
    Linux环境变量
    Linux安装nodejs问题
    Vue 2使用element ui 表格不显示
    创建型模式-建造者模式
    3.4bochs的调试方法
    机器学习-特征选择:如何使用交叉验证精准选择最优特征?
    招投标系统软件源码,招投标全流程在线化管理
    1162 Postfix Expression
  • 原文地址:https://blog.csdn.net/DOITJT/article/details/126801841