a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Creating a series
s = pd.Series(data, index=index)
Access elements
- s = pd.Series(data, index=index)
- s[0] # access numbered series
- s["charlie"] # access dict-like series
Extracting indices and values
Operation on every value
- s = pd.Series(data, index=index)
- s / 100 # divide by 100 for each and every value in s
Intro to data structures — pandas 1.4.4 documentation
https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe2-dimensional labeled data structure with columns of potentially different types.
Just like a SQL table
Creating a dataframe
df = pd.DataFrame(data, index=index, columns=columns)
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
Index
a list of row labels
Note: index is encoded as part of a row (like a column)
Columns
a list of column labels
Projection (access column)
- df = pd.DataFrame(data, index=index, columns=columns)
- df['col_name'] # OR
- df.'col_name'
- df = pd.DataFrame(data, index=index, columns=columns)
- df[['col_name_1', 'col_name_2', ...]]
Logical operations on column
- df = pd.DataFrame(data, index=index, columns=columns)
- df.'col_name' == value # do this operation on selected column for all rows
Extract indices and values
Extract certain rows
- df = pd.DataFrame(data, index=index, columns=columns)
- df[df.capital == 'London'] # return a new version of df, in which rows satisfies this condition
*df.head() // get top 5 rows
Add column
- df = pd.DataFrame(data, index=index, columns=columns)
- df['new_col'] = list # create new column in the dataframe and apply list values to each row
- df['new_col'] = df.'col_1' + df.'col_2' # create new column in the dataframe and populate fields by combining values from individual columns
Apply a function to a subset of a dataframe
series.map works only on Series but has the same functionality as apply.df.applymap works only on dfs and applies to every element excluding the target column.- df = pd.DataFrame(data, index=index, columns=columns)
- df.capital.apply(lambda x: x.upper()) # capitalize each capital col value in dataframe
- df['new_col'] = df.apply(lambda x: f(x['col_1']), axis = 1) # axis = 1 stands for along the column
Merge()
- population = pd.DataFrame(data, index=index, columns=columns)
- countries = pd.DataFrame(data, index=index, columns=columns)
-
- pd.merge(left=population, right=countries, left_on="col_1", right_on="col_2") # default, inner merge
- pd.merge(left=population, right=countries, left_on="col_1", right_on="col_2", how="left") # left merge
Groupby()
- population = pd.DataFrame(data, index=index, columns=columns)
-
- population.groupby('continent')[['area']].mean() # goupby continients and calculate means of each column, return dataframe with only area column
-
- population.groupby('continent').mean()[['area']] # alternative to the above function
-
- population.groupby('continent').mean()[['area']].reset_index(drop=False) # reformatting the index column of the grouped dataframe
** Cannot be converted to dataframe before aggregating
pop_countries.groupby('continent').to_frame() # won't work
Replaces NA/NaN values
missing_df.fillna(0) # replace all NaN with 0
Drop the rows or columns with NA/NaN values
dopna(axis=axis, how=how)
axis argument determines if rows or columns which contain missing values are removed.axis = 0: Drop rows which contain missing values.axis = 1: Drop columns which contain missing value.how argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.how = any: If any NA values are present, drop that row or column. (default)how = all : If all values are NA, drop that row or column.- missing_df.dropna(axis=0) # drop all the rows that have missing values
- missing_df.dropna(axis=1) # drop all the cols that have missing values