• pandas notes 30


    1. What is pandas?

    pandas main page

    pandas installation instructions

    Anaconda distribution of Python (includes pandas)

    How to use the IPython/Jupyter notebook

    2. How do I read a tabular data file into pandas?

    user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

    users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)

    3. How do I select a pandas Series from a DataFrame?

    # select the 'City' Series using bracket notation

    ufo['City']

    # or equivalently, use dot notation

    ufo.City

    ufo['Location'] = ufo.City + ', ' + ufo.State

    4. Why do some pandas commands end with parentheses

    movies.head()

    movies.columns

    movies.describe()

    movies.shape

    movies.dtypes

    # use an optional parameter to the describe method to summarize only 'object' columns

    movies.describe(include=['object'])

    5. How do I rename columns in a pandas DataFrame?

    ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

    ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']

    ufo.columns = ufo_cols

    # replace the column names during the file reading process by using the 'names' parameter

    ufo = pd.read_csv('http://bit.ly/uforeports', header=0, names=ufo_cols)

    ufo.columns = ufo.columns.str.replace(' ', '_')

    6. How do I remove columns from a pandas DataFrame?

    # remove a single column (axis=1 refers to columns)

    ufo.drop('Colors Reported', axis=1, inplace=True)

    # remove multiple columns at once

    ufo.drop(['City', 'State'], axis=1, inplace=True)

    # remove multiple rows at once (axis=0 refers to rows)

    ufo.drop([0, 1], axis=0, inplace=True)

    7. How do I sort a pandas DataFrame or a Series?

    movies.title.sort_values().head()

    movies.title.sort_values(ascending=False).head()

    movies.sort_values('title').head()

    movies.sort_values('title', ascending=False).head()

    movies.sort_values(['content_rating', 'duration']).head()

    8. How do I filter rows of a pandas DataFrame by column value?

    # select the 'genre' Series from the filtered DataFrame

    movies[movies.duration >= 200].genre

    # or equivalently, use the 'loc' method

    movies.loc[movies.duration >= 200, 'genre']

    9. How do I apply multiple filter criteria to a pandas DataFrame?

    movies[(movies.duration >=200) & (movies.genre == 'Drama')]

    movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')]

    movies[movies.genre.isin(['Crime', 'Drama', 'Action'])]

    10. Your pandas questions answered!

    # specify which columns to include by name

    ufo = pd.read_csv('http://bit.ly/uforeports', usecols=['City', 'State'])

    # or equivalently, specify columns by position

    ufo = pd.read_csv('http://bit.ly/uforeports', usecols=[0, 4])

    ufo = pd.read_csv('http://bit.ly/uforeports', nrows=3)

    # various methods are available to iterate through a DataFrame

    for index, row in ufo.iterrows():

        print(index, row.City, row.State)

    # only include numeric columns in the DataFrame

    import numpy as np

    drinks.select_dtypes(include=[np.number]).dtypes

    drinks.describe(include='all')

    drinks.describe(include=['object', 'float64'])

    11. How do I use the "axis" parameter in pandas?

    When performing a mathematical operation with the axis parameter:

    axis 0 means the operation should "move down" the row axis

    axis 1 means the operation should "move across" the column axis

    # 'index' is an alias for axis 0

    drinks.mean(axis='index')

    # 'columns' is an alias for axis 1

    drinks.mean(axis='columns')

    12. How do I use string methods in pandas?

    # string methods for pandas Series are accessed via 'str'

    orders.item_name.str.upper()

    orders[orders.item_name.str.contains('Chicken')]

    orders.choice_description.str.replace('[', '').str.replace(']', '')

    # many pandas string methods support regular expressions (regex)

    orders.choice_description.str.replace('[

    " role="presentation">
    ]', '')

    13. How do I change the data type of a pandas Series?

    # change the data type of an existing Series

    drinks['beer_servings'] = drinks.beer_servings.astype(float)

    drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'beer_servings':float})

    14. When should I use a "groupby" in pandas?

    drinks.groupby('continent').beer_servings.mean()

    drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])

    drinks.groupby('continent').mean().plot(kind='bar')

    15. How do I explore a pandas Series?

    movies.genre.value_counts()

    # display percentages instead of raw counts

    movies.genre.value_counts(normalize=True)

    movies.genre.unique()

    # count the number of unique values in the Series

    movies.genre.nunique()

    # compute a cross-tabulation of two Series

    pd.crosstab(movies.genre, movies.content_rating)

    movies.genre.value_counts().plot(kind='bar')

    16. How do I handle missing values in pandas?

    What does "NaN" mean?

    "NaN" is not a string, rather it's a special value: numpy.nan.

    It stands for "Not a Number" and indicates a missing value.

    read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

    ufo.isnull()

    ufo.notnull()

    # count the number of missing values in each Series

    ufo.isnull().sum()

    How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:

    ufo.shape

    # if 'any' values are missing in a row, then drop that row

    ufo.dropna(how='any').shape

    # if 'all' values are missing in a row, then drop that row (none are dropped in this case)

    ufo.dropna(how='all').shape

    # if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row

    ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

    # if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row

    ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape

    # 'value_counts' does not include missing values by default

    ufo['Shape Reported'].value_counts().head()

    # explicitly include missing values

    ufo['Shape Reported'].value_counts(dropna=False).head()

    # fill in missing values with a specified value

    ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

    17. What do I need to know about the pandas index?

    # set an existing column as the index

    drinks.set_index('country', inplace=True)

    # country name can now be used for selection

    drinks.loc['Brazil', 'beer_servings']

    # index name is optional

    drinks.index.name = None

    # restore the index name, and move the index back to a column

    drinks.index.name = 'country'

    drinks.reset_index(inplace=True)

    18. What do I need to know about the pandas index?

    drinks.continent.value_counts().sort_index()

    drinks.continent.value_counts()['Africa']

    people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')

    # calculate the total annual beer servings for each country

    (drinks.beer_servings * people).head()

    The two Series were aligned by their indexes.

    If a value is missing in either Series, the result is marked as NaN.

    Alignment enables us to easily work with incomplete data.

    # concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)

    pd.concat([drinks, people], axis=1).head()

    19. How do I select multiple rows and columns from a pandas DataFrame?

    The loc method is used to select rows and columns by label. You can pass it:

    A single label

    A list of labels

    A slice of labels

    A boolean Series

    A colon (which indicates "all labels")

    The iloc method is used to select rows and columns by integer position. You can pass it:

    A single integer position

    A list of integer positions

    A slice of integer positions

    A colon (which indicates "all integer positions")

    drinks = pd.read_csv('http://bit.ly/drinksbycountry', index_col='country')

    20. When should I use the "inplace" parameter in pandas?

    ufo.drop('City', axis=1, inplace=True)

    21. How do I make my pandas DataFrame smaller and faster?

    drinks.info()

    drinks.info(memory_usage='deep')

    drinks.memory_usage(deep=True)

    # use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers

    drinks['continent'] = drinks.continent.astype('category')

    df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)

    # comparison operators work with ordered categories

    df.loc[df.quality > 'good', :]

    22. How do I use pandas with scikit-learn to create Kaggle submissions?

    23. More of your pandas questions answered!

    ufo.sample(n=3)

    ufo.sample(n=3, random_state=42)

    # sample 75% of the DataFrame's rows without replacement

    train = ufo.sample(frac=0.75, random_state=99)

    # store the remaining 25% of the rows in another DataFrame

    test = ufo.loc[~ufo.index.isin(train.index), :]

    24. How do I create dummy variables in pandas?

    # create the 'Sex_male' dummy variable using the 'map' method

    train['Sex_male'] = train.Sex.map({'female':0, 'male':1})

    # alternative: use 'get_dummies' to create one column for every possible value

    pd.get_dummies(train.Sex)

    Generally speaking:

    If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.

    One convention is to drop the first dummy variable, which defines that level as the "baseline".

    # add a prefix to identify the source of the dummy variables

    pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()

    # use 'get_dummies' with a feature that has 3 possible values

    pd.get_dummies(train.Embarked, prefix='Embarked').head(10)

    # drop the first dummy variable ('C')

    pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

    25.How do I work with dates and times in pandas? 

    ufo['Time'] = pd.to_datetime(ufo.Time)

    ufo.Time.dt.hour

    ufo.Time.dt.weekday_name

    ufo.Time.dt.dayofyear

    # convert a single string to datetime format (outputs a timestamp object)

    ts = pd.to_datetime('1/1/1999')

    # perform mathematical operations with timestamps (outputs a timedelta object)

    ufo.Time.max() - ufo.Time.min()

    26. How do I find and remove duplicate rows in pandas?

    # read a dataset of movie reviewers into a DataFrame

    user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

    users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')

    # detect duplicate zip codes: True if an item is identical to a previous item

    users.zip_code.duplicated()

    # count the duplicate rows

    users.duplicated().sum()

    27. How do I avoid a SettingWithCopyWarning in pandas?

    Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

    28. How do I change display options in pandas?

    # check the current setting for the 'max_rows' option

    pd.get_option('display.max_rows')

    # overwrite the current setting so that all rows will be displayed

    pd.set_option('display.max_rows', None)

    # reset the 'max_rows' option to its default

    pd.reset_option('display.max_rows')

    # the 'max_columns' option is similar to 'max_rows'

    pd.get_option('display.max_columns')

    # overwrite the current setting so that more characters will be displayed

    pd.set_option('display.max_colwidth', 1000)

    # overwrite the 'precision' setting to display 2 digits after the decimal point of 'Fare'

    pd.set_option('display.precision', 2)

    # use a Python format string to specify a comma as the thousands separator

    pd.set_option('display.float_format', '{:,}'.format)

    29. How do I create a pandas DataFrame from another object?

    # optionally specify the order of columns and define the index

    df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])

    # create a DataFrame from a list of lists (each inner list becomes a row)

    pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])

    arr = np.random.rand(4, 2)

    pd.DataFrame(arr, columns=['one', 'two'])

    # create a DataFrame of student IDs (100 through 109) and test scores (random integers between 60 and 100)

    pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})

    # 'set_index' can be chained with the DataFrame constructor to select an index

    pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')

    # create a new Series using the Series constructor

    s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')

    30. How do I apply a function to a pandas Series or DataFrame?

    # map 'female' to 0 and 'male' to 1

    train['Sex_num'] = train.Sex.map({'female':0, 'male':1})

    # calculate the length of each string in the 'Name' Series

    train['Name_length'] = train.Name.apply(len)

    # round up each element in the 'Fare' Series to the next integer

    import numpy as np

    train['Fare_ceil'] = train.Fare.apply(np.ceil)

    # alternatively, use a lambda function

    train.Name.str.split(',').apply(lambda x: x[0]).head()

    # convert every DataFrame element into a float

    drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)

  • 相关阅读:
    如何自定义书写中间件?
    配配网:什么是流通盘,流通盘的相关知识内容是什么
    【SHELL】贪吃蛇
    基于ssm协同过滤推荐算法的电影院购票系统毕业设计源码131124
    【金融】探究财务报告业绩增速与股价变动之间的关系--研报复现及深入探究
    垃圾分类数据集+垃圾分类识别训练代码(支持googlenet, resnet, inception_v3, mobilenet_v2)
    【Kotlin】OkHttp框架实现网络下载
    【Java中23种面试常考的设计模式之桥接模式(Bridge)---结构型模式】
    [附源码]SSM计算机毕业设计学生互评的在线作业管理系统JAVA
    C语言之练习题
  • 原文地址:https://blog.csdn.net/henry_dx/article/details/132790629