python数据分析之pandas（一）

pandas入门

1、什么是pandas

pandas：一个开源的python类库：用于数据分析、数据处理、数据可视化

numpy：用于数学计算

scikit-learn：用于机器学习

2、pandas读取数据

import pandas as pd
1

读取纯文本文件

读取CSV，使用默认的标题行、逗号分隔符

import pandas as pd

csv_path='./result.csv'

# 使用pd.read_csv读取数据
contents=pd.read_csv(csv_path)

# 查看前几行数据
print(contents.head())

# 查看数据的形状，返回（行数、列数）
print(contents.shape)

# 查看列表列的名称
print(contents.columns)

# 查看索引行
print(contents.index)

# 查看每列的数据类型
print(contents.dtypes)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

读取txt文件，自己指定分隔符、列名

import pandas as pd

file_path='./date.txt'

file_content=pd.read_csv(
    file_path,
    header=None,# 没有标题行
    sep='\t',
    names=['date','random1','random2'] # 指定列名
)

print(file_content)
1
2
3
4
5
6
7
8
9
10
11
12

读取excel文件

import pandas as pd

excel_path='./date.xlsx'
excel_content=pd.read_excel(excel_path)
print(excel_content)
1
2
3
4
5

读取MySQL数据库

import pymysql
import pandas as pd

connect=pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    database='eat'
)

mysql_content=pd.read_sql('select * from user', con=connect)
print(mysql_content)
1
2
3
4
5
6
7
8
9
10
11
12

3、pandas的数据结构DataFrame和Series

DataFrame：二维数据，整个表格，多行多列

Series：一维数据，一行或一列

Series

Series是一种类似于一维数组的对象（与python基本的数据结构list很相近），它由一组数据（不同数据类型）以及一组与之相关的数据标签（即索引）组成

import pandas as pd

# 一、仅有数据列表即可产生最简单的Series
series = pd.Series([1, '2', 3.3, True])
print(series)

# 默认情况下，Series的索引都是数字
print(series.index)  # 获取索引
print(series.values)  # 获取数据

# np.nan 空值

# 二、创建一个具有标签索引的Series
series = pd.Series([1, '2', 3.3, True], index=['int', 'str', 'float', 'bool'])
print(series.index)

# 索引：数据的行标签
# 切片操作
print(series['int':'float']) # 左闭右闭

print(series[::2])

# 索引赋值
series.index.name='index'
print(series)

series.index=['a','b','c','d']
series.index.name='letter'
print(series)

# 三、使用python字典创建Series
dictioanry = {'apple': '苹果', 'banana': '香蕉', 'orange': '橘子'}
series = pd.Series(dictioanry)
print(series)

# 四、根据标签索引查询数据
# 类似python中的字典
print(series['apple'])
print(type(series['apple']))
print(series[['apple','orange']]) # 访问Series对象中的多个标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

DataFrame

DataFrame是一个表格型的数据结构：

每列可以是不同的值类型（数值、字符串、布尔值）
既有行索引index，也有列索引columns
可以被看作由Series组成的字典

# 使用二维数组创建DataFrame

# 创建索引
date=pd.date_range('20221025',periods=6)
print(date)

# 创建一个DataFrame结构
df=pd.DataFrame(np.random.randn(6,4),index=date,columns=['a','b','c','d'])
print(df)
1
2
3
4
5
6
7
8
9

# 使用Series创建DataFrame

series = pd.Series([1, '2', 3.3, True], index=['int', 'str', 'float', 'bool'])

df=pd.DataFrame(series)
df.index=['a','b','c','d']
df.index.name='字母'
df.columns=['index']
df.columns.name='索引'
print(df)

print(df['index'].loc['d'])
1
2
3
4
5
6
7
8
9
10
11
12

字典的每个key代表一列，其value可以是各种能够转化为Series的对象

DataFrame只要求每一列数据的格式相同

df=pd.DataFrame({'a':1,'b':pd.Timestamp('20221025'),'c':pd.Series(1,index=list(range(4)),dtype=float),'d':np.array([3]*4,dtype=str),'e':pd.Categorical(['train','train','train','test']),'f':'abc'})
print(df)

# 根据多个字典序列创建DataFrame
data = {
    'expression': ['happy', 'sad', 'angry', 'frustrate'],
    'year': [2000, 2001, 2002, 2003],
    'month': [1, 2, 3, 4]
}

frame = pd.DataFrame(data)
print(frame)

print(frame.columns)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

在DataFrame中查询：

如果只查询一行、一列，返回的是pd.Series

如果查询多行、多列，返回的是pd.DataFrame

df=pd.DataFrame({'a':1,'b':pd.Timestamp('20221025'),'c':pd.Series(1,index=list(range(4)),dtype=float),'d':np.array([3]*4,dtype=str),'e':pd.Categorical(['train','train','train','test']),'f':'abc'})
print(df)

# 前三行
print(df.head(2))

# 最后两行
print(df.tail(2))

# 类型查询
print(df.dtypes)

# 数据值查询
print(df.values)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

import pandas as pd

data = {
    'expression': ['happy', 'sad', 'angry', 'frustrate'],
    'year': [2000, 2001, 2002, 2003],
    'month': [1, 2, 3, 4]
}

frame = pd.DataFrame(data)
print(frame)

print(frame.columns)

# 查询一列，结果是一个pd.Series
print(frame['year'])

# 查询多列，结果是一个pd.DataFrame
print(frame[['year','expression']])

# 查询一行，结果是一个pd.Series，索引是列的名称
print(frame.loc[1])

# 查询多行，结果是一个pd.DataFrame
print(frame.loc[1:3])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

4、pandas数据查询

pandas查询数据的方法：

loc方法：根据行、列的标签值查询
iloc方法：根据行、列的数字位置查询
where方法
query方法

loc方法既能查询，又能覆盖写入

import pandas as pd

excel_path = './date.xlsx'
excel = pd.read_excel(excel_path)

print(excel)

print(excel['random1'])

# 设定索引为日期
excel.set_index('日期', inplace=True)
print(excel.index)
print(excel)

# 查询一行，结果是一个pd.Series，索引是列的名称
print(excel.loc['2019-10-12'])

# 一、使用单个label值查询数据
# 行或者列，都可以只传入单个值，实现精确匹配

# 得到单个值
print(excel.loc['2019-10-12', 'random1'])

# 得到一个Series
print(excel.loc['2019-10-12', ['random1', 'random2']])

# 二、使用值列表批量查询
# 得到Series
print(excel.loc[['2019-10-12', '2022-10-12'], 'random1'])

# 得到DataFrame
print(excel.loc[['2019-10-12', '2022-10-12'], ['random1', 'random2']])

# 三、使用数值区间进行查询
# 注意：区间包含开始，也包含结束
# 行index按区间
print(excel.loc['2019-10-12':'2022-10-12', 'random2'])

# 列index按区间
print(excel.loc['2019-10-12', 'random1':'random2'])

# 行和列都按区间查询
print(excel.loc['2019-10-12':'2022-10-12', 'random1':'random2'])

# 四、使用条件表达式查询
# bool列表的长度等于行数或列数
print(excel.loc[excel['random1'] > 100])

# 五、调用函数查询
print(excel.loc[lambda excel: excel['random1'] > 400, :])


def query_my_data(pf):
    return pf.index.str.startswith('2019-10')


print(excel.loc[query_my_data(excel), :])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

函数式编程的本质：函数自身可以像变量一样传递

5、pandas数据修改

添加一行

df=pd.read_excel('./date.xlsx')

# 列索引名称：字典的key
dictionary={'日期':'2022-10-25','random1':53,'random2':78}
s=pd.Series(dictionary)
print(s)
s.name=2 # 行索引

df=df.append(s)
print(df)
1
2
3
4
5
6
7
8
9
10

删除一行

df=pd.read_excel('./date.xlsx')

dictionary={'日期':'2022-10-25','random1':53,'random2':78}
s=pd.Series(dictionary)

s.name=2 # 行索引

df=df.append(s)

# 删除一行
df=df.drop(2) # 行索引
print(df)
1
2
3
4
5
6
7
8
9
10
11
12

增加一列

df=pd.read_excel('./date.xlsx')

# 增加一列
df['序号']=range(1,len(df)+1)
print(df)
1
2
3
4
5

删除一列

df=pd.read_excel('./date.xlsx')

# 增加一列
df['序号']=range(1,len(df)+1)
print(df)

# 删除一列
df=df.drop('序号',axis=1)
print(df)
1
2
3
4
5
6
7
8
9

6、缺失值及异常值处理

缺失值处理方法

df=pd.read_excel('./date.xlsx')

# 判断缺失值
print(df.isnull())
print(df['random1'].notnull())
1
2
3
4
5

填充缺失值

df=pd.read_excel('./date.xlsx')

# 填充缺失值
print(df[df['random1'].isnull()])
df['random1'].fillna(np.mean(df['random1']),inplace=True)
print(df)
1
2
3
4
5
6

删除缺失值

df.dropna()参数：

how=“all”：删除全为空值的行或列
inplace=True：覆盖之前的数据
axis=0：选择行或列

处理异常值

异常值，即在数据集中存在不合理的值，又称离群点，比如年龄为-1，笔记本电脑重量为1吨等，都属于异常值的范围

对于异常值，一般来说数量都会很少，在不影响整体数据分布的情况下，我们直接删除就可以了

7、数据保存

df.to_excel('./date.xlsx')
1

pandas中提供了对Excel文件进行写操作，方法为to_excel()

to_excel()方法的功能是将DataFrame对象写入到Excel工作表中

import pandas as pd

df1 = pd.DataFrame({'One': [1, 2, 3]})
df1.to_excel('one.xlsx', sheet_name='One', index=False)  # index=False为不写入索引
1
2
3
4

当Pandas要写入多个sheet时，to_excel第一个参数要选择ExcelWriter对象，不能是文件的路径。否则，就会覆盖写入。

ExcelWriter可以看作一个容器，一次性提交所有to_excel语句后再保存，从而避免覆盖写入

import pandas as pd

df1 = pd.DataFrame({'One': [1, 2, 3]})
df2 = pd.DataFrame({'Two': [4, 5, 6]})

with pd.ExcelWriter('two.xlsx') as writer:
    df1.to_excel(writer, sheet_name='One', index=False)
    df2.to_excel(writer, sheet_name='Two', index=False)
1
2
3
4
5
6
7
8

参考

1、Pandas写入Excel函数——to_excel 技术总结

相关阅读:
LeetCode50天刷题计划（Day 43 —子集（20）单词搜索（40）
MyBatis：配置文件
 记一次 OSS 大批量文件下载的实现 → bat脚本不好玩！
【MySQL】索引和事物
 《HTTP/2 in Action》阅读笔记(一)
Cpolar - 本地 WebUI 账号登录失败解决方案
 [项目管理-16]：大型复杂组织内的产品管理流程、项目管理流程、软件开发流程以及不同角色在项目开发中的位置
 致敬记者节，合合信息扫描全能王助力新闻工作者构建“随身资料库”
如何找到并快速上手一个开源项目
 翻译: Github Copilot 可以创作艺术吗？
原文地址：https://blog.csdn.net/julac/article/details/127528071