Pandas在数据处理方面提供了几乎全部的类SQL查询操作API,例如drop_duplicates()
代表SQL中的union合并去重
但PandasAPI不如直接的SQL简洁易读,例如,Pandas还无法替代的操作之一是非等连接(查询连接条件包含非等号,如大于号、小于号等),需要多步实现,这在SQL中非常简单,PandaSQL可以很好的解决这个问题
PandaSQL是一个可以直接在Python中使用SQL语法查询Pandas数据框Dataframe的框架,PandaSQL底层调用PandasAPI
另外,Python虽然内置有SQLite数据库,但如果我们想使用SQL语句查询DataFrame就必须将原始数据先插入到SQLite
即使PandaSQL允许我们在Pandas数据帧上运行SQL(SQLite语法)查询,但它的性能却不如原生PandasAPI语法
SQLite官网(SQL语法):https://www.sqlite.org/index.html
安装:
pip install -U pandasql
PandaSQL API简介:
'''
sqldf(query, env, db_uri)
- query:使用DataFrame作为表的sql查询
- env:环境globals()或locals(),允许sqldf访问Python环境中的全局或局部变量
- db_uri:SQLAlchemy兼容的数据库URI,默认为sqlite:///:memory:
返回:返回查询结果DataFrame
'''
封装SQL查询:
from pandasql import sqldf
def query(q: str, env=None):
return sqldf(q, env=globals()) if env is None else sqldf(q, env=env)
1)数据准备
import pandas as pd
# 商品促销活动时期表
df_promotion = pd.DataFrame({
"pdt_id": ["p01", "p02", "p03"],
"start_dt": ["10-06-2023", "20-06-2023", "15-08-2023"],
"end_dt": ["12-06-2023", "25-06-2023", "20-08-2023"]
})
# 商品交易数据表
df_trading = pd.DataFrame({
"id": ["p01", "p01", "p02", "p02", "p02", "p03", "p03"],
"trade_dt": ["11-06-2023", "20-06-2023", "15-08-2023", "22-06-2023", "11-06-2023", "17-08-2023", "29-08-2023"],
"sales": [10, 20, 30, 22, 30, 20, 34]
})
print(df_promotion.to_string())
print(df_trading.to_string())
2)需求描述
查询促销期间商品的销售额
3)Pandas解决方案
# 合并
df_merge = pd.merge(df_promotion, df_trading, left_on="pdt_id", right_on="id")
# print(df_merge.to_string())
# 非等连接查询
df_query = df_merge[(df_merge["trade_dt"] >= df_merge["start_dt"]) & (df_merge["trade_dt"] <= df_merge["end_dt"])]
# 选择字段
df_res = df_query[["pdt_id", "start_dt", "end_dt", "trade_dt", "sales"]]
print(df_res.to_string())
'''
pdt_id start_dt end_dt trade_dt sales
0 p01 10-06-2023 12-06-2023 11-06-2023 10
1 p02 20-06-2023 25-06-2023 22-06-2023 22
2 p03 15-08-2023 20-08-2023 17-08-2023 20
'''
4)PandaSQL解决方案
sql = """
select pdt_id, start_dt, end_dt, trade_dt, sales from
df_promotion a join df_trading b
on a.pdt_id = b.id and b.trade_dt >= a.start_dt and b.trade_dt <= a.end_dt
"""
df = query(sql)
print(df.to_string())
'''
pdt_id start_dt end_dt trade_dt sales
0 p01 10-06-2023 12-06-2023 11-06-2023 10
1 p02 20-06-2023 25-06-2