开始分析之前,你需要知道
在本例中,我们将采用一个简单的数据集(sample_dataset.parquent)进行讲解,并应用到之前 EDA 结束之后保存的 M5 数据集上。
import pandas as pd
import numpy as np
# downcast就是在前面EDA notebook里的downcast函数的封装的包,如果没有安装请 pip install downcast
from downcast import reduce
pd.options.display.max_rows= 999
pd.options.display.max_columns = 999
calendar = pd.read_csv('../data/calendar.csv')
sales_train_evaluation = pd.read_csv('../data/sales_train_evaluation.csv')
sell_prices = pd.read_csv('../data/sell_prices.csv')
example_df = pd.read_parquet('../data/sample_dataset.parquet')
对于时间序列问题,尤其是销量预测类型的表格类数据,特征工程大都有以下几类:
对于特征工程里创建的特征,重点还是要思考它对于最后预测值的意义,一些比值,或者统计值是否能直接或者间接帮助到预测。
可以看到这一步和EDA其实是相辅相成完成的,正常情况下不需要分隔成2个notebook
处理 Nan
cat = ['event_name_1','event_type_1','event_name_2','event_type_2']
for i in cat:
calendar[i].fillna('no_event',inplace=True)
增加特征 在这里也是提取了很多时间相关的特征 (为什么?)
calendar['is_weekend'] = calendar['wday'].map(lambda x: 1 if x<=2 else 0)
calendar['is_weekend'] = calendar['is_weekend'].astype(np.int8)
m = calendar["date"].tolist()
m = [i.split("-")[2] for i in m]
calendar["month_day"] = m
calendar['month_day'] = calendar['month_day'].astype(np.int8)
calendar['month_week_number'] = (calendar['month_day']-1) // 7 + 1
calendar['month_week_number'] = calendar['month_week_number'].astype(np.int8)
calendar['events_per_day'] = calendar['event_type_1'].map(lambda x: 0 if x=='no_event' else 1)
index = calendar.index
indices = index[calendar['event_type_2']!='no_event'].tolist()
for i in indices:
calendar['events_per_day'][i] += 1
calendar['events_per_day'] = calendar['events_per_day'].astype(np.int8)
/Users/mikechen/anaconda3/envs/dingyan/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
“”"