• 特征衍生工程


     在这里插入图片描述

    高阶多项式特征衍生 

    1. import pandas as pd
    2. import numpy as np
    3. # import warnings
    4. # warnings.filterwarnings('ignore')
    5. diabetes = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\diabetes.csv")
    6. titanic = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv")
    1. titanic.fillna(0, inplace = True)
    2. titanic.head()

    diabetes.head()

     对diabetes特征处理

    1. from sklearn.preprocessing import PolynomialFeatures
    2. tmp = PolynomialFeatures(degree=5).fit_transform(diabetes.iloc[:,0:1])
    3. tmp = pd.DataFrame(tmp)
    4. tmp.rename(columns = {0:'preg$^0$', 1:'preg$^1$', 2:'preg$^2$', 3:'preg$^3$', 4:'preg$^4$', 5:'preg$^5$'}, inplace = True)
    5. new_diabetes = pd.concat([diabetes, tmp], axis = 1, join = 'inner')
    6. new_diabetes.head()

     

     对titanic特征处理

    1. temp = PolynomialFeatures(degree = 5).fit_transform(titanic.iloc[:,9:10])
    2. temp = pd.DataFrame(temp)
    3. temp.rename(columns = {0:'fare$^0$', 1:'fare$^1$', 2:'fare$^2$', 3:'fare$^3$', 4:'fare$^4$', 5:'fare$^5$'}, inplace = True)
    4. new_titanic = pd.concat([titanic, temp], axis = 1, join = 'inner')
    5. new_titanic.iloc[0:5, 6:]

    对diabetes二阶多项式衍生

    1. data = diabetes.iloc[[0,1], [0,1]]
    2. data

     

    PolynomialFeatures(degree = 2, include_bias = False).fit_transform(data)
    array([[6.0000e+00, 1.4800e+02, 3.6000e+01, 8.8800e+02, 2.1904e+04],
           [1.0000e+00, 8.5000e+01, 1.0000e+00, 8.5000e+01, 7.2250e+03]])

    61483688821904
    1851857225

     对diabetes三阶多项式衍生

    PolynomialFeatures(degree = 3, include_bias = False).fit_transform(data)

    array([[6.000000e+00, 1.480000e+02, 3.600000e+01, 8.880000e+02,
            2.190400e+04, 2.160000e+02, 5.328000e+03, 1.314240e+05,
            3.241792e+06],
           [1.000000e+00, 8.500000e+01, 1.000000e+00, 8.500000e+01,
            7.225000e+03, 1.000000e+00, 8.500000e+01, 7.225000e+03,
            6.141250e+05]])

     交叉组合特征衍生

      

     5.交叉组合特征衍生方法介绍_哔哩哔哩_bilibili

    时序特征——对时间数据的特征衍生 

    一、对时间数据的转换 

    1.初定义数据(转化成表格                                           

    1. import pandas as pd
    2. t = pd.DataFrame()
    3. t['time'] = ['2022-04-05;13:34:03',
    4. '1949-10-03;14:01:06',
    5. '1945-08-15;09:00:00']
    6. t

     

    2.数据转换(真正转换成时间

    1. t['time'] = pd.to_datetime(t['time'])
    2. t['time']

    0   2022-04-05 13:34:03
    1   1949-10-01 14:01:06
    2   1945-08-15 09:00:00
    Name: time, dtype: datetime64[ns]

     再来一个例子

    1. t1 = pd.DataFrame()
    2. t1['time'] = ['1997-07-01',
    3. '1999-12-20']
    4. t1['time'] = pd.to_datetime(t1['time'])
    5. t1['time']
    0   1997-07-01
    1   1999-12-20
    Name: time, dtype: datetime64[ns]
    1. t1['time'].values.astype('datetime64[D]')
    2. t1['time']

    0   1997-07-01
    1   1999-12-20
    Name: time, dtype: datetime64[ns]
    1. t1['time'].values.astype('datetime64[h]')
    2. t1['time']
    0   1997-07-01
    1   1999-12-20
    Name: time, dtype: datetime64[ns]
    
      常用 时间数据类型                        
              pd.datetime64[ns] ( 纳秒 )
              pd.datetime64[D]  ( )
              pd.datetime64[h] (小时 )
              pd.datetime64[s]  ( )
              pd.datetime64[ ms ] ( 毫秒 )                                           
         DataFrame 类型只支持 [ns] 类型

    t1['time'].dt.year

    0    1997
    1    1999
    Name: time, dtype: int64
    t1['time'].dt.quarter

    0    3
    1    4
    Name: time, dtype: int64

    目标编码 

    1. import numpy as np
    2. a = np.array([[1,2] * 5, [0, 1, 1, 1, 1, 0, 0, 0, 1, 0]]).T
    3. train = pd.DataFrame(a, columns = ['tenure', 'Churn'])
    4. train

     

    1. from sklearn.model_selection import KFold
    2. kf = KFold(n_splits = 5)
    3. for train, text in kf.split(a):
    4. print('train: %s, text: %s' %(train, text))

  • 相关阅读:
    搜索技术【广度优先搜索】 - 优先队列式广度优先搜索
    开源办公OA平台教程:如何修改O2OA配置连接本地部署的OnlyOffice Docs Server服务器?
    全网最适合入门的面向对象编程教程:18 类和对象的 Python 实现-多重继承与 PyQtGraph 串口数据绘制曲线图
    商城系统APP如何开发 都有哪些步骤
    数据导入与预处理-课程总结-01~03章
    笔记--Windows下从源码编译安装OpenCV
    工作中遇到的问题与解决办法(三)
    Java之HashMap经典算法-红黑树(插入节点平衡调整,左旋转,右旋转)
    session与JWT认证
    数据库系统原理与应用教程(070)—— MySQL 练习题:操作题 101-109(十四):查询条件练习
  • 原文地址:https://blog.csdn.net/ykrsgs/article/details/126437912