• 8李沐d2l(七)kaggle房价预测+数值稳定性+模型初始化和激活函数


    一、kaggle房价预测

    数据获取

    ​ 首先去kaggle上下载提供的数据集,通过pandas从文件中读取数据。

    import numpy as np
    import pandas as pd 
    import torch
    from torch import nn
    from d2l import torch as d2l
    
    train_data = pd.read_csv("8房价预测\\train.csv")
    test_data = pd.read_csv("8房价预测\\test.csv")
    print(train_data.shape) #(1460, 81)
    print(test_data.shape) #(1459, 80)
    print(train_data.iloc[0:4, [0,1,2,3,-3,-2,-1]])
    '''
       Id  MSSubClass MSZoning  LotFrontage SaleType SaleCondition  SalePrice
    0   1          60       RL         65.0       WD        Normal     208500
    1   2          20       RL         80.0       WD        Normal     181500
    2   3          60       RL         68.0       WD        Normal     223500
    3   4          70       RL         60.0       WD       Abnorml     140000
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    数据处理

    ​ 从上面打印出来的部分数据可以看到,存在一列为Id,一般在训练时用不到这列,所以我们需要把它删除。此外,最后一列是我们需要预测的结果,所以也需要删除。

    all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) # 1:-1删除了最后一列
    print(all_features.iloc[:4, [0,1,2,3,-3,-2,-1]])
    '''
       MSSubClass MSZoning  LotFrontage  LotArea  YrSold SaleType SaleCondition
    0          60       RL         65.0     8450    2008       WD        Normal
    1          20       RL         80.0     9600    2007       WD        Normal
    2          60       RL         68.0    11250    2008       WD        Normal
    3          70       RL         60.0     9550    2006       WD       Abnorml
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    ​ 之前在学习Numpy和pandas时都会有对脏数据的处理,对于本次给出数据集中的脏数据,我们可以将缺失的值替换成为相应特征的平均值(通过将特征重新缩放到零均值和单位方差来标准化数据),对于为nan的值直接补充为0.

    #脏数据处理
    numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index #找到数值类型的特征
    all_features[numeric_features] = all_features[numeric_features].apply( #比赛的时候是给出测试集,所以之前是把它们结合在一起,而平时需要先在训练集算均值再运用到测试集上去
      lambda x : (x - x.mean() / (x.std())) #这一行相当于把均值变为0,方差变为1(视频中讲的,没太懂)
    )
    all_features[numeric_features] = all_features[numeric_features].fillna(0)  #把无效的数值定义为均值0
    
    #处理离散值,用一次独热编码替换它们
    all_features = pd.get_dummies(all_features, dummy_na=True)
    print(all_features.shape)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    转换格式

    # 从pandas格式中提取Numpy格式,并将其转换为张量表示
    n_train = train_data.shape[0]
    train_features = torch.tensor(all_features[:n_train].values,
                                dtype=torch.float32)
    test_features = torch.tensor(all_features[n_train:].values,
                                dtype=torch.float32)
    train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1,1),
                                dtype=torch.float32)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    训练

    # 训练
    loss = nn.MSELoss()
    in_features = train_features.shape[1] # 331
    
    def get_net():
      net = nn.Sequential(nn.Linear(in_features, 1)) # 线性回归
      return net
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    ​ 之前我们对于误差是用真实值 - 预测值,但是这样带来的结果并不准确(比如预测的误差是10w,有些房子100w,那么误差就较大,虽然对于10w左右的房价误差较小。)

    ​ 所以这里我们更关注相对误差y-y’/y

    def log_rmse(net, features, labels):
      clipped_preds = torch.clamp(net(features), 1, float('inf'))
      rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels)))
      return rmse.item()
    
    • 1
    • 2
    • 3
    • 4

    训练函数

    def train(net, train_features, train_labels, test_features, test_labels,
              num_epochs, learning_rate, weight_decay, batch_size):
        train_ls, test_ls = [], []
        train_iter = d2l.load_array((train_features, train_labels), batch_size)
        optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate,
                                     weight_decay=weight_decay) #可以看做是一个更加平滑的SGD
        for epoch in range(num_epochs):
            for X, y in train_iter:
                optimizer.zero_grad()
                l = loss(net(X), y)
                l.backward()
                optimizer.step()
            train_ls.append(log_rmse(net, train_features, train_labels))
            if test_labels is not None:
                test_ls.append(log_rmse(net, test_features, test_labels))
        return train_ls, test_ls
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    K折交叉验证

    def get_k_fold_data(k, i, X, y):
        assert k > 1
        fold_size = X.shape[0] // k # 每一折的大小为 样本数/K
        X_train, y_train = None, None
        for j in range(k):
            idx = slice(j * fold_size, (j + 1) * fold_size) 
            X_part, y_part = X[idx, :], y[idx]
            if j == i: # i 就是当前第几折,并把它做成验证集
                X_valid, y_valid = X_part, y_part
            elif X_train is None: 
                X_train, y_train = X_part, y_part
            else:
                X_train = torch.cat([X_train, X_part], 0)
                y_train = torch.cat([y_train, y_part], 0)
        return X_train, y_train, X_valid, y_valid
    
    # 返回训练和验证误差的平均值
    def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
               batch_size):
        train_l_sum, valid_l_sum = 0, 0
        for i in range(k):
            data = get_k_fold_data(k, i, X_train, y_train)
            net = get_net()
            train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
                                       weight_decay, batch_size)
            train_l_sum += train_ls[-1]
            valid_l_sum += valid_ls[-1]
            if i == 0:
                d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
                         xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
                         legend=['train', 'valid'], yscale='log')
            print(f'fold {i + 1}, train log rmse {float(train_ls[-1]):f}, '
                  f'valid log rmse {float(valid_ls[-1]):f}')
        return train_l_sum / k, valid_l_sum / k
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    模型选择

    k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
    train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
                              weight_decay, batch_size)
    print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
          f'平均验证log rmse: {float(valid_l):f}')
    '''
    fold 1, train log rmse 0.168286, valid log rmse 0.170249
    fold 2, train log rmse 0.165790, valid log rmse 0.182056
    fold 3, train log rmse 0.177228, valid log rmse 0.176420
    fold 4, train log rmse 0.194456, valid log rmse 0.189484
    fold 5, train log rmse 0.184219, valid log rmse 0.207091
    5-折验证: 平均训练log rmse: 0.177996, 平均验证log rmse: 0.185060
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    image-20220729205014733

    kaggle上提交结果

    def train_and_pred(train_features, test_feature, train_labels, test_data,
                       num_epochs, lr, weight_decay, batch_size):
        net = get_net()
        train_ls, _ = train(net, train_features, train_labels, None, None,
                            num_epochs, lr, weight_decay, batch_size)
        d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
                 ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
        print(f'train log rmse {float(train_ls[-1]):f}')
        preds = net(test_features).detach().numpy()
        test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
        submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
        submission.to_csv('submission.csv', index=False)
    
    train_and_pred(train_features, test_features, train_labels, test_data,
                   num_epochs, lr, weight_decay, batch_size)
    
    '''
    train log rmse 0.200159
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    二、数值的稳定性

    考虑如下有d层的神经网络

    t表示层。

    image-20220729184634681

    数值稳定性常见的两个问题

    梯度爆炸

    • 值超出值域
    • 对学习率敏感

    梯度消失

    • 梯度值变为0
    • 训练无进展
    • 对于底部层尤为严重

    ​ 当数值过大或者过小时会导致数值问题,常发生在深度模型中,因为会对n和数累乘。

    三、模型初始化和激活函数

    让每层的方差是一个常数

    • 将每层的输出和梯度都看做随机变量
    • 让它们的均值和方差都保持一致

    权重初始化

    • 在合理区间里随机初试参数
    • 训练开始的时候更容易有数值不稳定
    • 使用N(0, 0.01)来初始可能对小网络没问题,但不能保证深度神经网络。

    正向方差:

    image-20220729090044138

    反向均值和方差:

    image-20220729090111556

  • 相关阅读:
    每日一练:质因数分解
    Node.js 零基础入门 Node.js 零基础入门第四天 4.5 前后端的身份认证
    Day38.动规:斐波那契、爬楼梯、最小代价爬楼梯
    微服务远程调用组件Feign的使用详解
    Lightrun还可以这样用?
    JeecgBoot 3.3.0 版本发布,基于代码生成器的企业级低代码平台
    干货 | 一文搞定 pytest 自动化测试框架(二)
    Java系列技术之JavaScript基础(从入门开始)③
    需求解析思路
    SpringBoot使用Nacos作为配置中心服务
  • 原文地址:https://blog.csdn.net/qq_18824403/article/details/126064061