• umicv cv-summary1-全连接神经网络模块化实现


    今天这篇博文针对Assignment3的全连接网络作业,对前面学习的内容进行一些总结

    在前面的作业中我们建立神经网络的操作比较简单,也不具有模块化的特征,在A3作业中,引导我们对前面的比如linear layer,Relu layer,Loss layer以及dropout layer(这个前面课程内容未涉及 但是在cs231n中有出现),以及梯度下降不同方法(SGD,SGD+Momentum,RMSprop,Adam)等等进行模块化的实现

    Linear与Relu单层实现

    class Linear(object):
    
      @staticmethod
      def forward(x, w, b):
        """
        Computes the forward pass for an linear (fully-connected) layer.
        The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
        examples, where each example x[i] has shape (d_1, ..., d_k). We will
        reshape each input into a vector of dimension D = d_1 * ... * d_k, and
        then transform it to an output vector of dimension M.
        Inputs:
        - x: A tensor containing input data, of shape (N, d_1, ..., d_k)
        - w: A tensor of weights, of shape (D, M)
        - b: A tensor of biases, of shape (M,)
        Returns a tuple of:
        - out: output, of shape (N, M)
        - cache: (x, w, b)
        """
        out = None
        out = x.view(x.shape[0],-1).mm(w)+b
        cache = (x, w, b)
        return out, cache
    
      @staticmethod
      def backward(dout, cache):
        """
        Computes the backward pass for an linear layer.
        Inputs:
        - dout: Upstream derivative, of shape (N, M)
        - cache: Tuple of:
          - x: Input data, of shape (N, d_1, ... d_k)
          - w: Weights, of shape (D, M)
          - b: Biases, of shape (M,)
        Returns a tuple of:
        - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
        - dw: Gradient with respect to w, of shape (D, M)
        - db: Gradient with respect to b, of shape (M,)
        """
        x, w, b = cache
        dx, dw, db = None, None, None
        db = dout.sum(dim = 0)
        dx = dout.mm(w.t()).view(x.shape)
        dw = x.view(x.shape[0],-1).t().mm(dout)
        return dx, dw, db
    
    
    class ReLU(object):
    
      @staticmethod
      def forward(x):
        """
        Computes the forward pass for a layer of rectified linear units (ReLUs).
        Input:
        - x: Input; a tensor of any shape
        Returns a tuple of:
        - out: Output, a tensor of the same shape as x
        - cache: x
        """
        out = None
        out = x.clone()
        out[out<0] = 0
        cache = x
        return out, cache
    
      @staticmethod
      def backward(dout, cache):
        """
        Computes the backward pass for a layer of rectified linear units (ReLUs).
        Input:
        - dout: Upstream derivatives, of any shape
        - cache: Input x, of same shape as dout
        Returns:
        - dx: Gradient with respect to x
        """
        dx, x = None, cache
        dx = dout.clone()
        dx[x<0] = 0
        return dx
    
    
    class Linear_ReLU(object):
    
      @staticmethod
      def forward(x, w, b):
        """
        Convenience layer that performs an linear transform followed by a ReLU.
    
        Inputs:
        - x: Input to the linear layer
        - w, b: Weights for the linear layer
        Returns a tuple of:
        - out: Output from the ReLU
        - cache: Object to give to the backward pass
        """
        a, fc_cache = Linear.forward(x, w, b)
        out, relu_cache = ReLU.forward(a)
        cache = (fc_cache, relu_cache)
        return out, cache
    
      @staticmethod
      def backward(dout, cache):
        """
        Backward pass for the linear-relu convenience layer
        """
        fc_cache, relu_cache = cache
        da = ReLU.backward(dout, relu_cache)
        dx, dw, db = Linear.backward(da, fc_cache)
        return dx, dw, db
    
    

    从上面的代码我们可以看到,针对linear与relu层,我们可以将前向传播与反向传播分开实现,具体过程在上一篇我的博文中有讨论:https://www.cnblogs.com/dyccyber/p/17764347.html
    不同的是我们要对x进行一个reshape,将其转换为N*D的矩阵,才能与矩阵进行点积
    在分别实现了linear与relu之后,因为神经网络的架构往往是在linear之后立马加入一个relu层,所以我们可以再建立一个linear-relu class,将这两个层的前向与反向传播合并

    LossLayer实现

    def svm_loss(x, y):
      """
      Computes the loss and gradient using for multiclass SVM classification.
      Inputs:
      - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
        class for the ith input.
      - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
        0 <= y[i] < C
      Returns a tuple of:
      - loss: Scalar giving the loss
      - dx: Gradient of the loss with respect to x
      """
      N = x.shape[0]
      correct_class_scores = x[torch.arange(N), y]
      margins = (x - correct_class_scores[:, None] + 1.0).clamp(min=0.)
      margins[torch.arange(N), y] = 0.
      loss = margins.sum() / N
      num_pos = (margins > 0).sum(dim=1)
      dx = torch.zeros_like(x)
      dx[margins > 0] = 1.
      dx[torch.arange(N), y] -= num_pos.to(dx.dtype)
      dx /= N
      return loss, dx
    
    
    def softmax_loss(x, y):
      """
      Computes the loss and gradient for softmax classification.
      Inputs:
      - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
        class for the ith input.
      - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
        0 <= y[i] < C
      Returns a tuple of:
      - loss: Scalar giving the loss
      - dx: Gradient of the loss with respect to x
      """
      shifted_logits = x - x.max(dim=1, keepdim=True).values
      Z = shifted_logits.exp().sum(dim=1, keepdim=True)
      log_probs = shifted_logits - Z.log()
      probs = log_probs.exp()
      N = x.shape[0]
      loss = (-1.0/ N) * log_probs[torch.arange(N), y].sum()
      dx = probs.clone()
      dx[torch.arange(N), y] -= 1
      dx /= N
      return loss, dx
    

    上面损失函数层我们在之前已经实现过,具体实现需要用到一些矩阵微分的知识,具体可以参考这两篇博文:
    http://giantpandacv.com/academic/算法科普/深度学习基础/SVM Loss以及梯度推导/
    https://blog.csdn.net/qq_27261889/article/details/82915598

    多层神经网络

    关于多层神经网络,首先是类的初始化定义,我们可以看神经网络的结构{linear - relu - [dropout]} x (L - 1) - linear - softmax,有L-1个linear层与relu层与dropout层的组合,最后再以linear-softmax的结构结束输出结果,初始化我们要遍历每个隐藏层,初始化权重矩阵与偏置项,最后再去初始化最后一个linear层,要注意矩阵的维度

    class FullyConnectedNet(object):
      """
      A fully-connected neural network with an arbitrary number of hidden layers,
      ReLU nonlinearities, and a softmax loss function.
      For a network with L layers, the architecture will be:
    
      {linear - relu - [dropout]} x (L - 1) - linear - softmax
    
      where dropout is optional, and the {...} block is repeated L - 1 times.
    
      Similar to the TwoLayerNet above, learnable parameters are stored in the
      self.params dictionary and will be learned using the Solver class.
      """
    
      def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                   dropout=0.0, reg=0.0, weight_scale=1e-2, seed=None,
                   dtype=torch.float, device='cpu'):
        """
        Initialize a new FullyConnectedNet.
    
        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving the drop probability for networks
          with dropout. If dropout=0 then the network should not use dropout.
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        - dtype: A torch data type object; all computations will be performed using
          this datatype. float is faster but less accurate, so you should use
          double for numeric gradient checking.
        - device: device to use for computation. 'cpu' or 'cuda'
        """
        self.use_dropout = dropout != 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}
    
        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        ############################################################################
        # Replace "pass" statement with your code
        last_dim = input_dim
        for n ,hidden_dim in enumerate(hidden_dims):
          i = n+1
          self.params['W{}'.format(i)] = torch.zeros(last_dim, hidden_dim, dtype=dtype,device = device)
          self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, hidden_dim, dtype=dtype,device= device)
          self.params['b{}'.format(i)] = torch.zeros(hidden_dim, dtype=dtype,device= device)
          last_dim = hidden_dim
        i+=1
        self.params['W{}'.format(i)] = torch.zeros(last_dim, num_classes, dtype=dtype,device = device)
        self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, num_classes, dtype=dtype,device= device)
        self.params['b{}'.format(i)] = torch.zeros(num_classes, dtype=dtype,device= device)
       
    
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
          self.dropout_param = {'mode': 'train', 'p': dropout}
          if seed is not None:
            self.dropout_param['seed'] = seed
    
    
    

    其次,我们可以定义save与load函数,对模型参数等等进行存储与加载:

    def save(self, path):
        checkpoint = {
          'reg': self.reg,
          'dtype': self.dtype,
          'params': self.params,
          'num_layers': self.num_layers,
          'use_dropout': self.use_dropout,
          'dropout_param': self.dropout_param,
        }
          
        torch.save(checkpoint, path)
        print("Saved in {}".format(path))
    
    
      def load(self, path, dtype, device):
        checkpoint = torch.load(path, map_location='cpu')
        self.params = checkpoint['params']
        self.dtype = dtype
        self.reg = checkpoint['reg']
        self.num_layers = checkpoint['num_layers']
        self.use_dropout = checkpoint['use_dropout']
        self.dropout_param = checkpoint['dropout_param']
    
        for p in self.params:
          self.params[p] = self.params[p].type(dtype).to(device)
    
        print("load checkpoint file: {}".format(path))
    

    最后是前向传播与反向传播的实现,这里直接使用前面基础的linear与relu的前向与反向传播即可,注意一下神经网络的结构,不要把顺序搞错即可

    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.
        Input / output: Same as TwoLayerNet above.
        """
        X = X.to(self.dtype)
        mode = 'test' if y is None else 'train'
    
        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
          self.dropout_param['mode'] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        ############################################################################
        # Replace "pass" statement with your code
        cache_dict = {}
        last_out = X
        for n  in range(self.num_layers-1):
          i=n+1
          last_out, cache_dict['cache_LR{}'.format(i)] = Linear_ReLU.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])
          if self.use_dropout:
            last_out, cache_dict['cache_Dropout{}'.format(i)] =  Dropout.forward(last_out,self.dropout_param)
        i+=1
        last_out, cache_dict['cache_L{}'.format(i)] = Linear.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])
        scores = last_out
    
        # If test mode return early
        if mode == 'test':
          return scores
    
        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # Replace "pass" statement with your code
        loss, dout = softmax_loss(scores, y)
        loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg
        last_dout, dw, db  = Linear.backward(dout, cache_dict['cache_L{}'.format(i)])
        grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg
        grads['b{}'.format(i)] = db
        for n  in range(self.num_layers-1)[::-1]:
          i = n +1
          if self.use_dropout:
            last_dout =  Dropout.backward(last_dout, cache_dict['cache_Dropout{}'.format(i)])
          last_dout, dw, db  = Linear_ReLU.backward(last_dout, cache_dict['cache_LR{}'.format(i)])
          grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg
          grads['b{}'.format(i)] = db
          loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg
        return loss, grads
    

    不同梯度下降方法

    SGD,SGD+Momentum,RMSprop,Adam(Momentum+RMSprop+bias)的实现
    具体原理介绍可参考之前的一篇博文:https://www.cnblogs.com/dyccyber/p/17759697.html
    这里特别提及一下在Adam中我们加入了偏置项,是为了防止在初期进行梯度下降的过程中,下降的过快

    def sgd(w, dw, config=None):
        """
        Performs vanilla stochastic gradient descent.
        config format:
        - learning_rate: Scalar learning rate.
        """
        if config is None: config = {}
        config.setdefault('learning_rate', 1e-2)
    
        w -= config['learning_rate'] * dw
        return w, config
    
    def sgd_momentum(w, dw, config=None):
      """
      Performs stochastic gradient descent with momentum.
      config format:
      - learning_rate: Scalar learning rate.
      - momentum: Scalar between 0 and 1 giving the momentum value.
        Setting momentum = 0 reduces to sgd.
      - velocity: A numpy array of the same shape as w and dw used to store a
        moving average of the gradients.
      """
      if config is None: config = {}
      config.setdefault('learning_rate', 1e-2)
      config.setdefault('momentum', 0.9)
      v = config.get('velocity', torch.zeros_like(w))
    
      next_w = None
      #############################################################################
      # TODO: Implement the momentum update formula. Store the updated value in   #
      # the next_w variable. You should also use and update the velocity v.       #
      #############################################################################
      # Replace "pass" statement with your code
      v = config['momentum']*v - config['learning_rate'] * dw
      next_w = w + v
      #############################################################################
      #                              END OF YOUR CODE                             #
      #############################################################################
      config['velocity'] = v
    
      return next_w, config
    
    def rmsprop(w, dw, config=None):
      """
      Uses the RMSProp update rule, which uses a moving average of squared
      gradient values to set adaptive per-parameter learning rates.
      config format:
      - learning_rate: Scalar learning rate.
      - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
        gradient cache.
      - epsilon: Small scalar used for smoothing to avoid dividing by zero.
      - cache: Moving average of second moments of gradients.
      """
      if config is None: config = {}
      config.setdefault('learning_rate', 1e-2)
      config.setdefault('decay_rate', 0.99)
      config.setdefault('epsilon', 1e-8)
      config.setdefault('cache', torch.zeros_like(w))
    
      next_w = None
      ###########################################################################
      # TODO: Implement the RMSprop update formula, storing the next value of w #
      # in the next_w variable. Don't forget to update cache value stored in    #
      # config['cache'].                                                        #
      ###########################################################################
      # Replace "pass" statement with your code
      config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw**2
      w  +=  -config['learning_rate'] * dw / (torch.sqrt(config['cache']) + config['epsilon'])
      next_w = w
      ###########################################################################
      #                             END OF YOUR CODE                            #
      ###########################################################################
    
      return next_w, config
    
    def adam(w, dw, config=None):
      """
      Uses the Adam update rule, which incorporates moving averages of both the
      gradient and its square and a bias correction term.
      config format:
      - learning_rate: Scalar learning rate.
      - beta1: Decay rate for moving average of first moment of gradient.
      - beta2: Decay rate for moving average of second moment of gradient.
      - epsilon: Small scalar used for smoothing to avoid dividing by zero.
      - m: Moving average of gradient.
      - v: Moving average of squared gradient.
      - t: Iteration number.
      """
      if config is None: config = {}
      config.setdefault('learning_rate', 1e-3)
      config.setdefault('beta1', 0.9)
      config.setdefault('beta2', 0.999)
      config.setdefault('epsilon', 1e-8)
      config.setdefault('m', torch.zeros_like(w))
      config.setdefault('v', torch.zeros_like(w))
      config.setdefault('t', 0)
    
      next_w = None
      #############################################################################
      # TODO: Implement the Adam update formula, storing the next value of w in   #
      # the next_w variable. Don't forget to update the m, v, and t variables     #
      # stored in config.                                                         #
      #                                                                           #
      # NOTE: In order to match the reference output, please modify t _before_    #
      # using it in any calculations.                                             #
      #############################################################################
      # Replace "pass" statement with your code
      config['t'] += 1
      config['m'] = config['beta1']*config['m'] + (1-config['beta1'])*dw
      mt = config['m'] / (1-config['beta1']**config['t'])
      config['v'] = config['beta2']*config['v'] + (1-config['beta2'])*(dw*dw)
      vc = config['v'] / (1-(config['beta2']**config['t']))
      w = w - (config['learning_rate'] * mt)/ (torch.sqrt(vc) + config['epsilon'])
      next_w = w
      #############################################################################
      #                              END OF YOUR CODE                             #
      #############################################################################
    
      return next_w, config
    
    

    Dropout层

    注意在前面多层全连接网络的实现中,dropout只有在我们进行train的时候才使用,在test的时候是不使用的
    dropout层是一个非常高效与简单的正则化方法,具体来说,在训练时,dropout 是通过仅以一定概率 p 保持神经元活跃来实现的,如果我们设置的随机数小于p就将其设置为零,如下图所示:
    img
    用另一种视角去看,dropout实际上是一种对全神经网络进行抽样的方法,可以减少不同神经元之间复杂的关系
    具体论文原文见:https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
    代码实现:

    class Dropout(object):
    
      @staticmethod
      def forward(x, dropout_param):
        """
        Performs the forward pass for (inverted) dropout.
        Inputs:
        - x: Input data: tensor of any shape
        - dropout_param: A dictionary with the following keys:
          - p: Dropout parameter. We *drop* each neuron output with probability p.
          - mode: 'test' or 'train'. If the mode is train, then perform dropout;
          if the mode is test, then just return the input.
          - seed: Seed for the random number generator. Passing seed makes this
          function deterministic, which is needed for gradient checking but not
          in real networks.
        Outputs:
        - out: Tensor of the same shape as x.
        - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
          mask that was used to multiply the input; in test mode, mask is None.
        NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
        See http://cs231n.github.io/neural-networks-2/#reg for more details.
        NOTE 2: Keep in mind that p is the probability of **dropping** a neuron
        output; this might be contrary to some sources, where it is referred to
        as the probability of keeping a neuron output.
        """
        p, mode = dropout_param['p'], dropout_param['mode']
        if 'seed' in dropout_param:
          torch.manual_seed(dropout_param['seed'])
    
        mask = None
        out = None
    
        if mode == 'train':
          ###########################################################################
          # TODO: Implement training phase forward pass for inverted dropout.       #
          # Store the dropout mask in the mask variable.                            #
          ###########################################################################
          # Replace "pass" statement with your code
          mask = torch.rand(x.shape) > p
          out = x.clone()
          out[mask] = 0
          ###########################################################################
          #                             END OF YOUR CODE                            #
          ###########################################################################
        elif mode == 'test':
          ###########################################################################
          # TODO: Implement the test phase forward pass for inverted dropout.       #
          ###########################################################################
          # Replace "pass" statement with your code
          out = x
        cache = (dropout_param, mask)
    
        return out, cache
    
      @staticmethod
      def backward(dout, cache):
        """
        Perform the backward pass for (inverted) dropout.
        Inputs:
        - dout: Upstream derivatives, of any shape
        - cache: (dropout_param, mask) from Dropout.forward.
        """
        dropout_param, mask = cache
        mode = dropout_param['mode']
    
        dx = None
        if mode == 'train':
          ###########################################################################
          # TODO: Implement training phase backward pass for inverted dropout       #
          ###########################################################################
          # Replace "pass" statement with your code
          dx = dout
          dx[mask] = 0
        elif mode == 'test':
          dx = dout
        return dx
    
    
  • 相关阅读:
    基于形态学处理的交通标志检测分割算法matlab仿真
    (附源码)ssm医护服务平台 毕业设计 260954
    【Jetson】Jetson TX2 ubuntu18.04从0安装 jupyterLab
    源聚达科技:抖音小店到底怎么样2024年
    Acwing 844. 走迷宫
    代码优化实例
    joomscan工具使用(针对joomla扫描)
    SD NAND对比TF卡优势(以CSNP4GCR01-AMW为例)
    融合关键点属性与注意力表征的人脸表情识别--2021.高红霞
    基于Springboot的闲置图书共享系统设计与实现(源码+论文+开题报告+PPT+部署)
  • 原文地址:https://www.cnblogs.com/dyccyber/p/17778041.html