• 手把手教你语音识别(三)


    朋友们,手把手语音识别第三部分来了,这部分开始讲解网络搭建部分,同样是手把手教你哦,不要错过。

    1、读入数据

    这部分,就不展开讲了,之前的文章讲过:数据处理,这次直接从读入数据的格式到怼入网络开始讲,不过,由于网络很大,层数很多,所以要一层层的去讲解,本次咱们讲解的是encode里面的embeding层,也就是词嵌入层,这里可以认为embedding就是一个lookup,也就是一个速查表,用id->embedding做一个对应,这样的话,根据词的id,能通过vocab.txt查找到这个id对应的字或词,比如a、b、c、d、你、我等,也可以找到这个词的embedding,从而计算他们的关系,比如距离、语义关系等等。

    import soundfile
    audio, audio_sample_rate = soundfile.read("C:\Users\Desktop\asr16.wav", dtype="int16",always_2d=True)
    
    • 1
    • 2

    在这里插入图片描述

    import numpy as np
    audio = audio.mean(axis=1, dtype=np.int16)
    
    • 1
    • 2

    在这里插入图片描述

    def pcm16to32(audio):
        assert (audio.dtype == np.int16)
        audio = audio.astype("float32")
        bits = np.iinfo(np.int16).bits
        audio = audio / (2**(bits - 1))
        return audio
    def pcm32to16(audio):
        assert (audio.dtype == np.float32)
        bits = np.iinfo(np.int16).bits
        audio = audio * (2**(bits - 1))
        audio = np.round(audio).astype("int16")
        return audio
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    audio = pcm16to32(audio)
    audio=pcm32to16(audio)
    
    • 1
    • 2

    在这里插入图片描述

    with open("transform.pkl", "rb") as tf:    
        preprocessing = pickle.load(tf)
    
    • 1
    • 2
    audio = preprocessing(audio,**{"train":False}
    
    • 1
    audio = paddle.to_tensor(audio, dtype='float32').unsqueeze(axis=0)
    
    • 1

    在这里插入图片描述

    2、进入模型

    前面的数据读取已经完成,并且进行了transform,也就是预处理,经过这些操作以后,特征已经经过预处理,就比较好用了,具体经过那些处理,前面的文章讲过了,忘记的可以自行翻阅,下面就是要进入网络层,进入网络层之前,可以看到,现在的audio的shape为(1,363,80),前面的文章也讲过,这个363很重要,需要记住这个数,从现在开始就可以忘记了,因为后面经过很多次reshape后,这个363就没了。

    (1)查看网络

    在搭建网络之前,得先看看网络啥样,咱们先展示下网络层把,这个之前加载网络讲过,这里展示下网络的summary。

    U2Model(
      (encoder): ConformerEncoder(
        (embed): Conv2dSubsampling4(
          (pos_enc): RelPositionalEncoding(
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          )
          (conv): Sequential(
            (0): Conv2D(1, 512, kernel_size=[3, 3], stride=[2, 2], data_format=NCHW)
            (1): ReLU()
            (2): Conv2D(512, 512, kernel_size=[3, 3], stride=[2, 2], data_format=NCHW)
            (3): ReLU()
          )
          (out): Sequential(
            (0): Linear(in_features=9728, out_features=512, dtype=float32)
          )
        )
        (after_norm): LayerNorm(normalized_shape=[512], epsilon=1e-12)
        (encoders): LayerList(
          (0): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (1): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (2): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (3): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (4): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (5): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (6): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (7): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (8): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (9): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (10): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (11): ConformerEncoderLayer(
            (self_attn): RelPositionMultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
              (linear_pos): Linear(in_features=512, out_features=512, dtype=float32)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (feed_forward_macaron): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): Swish()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (conv_module): ConvolutionModule(
              (pointwise_conv1): Conv1D(512, 1024, kernel_size=[1], data_format=NCL)
              (depthwise_conv): Conv1D(512, 512, kernel_size=[15], padding=7, groups=512, data_format=NCL)
              (norm): LayerNorm(normalized_shape=[512], epsilon=1e-05)
              (pointwise_conv2): Conv1D(512, 512, kernel_size=[1], data_format=NCL)
              (activation): Swish()
            )
            (norm_ff): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_mha): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_ff_macaron): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_conv): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm_final): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear): Linear(in_features=1024, out_features=512, dtype=float32)
          )
        )
      )
      (decoder): TransformerDecoder(
        (embed): Sequential(
          (0): Embedding(5537, 512, sparse=False)
          (1): PositionalEncoding(
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          )
        )
        (after_norm): LayerNorm(normalized_shape=[512], epsilon=1e-12)
        (output_layer): Linear(in_features=512, out_features=5537, dtype=float32)
        (decoders): LayerList(
          (0): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (1): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (2): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (3): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (4): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
          (5): DecoderLayer(
            (self_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (src_attn): MultiHeadedAttention(
              (linear_q): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_k): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_v): Linear(in_features=512, out_features=512, dtype=float32)
              (linear_out): Linear(in_features=512, out_features=512, dtype=float32)
              (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
            )
            (feed_forward): PositionwiseFeedForward(
              (w_1): Linear(in_features=512, out_features=2048, dtype=float32)
              (activation): ReLU()
              (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
              (w_2): Linear(in_features=2048, out_features=512, dtype=float32)
            )
            (norm1): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm2): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (norm3): LayerNorm(normalized_shape=[512], epsilon=1e-12)
            (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
            (concat_linear1): Linear(in_features=1024, out_features=512, dtype=float32)
            (concat_linear2): Linear(in_features=1024, out_features=512, dtype=float32)
          )
        )
      )
      (ctc): CTCDecoderBase(
        (dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
        (ctc_lo): Linear(in_features=512, out_features=5537, dtype=float32)
        (criterion): CTCLoss(
          (loss): CTCLoss()
        )
      )
      (criterion_att): LabelSmoothingLoss(
        (criterion): KLDivLoss()
      )
    )
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 251
    • 252
    • 253
    • 254
    • 255
    • 256
    • 257
    • 258
    • 259
    • 260
    • 261
    • 262
    • 263
    • 264
    • 265
    • 266
    • 267
    • 268
    • 269
    • 270
    • 271
    • 272
    • 273
    • 274
    • 275
    • 276
    • 277
    • 278
    • 279
    • 280
    • 281
    • 282
    • 283
    • 284
    • 285
    • 286
    • 287
    • 288
    • 289
    • 290
    • 291
    • 292
    • 293
    • 294
    • 295
    • 296
    • 297
    • 298
    • 299
    • 300
    • 301
    • 302
    • 303
    • 304
    • 305
    • 306
    • 307
    • 308
    • 309
    • 310
    • 311
    • 312
    • 313
    • 314
    • 315
    • 316
    • 317
    • 318
    • 319
    • 320
    • 321
    • 322
    • 323
    • 324
    • 325
    • 326
    • 327
    • 328
    • 329
    • 330
    • 331
    • 332
    • 333
    • 334
    • 335
    • 336
    • 337
    • 338
    • 339
    • 340
    • 341
    • 342
    • 343
    • 344
    • 345
    • 346
    • 347
    • 348
    • 349
    • 350
    • 351
    • 352
    • 353
    • 354
    • 355
    • 356
    • 357
    • 358
    • 359
    • 360
    • 361
    • 362
    • 363
    • 364
    • 365
    • 366
    • 367
    • 368
    • 369
    • 370
    • 371
    • 372
    • 373
    • 374
    • 375
    • 376
    • 377
    • 378
    • 379
    • 380
    • 381
    • 382
    • 383
    • 384
    • 385
    • 386
    • 387
    • 388
    • 389
    • 390
    • 391
    • 392
    • 393
    • 394
    • 395
    • 396
    • 397
    • 398
    • 399
    • 400
    • 401
    • 402
    • 403
    • 404
    • 405
    • 406
    • 407
    • 408
    • 409
    • 410
    • 411
    • 412
    • 413
    • 414
    • 415
    • 416
    • 417
    • 418
    • 419
    • 420
    • 421
    • 422
    • 423
    • 424
    • 425
    • 426
    • 427
    • 428
    • 429
    • 430
    • 431
    • 432
    • 433
    • 434
    • 435
    • 436
    • 437
    • 438
    • 439
    • 440
    • 441
    • 442
    • 443
    • 444
    • 445
    • 446
    • 447
    • 448
    • 449
    • 450
    • 451
    • 452
    • 453
    • 454
    • 455
    • 456
    • 457
    • 458
    • 459
    • 460
    • 461
    • 462
    • 463
    • 464
    • 465
    • 466
    • 467
    • 468
    • 469
    • 470
    • 471
    • 472
    • 473
    • 474
    • 475
    • 476
    • 477
    • 478
    • 479
    • 480
    • 481
    • 482
    • 483
    • 484
    • 485
    • 486
    • 487
    • 488
    • 489
    • 490
    • 491
    • 492
    • 493
    • 494
    • 495
    • 496
    • 497
    • 498
    • 499
    • 500
    • 501
    • 502
    • 503
    • 504
    • 505
    • 506
    • 507
    • 508
    • 509
    • 510
    • 511
    • 512
    • 513
    • 514
    • 515
    • 516
    • 517
    • 518
    • 519
    • 520
    • 521
    • 522
    • 523
    • 524
    • 525
    • 526
    • 527
    • 528
    • 529
    • 530
    • 531
    • 532
    • 533
    • 534
    • 535
    • 536
    • 537
    • 538
    • 539
    • 540
    • 541
    • 542
    • 543
    • 544
    • 545
    • 546
    • 547
    • 548
    • 549
    • 550
    • 551
    • 552
    • 553
    • 554
    • 555
    • 556
    • 557
    • 558
    • 559
    • 560
    • 561
    • 562
    • 563
    • 564
    • 565
    • 566
    • 567
    • 568
    • 569
    • 570
    • 571
    • 572
    • 573
    • 574
    • 575
    • 576
    • 577
    • 578
    • 579
    • 580
    • 581
    • 582
    • 583
    • 584
    • 585
    • 586
    • 587
    • 588
    • 589
    • 590
    • 591
    • 592
    • 593
    • 594
    • 595
    • 596
    • 597
    • 598
    • 599
    • 600
    • 601
    • 602
    • 603
    • 604
    • 605
    • 606
    • 607
    • 608
    • 609
    • 610
    • 611
    • 612
    • 613
    • 614
    • 615
    • 616
    • 617
    • 618
    • 619
    • 620
    • 621
    • 622
    • 623
    • 624
    • 625
    • 626
    • 627
    • 628
    • 629
    • 630
    • 631
    • 632
    • 633
    • 634
    • 635
    • 636
    • 637
    • 638
    • 639
    • 640
    • 641
    • 642
    • 643

    这个网络非常长,所以需要拆开慢慢讲解,下面我们先讲这一层,然后这一层,主要是两部分:一个是卷积,一个是Linear,然后我们来手工实现。

    
    Conv2dSubsampling4(
      (pos_enc): RelPositionalEncoding(
        (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
      )
      (conv): Sequential(
        (0): Conv2D(1, 512, kernel_size=[3, 3], stride=[2, 2], data_format=NCHW)
        (1): ReLU()
        (2): Conv2D(512, 512, kernel_size=[3, 3], stride=[2, 2], data_format=NCHW)
        (3): ReLU()
      )
      (out): Sequential(
        (0): Linear(in_features=9728, out_features=512, dtype=float32)
      )
    )
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    (2)搭建网络

    import paddle
    class Model(paddle.nn.Layer):
        def __init__(self):
            super(Model,self).__init__()
            self.conv2d_1 = paddle.nn.Conv2D(1, 512, kernel_size=[3, 3], stride=[2, 2], data_format="NCHW")
            self.conv2d_2 = paddle.nn.Conv2D(512, 512, kernel_size=[3, 3], stride=[2, 2], data_format="NCHW")
            self.lineone = paddle.nn.Linear(9728,512)
        def forward(self,inputs):
            y = self.conv2d_1(inputs)
            y = paddle.nn.functional.relu(y)
            y = self.conv2d_2(y)
            y = paddle.nn.functional.relu(y)
            b, c, t, f = paddle.shape(y)
            y = y.transpose([0, 2, 1, 3]).reshape([b,t,c*f])
            y = self.lineone(y)
            print(y.shape)
            print(y)
            return y
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    上面这是使用paddle框架,根据最终的summary搭建出来的网络,具体为何如此搭建,请看下图。

    在这里插入图片描述

    可以清楚的看到,在forward过程中,先是卷积,然后卷积完以后,进行了transpose,reshape才怼入Linear,所以咱们也得这么写。

    在这里插入图片描述

    model = Model()
    
    • 1
    audio1 = audio.unsqueeze(1) # (b, c=1, t, f)
    
    • 1

    在这里插入图片描述

    x = model(audio1)
    
    • 1

    在这里插入图片描述

    可以看到,这个时候,咱们自己写的网络和源代码里面的都对上了,输出shape也一样。
    (3)embedding

    这一部分其实就是一个lookup表,不写代码了,直接看源码,不过这个地方主要是position_embed,最终是多个结果一起输出

    在这里插入图片描述

    完成位置嵌入后,其实后面还有很多处理,这个下次再讲了。

    3、总结

    好了,这里讲了这么多,其实只是完成了网络的第一部分,后面还有很多,下期继续讲。

  • 相关阅读:
    python轻量级性能工具-Locust
    RabbitMQ------其他知识点(幂等性、优先级队列、惰性队列)(九)
    SpringCloud——负载均衡Ribbon
    Windows平台下安装binwalk
    go-redis 框架基本使用
    Redis的内存淘汰策略分析
    2020华数杯全国大学生数学建模竞赛C题-基于大数据对脱贫帮扶绩效的评价(一)(附带赛题解析&获奖论文及MATLAB代码)
    178文章复现:基于matlab的微震图像去噪
    外汇天眼:Oranco Group不让出金 还擅自关闭投资者账户清除MT5 交易数据
    uni-app props不能传递function的问题
  • 原文地址:https://blog.csdn.net/qq_23953717/article/details/126103648