• MATLAB环境下基于深度学习的语音降噪方法


    之前简单的利用深层自编码器对语音信号进行降噪

    基于自编码器的语音信号降噪 - 哥廷根数学学派的文章 - 知乎 基于自编码器的语音信号降噪 - 知乎

    本篇讲一些稍微复杂的基于深度学习的语音降噪方法,并比较了应用于同一任务的两种的网络:全连接层网络和卷积网络。

    完整代码和数据集见如下链接

    🍞正在为您运送作品详情

    考虑以下以 8 kHz 采样的语音信号

    1. [cleanAudio,fs] = audioread("SpeechDFT.wav");
    2. sound(cleanAudio,fs)

    将洗衣机噪声添加到上述的语音信号中,设置噪声功率,使信噪比 (SNR) 为0dB

    noise = audioread("WashingMachine.mp3");

    接下来从噪声文件中的随机位置提取噪声段

    1. ind = randi(numel(noise) - numel(cleanAudio) + 1, 1, 1);
    2. noiseSegment = noise(ind:ind + numel(cleanAudio) - 1);
    3. speechPower = sum(cleanAudio.^2);
    4. noisePower = sum(noiseSegment.^2);
    5. noisyAudio = cleanAudio + sqrt(speechPower/noisePower) * noiseSegment;
    6. %播放信号
    7. sound(noisyAudio,fs)

    可视化原始信号和噪声信号

    1. t = (1/fs) * (0:numel(cleanAudio)-1);
    2. subplot(2,1,1)
    3. plot(t,cleanAudio)
    4. title("Clean Audio")
    5. grid on
    6. subplot(2,1,2)
    7. plot(t,noisyAudio)
    8. title("Noisy Audio")
    9. xlabel("Time (s)")
    10. grid on

    语音降噪的目的是从语音信号中去除洗衣机噪声,同时最大限度地减少输出语音信号中不希望的所谓的artifacts。

    检查数据集

    本例使用的数据集包含 48 kHz 的短句录音

    训练集,测试集,验证集文件

    训练集部分数据

    使用 audioDatastore 为训练集创建数据存储

    adsTrain = audioDatastore(fullfile(dataFolder,'train'),'IncludeSubfolders',true);

    读取datastore中第一个文件的内容

    [audio,adsTrainInfo] = read(adsTrain);

    播放语音信号

    sound(audio,adsTrainInfo.SampleRate)

    绘制语音信号

    1. figure
    2. t = (1/adsTrainInfo.SampleRate) * (0:numel(audio)-1);
    3. plot(t,audio)
    4. title("Example Speech Signal")
    5. xlabel("Time (s)")
    6. grid on

    深度学习系统概述

    基本的深度学习训练方案如下图所示。

    对于熟悉语音信号处理的同学肯定是小case。注意,由于语音通常低于 4 kHz,因此首先将干净和嘈杂的语音信号下采样到 8 kHz,以减少网络的计算负担。 网络的输出是降噪信号的幅度谱,使用输出幅度谱和噪声信号的相位将降噪后的音频转换回时域[1]。

    可以使用短时傅里叶变换 (STFT) 将音频信号转换到频域,使用Hamming窗,窗口长度为 256 ,重叠率为 75%。Predicter的输入由 8 个连续的噪声 STFT 向量组成,因此每个 STFT 输出估计值都是基于当前的噪声 STFT 和 7 个先前的噪声 STFT 向量计算的。

    如何从一个训练文件生成Target和Predicter(直接用英文单词了,否则容易引起误解)?首先,定义系统参数:

    1. windowLength = 256;
    2. win = hamming(windowLength,"periodic");
    3. overlap = round(0.75 * windowLength);
    4. ffTLength = windowLength;
    5. inputFs = 48e3;
    6. fs = 8e3;
    7. numFeatures = ffTLength/2 + 1;
    8. numSegments = 8;

    创建一个 dsp.SampleRateConverter 对象以将 48 kHz 音频转换为 8 kHz,

    1. src = dsp.SampleRateConverter("InputSampleRate",inputFs, ...
    2. "OutputSampleRate",fs, ...
    3. "Bandwidth",7920);

    从datastore中读入音频文件的内容

    audio = read(adsTrain);

    注意:要确保音频长度是采样率转换器抽取因子的倍数

    1. decimationFactor = inputFs/fs;
    2. L = floor(numel(audio)/decimationFactor);
    3. audio = audio(1:decimationFactor*L);

    将音频信号转换为 8 kHz

    1. audio = src(audio);
    2. reset(src)

    从洗衣机噪声向量中创建一个随机噪声段

    1. randind = randi(numel(noise) - numel(audio),[1 1]);
    2. noiseSegment = noise(randind : randind + numel(audio) - 1);

    向语音信号中添加噪声,使 SNR 为 0 dB。

    1. noisePower = sum(noiseSegment.^2);
    2. cleanPower = sum(audio.^2);
    3. noiseSegment = noiseSegment .* sqrt(cleanPower/noisePower);
    4. noisyAudio = audio + noiseSegment;

    使用STFT从原始和嘈杂的音频信号生成幅值STFT向量

    1. cleanSTFT = stft(audio,'Window',win,'OverlapLength',overlap,'FFTLength',ffTLength);
    2. cleanSTFT = abs(cleanSTFT(numFeatures-1:end,:));
    3. noisySTFT = stft(noisyAudio,'Window',win,'OverlapLength',overlap,'FFTLength',ffTLength);
    4. noisySTFT = abs(noisySTFT(numFeatures-1:end,:));

    从嘈杂的 STFT 生成 8 段训练Predicter信号,对应深度学习训练方案图

    1. noisySTFT = [noisySTFT(:,1:numSegments - 1), noisySTFT];
    2. stftSegments = zeros(numFeatures, numSegments , size(noisySTFT,2) - numSegments + 1);
    3. for index = 1:size(noisySTFT,2) - numSegments + 1
    4. stftSegments(:,:,index) = (noisySTFT(:,index:index + numSegments - 1));
    5. end

    设置Target和Predicter,每个Predicter维度为 129×8,每个Target为 129×1,这些都是可以根据信号调整的

    1. targets = cleanSTFT;
    2. size(targets)
    3. predictors = stftSegments;
    4. size(predictors)

    为了加快处理速度,使用 tall 数组从datastore中所有音频文件的语音片段中提取特征序列,要是有matlab的并行工具箱和GPU。

    首先,将datastore转换为 tall 数组,此处需要的时间较长

    1. reset(adsTrain)
    2. T = tall(adsTrain)

    从 tall 表中提取Target和Predicter的幅值STFT,这一步很重要,GenerateSpeechDenoisingFeatures是提取Target和Predicter的幅值STFT的函数,我还需要优化一点

    1. [targets,predictors] = cellfun(@(x)GenerateSpeechDenoisingFeatures(x,noise,src),T,"UniformOutput",false);
    2. [targets,predictors] = gather(targets,predictors);

    将所有特征进行归一化,分别计算Target和Predicter的均值和标准差,并使用它们对数据进行归一化。

    1. predictors = cat(3,predictors{:});
    2. noisyMean = mean(predictors(:));
    3. noisyStd = std(predictors(:));
    4. predictors(:) = (predictors(:) - noisyMean)/noisyStd;
    5. targets = cat(2,targets{:});
    6. cleanMean = mean(targets(:));
    7. cleanStd = std(targets(:));
    8. targets(:) = (targets(:) - cleanMean)/cleanStd;

    将Target和Predicter进行维度重塑,即reshape为与深度学习网络相对应的维度

    1. predictors = reshape(predictors,size(predictors,1),size(predictors,2),1,size(predictors,3));
    2. targets = reshape(targets,1,1,size(targets,1),size(targets,2));

    将数据随机分成训练集和验证集。

    1. inds = randperm(size(predictors,4));
    2. L = round(0.99 * size(predictors,4));
    3. trainPredictors = predictors(:,:,:,inds(1:L));
    4. trainTargets = targets(:,:,:,inds(1:L));
    5. validatePredictors = predictors(:,:,:,inds(L+1:end));
    6. validateTargets = targets(:,:,:,inds(L+1:end));

    下面开始步入正题,进行第一个全连接层深层网络的语音降噪

    全连接网络是什么就不讲了,看图

    定义网络层,将输入大小指定为大小为 NumFeatures-by-NumSegments(本例中为 129-by-8)的图像。 定义两个隐藏的全连接层,每个层有 1024 个神经元。 由于是纯线性系统,因此在每个隐藏的全连接层之后都有一个整流线性单元 (ReLU) 层。 批量归一化层对输出的均值和标准差进行归一化,添加一个具有 129 个神经元的全连接层,然后是一个回归层。

    1. layers = [
    2. imageInputLayer([numFeatures,numSegments])
    3. fullyConnectedLayer(1024)
    4. batchNormalizationLayer
    5. reluLayer
    6. fullyConnectedLayer(1024)
    7. batchNormalizationLayer
    8. reluLayer
    9. fullyConnectedLayer(numFeatures)
    10. regressionLayer
    11. ];

    然后设置网络的训练选项,很好理解

    1. miniBatchSize = 128;
    2. options = trainingOptions("adam", ...
    3. "MaxEpochs",3, ...
    4. "InitialLearnRate",1e-5,...
    5. "MiniBatchSize",miniBatchSize, ...
    6. "Shuffle","every-epoch", ...
    7. "Plots","training-progress", ...
    8. "Verbose",false, ...
    9. "ValidationFrequency",floor(size(trainPredictors,4)/miniBatchSize), ...
    10. "LearnRateSchedule","piecewise", ...
    11. "LearnRateDropFactor",0.9, ...
    12. "LearnRateDropPeriod",1, ...
    13. "ValidationData",{validatePredictors,validateTargets});

    利用trainNetwork 使用指定的训练选项和网络层训练深层网络。 因为训练集很大,训练过程较为耗时

        denoiseNetFullyConnected = trainNetwork(trainPredictors,trainTargets,layers,options);

    计算网络全连接层的权重数量

    1. numWeights = 0;
    2. for index = 1:numel(denoiseNetFullyConnected.Layers)
    3. if isa(denoiseNetFullyConnected.Layers(index),"nnet.cnn.layer.FullyConnectedLayer")
    4. numWeights = numWeights + numel(denoiseNetFullyConnected.Layers(index).Weights);
    5. end
    6. end
    7. fprintf("The number of weights is %d.\n",numWeights);

    然后进行卷积层神经网络的语音去噪

    卷积层通常比全连接层包含更少的参数,根据文献[2]中描述的全卷积网络的层数,包括 16 个卷积层。 前 15 个卷积层为3层的组,重复 5 次,滤波器宽度分别为 9、5 和 9,滤波器数量分别为 18、30 和 8。 最后一个卷积层的滤波器宽度为129 。 在这个网络中,卷积只在一个方向(沿着频率维度)执行,并且除了第一层之外的所有层,沿着时间维度的滤波器宽度设置为 1。 与全连接网络类似,卷积层之后是 ReLu 和批量归一化层。

    1. layers = [imageInputLayer([numFeatures,numSegments])
    2. convolution2dLayer([9 8],18,"Stride",[1 100],"Padding","same")
    3. batchNormalizationLayer
    4. reluLayer
    5. repmat( ...
    6. [convolution2dLayer([5 1],30,"Stride",[1 100],"Padding","same")
    7. batchNormalizationLayer
    8. reluLayer
    9. convolution2dLayer([9 1],8,"Stride",[1 100],"Padding","same")
    10. batchNormalizationLayer
    11. reluLayer
    12. convolution2dLayer([9 1],18,"Stride",[1 100],"Padding","same")
    13. batchNormalizationLayer
    14. reluLayer],4,1)
    15. convolution2dLayer([5 1],30,"Stride",[1 100],"Padding","same")
    16. batchNormalizationLayer
    17. reluLayer
    18. convolution2dLayer([9 1],8,"Stride",[1 100],"Padding","same")
    19. batchNormalizationLayer
    20. reluLayer
    21. convolution2dLayer([129 1],1,"Stride",[1 100],"Padding","same")
    22. regressionLayer
    23. ];

    训练选项与全连接网络的选项类似

    1. options = trainingOptions("adam", ...
    2. "MaxEpochs",3, ...
    3. "InitialLearnRate",1e-5, ...
    4. "MiniBatchSize",miniBatchSize, ...
    5. "Shuffle","every-epoch", ...
    6. "Plots","training-progress", ...
    7. "Verbose",false, ...
    8. "ValidationFrequency",floor(size(trainPredictors,4)/miniBatchSize), ...
    9. "LearnRateSchedule","piecewise", ...
    10. "LearnRateDropFactor",0.9, ...
    11. "LearnRateDropPeriod",1, ...
    12. "ValidationData",{validatePredictors,permute(validateTargets,[3 1 2 4])});

    利用 trainNetwork 使用指定的训练选项和层架构训练网络

        denoiseNetFullyConvolutional = trainNetwork(trainPredictors,permute(trainTargets,[3 1 2 4]),layers,options);

    计算网络全连接层的权重数量

    1. numWeights = 0;
    2. for index = 1:numel(denoiseNetFullyConvolutional.Layers)
    3. if isa(denoiseNetFullyConvolutional.Layers(index),"nnet.cnn.layer.Convolution2DLayer")
    4. numWeights = numWeights + numel(denoiseNetFullyConvolutional.Layers(index).Weights);
    5. end
    6. end
    7. fprintf("The number of weights in convolutional layers is %d\n",numWeights);

    测试降噪网络

    读入测试集

    adsTest = audioDatastore(fullfile(dataFolder,'test'),'IncludeSubfolders',true);

    从datastore中读取文件

    [cleanAudio,adsTestInfo] = read(adsTest);

    确保音频长度是采样率转换器抽取因子的倍数

    1. L = floor(numel(cleanAudio)/decimationFactor);
    2. cleanAudio = cleanAudio(1:decimationFactor*L);

    将音频信号转换为 8 kHz。

    1. cleanAudio = src(cleanAudio);
    2. reset(src)

    测试阶段使用训练阶段未使用的洗衣机噪声来给语音信号加噪

    noise = audioread("WashingMachine-16-8-mono-200secs.mp3");

    从洗衣机噪声向量中创建一个随机噪声段

    1. randind = randi(numel(noise) - numel(cleanAudio), [1 1]);
    2. noiseSegment = noise(randind : randind + numel(cleanAudio) - 1);

    向语音信号中添加噪声,使 SNR为0 dB

    1. noisePower = sum(noiseSegment.^2);
    2. cleanPower = sum(cleanAudio.^2);
    3. noiseSegment = noiseSegment .* sqrt(cleanPower/noisePower);
    4. noisyAudio = cleanAudio + noiseSegment;

    同样使用STFT从嘈杂的语音信号中生成幅值STFT向量

    1. noisySTFT = stft(noisyAudio,'Window',win,'OverlapLength',overlap,'FFTLength',ffTLength);
    2. noisyPhase = angle(noisySTFT(numFeatures-1:end,:));
    3. noisySTFT = abs(noisySTFT(numFeatures-1:end,:));

    同样从STFT 生成 8 段训练Predicter信号

    1. noisySTFT = [noisySTFT(:,1:numSegments-1) noisySTFT];
    2. predictors = zeros( numFeatures, numSegments , size(noisySTFT,2) - numSegments + 1);
    3. for index = 1:(size(noisySTFT,2) - numSegments + 1)
    4. predictors(:,:,index) = noisySTFT(:,index:index + numSegments - 1);
    5. end

    通过在训练阶段计算的均值和标准差对Predicter进行归一化。

    predictors(:) = (predictors(:) - noisyMean) / noisyStd;

    通过对两个经过训练的网络计算降噪幅值STFT

    1. predictors = reshape(predictors, [numFeatures,numSegments,1,size(predictors,3)]);
    2. STFTFullyConnected = predict(denoiseNetFullyConnected, predictors);
    3. STFTFullyConvolutional = predict(denoiseNetFullyConvolutional, predictors);

    通过训练阶段使用的平均值和标准差来恢复输出

    1. STFTFullyConnected(:) = cleanStd * STFTFullyConnected(:) + cleanMean;
    2. STFTFullyConvolutional(:) = cleanStd * STFTFullyConvolutional(:) + cleanMean;

    将单边STFT 转换为“居中的” STFT,这在信号处理中很好理解

    1. STFTFullyConnected = STFTFullyConnected.' .* exp(1j*noisyPhase);
    2. STFTFullyConnected = [conj(STFTFullyConnected(end-1:-1:2,:)); STFTFullyConnected];
    3. STFTFullyConvolutional = squeeze(STFTFullyConvolutional) .* exp(1j*noisyPhase);
    4. STFTFullyConvolutional = [conj(STFTFullyConvolutional(end-1:-1:2,:)) ; STFTFullyConvolutional];

    计算降噪语音信号。 istft 执行逆 STFT,使用带噪声STFT相位的相位来重建时域信号

    1. denoisedAudioFullyConnected = istft(STFTFullyConnected, ...
    2. 'Window',win,'OverlapLength',overlap, ...
    3. 'FFTLength',ffTLength,'ConjugateSymmetric',true);
    4. denoisedAudioFullyConvolutional = istft(STFTFullyConvolutional, ...
    5. 'Window',win,'OverlapLength',overlap, ...
    6. 'FFTLength',ffTLength,'ConjugateSymmetric',true);

    绘制干净、嘈杂和降噪后的音频信号

    1. t = (1/fs) * (0:numel(denoisedAudioFullyConnected)-1);
    2. figure
    3. subplot(4,1,1)
    4. plot(t,cleanAudio(1:numel(denoisedAudioFullyConnected)))
    5. title("Clean Speech")
    6. grid on
    7. subplot(4,1,2)
    8. plot(t,noisyAudio(1:numel(denoisedAudioFullyConnected)))
    9. title("Noisy Speech")
    10. grid on
    11. subplot(4,1,3)
    12. plot(t,denoisedAudioFullyConnected)
    13. title("Denoised Speech (Fully Connected Layers)")
    14. grid on
    15. subplot(4,1,4)
    16. plot(t,denoisedAudioFullyConvolutional)
    17. title("Denoised Speech (Convolutional Layers)")
    18. grid on
    19. xlabel("Time (s)")

    绘制干净、嘈杂和降噪后的频谱图。

    1. h = figure;
    2. subplot(4,1,1)
    3. spectrogram(cleanAudio,win,overlap,ffTLength,fs);
    4. title("Clean Speech")
    5. grid on
    6. subplot(4,1,2)
    7. spectrogram(noisyAudio,win,overlap,ffTLength,fs);
    8. title("Noisy Speech")
    9. grid on
    10. subplot(4,1,3)
    11. spectrogram(denoisedAudioFullyConnected,win,overlap,ffTLength,fs);
    12. title("Denoised Speech (Fully Connected Layers)")
    13. grid on
    14. subplot(4,1,4)
    15. spectrogram(denoisedAudioFullyConvolutional,win,overlap,ffTLength,fs);
    16. title("Denoised Speech (Convolutional Layers)")
    17. grid on
    18. p = get(h,'Position');
    19. set(h,'Position',[p(1) 65 p(3) 800]);

    sound(noisyAudio,fs)

    播放全连接层神经网络降噪后的语音信号

    sound(denoisedAudioFullyConnected,fs)

    播放卷积层神经网络降噪后的语音信号

    sound(denoisedAudioFullyConvolutional,fs)

    播放干净的语音信号

    sound(cleanAudio,fs)

    测试更多文件,且生成时域图和频域图,同时返回干净、加噪和降噪后的语音信号

    [cleanAudio,noisyAudio,denoisedAudioFullyConnected,denoisedAudioFullyConvolutional] = testDenoisingNets(adsTest,denoiseNetFullyConnected,denoiseNetFullyConvolutional,noisyMean,noisyStd,cleanMean,cleanStd);

    训练非常耗时,必须要配备matlab并行工具箱和GPU

    另外,此方法可迁移至其他的一维信号,比如微震信号,机械振动信号,心电信号等,但要特别注意所加噪声信号的相位问题。

    参考文献

    [1] "Experiments on Deep Learning for Speech Denoising", Ding Liu, Paris Smaragdis, Minje Kim, INTERSPEECH, 2014.

    [2] "A Fully Convolutional Neural Network for Speech Enhancement", Se Rim Park, Jin Won Lee, INTERSPEECH, 2017.

  • 相关阅读:
    【精通内核】Linux内核自旋锁实现原理与源码解析
    640. 求解方程(JavaScript)
    【Proteus仿真】【Arduino单片机】基于物联网新能源电动车检测系统设计
    形态学 - 骨架
    jpa分页插件对象Pageable出现了错误异常如何解决?
    lua脚本使用,单个及多个参数post请求
    javascript算法排序之桶排序
    app毕业设计开题报告-基于Uniapp实现的移动端的医生挂号就诊平台
    XPS测试中CHN含量为什么测不准?-科学指南针
    ubuntu 安装串口工具和添加虚拟串口
  • 原文地址:https://blog.csdn.net/weixin_39402231/article/details/127104174