• 【20220121】Voice conversion


    Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion 中间结果,github

    1. autovc

    AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

    1. 英文翻译参考此CSDN此CSDN
    2. 零次学习(Zero-Shot Learning)参考此知乎
    3. autovc官方github

    三个问题:

    1. 训练非并行数据;
    2. 多对多转换;
    3. zero-shot

    希望像GAN一样匹配分布,像CVAE一样容易训练

    zero-shot

    zero-shot learning 概念示意
    在这里插入图片描述
    利用训练集数据训练模型,使得模型能够对测试集的对象进行分类,但是训练集类别和测试集类别之间没有交集;期间需要借助类别的描述,来建立训练集和测试集之间的联系,从而使得模型有效。第一个问题是获取合适的类别描述 A A A;第二个问题是建立一个合适的分类模型。

    存在的问题:

    1. 领域漂移问题——自编码器过程
    2. 枢纽点问题——建立从语义空间到特征空间的映射;使用生成模型
    3. 语义间隔——将两者的流形调整到一致

    代码复现

    程序报错

    1. RuntimeError: CUDA error: out of memory CUDA内存不足
      解决办法:等0号卡上别人程序跑完(需要1725MiB)
    2. module 'librosa' has no attribute 'output'
      0.8.0以后的版本,librosa已将output函数删除
      解决办法:
    import soundfile as sf
    sf.write(name + '.wav', waveform, 16000)
    
    • 1
    • 2

    算法流程:

    1. Generate spectrogram data from the wav files: python make_spect.py
      以CASIA database中的liuchanhg_angry、wangzhe_happy、zhaoquanyin_sad中的201.wav、202.wav为例,生成的谱S分别为:(112, 80)、(103, 80);(92, 80)、(70, 80);(189, 80)、(87, 80)

    2. Generate training metadata, including the GE2E speaker embedding (please use one-hot embeddings if you are not doing zero-shot conversion): python make_metadata.py
      如果当前的话语太短,选择另一个话语
      最后生成的格式为:
      [[‘liuchanhg_angry’, array(speaker embeding), ‘liuchanhg_angry/201.npy’, ‘liuchanhg_angry/202.npy’]]

    3. Run the main training script: python main.py
      Converges when the reconstruction loss is around 0.0001.

    ...
    Elapsed [1 day, 3:41:14], Iteration [304060/1000000], G/loss_id: 0.0180, G/loss_id_psnt: 0.0179, G/loss_cd: 0.0001
    Elapsed [1 day, 3:41:17], Iteration [304070/1000000], G/loss_id: 0.0110, G/loss_id_psnt: 0.0109, G/loss_cd: 0.0001
    ...
    
    • 1
    • 2
    • 3
    • 4

    100k

    Elapsed [2:25:56], Iteration [99990/100000], G/loss_id: 0.0294, G/loss_id_psnt: 0.0294, G/loss_cd: 0.0000
    Elapsed [2:25:57], Iteration [100000/100000], G/loss_id: 0.0240, G/loss_id_psnt: 0.0240, G/loss_cd: 0.0000
    
    • 1
    • 2

    1000k

    Elapsed [17:26:39], Iteration [698500/1000000], G/loss_id: 0.0289, G/loss_id_psnt: 0.0289, G/loss_cd: 0.0000
    
    
    • 1
    • 2

    model_2修改dim_neck=32,freq=32,batch_size=2

    Elapsed [14:21:06], Iteration [461400/1000000], G/loss_id: 0.0139, G/loss_id_psnt: 0.0139, G/loss_cd: 0.0001
    ...
    Elapsed [23:37:37], Iteration [732500/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0162, G/loss_cd: 0.0001
    ...
    Elapsed [1 day, 8:23:07], Iteration [999900/1000000], G/loss_id: 0.0190, G/loss_id_psnt: 0.0189, G/loss_cd: 0.0001
    Elapsed [1 day, 8:23:18], Iteration [1000000/1000000], G/loss_id: 0.0143, G/loss_id_psnt: 0.0143, G/loss_cd: 0.0001
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    model_3修改len_crop=128*3

    Elapsed [14:10:26], Iteration [181400/1000000], G/loss_id: 0.0146, G/loss_id_psnt: 0.0145, G/loss_cd: 0.0002
    Elapsed [14:10:54], Iteration [181500/1000000], G/loss_id: 0.0160, G/loss_id_psnt: 0.0159, G/loss_cd: 0.0002
    ...
    Elapsed [23:26:40], Iteration [290100/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0163, G/loss_cd: 0.0001
    ...
    Elapsed [1 day, 9:20:03], Iteration [426600/1000000], G/loss_id: 0.0125, G/loss_id_psnt: 0.0124, G/loss_cd: 0.0001
    Elapsed [1 day, 9:20:17], Iteration [426700/1000000], G/loss_id: 0.0186, G/loss_id_psnt: 0.0182, G/loss_cd: 0.0001
    ...
    Elapsed [1 day, 12:16:26], Iteration [501300/1000000], G/loss_id: 0.0137, G/loss_id_psnt: 0.0131, G/loss_cd: 0.0001
    Elapsed [1 day, 12:16:40], Iteration [501400/1000000], G/loss_id: 0.0152, G/loss_id_psnt: 0.0149, G/loss_cd: 0.0001
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    下面发现mel谱的长度和vecoder不一致,一个2s左右的音频mel谱的维度居然有(四五百, 80),所以直接从wavenet的代码试试生成mel谱,参考官方github

    model_4参考,增加squeeze:

    g_loss_id = F.mse_loss(x_real, x_identic.squeeze())
    g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt.squeeze())
    
    • 1
    • 2

    但是作者

    I trimmed the silence off by hand.

    还有

    small batch size leads to better generalization

    是否需要using other vocoders?

    作者

    300k steps, 10 hours

    作者

    The only use a subset.

    作者

    final loss is around 1e-2

    关于AUTOVC的效果非常不好我猜测可能的问题如下:

    1. 作者说“I trimmed the silence off by hand.”,所以不知道沉默部分太多对训练效果是否有影响,因为按所给代码提的p225_001.wav的维度为(385, 80) ,我又用wavenet的mel谱提取方式,得到的结果为(177, 80),而vecoder里面此音频的维度为(90, 80),这个问题对于结果的生成(尤其是metadata)我觉得比较重要,也有很多人有同样的问题(https://github.com/auspicious3000/autovc/issues/84,https://github.com/auspicious3000/autovc/issues/17);同时原代码的len_crop=128,因为这样可能太短了,没有包含说话人的声音,我将其增加为376,还在训练,但是Speaker_embeding仍然是根据128做的,如果之后训练效果仍然不好,我会再最后尝试改这个;(PS:现在维度为(166, 80),又发现(129, 80), (385, 80) for wav48)
    2. 作者说“small batch size leads to better generalization”,我现在把batch size改会2重新训练;
    3. 关于训练步长作者在原论文中写道“100k”,但是又说“300k steps”;
    4. 作者又说“The only use a subset.”,不是按原论文中说的9:1划分VCTK数据集。

    retrain

    Elapsed [1 day, 3:08:04], Iteration [1000000/1000000], G/loss_id: 0.0077, G/loss_id_psnt: 0.0077, G/loss_cd: 0.0005
    
    • 1
    Elapsed [1 day, 10:11:06], Iteration [1000000/1000000], G/loss_id: 0.0062, G/loss_id_psnt: 0.0062, G/loss_cd: 0.0002
    
    • 1
    Elapsed [1 day, 10:10:44], Iteration [1000000/1000000], G/loss_id: 0.0037, G/loss_id_psnt: 0.0036, G/loss_cd: 0.0002
    
    • 1
    Elapsed [2 days, 6:16:53], Iteration [1000000/1000000], G/loss_id: 0.0034, G/loss_id_psnt: 0.0033, G/loss_cd: 0.0002
    
    • 1
    Elapsed [2 days, 5:12:11], Iteration [1000000/1000000], G/loss_id: 0.0033, G/loss_id_psnt: 0.0032, G/loss_cd: 0.0002
    
    • 1

    效果都不好

    Vocoder

    参考实验室wiki工具库github,可基于预训练模型,解决mel谱的提取与vocoder生成问题。

    wavenet cudu问题
    修改/ceph/home/yangsc21/anaconda3/envs/autovc/lib/python3.8/site-packages/wavenet_vocoder/mixture.py的第112行

    划分数据集

    为了进行zero shot,随机选择100个说话人进行训练,9个说话人不出现在训练集中。

    注意剔除p376_295.raw文件

    Note that the ‘p315’ text was lost due to a hard disk error.

    Elapsed [0:00:12], Iteration [100/1000000], G/loss_id: 0.7917, G/loss_id_psnt: 1.3611, G/loss_cd: 0.0848
    Elapsed [0:00:22], Iteration [200/1000000], G/loss_id: 0.7815, G/loss_id_psnt: 0.6783, G/loss_cd: 0.0517
    ...
    Elapsed [9:17:16], Iteration [325900/1000000], G/loss_id: 0.0984, G/loss_id_psnt: 0.0974, G/loss_cd: 0.0041
    Elapsed [9:17:26], Iteration [326000/1000000], G/loss_id: 0.0822, G/loss_id_psnt: 0.0815, G/loss_cd: 0.0030
    ...
    Elapsed [17:51:58], Iteration [630900/1000000], G/loss_id: 0.0621, G/loss_id_psnt: 0.0614, G/loss_cd: 0.0020
    Elapsed [17:52:08], Iteration [631000/1000000], G/loss_id: 0.0534, G/loss_id_psnt: 0.0531, G/loss_cd: 0.0018
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    Bottleneck维度分析

    • “too narrow” model:dimensions from 32 to 16,downsampling factor from 32 to 128
    • “too wide” model:dimensions 256,sampling factor to 8,λ is set to 0

    The “too narrow” model should have low classification accuracy (good disentanglement) but high reconstruction error (poor reconstruction)

    The “too wide” model should have low reconstruction error (good reconstruction) but high classification accuracy (poor disentanglement).

    freq=64,dim_neck=24

    Elapsed [15:18:01], Iteration [148500/1000000], G/loss_id: 0.1358, G/loss_id_psnt: 0.1339, G/loss_cd: 0.0075
    Elapsed [15:18:41], Iteration [148600/1000000], G/loss_id: 0.2336, G/loss_id_psnt: 0.2340, G/loss_cd: 0.0100
    ...
    Elapsed [1 day, 1:08:15], Iteration [236000/1000000], G/loss_id: 0.1385, G/loss_id_psnt: 0.1368, G/loss_cd: 0.0061
    Elapsed [1 day, 1:08:55], Iteration [236100/1000000], G/loss_id: 0.1462, G/loss_id_psnt: 0.1443, G/loss_cd: 0.0069
    ...
    Elapsed [1 day, 14:13:03], Iteration [352300/1000000], G/loss_id: 0.1201, G/loss_id_psnt: 0.1191, G/loss_cd: 0.0060
    Elapsed [1 day, 14:13:43], Iteration [352400/1000000], G/loss_id: 0.1177, G/loss_id_psnt: 0.1168, G/loss_cd: 0.0066
    ...
    Elapsed [4 days, 14:41:52], Iteration [1000000/1000000], G/loss_id: 0.0673, G/loss_id_psnt: 0.0659, G/loss_cd: 0.0027
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    freq=64,dim_neck=16

    Elapsed [15:18:14], Iteration [151500/1000000], G/loss_id: 0.1823, G/loss_id_psnt: 0.1794, G/loss_cd: 0.0086
    Elapsed [15:18:54], Iteration [151600/1000000], G/loss_id: 0.1745, G/loss_id_psnt: 0.1731, G/loss_cd: 0.0073
    ...
    Elapsed [1 day, 1:09:25], Iteration [240900/1000000], G/loss_id: 0.1490, G/loss_id_psnt: 0.1475, G/loss_cd: 0.0058
    Elapsed [1 day, 1:10:05], Iteration [241000/1000000], G/loss_id: 0.1617, G/loss_id_psnt: 0.1594, G/loss_cd: 0.0057
    ...
    Elapsed [1 day, 14:13:05], Iteration [359300/1000000], G/loss_id: 0.0810, G/loss_id_psnt: 0.0803, G/loss_cd: 0.0033
    Elapsed [1 day, 14:13:45], Iteration [359400/1000000], G/loss_id: 0.0782, G/loss_id_psnt: 0.0785, G/loss_cd: 0.0048
    ...
    Elapsed [4 days, 13:00:51], Iteration [999900/1000000], G/loss_id: 0.1269, G/loss_id_psnt: 0.1248, G/loss_cd: 0.0037
    Elapsed [4 days, 13:01:30], Iteration [1000000/1000000], G/loss_id: 0.1593, G/loss_id_psnt: 0.1570, G/loss_cd: 0.0044
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    freq=64,dim_neck=8

    Elapsed [15:14:48], Iteration [146200/1000000], G/loss_id: 0.1795, G/loss_id_psnt: 0.1785, G/loss_cd: 0.0053
    Elapsed [15:15:29], Iteration [146300/1000000], G/loss_id: 0.2842, G/loss_id_psnt: 0.2802, G/loss_cd: 0.0078
    ...
    Elapsed [1 day, 1:06:04], Iteration [233200/1000000], G/loss_id: 0.1539, G/loss_id_psnt: 0.1560, G/loss_cd: 0.0054
    Elapsed [1 day, 1:06:45], Iteration [233300/1000000], G/loss_id: 0.2401, G/loss_id_psnt: 0.2394, G/loss_cd: 0.0074
    ...
    Elapsed [1 day, 14:09:50], Iteration [348400/1000000], G/loss_id: 0.3423, G/loss_id_psnt: 0.3405, G/loss_cd: 0.0071
    Elapsed [1 day, 14:10:31], Iteration [348500/1000000], G/loss_id: 0.3143, G/loss_id_psnt: 0.3121, G/loss_cd: 0.0091
    ...
    Elapsed [4 days, 15:07:50], Iteration [1000000/1000000], G/loss_id: 0.1669, G/loss_id_psnt: 0.1649, G/loss_cd: 0.0045
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    freq=64,dim_neck=8,lambda_cd=10

    Elapsed [11:10:59], Iteration [98700/1000000], G/loss_id: 0.3038, G/loss_id_psnt: 0.3034, G/loss_cd: 0.0017
    Elapsed [11:11:40], Iteration [98800/1000000], G/loss_id: 0.3581, G/loss_id_psnt: 0.3517, G/loss_cd: 0.0022
    ...
    Elapsed [21:02:44], Iteration [185700/1000000], G/loss_id: 0.3073, G/loss_id_psnt: 0.3097, G/loss_cd: 0.0017
    Elapsed [21:03:25], Iteration [185800/1000000], G/loss_id: 0.2681, G/loss_id_psnt: 0.2694, G/loss_cd: 0.0016
    ...
    Elapsed [1 day, 10:06:21], Iteration [300900/1000000], G/loss_id: 0.2351, G/loss_id_psnt: 0.2319, G/loss_cd: 0.0016
    Elapsed [1 day, 10:07:02], Iteration [301000/1000000], G/loss_id: 0.2504, G/loss_id_psnt: 0.2511, G/loss_cd: 0.0016
    ...
    Elapsed [4 days, 12:21:31], Iteration [1000000/1000000], G/loss_id: 0.2125, G/loss_id_psnt: 0.2103, G/loss_cd: 0.0011
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    2. CLSVC

    代码复现

    Elapsed [11:38:12], Iteration [110100/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0001, content_advloss: 4.6966, code_loss: 0.0006
    Elapsed [11:38:50], Iteration [110200/200000], G/loss_id: 0.0019, G/loss_id_psnt: 0.0019, spk_loss: 0.0124, content_advloss: 4.6481, code_loss: 0.0007
    ...
    Elapsed [21:15:26], Iteration [200000/200000], G/loss_id: 0.0016, G/loss_id_psnt: 0.0016, spk_loss: 0.0001, content_advloss: 4.6103, code_loss: 0.0003
    
    • 1
    • 2
    • 3
    • 4
    Elapsed [11:30:53], Iteration [108800/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0002, content_advloss: 4.6069, code_loss: 0.0005
    Elapsed [11:31:31], Iteration [108900/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0001, content_advloss: 4.6133, code_loss: 0.0004
    ...
    Elapsed [21:16:12], Iteration [200000/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0010, content_advloss: 4.6662, code_loss: 0.0006
    
    • 1
    • 2
    • 3
    • 4

    我仔细看了Clasvc的代码,与Autovc相比,Clsvc就是删除了下采样和上采样的部分,以及增加了一个很简单的adversarial classifier(其中content用了梯度反转),所谓的“flexible hidden feature dimensions”就是Autovc Encoder的输出维度,也是自己定义的,好像贡献也不是很大…

    效果不好,重新训练

    把训练过程的z改为content_embedding_source

    Elapsed [0:02:48], Iteration [1000/810000], G/loss_id: 0.2151, G/loss_id_psnt: 0.2155, spk_loss: 0.3096, content_advloss: 4.6123, code_loss: 0.0139
    ...
    Elapsed [10 days, 2:53:56], Iteration [810000/810000], G/loss_id: 0.0008, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6048, code_loss: 0.0000
    
    • 1
    • 2
    • 3

    再把g_loss_cd前系数由1改为0.5

    Elapsed [0:00:35], Iteration [100/810000], G/loss_id: 0.2475, G/loss_id_psnt: 0.2654, spk_loss: 4.1111, content_advloss: 4.5997, code_loss: 0.0165
    ...
    Elapsed [10 days, 2:23:01], Iteration [810000/810000], G/loss_id: 0.0009, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6057, code_loss: 0.0001
    
    • 1
    • 2
    • 3

    3. SpeechFlow

    语言内容(content),音色(timbre),音调(pitch)和韵律节奏(rhythm)

    代码复现

    spk2gen,len(spk2gen)=109,将对应说话人转化为对应性别

    {'p250': 'F', 'p285': 'M', 'p277': 'F',...
    
    • 1

    重新用VCTK的wav16进行实验,发现一些音频并不能进行sf.read操作,直接用软件打开也显示格式损坏:

    ...
    p329
    /ceph/datasets/VCTK-Corpus/wav16/p329/p329_037.wav
    p330
    /ceph/datasets/VCTK-Corpus/wav16/p330/p330_101.wav
    ...
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    关于预处理,作者

    All preprocessing steps are in the code, except trimming silence. But I don’t think they will make any fundamental difference. Your loss value looks fine.

    关于如何训练P可见此讨论

    demo.pkl格式如下:

    import pickle
    
    metadata = pickle.load(open("/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/demo.pkl", "rb"))
    print(metadata[0][1].shape)
    print(metadata[0][2][0].shape)
    print(metadata[0][2][1].shape)
    
    print(metadata[1][1].shape)     # (1, 82)
    print(metadata[1][2][0].shape)      # (105, 80)
    print(metadata[1][2][1].shape)      # (105,)
    '''
    [['p226', array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.    # (1, 82)
    ,(array([[0.43534297        # (135, 80)
    ,array([-1.0000000e+10,     # (135,)
    135, '003002')]
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    用他提取的mel谱不能通过mel2wav_GriffinLim生成对应音频,所以一个mel谱的生成方式有一个vocoder与之对应。

    min_len_seq=128, max_len_seq=128 * 2, max_len_pad=128 * 3,

    Elapsed [1 day, 10:29:04], Iteration [311800/1000000], G/loss_id: 0.00094931
    ...
    Elapsed [1 day, 22:15:11], Iteration [393000/1000000], G/loss_id: 0.00089064
    Validation loss: 25.521947860717773
    ...
    Elapsed [6 days, 12:43:18], Iteration [1000000/1000000], G/loss_id: 0.00062485
    Saved model checkpoints into run/models...
    Validation loss: 24.621514320373535
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    min_len_seq=64, max_len_seq=128, max_len_pad=192,

    Elapsed [1 day, 6:19:23], Iteration [392500/1000000], G/loss_id: 0.00089022
    ...
    Elapsed [1 day, 17:51:54], Iteration [510900/1000000], G/loss_id: 0.00070071
    ...
    Elapsed [2 days, 2:45:19], Iteration [601200/1000000], G/loss_id: 0.00064877
    Elapsed [2 days, 2:46:17], Iteration [601300/1000000], G/loss_id: 0.00086512
    ...
    Elapsed [4 days, 13:16:14], Iteration [1000000/1000000], G/loss_id: 0.00063725
    Saved model checkpoints into run_192/models...
    Validation loss: 25.529197692871094
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    ‘R’ - Rhythm, ‘F’ - Pitch, ‘U’-Timbre

    4. VQMIVC

    github

    preprocess.py中,随机选择88个说话人作为训练集,20个说话人作为测试集,从训练集中随机取10%的数据作为验证集。训练集:31877,验证集:3496,测试集:8474

    现在p225_001的mel谱为(206, 80)维度的了…这是第三种mel谱提取方式了(第一是Kaizhi Qian的,第二是实验室的),lf0的维度是(206,)

    对比预测编码(Contrastive Predictive Coding, CPC)来自以下:

    Representation Learning with Contrastive Predictive Coding

    1. VQMIVC的Encoder来自Vector-Quantized Contrastive Predictive Coding for VQ-VAE model
    2. speaker embedding 来自One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization for achieve one-shot VC,
    3. MI来自CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information for ot only provide reliable upper bound MI estimation, but also effectively minimize correlation in deep models as a learning critic
    4. Decoder来自AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

    代码复现

    Training with mutual information minimization (MIM):

    epoch:1, global step:19, recon loss:4.916, cpc loss:2.398, vq loss:0.004, perpexlity:36.475, lld cs loss:-22.828, mi cs loss:1.334E-03, lld ps loss:0.072, mi ps loss:0.000, lld cp loss:-47.886, mi cp loss:0.005, used time:1.841s
    [14.59624186 14.74744059 17.68857762 16.60425648  8.98168102 10.80852639]
    ...
    Eval | epoch:150, recon loss:0.597, cpc loss:1.171, vq loss:0.452, perpexlity:331.668, lld cs loss:109.938, mi cs loss:2.622E-03, lld ps loss:0.053, mi ps loss:0.000, lld cp loss:1085.477, mi cp loss:0.027, used time:11.050s
    ...
    epoch:500, global step:62500, recon loss:0.446, cpc loss:1.058, vq loss:0.481, perpexlity:382.976, lld cs loss:133.278, mi cs loss:-3.201E-11, lld ps loss:0.043, mi ps loss:0.001, lld cp loss:1430.699, mi cp loss:0.019, used time:59.947s
    [81.88058467 74.5465714  66.79555879 59.56658187 53.12879977 48.21143758]
    Saved checkpoint: model.ckpt-500
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    python convert_example.py -s test_wavs_/p225_001.wav -r test_wavs_/p232_001.wav -c converted_ -m /ceph/home/yangsc21/Python/autovc/VQMIVC/checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt 
    
    • 1

    Training without MIM:

    epoch:1, global step:11, recon loss:5.052, cpc loss:2.398, vq loss:0.007, perpexlity:28.734, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:0.932s
    [15.68673657 15.79730163 17.24362392 18.36610983  9.6393453  11.61267515]
    ...
    eval epoch:500, global step:62500, recon loss:0.428, cpc loss:1.084, vq loss:0.434, perpexlity:236.974, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:2.310s
    [80.40287876 73.21134896 65.82538815 58.78130851 52.62680888 48.37559807]
    Saved checkpoint: model.ckpt-500
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    第一次训练的时候出现了以下问题:

    Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0
    epoch:22, global step:2738, recon loss:nan, cpc loss:2.190, vq loss:0.020, perpexlity:2.943,
    lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.0
    00, mi cp loss:0.000, used time:0.928s
    [69.31384211 62.62814037 61.00833479 59.76804704 60.72292783 60.05764839]
    File "train.py", line 132, in mi_second_forward
    scaled_loss.backward()
    ZeroDivisionError: float division by zero
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    5. AutoPST

    github

    spk2emb_82.pkl格式:one - hot的speaker embedding,这里不知道怎么训练SEA model,所以直接用spk2emb_82.pkl的82个说话人进行训练,将所有的对应的说话人音频复制过来,但是注意到p248p251这两个说话人是‘indian’,在‘wav16’的文件夹中没有,所以用于训练的说话人一共有80个

    How to train SEA model issues
    How to make ‘mfcc_stats.pkl’ and ‘spk2emb_82.pkl’ issues

    A:

    Elapsed [0:00:31], Iteration [100/1000000], P/loss_tx2sp: 1.56564283, P/loss_stop_sp: 0.41424835
    ...
    Elapsed [1 day, 22:56:36], Iteration [739100/1000000], P/loss_tx2sp: 0.04529002, P/loss_stop_sp: 0.00000134
    ...
    Elapsed [3 days, 2:02:54], Iteration [1000000/1000000], P/loss_tx2sp: 0.04210990, P/loss_stop_sp: 0.00000138
    Saved model checkpoints into assets ...
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    B:

    Elapsed [0:01:07], Iteration [100/1000000], P/loss_tx2sp: 0.16594851, P/loss_stop_sp: 0.01642439
    ...
    Elapsed [1 day, 3:50:30], Iteration [308000/1000000], P/loss_tx2sp: 0.05139246, P/loss_stop_sp: 0.00002161
    129 torch.Size([4, 1635])
    (RuntimeError: CUDA out of memory. )
    ...	# retrain on A100
    Elapsed [1 day, 4:01:57], Iteration [612100/1000000], P/loss_tx2sp: 0.06876539, P/loss_stop_sp: 0.00025042
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    SpeechSplit performs better only when it has the ground truth rhythm.

    6. MAP

    MCD

    896 p299_010_p269_010

    9.590664489988024 3.480195005888315

    MCD = 6.729270791070735 dB, calculated over a total of 658567 frames, total 896 pairs
    
    • 1

    933

    8.460124809689486 2.212626641284654

    MCD = 5.365016399179563 dB, calculated over a total of 845206 frames, total 982 pairs
    
    • 1

    7. VQVC+ and AdaIN-VC

    AdaIN-VC:
    https://github.com/jjery2243542/adaptive_voice_conversion
    非官方:https://github.com/cyhuang-tw/AdaIN-VC
    VQVC+:
    https://github.com/ericwudayi/SkipVQVC

    8. My_Model

    run,每一次执行五次G1…,会很慢,大约100iters/1min,应该写错了,传出的是detach之后的r,p和c,这里直接停掉就不执行了

    Elapsed [0:00:59], Iteration [100/1000000], G/loss_id: 0.07169086, G/loss_id_psnt: 0.69216526, spk_loss: 4.61872101, content_adv_loss: 4.61275053, mi_cp_loss: 0.01635285, mi_rc_loss: 0.00026382, mi_rp_loss: 0.00063036, lld_cp_loss: -61.53382492, lld_rc_loss: -15.68565655, lld_rp_loss: -58.85915375
    ...
    Elapsed [10:16:29], Iteration [65200/1000000], G/loss_id: 0.00442289, G/loss_id_psnt: 0.00442220, spk_loss: 0.25282666, content_adv_loss: 4.30261326, mi_cp_loss: 0.01877159, mi_rc_loss: 0.00023279, mi_rp_loss: 0.00061867, lld_cp_loss: -62.52816391, lld_rc_loss: -15.75810432, lld_rp_loss: -59.59102631
    ...
    Elapsed [22:28:41], Iteration [139300/1000000], G/loss_id: 0.00412945, G/loss_id_psnt: 0.00413380, spk_loss: 0.00963319, content_adv_loss: 3.85407591, mi_cp_loss: 0.02865396, mi_rc_loss: 0.00045456, mi_rp_loss: 0.00082575, lld_cp_loss: -62.08999634, lld_rc_loss: -15.69443417, lld_rp_loss: -58.60738373
    
    • 1
    • 2
    • 3
    • 4
    • 5

    run_,对以上进行完善,每一次仅执行一次G1,然后执行五次loglikeli的MI网络更新,增加eval,sample plot,100iters/40s

    Elapsed [0:03:57], Iteration [600/1000000], G/loss_id: 0.30983770, G/loss_id_psnt: 0.19387859, spk_loss: 4.63300323, content_adv_loss: 4.61494732, mi_cp_loss: 0.00945068, mi_rc_loss: 0.00023235, mi_rp_loss: 0.00065887, lld_cp_loss: -58.04447556, lld_rc_loss: -15.88657951, lld_rp_loss: -56.80010986
    ...
    Validation loss: 47.09280776977539
    Elapsed [3 days, 2:54:44], Iteration [633100/1000000], G/loss_id: 0.00095636, G/loss_id_psnt: 0.00094097, spk_loss: 0.00062935, content_adv_loss: 4.60769081, mi_cp_loss: -0.00006776, mi_rc_loss: 0.00005649, mi_rp_loss: -0.00000760, lld_cp_loss: -63.45062256, lld_rc_loss: -15.89001274, lld_rp_loss: -63.43792725
    ...
    Elapsed [4 days, 22:57:11], Iteration [1000000/1000000], G/loss_id: 0.00098727, G/loss_id_psnt: 0.00097293, spk_loss: 0.00006716, content_adv_loss: 4.60707092, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00003540, mi_rp_loss: -0.00001546, lld_cp_loss: -63.21662521, lld_rc_loss: -15.81787777, lld_rp_loss: -63.23201752
    Saved model checkpoints into run_/models...
    Validation loss: 32.39069652557373
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    run_VQ,继续对以上进行完善,增加VQ和CPC

    Elapsed [0:04:19], Iteration [600/1000000], G/loss_id: 0.21996836, G/loss_id_psnt: 0.25334343, spk_loss: 4.64389563, content_adv_loss: 4.59541082, mi_cp_loss: 0.00000005, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000003, lld_cp_loss: -63.99991989, lld_rc_loss: -15.99912262, lld_rp_loss: -63.99991989, vq_loss: 0.21500304, cpc_loss: 2.32923222
    [25.67204237 24.83198941 23.82392436 24.96639788 25.70564449 24.59677458]
    ...
    Validation loss: 44.1766471862793
    Elapsed [2 days, 21:25:56], Iteration [518300/1000000], G/loss_id: 0.00244417, G/loss_id_psnt: 0.00243542, spk_loss: 0.00113923, content_adv_loss: 4.59757185, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000069, mi_rp_loss: 0.00000080, lld_cp_loss: -63.96213531, lld_rc_loss: -15.99189568, lld_rp_loss: -63.96203232, vq_loss: 816.06317139, cpc_loss: 1.40695262
    [58.23252797 57.29166865 56.85483813 55.94757795 53.49462628 53.93145084]
    ...
    Elapsed [5 days, 12:36:53], Iteration [1000000/1000000], G/loss_id: 0.00149921, G/loss_id_psnt: 0.00148138, spk_loss: 0.00003648, content_adv_loss: 4.60524797, mi_cp_loss: -0.00000074, mi_rc_loss: 0.00000015, mi_rp_loss: 0.00000297, lld_cp_loss: -63.96846771, lld_rc_loss: -15.98954391, lld_rp_loss: -63.96820068, vq_loss: 3281.53906250, cpc_loss: 1.39505339
    [65.49059153 62.63440847 62.02957034 60.18145084 58.77016187 58.36693645]
    Saved model checkpoints into run_VQ/models...
    Validation loss: 44.55451202392578
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    run_VQ_1,lambda_cd = 0.1 -> 1

    Elapsed [0:11:13], Iteration [1500/1000000], G/loss_id: 0.20114005, G/loss_id_psnt: 0.20369399, spk_loss: 4.10042000, content_adv_loss: 4.61024761, mi_cp_loss: 0.00000031, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000008, lld_cp_loss: -63.99993896, lld_rc_loss: -15.99916458, lld_rp_loss: -63.99987411, vq_loss: 0.26906750, cpc_loss: 2.26129460
    [33.16532373 33.36693645 33.8373661  31.72042966 31.01478517 31.85483813]
    ...
    Validation loss: 49.884931564331055                                   [403/1899]
    Elapsed [2 days, 19:55:09], Iteration [500100/1000000], G/loss_id: 0.00222024, G
    /loss_id_psnt: 0.00221478, spk_loss: 0.00017085, content_adv_loss: 4.60785437, m
    i_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: -0.00000000, lld_cp_l
    oss: -64.00000000, lld_rc_loss: -15.99412155, lld_rp_loss: -63.99996185, vq_loss
    : 885.47814941, cpc_loss: 1.49846244
    [48.7231195  47.81585932 46.77419364 45.93414068 47.27822542 45.69892585]
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    run_pitch,增加一个pitch的decoder,lambda_cd = 1 -> 0.1,pitch的decoder的权重系数设为1

    Elapsed [0:01:50], Iteration [200/1000000], G/loss_id: 0.03694459, G/loss_id_psnt: 0.88864940, spk_loss: 4.62199402, content_adv_loss: 4.66859436, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000005, mi_rp_loss: 0.00000007, lld_cp_loss: -63.99984741, lld_rc_loss: -15.99911499, lld_rp_loss: -63.99986267, vq_loss: 0.18332298, cpc_loss: 2.39174533, pitch_loss: 84147137801697099776.00000000
    [16.83467776 16.70026928 17.37231165 17.13709682 17.40591377 19.52284873]
    ...
    Elapsed [7:49:14], Iteration [53200/1000000], G/loss_id: 0.00311949, G/loss_id_psnt: 0.00310999, spk_loss: 0.22137184, content_adv_loss: 4.62207985, mi_cp_loss: -0.00000042, mi_rc_loss: 0.00000002, mi_rp_loss: -0.00000121, lld_cp_loss: -63.99859238, lld_rc_loss: -15.99206543, lld_rp_loss: -63.66162872, vq_loss: 6.28716087, cpc_loss: 1.54115570, pitch_loss: 82519535137146798080.00000000
    [49.05914068 49.96639788 48.01747203 47.47983813 45.69892585 45.0268805 ]
    ...
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    改正loss过大的问题后:

    Elapsed [0:00:56], Iteration [100/1000000], G/loss_id: 0.05138026, G/loss_id_psnt: 0.77242196, spk_loss: 0.68211454, content_adv_loss: 3.45159459, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000000, mi_rp_loss: -0.00000000, lld_cp_loss: -63.99995041, lld_rc_loss: -15.99927044, lld_rp_loss: -63.99993896, vq_loss: 0.17577030, cpc_loss: 2.39789486, pitch_loss: 0.03641313
    [14.41532224 13.70967776 11.79435477 14.11290318 15.15457034 11.35752723]
    ...
    Elapsed [1 day, 23:38:22], Iteration [353300/1000000], G/loss_id: 0.00200523, G/loss_id_psnt: 0.00200193, spk_loss: 0.00335806, content_adv_loss: 4.60966444, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000065, mi_rp_loss: -0.00271915, lld_cp_loss: -61.08781052, lld_rc_loss: -15.98555946, lld_rp_loss: -61.11469650, vq_loss: 361.41998291, cpc_loss: 1.45688534, pitch_loss: 0.00808701
    [55.07392287 53.76344323 52.08333135 51.74731016 51.00806355 52.55376101]
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    run_pitch_,可以发现以上pitch_loss大的离谱,还是在我加了tanh()的情况下,原来是本身的f0的值在-10000000000.0 ~ 1.0 之间,就很奇怪,怎么这么小?但我用以下函数检查的时候:

    import numpy as np
    
    def speaker_normalization(f0, index_nonzero, mean_f0, std_f0):
        # f0 is logf0
        f0 = f0.astype(float).copy()
        #index_nonzero = f0 != 0
        f0[index_nonzero] = (f0[index_nonzero] - mean_f0) / std_f0 / 4.0
        f0[index_nonzero] = np.clip(f0[index_nonzero], -1, 1)
        f0[index_nonzero] = (f0[index_nonzero] + 1) / 2.0
        return f0
    
    path = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/raptf0/p226/p226_025_cat.npy"
    # path = "/ceph/home/yangsc21/Python/VCTK/wav16/raptf0_100_crop_cat/p225/p225_cat.npy"
    f0_rapt = np.load(path)
    index_nonzero = (f0_rapt != -1e10)
    mean_f0, std_f0 = np.mean(f0_rapt[index_nonzero]), np.std(f0_rapt[index_nonzero])
    f0_norm = speaker_normalization(f0_rapt, index_nonzero, mean_f0, std_f0)
    f0_norm[f0_norm==-1e10] = 0
    print(f0_rapt, np.max(f0_rapt), np.min(f0_rapt), index_nonzero, mean_f0, std_f0, f0_norm, np.max(f0_norm), np.min(f0_norm), np.mean(f0_norm))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    结果如下:

    [-1.e+10 -1.e+10 -1.e+10 ... -1.e+10 -1.e+10 -1.e+10] 1.0 -10000000000.0 [False False False ... False False False] 0.500022 0.12482828 [0. 0. 0. ... 0. 0. 0.] 1.0 0.0 0.278805128421691
    
    • 1

    可以发现,除了-1e10,剩下就是从0-1之间的数了,那这里过一个Sigmoid(),权重系数设为0.1

    Elapsed [0:00:23], Iteration [40/1000000], G/loss_id: 0.05261087, G/loss_id_psnt: 0.72422123, spk_loss: 0.69748354, content_adv_loss: 4.07323647, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000005, lld_cp_loss: -63.99515915, lld_rc_loss: -15.99876404, lld_rp_loss: -63.99857712, vq_loss: 0.12072643, cpc_loss: 2.39789486, pitch_loss: 0.05633526
    [15.35618305 15.42338729 14.34811801 15.42338729 14.71774131 10.95430106]
    ...
    Elapsed [1 day, 23:40:28], Iteration [317600/1000000], G/loss_id: 0.00184345, G/loss_id_psnt: 0.00184246, spk_loss: 0.12309902, content_adv_loss: 4.59243107, mi_cp_loss: 0.00000003, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000004, lld_cp_loss: -63.99990082, lld_rc_loss: -15.98676586, lld_rp_loss: -63.99086761, vq_loss: 322.91177368, cpc_loss: 1.37124848, pitch_loss: 0.00887575
    [63.23924661 62.29838729 60.61828136 59.40859914 58.90457034 58.63575339]
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    run_pitch_2,不用VQ+CPC,只用MI + 0.1 * pitch decoder

    Elapsed [0:04:26], Iteration [700/1000000], G/loss_id: 0.16231853, G/loss_id_psnt: 0.32284465, spk_loss: 4.62042952, content_adv_loss: 4.61721945, mi_cp_loss: 0.00780822, mi_rc_loss: 0.00016762, mi_rp_loss: 0.00188451, lld_cp_loss: -61.02302551, lld_rc_loss: -15.70524502, lld_rp_loss: -59.94433594, pitch_loss: 0.01591282
    ...
    Elapsed [4 days, 23:25:32], Iteration [1000000/1000000], G/loss_id: 0.00075769, G/loss_id_psnt: 0.00075213, spk_loss: 0.00020621, content_adv_loss: 4.60287952, mi_cp_loss: 0.00002108, mi_rc_loss: 0.00001613, mi_rp_loss: 0.00000331, lld_cp_loss: -63.69352341, lld_rc_loss: -15.81671143, lld_rp_loss: -63.69023132, pitch_loss: 0.00293102
    Saved model checkpoints into run_pitch_2/models...
    Validation loss: 25.982803344726562
    
    • 1
    • 2
    • 3
    • 4
    • 5

    run_pitch_3,VQCPC after Mel Encoder (rhythm)

    Elapsed [0:00:47], Iteration [100/1000000], G/loss_id: 0.03398866, G/loss_id_psnt: 0.81710476, spk_loss: 4.63411522, content_adv_loss: 4.56968164, mi_cp_loss: 0.00585751, mi_rc_loss: 0.00000011, mi_rp_loss: -0.00000144, lld_cp_loss: -61.00984192, lld_rc_loss: -15.42654037, lld_rp_loss: -59.25158691, vq_loss: 0.05246538, cpc_loss: 2.39724064, pitch_loss: 0.01885412
    [40.2777791  45.13888955 50.69444776 56.25       63.19444776 67.36111045]
    ...
    Elapsed [5 days, 0:38:57], Iteration [1000000/1000000], G/loss_id: 0.00155650, G/loss_id_psnt: 0.00154984, spk_loss: 0.00189495, content_adv_loss: 4.60772943, mi_cp_loss: 0.00018108, mi_rc_loss: 0.00001328, mi_rp_loss: 0.00000790, lld_cp_loss: -63.04440308, lld_rc_loss: -12.41486168, lld_rp_loss: -63.04750824, vq_loss: 0.12090041, cpc_loss: 1.45722890, pitch_loss: 0.01511321
    [56.25       61.8055582  63.54166865 67.01388955 69.79166865 75.3472209 ]
    Saved model checkpoints into run_pitch_3/models...
    Validation loss: 1484.6753540039062
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    消融实验

    w/o adv

    Saved model checkpoints into run_pitch_wo_adv/models...
    Validation loss: 57.68343734741211
    Elapsed [4 days, 14:00:24], Iteration [800100/1000000], G/loss_id: 0.00046955, G/loss_id_psnt: 0.00046547, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000706, mi_rc_loss: 0.00001590, mi_rp_loss: -0.00017925, lld_cp_loss: -59.52406311, lld_rc_loss: -15.50933647, lld_rp_loss: -59.52644348, pitch_loss: 0.00289221
    
    • 1
    • 2
    • 3

    w/o MI

    Elapsed [3 days, 20:09:18], Iteration [861400/1000000], G/loss_id: 0.00078604, G/loss_id_psnt: 0.00077555, spk_loss: 0.01067678, content_adv_loss: 4.61410141, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00468707
    Elapsed [3 days, 20:10:00], Iteration [861500/1000000], G/loss_id: 0.00068698, G/loss_id_psnt: 0.00067951, spk_loss: 0.21367706, content_adv_loss: 4.60780621, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00338293
    
    • 1
    • 2

    w/o pitch

    Elapsed [3 days, 21:44:47], Iteration [807800/1000000], G/loss_id: 0.00086264, G/loss_id_psnt: 0.00085352, spk_loss: 0.00084184, content_adv_loss: 4.60911512, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00008330, mi_rp_loss: 0.00001434, lld_cp_loss: -62.65066147, lld_rc_loss: -14.44594097, lld_rp_loss: -62.67908859
    Elapsed [3 days, 21:45:36], Iteration [807900/1000000], G/loss_id: 0.00084089, G/loss_id_psnt: 0.00082641, spk_loss: 0.00110251, content_adv_loss: 4.60943460, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00016344, mi_rp_loss: 0.00001484, lld_cp_loss: -62.94990158, lld_rc_loss: -14.53773308, lld_rp_loss: -62.95674896
    
    • 1
    • 2

    w/o adv + MI

    Elapsed [3 days, 17:36:39], Iteration [886500/1000000], G/loss_id: 0.00060688, G/loss_id_psnt: 0.00059975, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00623717
    
    • 1

    w/o adv + pitch

    Elapsed [5 days, 4:59:55], Iteration [831100/1000000], G/loss_id: 0.00050043, G/loss_id_psnt: 0.00049772, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000898, mi_rp_loss: -0.00004322, lld_cp_loss: -63.60225677, lld_rc_loss: -15.57159233, lld_rp_loss: -63.61416626
    
    • 1

    w/o pitch + MI

    Elapsed [4 days, 2:38:30], Iteration [818800/1000000], G/loss_id: 0.00079214, G/loss_id_psnt: 0.00078017, spk_loss: 0.00368636, content_adv_loss: 4.61357307, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
    
    • 1

    w/o pitch + adv + MI

    Elapsed [2 days, 19:02:29], Iteration [827100/1000000], G/loss_id: 0.00056297, G/loss_id_psnt: 0.00055678, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
    
    • 1

    维度扩大2倍

    Elapsed [6 days, 15:18:42], Iteration [1000000/1000000], G/loss_id: 0.00051162, G/loss_id_psnt: 0.00050539, spk_loss: 0.00400598, content_adv_loss: 4.59954834, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00137981, mi_rp_loss: 0.00045807, lld_cp_loss: -117.74221039, lld_rc_loss: -8.09668541, lld_rp_loss: -117.96425629, pitch_loss: 0.00297939
    Saved model checkpoints into run_dim2_pitch/models...
    Validation loss: 24.739643096923828
    
    • 1
    • 2
    • 3

    w/o adv

    Elapsed [6 days, 17:55:16], Iteration [1000000/1000000], G/loss_id: 0.00193528, G/loss_id_psnt: 0.00193324, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00027657, mi_rp_loss: 0.00118168, lld_cp_loss: -36.59268570, lld_rc_loss: -31.09467697, lld_rp_loss: -41.84099579, pitch_loss: 0.00602757
    Saved model checkpoints into run_dim2_pitch_wo_adv/models...
    Validation loss: 61.93159866333008
    
    • 1
    • 2
    • 3

    w/o MI

    Elapsed [4 days, 14:43:55], Iteration [874800/1000000], G/loss_id: 0.00076253, G/loss_id_psnt: 0.00075406, spk_loss: 0.00023900, content_adv_loss: 4.60489750, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00438170
    
    • 1

    w/o pitch

    Elapsed [7 days, 12:10:08], Iteration [1000000/1000000], G/loss_id: 0.00056876, G/loss_id_psnt: 0.00055979, spk_loss: 0.00775459, content_adv_loss: 4.60920477, mi_cp_loss: -0.00001970, mi_rc_loss: -0.00001161, mi_rp_loss: 0.00000567, lld_cp_loss: -126.23571777, lld_rc_loss: -28.51069260, lld_rp_loss: -126.23281097
    Saved model checkpoints into run_dim2_pitch_wo_pitch/models...
    Validation loss: 23.904769897460938
    
    • 1
    • 2
    • 3

    w/o adv + MI

    Elapsed [7 days, 2:04:21], Iteration [1000000/1000000], G/loss_id: 0.00036989, G/loss_id_psnt: 0.00036615, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00280974
    Saved model checkpoints into run_dim2_pitch_wo_adv_mi/models...
    Validation loss: 17.292850494384766
    
    • 1
    • 2
    • 3

    w/o adv + pitch

    Elapsed [4 days, 14:56:20], Iteration [819400/1000000], G/loss_id: 0.00043808, G/loss_id_psnt: 0.00043352, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00003838, mi_rp_loss: -0.00066438, lld_cp_loss: -116.34136963, lld_rc_loss: -31.60812950, lld_rp_loss: -116.60922241
    
    • 1

    w/o pitch + MI

    Elapsed [4 days, 1:50:09], Iteration [837000/1000000], G/loss_id: 0.00066237, G/loss_id_psnt: 0.00065387, spk_loss: 0.16599277, content_adv_loss: 4.60649395, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
    
    • 1

    w/o pitch + adv + MI

    Elapsed [3 days, 16:58:49], Iteration [817700/1000000], G/loss_id: 0.00039425, G/loss_id_psnt: 0.00039077, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
    
    • 1

    9. Model

    run,以上效果的转化音色时效果不好,这里把speaker embedding 改成one hot向量,

    Elapsed [0:00:42], Iteration [100/1000000], G/loss_id: 0.03971826, G/loss_id_psnt: 0.77077729, content_adv_loss: 4.60181856, mi_cp_loss: 0.01571883, mi_rc_loss: 0.00013684, mi_rp_loss: 0.00233091, lld_cp_loss: -61.88400269, lld_rc_loss: -15.71234703, lld_rp_loss: -58.84709930
    ...
    Elapsed [2 days, 23:36:30], Iteration [664400/1000000], G/loss_id: 0.00082198, G/loss_id_psnt: 0.00081689, content_adv_loss: 4.60013008, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00009694, mi_rp_loss: -0.00047144, lld_cp_loss: -55.83151245, lld_rc_loss: -15.27785683, lld_rp_loss: -55.92704010
    ...
    Elapsed [4 days, 12:08:59], Iteration [1000000/1000000], G/loss_id: 0.00[1/1862]
    G/loss_id_psnt: 0.00059563, content_adv_loss: 4.61593437, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00007099, mi_rp_loss: -0.00033119, lld_cp_loss: -50.61760712, lld_rc_loss: -15.26475143, lld_rp_loss: -51.30073166
    Saved model checkpoints into run/models...
    Validation loss: 37.75163459777832
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    run_pitch,

    Elapsed [0:00:51], Iteration [100/1000000], G/loss_id: 0.04715083, G/loss_id_psnt: 0.75423688, content_adv_loss: 4.61723232, mi_cp_loss: 0.01895704, mi_rc_loss: 0.00019252, mi_rp_loss: -0.00003131, lld_cp_loss: -60.86524582, lld_rc_loss: -15.71782684, lld_rp_loss: -57.30221176, pitch_loss: 0.03021768
    ...
    Elapsed [2 days, 23:46:05], Iteration [598000/1000000], G/loss_id: 0.00082708, G/loss_id_psnt: 0.00082202, content_adv_loss: 4.60971642, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00007787, mi_rp_loss: 0.00014183, lld_cp_loss: -62.80820847, lld_rc_loss: -15.05598736, lld_rp_loss: -62.85067749, pitch_loss: 0.00637796
    
    • 1
    • 2
    • 3

    run_use_VQCPC,迭代到580000代,被人催下13号机了

    Elapsed [3 days, 5:22:11], Iteration [580000/1000000], G/loss_id: 0.0027 loss_id_psnt: 0.00274249, content_adv_loss: 4.59387255, mi_cp_loss: 0.00000014, mi_rc_loss: 0.00000099, mi_rp_loss: 0.00000026, lld_cp_loss: -63.99961853, lld_rc_loss: -15.95115471, lld_rp_loss: -63.99967575, vq_loss: 1067.20275879, cpc_loss: 1.43544137, pitch_loss: 0.01039015
    [58.77016187 58.33333135 56.95564747 55.94757795 54.30107713 54.33467627]
    Saved model checkpoints into run_use_VQCPC/models...
    Validation loss: 81.5089225769043
    
    • 1
    • 2
    • 3
    • 4

    run_use_VQCPC_2

    Elapsed [4 days, 11:03:03], Iteration [812400/1000000], G/loss_id: 0.00247453, G/loss_id_psnt: 0.00247133, content_adv_loss: 4.60783482, mi_cp_loss: 0.00007319, mi_rc_loss: -0.00000709, mi_rp_loss: 0.00000088, lld_cp_loss: -63.97900009, lld_rc_loss: -15.18019390, lld_rp_loss: -63.95431519, vq_loss: 0.13141513, cpc_loss: 1.21822941, pitch_loss: 0.01545527
    [64.23611045 63.88888955 64.23611045 63.54166865 63.54166865 65.2777791 ]
    
    • 1
    • 2

    new 我改了一下,先取mel谱过G1五次,再把mel谱过G1和G2,这样就会慢一些

    MI:

    Elapsed [5 days, 20:36:16], Iteration [676100/1000000], G/loss_id: 0.00094016, G/loss_id_psnt: 0.00093268, content_adv_loss: 4.60860682, mi_cp_loss: -0.00006625, mi_rc_loss: 0.00011470, mi_rp_loss: 0.00001495, lld_cp_loss: -62.70265961, lld_rc_loss: -14.90209007, lld_rp_loss: -62.72354126
    
    
    • 1
    • 2

    MI + pitch:

    Elapsed [5 days, 20:37:14], Iteration [647300/1000000], G/loss_id: 0.00082176, G/loss_id_psnt: 0.00081603, content_adv_loss: 4.60966206, mi_cp_loss: 0.00002162, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00004071, lld_cp_loss: -63.53780365, lld_rc_loss: -15.10877800, lld_rp_loss: -63.54629135, pitch_loss: 0.00451364
    
    
    • 1
    • 2

    结果可视化参考

    1. Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations

    Objective evaluation:MCD、WER
    Subjective evaluation:MOS

    英文应该是WER,先用ESPnet: end-to-end speech processing toolkit工具进行语音识别,官方github,WER计算方式可参考此github

    以下识别结果为’SS’,待修改:

    import json
    import torch
    import argparse
    from espnet.bin.asr_recog import get_parser
    from espnet.nets.pytorch_backend.e2e_asr_transformer import E2E
    import os
    import scipy.io.wavfile as wav
    from python_speech_features import fbank
    
    filename = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/test_wav/p225/p225_001.wav"
    sample_rate, waveform = wav.read(filename)
    fbank1, _ = fbank(waveform,samplerate=16000,winlen=0.025,winstep=0.01,
          nfilt=86,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)
    
    # print(fbank1[0].shape, fbank1[1].shape)     # (204, 86), (204, )
    
    
    root = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/asr1/"
    model_dir = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/exp/train_trim_sp_pytorch_nbpe500_ngpu8_train_pytorch_transformer.v2_specaug/results/"
    
    # load model
    with open(model_dir + "/model.json", "r") as f:
      idim, odim, conf = json.load(f)
    model = E2E.build(idim, odim, **conf)
    model.load_state_dict(torch.load(model_dir + "/model.last10.avg.best"))
    model.cpu().eval()
    
    # load tocken_list
    token_list = conf['char_list']
    print(token_list)
    # recognize speech
    parser = get_parser()
    args = parser.parse_args(["--beam-size", "1", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])
    
    x = torch.as_tensor(fbank1).to(torch.float32)
    result = model.recognize(x, args, token_list)
    
    print(result)
    
    s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("", "").replace("", " ").replace("", "")
    
    print("prediction: ", s)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43

    放弃用espnet,用espnet_model_zoo,官方github,语料库和网络名字可在看到,这里我们基于librispeech,用espnet2实现ASR,

    2. Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder

    tSNE Visualization of speaker embedding space

    Fig. 3 illustrates speaker embedding visualized by tSNE method, there are 30 utterances sampled for every speaker to calculate the speaker representation. According to the empirical results, we found that a chunk of 2 seconds is adequate to extract the speaker representation. As shown in Fig. 3, speaker embeddings are separable for different speakers. In contrast, the speaker embeddings of utterances of the same speaker are close to each other. As a result, our method is able to extract speaker-dependence information by using the encoder network.

    p335 p264 p247 p278 p272 p262 (F, F, M, M, M, F)这些说话人没有出现在训练中

    3. Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model
    加个text?

    4. Non-Parallel Many-To-Many Voice Conversion Using Local Linguistic Tokens
    VQ-VAE

    5. Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention
    Ablation studies

    6. Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features
    F0 distributions, subjective evaluations

    7. Non-Autoregressive Sequence-To-Sequence Voice Conversion

    root mean square error (RMSE) of log F0, and character error rate (CER)

    8. fake speech detection

    Towards fine-grained prosody control for voice conversion

  • 相关阅读:
    el-date-picker 有效时间精确到时分秒 且给有效时间添加标记
    Linux的NIS配置
    SpringBoot实现缓存预热方案
    Java虚拟机垃圾收集器详细总结
    系统启动其实就2个步骤BIOS和MBR(和之后的init/systemd的关系)
    【LeetCode】1870. 准时到达的列车最小时速
    NC20583 [SDOI2016]齿轮
    记一个,生产遇到的redission锁,释放问题:lock.tryLock(0, 0, TimeUnit.SECONDS)
    机器学习(公式推导与代码实现)--sklearn机器学习库
    Survey on Cooperative Perception in an Automotive Context 论文阅读
  • 原文地址:https://blog.csdn.net/qq_41897800/article/details/122616675