Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion 中间结果,github
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
三个问题:
希望像GAN一样匹配分布,像CVAE一样容易训练
zero-shot learning 概念示意
利用训练集数据训练模型,使得模型能够对测试集的对象进行分类,但是训练集类别和测试集类别之间没有交集;期间需要借助类别的描述,来建立训练集和测试集之间的联系,从而使得模型有效。第一个问题是获取合适的类别描述
A
A
A;第二个问题是建立一个合适的分类模型。
存在的问题:
程序报错
RuntimeError: CUDA error: out of memory
CUDA内存不足module 'librosa' has no attribute 'output'
import soundfile as sf
sf.write(name + '.wav', waveform, 16000)
算法流程:
Generate spectrogram data from the wav files: python make_spect.py
以CASIA database中的liuchanhg_angry、wangzhe_happy、zhaoquanyin_sad中的201.wav、202.wav为例,生成的谱S分别为:(112, 80)、(103, 80);(92, 80)、(70, 80);(189, 80)、(87, 80)
Generate training metadata, including the GE2E speaker embedding (please use one-hot embeddings if you are not doing zero-shot conversion): python make_metadata.py
如果当前的话语太短,选择另一个话语
最后生成的格式为:
[[‘liuchanhg_angry’, array(speaker embeding), ‘liuchanhg_angry/201.npy’, ‘liuchanhg_angry/202.npy’]]
Run the main training script: python main.py
Converges when the reconstruction loss is around 0.0001.
...
Elapsed [1 day, 3:41:14], Iteration [304060/1000000], G/loss_id: 0.0180, G/loss_id_psnt: 0.0179, G/loss_cd: 0.0001
Elapsed [1 day, 3:41:17], Iteration [304070/1000000], G/loss_id: 0.0110, G/loss_id_psnt: 0.0109, G/loss_cd: 0.0001
...
100k
Elapsed [2:25:56], Iteration [99990/100000], G/loss_id: 0.0294, G/loss_id_psnt: 0.0294, G/loss_cd: 0.0000
Elapsed [2:25:57], Iteration [100000/100000], G/loss_id: 0.0240, G/loss_id_psnt: 0.0240, G/loss_cd: 0.0000
1000k
Elapsed [17:26:39], Iteration [698500/1000000], G/loss_id: 0.0289, G/loss_id_psnt: 0.0289, G/loss_cd: 0.0000
model_2修改dim_neck=32,freq=32,batch_size=2
Elapsed [14:21:06], Iteration [461400/1000000], G/loss_id: 0.0139, G/loss_id_psnt: 0.0139, G/loss_cd: 0.0001
...
Elapsed [23:37:37], Iteration [732500/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0162, G/loss_cd: 0.0001
...
Elapsed [1 day, 8:23:07], Iteration [999900/1000000], G/loss_id: 0.0190, G/loss_id_psnt: 0.0189, G/loss_cd: 0.0001
Elapsed [1 day, 8:23:18], Iteration [1000000/1000000], G/loss_id: 0.0143, G/loss_id_psnt: 0.0143, G/loss_cd: 0.0001
model_3修改len_crop=128*3
Elapsed [14:10:26], Iteration [181400/1000000], G/loss_id: 0.0146, G/loss_id_psnt: 0.0145, G/loss_cd: 0.0002
Elapsed [14:10:54], Iteration [181500/1000000], G/loss_id: 0.0160, G/loss_id_psnt: 0.0159, G/loss_cd: 0.0002
...
Elapsed [23:26:40], Iteration [290100/1000000], G/loss_id: 0.0163, G/loss_id_psnt: 0.0163, G/loss_cd: 0.0001
...
Elapsed [1 day, 9:20:03], Iteration [426600/1000000], G/loss_id: 0.0125, G/loss_id_psnt: 0.0124, G/loss_cd: 0.0001
Elapsed [1 day, 9:20:17], Iteration [426700/1000000], G/loss_id: 0.0186, G/loss_id_psnt: 0.0182, G/loss_cd: 0.0001
...
Elapsed [1 day, 12:16:26], Iteration [501300/1000000], G/loss_id: 0.0137, G/loss_id_psnt: 0.0131, G/loss_cd: 0.0001
Elapsed [1 day, 12:16:40], Iteration [501400/1000000], G/loss_id: 0.0152, G/loss_id_psnt: 0.0149, G/loss_cd: 0.0001
下面发现mel谱的长度和vecoder不一致,一个2s左右的音频mel谱的维度居然有(四五百, 80),所以直接从wavenet的代码试试生成mel谱,参考官方github
model_4参考此,增加squeeze:
g_loss_id = F.mse_loss(x_real, x_identic.squeeze())
g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt.squeeze())
I trimmed the silence off by hand.
small batch size leads to better generalization
是否需要using other vocoders?
300k steps, 10 hours
The only use a subset.
final loss is around 1e-2
关于AUTOVC的效果非常不好我猜测可能的问题如下:
Elapsed [1 day, 3:08:04], Iteration [1000000/1000000], G/loss_id: 0.0077, G/loss_id_psnt: 0.0077, G/loss_cd: 0.0005
Elapsed [1 day, 10:11:06], Iteration [1000000/1000000], G/loss_id: 0.0062, G/loss_id_psnt: 0.0062, G/loss_cd: 0.0002
Elapsed [1 day, 10:10:44], Iteration [1000000/1000000], G/loss_id: 0.0037, G/loss_id_psnt: 0.0036, G/loss_cd: 0.0002
Elapsed [2 days, 6:16:53], Iteration [1000000/1000000], G/loss_id: 0.0034, G/loss_id_psnt: 0.0033, G/loss_cd: 0.0002
Elapsed [2 days, 5:12:11], Iteration [1000000/1000000], G/loss_id: 0.0033, G/loss_id_psnt: 0.0032, G/loss_cd: 0.0002
效果都不好
参考实验室wiki工具库和github,可基于预训练模型,解决mel谱的提取与vocoder生成问题。
wavenet cudu问题
修改/ceph/home/yangsc21/anaconda3/envs/autovc/lib/python3.8/site-packages/wavenet_vocoder/
的mixture.py
的第112行
为了进行zero shot,随机选择100个说话人进行训练,9个说话人不出现在训练集中。
注意剔除p376_295.raw
文件
Note that the ‘p315’ text was lost due to a hard disk error.
Elapsed [0:00:12], Iteration [100/1000000], G/loss_id: 0.7917, G/loss_id_psnt: 1.3611, G/loss_cd: 0.0848
Elapsed [0:00:22], Iteration [200/1000000], G/loss_id: 0.7815, G/loss_id_psnt: 0.6783, G/loss_cd: 0.0517
...
Elapsed [9:17:16], Iteration [325900/1000000], G/loss_id: 0.0984, G/loss_id_psnt: 0.0974, G/loss_cd: 0.0041
Elapsed [9:17:26], Iteration [326000/1000000], G/loss_id: 0.0822, G/loss_id_psnt: 0.0815, G/loss_cd: 0.0030
...
Elapsed [17:51:58], Iteration [630900/1000000], G/loss_id: 0.0621, G/loss_id_psnt: 0.0614, G/loss_cd: 0.0020
Elapsed [17:52:08], Iteration [631000/1000000], G/loss_id: 0.0534, G/loss_id_psnt: 0.0531, G/loss_cd: 0.0018
The “too narrow” model should have low classification accuracy (good disentanglement) but high reconstruction error (poor reconstruction)
The “too wide” model should have low reconstruction error (good reconstruction) but high classification accuracy (poor disentanglement).
freq=64,dim_neck=24
Elapsed [15:18:01], Iteration [148500/1000000], G/loss_id: 0.1358, G/loss_id_psnt: 0.1339, G/loss_cd: 0.0075
Elapsed [15:18:41], Iteration [148600/1000000], G/loss_id: 0.2336, G/loss_id_psnt: 0.2340, G/loss_cd: 0.0100
...
Elapsed [1 day, 1:08:15], Iteration [236000/1000000], G/loss_id: 0.1385, G/loss_id_psnt: 0.1368, G/loss_cd: 0.0061
Elapsed [1 day, 1:08:55], Iteration [236100/1000000], G/loss_id: 0.1462, G/loss_id_psnt: 0.1443, G/loss_cd: 0.0069
...
Elapsed [1 day, 14:13:03], Iteration [352300/1000000], G/loss_id: 0.1201, G/loss_id_psnt: 0.1191, G/loss_cd: 0.0060
Elapsed [1 day, 14:13:43], Iteration [352400/1000000], G/loss_id: 0.1177, G/loss_id_psnt: 0.1168, G/loss_cd: 0.0066
...
Elapsed [4 days, 14:41:52], Iteration [1000000/1000000], G/loss_id: 0.0673, G/loss_id_psnt: 0.0659, G/loss_cd: 0.0027
freq=64,dim_neck=16
Elapsed [15:18:14], Iteration [151500/1000000], G/loss_id: 0.1823, G/loss_id_psnt: 0.1794, G/loss_cd: 0.0086
Elapsed [15:18:54], Iteration [151600/1000000], G/loss_id: 0.1745, G/loss_id_psnt: 0.1731, G/loss_cd: 0.0073
...
Elapsed [1 day, 1:09:25], Iteration [240900/1000000], G/loss_id: 0.1490, G/loss_id_psnt: 0.1475, G/loss_cd: 0.0058
Elapsed [1 day, 1:10:05], Iteration [241000/1000000], G/loss_id: 0.1617, G/loss_id_psnt: 0.1594, G/loss_cd: 0.0057
...
Elapsed [1 day, 14:13:05], Iteration [359300/1000000], G/loss_id: 0.0810, G/loss_id_psnt: 0.0803, G/loss_cd: 0.0033
Elapsed [1 day, 14:13:45], Iteration [359400/1000000], G/loss_id: 0.0782, G/loss_id_psnt: 0.0785, G/loss_cd: 0.0048
...
Elapsed [4 days, 13:00:51], Iteration [999900/1000000], G/loss_id: 0.1269, G/loss_id_psnt: 0.1248, G/loss_cd: 0.0037
Elapsed [4 days, 13:01:30], Iteration [1000000/1000000], G/loss_id: 0.1593, G/loss_id_psnt: 0.1570, G/loss_cd: 0.0044
freq=64,dim_neck=8
Elapsed [15:14:48], Iteration [146200/1000000], G/loss_id: 0.1795, G/loss_id_psnt: 0.1785, G/loss_cd: 0.0053
Elapsed [15:15:29], Iteration [146300/1000000], G/loss_id: 0.2842, G/loss_id_psnt: 0.2802, G/loss_cd: 0.0078
...
Elapsed [1 day, 1:06:04], Iteration [233200/1000000], G/loss_id: 0.1539, G/loss_id_psnt: 0.1560, G/loss_cd: 0.0054
Elapsed [1 day, 1:06:45], Iteration [233300/1000000], G/loss_id: 0.2401, G/loss_id_psnt: 0.2394, G/loss_cd: 0.0074
...
Elapsed [1 day, 14:09:50], Iteration [348400/1000000], G/loss_id: 0.3423, G/loss_id_psnt: 0.3405, G/loss_cd: 0.0071
Elapsed [1 day, 14:10:31], Iteration [348500/1000000], G/loss_id: 0.3143, G/loss_id_psnt: 0.3121, G/loss_cd: 0.0091
...
Elapsed [4 days, 15:07:50], Iteration [1000000/1000000], G/loss_id: 0.1669, G/loss_id_psnt: 0.1649, G/loss_cd: 0.0045
freq=64,dim_neck=8,lambda_cd=10
Elapsed [11:10:59], Iteration [98700/1000000], G/loss_id: 0.3038, G/loss_id_psnt: 0.3034, G/loss_cd: 0.0017
Elapsed [11:11:40], Iteration [98800/1000000], G/loss_id: 0.3581, G/loss_id_psnt: 0.3517, G/loss_cd: 0.0022
...
Elapsed [21:02:44], Iteration [185700/1000000], G/loss_id: 0.3073, G/loss_id_psnt: 0.3097, G/loss_cd: 0.0017
Elapsed [21:03:25], Iteration [185800/1000000], G/loss_id: 0.2681, G/loss_id_psnt: 0.2694, G/loss_cd: 0.0016
...
Elapsed [1 day, 10:06:21], Iteration [300900/1000000], G/loss_id: 0.2351, G/loss_id_psnt: 0.2319, G/loss_cd: 0.0016
Elapsed [1 day, 10:07:02], Iteration [301000/1000000], G/loss_id: 0.2504, G/loss_id_psnt: 0.2511, G/loss_cd: 0.0016
...
Elapsed [4 days, 12:21:31], Iteration [1000000/1000000], G/loss_id: 0.2125, G/loss_id_psnt: 0.2103, G/loss_cd: 0.0011
Elapsed [11:38:12], Iteration [110100/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0001, content_advloss: 4.6966, code_loss: 0.0006
Elapsed [11:38:50], Iteration [110200/200000], G/loss_id: 0.0019, G/loss_id_psnt: 0.0019, spk_loss: 0.0124, content_advloss: 4.6481, code_loss: 0.0007
...
Elapsed [21:15:26], Iteration [200000/200000], G/loss_id: 0.0016, G/loss_id_psnt: 0.0016, spk_loss: 0.0001, content_advloss: 4.6103, code_loss: 0.0003
Elapsed [11:30:53], Iteration [108800/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0002, content_advloss: 4.6069, code_loss: 0.0005
Elapsed [11:31:31], Iteration [108900/200000], G/loss_id: 0.0018, G/loss_id_psnt: 0.0019, spk_loss: 0.0001, content_advloss: 4.6133, code_loss: 0.0004
...
Elapsed [21:16:12], Iteration [200000/200000], G/loss_id: 0.0020, G/loss_id_psnt: 0.0020, spk_loss: 0.0010, content_advloss: 4.6662, code_loss: 0.0006
我仔细看了Clasvc的代码,与Autovc相比,Clsvc就是删除了下采样和上采样的部分,以及增加了一个很简单的adversarial classifier(其中content用了梯度反转),所谓的“flexible hidden feature dimensions”就是Autovc Encoder的输出维度,也是自己定义的,好像贡献也不是很大…
效果不好,重新训练
把训练过程的z
改为content_embedding_source
:
Elapsed [0:02:48], Iteration [1000/810000], G/loss_id: 0.2151, G/loss_id_psnt: 0.2155, spk_loss: 0.3096, content_advloss: 4.6123, code_loss: 0.0139
...
Elapsed [10 days, 2:53:56], Iteration [810000/810000], G/loss_id: 0.0008, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6048, code_loss: 0.0000
再把g_loss_cd
前系数由1
改为0.5
Elapsed [0:00:35], Iteration [100/810000], G/loss_id: 0.2475, G/loss_id_psnt: 0.2654, spk_loss: 4.1111, content_advloss: 4.5997, code_loss: 0.0165
...
Elapsed [10 days, 2:23:01], Iteration [810000/810000], G/loss_id: 0.0009, G/loss_id_psnt: 0.0008, spk_loss: 0.0000, content_advloss: 4.6057, code_loss: 0.0001
语言内容(content),音色(timbre),音调(pitch)和韵律节奏(rhythm)
spk2gen,len(spk2gen)=109,将对应说话人转化为对应性别
{'p250': 'F', 'p285': 'M', 'p277': 'F',...
重新用VCTK的wav16进行实验,发现一些音频并不能进行sf.read操作,直接用软件打开也显示格式损坏:
...
p329
/ceph/datasets/VCTK-Corpus/wav16/p329/p329_037.wav
p330
/ceph/datasets/VCTK-Corpus/wav16/p330/p330_101.wav
...
关于预处理,作者:
All preprocessing steps are in the code, except trimming silence. But I don’t think they will make any fundamental difference. Your loss value looks fine.
关于如何训练P可见此讨论
demo.pkl格式如下:
import pickle
metadata = pickle.load(open("/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/demo.pkl", "rb"))
print(metadata[0][1].shape)
print(metadata[0][2][0].shape)
print(metadata[0][2][1].shape)
print(metadata[1][1].shape) # (1, 82)
print(metadata[1][2][0].shape) # (105, 80)
print(metadata[1][2][1].shape) # (105,)
'''
[['p226', array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. # (1, 82)
,(array([[0.43534297 # (135, 80)
,array([-1.0000000e+10, # (135,)
135, '003002')]
'''
用他提取的mel谱不能通过mel2wav_GriffinLim
生成对应音频,所以一个mel谱的生成方式有一个vocoder与之对应。
min_len_seq=128, max_len_seq=128 * 2, max_len_pad=128 * 3,
Elapsed [1 day, 10:29:04], Iteration [311800/1000000], G/loss_id: 0.00094931
...
Elapsed [1 day, 22:15:11], Iteration [393000/1000000], G/loss_id: 0.00089064
Validation loss: 25.521947860717773
...
Elapsed [6 days, 12:43:18], Iteration [1000000/1000000], G/loss_id: 0.00062485
Saved model checkpoints into run/models...
Validation loss: 24.621514320373535
min_len_seq=64, max_len_seq=128, max_len_pad=192,
Elapsed [1 day, 6:19:23], Iteration [392500/1000000], G/loss_id: 0.00089022
...
Elapsed [1 day, 17:51:54], Iteration [510900/1000000], G/loss_id: 0.00070071
...
Elapsed [2 days, 2:45:19], Iteration [601200/1000000], G/loss_id: 0.00064877
Elapsed [2 days, 2:46:17], Iteration [601300/1000000], G/loss_id: 0.00086512
...
Elapsed [4 days, 13:16:14], Iteration [1000000/1000000], G/loss_id: 0.00063725
Saved model checkpoints into run_192/models...
Validation loss: 25.529197692871094
‘R’ - Rhythm, ‘F’ - Pitch, ‘U’-Timbre
preprocess.py
中,随机选择88个说话人作为训练集,20个说话人作为测试集,从训练集中随机取10%的数据作为验证集。训练集:31877,验证集:3496,测试集:8474
现在p225_001的mel谱为(206, 80)维度的了…这是第三种mel谱提取方式了(第一是Kaizhi Qian的,第二是实验室的),lf0的维度是(206,)
对比预测编码(Contrastive Predictive Coding, CPC)来自以下:
Representation Learning with Contrastive Predictive Coding
Training with mutual information minimization (MIM):
epoch:1, global step:19, recon loss:4.916, cpc loss:2.398, vq loss:0.004, perpexlity:36.475, lld cs loss:-22.828, mi cs loss:1.334E-03, lld ps loss:0.072, mi ps loss:0.000, lld cp loss:-47.886, mi cp loss:0.005, used time:1.841s
[14.59624186 14.74744059 17.68857762 16.60425648 8.98168102 10.80852639]
...
Eval | epoch:150, recon loss:0.597, cpc loss:1.171, vq loss:0.452, perpexlity:331.668, lld cs loss:109.938, mi cs loss:2.622E-03, lld ps loss:0.053, mi ps loss:0.000, lld cp loss:1085.477, mi cp loss:0.027, used time:11.050s
...
epoch:500, global step:62500, recon loss:0.446, cpc loss:1.058, vq loss:0.481, perpexlity:382.976, lld cs loss:133.278, mi cs loss:-3.201E-11, lld ps loss:0.043, mi ps loss:0.001, lld cp loss:1430.699, mi cp loss:0.019, used time:59.947s
[81.88058467 74.5465714 66.79555879 59.56658187 53.12879977 48.21143758]
Saved checkpoint: model.ckpt-500
python convert_example.py -s test_wavs_/p225_001.wav -r test_wavs_/p232_001.wav -c converted_ -m /ceph/home/yangsc21/Python/autovc/VQMIVC/checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt
Training without MIM:
epoch:1, global step:11, recon loss:5.052, cpc loss:2.398, vq loss:0.007, perpexlity:28.734, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:0.932s
[15.68673657 15.79730163 17.24362392 18.36610983 9.6393453 11.61267515]
...
eval epoch:500, global step:62500, recon loss:0.428, cpc loss:1.084, vq loss:0.434, perpexlity:236.974, lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.000, mi cp loss:0.000, used time:2.310s
[80.40287876 73.21134896 65.82538815 58.78130851 52.62680888 48.37559807]
Saved checkpoint: model.ckpt-500
第一次训练的时候出现了以下问题:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
epoch:22, global step:2738, recon loss:nan, cpc loss:2.190, vq loss:0.020, perpexlity:2.943,
lld cs loss:0.000, mi cs loss:0.000E+00, lld ps loss:0.000, mi ps loss:0.000, lld cp loss:0.0
00, mi cp loss:0.000, used time:0.928s
[69.31384211 62.62814037 61.00833479 59.76804704 60.72292783 60.05764839]
File "train.py", line 132, in mi_second_forward
scaled_loss.backward()
ZeroDivisionError: float division by zero
spk2emb_82.pkl
格式:one - hot的speaker embedding,这里不知道怎么训练SEA model,所以直接用spk2emb_82.pkl
的82个说话人进行训练,将所有的对应的说话人音频复制过来,但是注意到p248
和p251
这两个说话人是‘indian’,在‘wav16’的文件夹中没有,所以用于训练的说话人一共有80个
How to train SEA model issues
How to make ‘mfcc_stats.pkl’ and ‘spk2emb_82.pkl’ issues
A:
Elapsed [0:00:31], Iteration [100/1000000], P/loss_tx2sp: 1.56564283, P/loss_stop_sp: 0.41424835
...
Elapsed [1 day, 22:56:36], Iteration [739100/1000000], P/loss_tx2sp: 0.04529002, P/loss_stop_sp: 0.00000134
...
Elapsed [3 days, 2:02:54], Iteration [1000000/1000000], P/loss_tx2sp: 0.04210990, P/loss_stop_sp: 0.00000138
Saved model checkpoints into assets ...
B:
Elapsed [0:01:07], Iteration [100/1000000], P/loss_tx2sp: 0.16594851, P/loss_stop_sp: 0.01642439
...
Elapsed [1 day, 3:50:30], Iteration [308000/1000000], P/loss_tx2sp: 0.05139246, P/loss_stop_sp: 0.00002161
129 torch.Size([4, 1635])
(RuntimeError: CUDA out of memory. )
... # retrain on A100
Elapsed [1 day, 4:01:57], Iteration [612100/1000000], P/loss_tx2sp: 0.06876539, P/loss_stop_sp: 0.00025042
SpeechSplit performs better only when it has the ground truth rhythm.
896 p299_010_p269_010
9.590664489988024 3.480195005888315
MCD = 6.729270791070735 dB, calculated over a total of 658567 frames, total 896 pairs
933
8.460124809689486 2.212626641284654
MCD = 5.365016399179563 dB, calculated over a total of 845206 frames, total 982 pairs
AdaIN-VC:
https://github.com/jjery2243542/adaptive_voice_conversion
非官方:https://github.com/cyhuang-tw/AdaIN-VC
VQVC+:
https://github.com/ericwudayi/SkipVQVC
run,每一次执行五次G1…,会很慢,大约100iters/1min,应该写错了,传出的是detach之后的r,p和c,这里直接停掉就不执行了
Elapsed [0:00:59], Iteration [100/1000000], G/loss_id: 0.07169086, G/loss_id_psnt: 0.69216526, spk_loss: 4.61872101, content_adv_loss: 4.61275053, mi_cp_loss: 0.01635285, mi_rc_loss: 0.00026382, mi_rp_loss: 0.00063036, lld_cp_loss: -61.53382492, lld_rc_loss: -15.68565655, lld_rp_loss: -58.85915375
...
Elapsed [10:16:29], Iteration [65200/1000000], G/loss_id: 0.00442289, G/loss_id_psnt: 0.00442220, spk_loss: 0.25282666, content_adv_loss: 4.30261326, mi_cp_loss: 0.01877159, mi_rc_loss: 0.00023279, mi_rp_loss: 0.00061867, lld_cp_loss: -62.52816391, lld_rc_loss: -15.75810432, lld_rp_loss: -59.59102631
...
Elapsed [22:28:41], Iteration [139300/1000000], G/loss_id: 0.00412945, G/loss_id_psnt: 0.00413380, spk_loss: 0.00963319, content_adv_loss: 3.85407591, mi_cp_loss: 0.02865396, mi_rc_loss: 0.00045456, mi_rp_loss: 0.00082575, lld_cp_loss: -62.08999634, lld_rc_loss: -15.69443417, lld_rp_loss: -58.60738373
run_,对以上进行完善,每一次仅执行一次G1,然后执行五次loglikeli的MI网络更新,增加eval,sample plot,100iters/40s
Elapsed [0:03:57], Iteration [600/1000000], G/loss_id: 0.30983770, G/loss_id_psnt: 0.19387859, spk_loss: 4.63300323, content_adv_loss: 4.61494732, mi_cp_loss: 0.00945068, mi_rc_loss: 0.00023235, mi_rp_loss: 0.00065887, lld_cp_loss: -58.04447556, lld_rc_loss: -15.88657951, lld_rp_loss: -56.80010986
...
Validation loss: 47.09280776977539
Elapsed [3 days, 2:54:44], Iteration [633100/1000000], G/loss_id: 0.00095636, G/loss_id_psnt: 0.00094097, spk_loss: 0.00062935, content_adv_loss: 4.60769081, mi_cp_loss: -0.00006776, mi_rc_loss: 0.00005649, mi_rp_loss: -0.00000760, lld_cp_loss: -63.45062256, lld_rc_loss: -15.89001274, lld_rp_loss: -63.43792725
...
Elapsed [4 days, 22:57:11], Iteration [1000000/1000000], G/loss_id: 0.00098727, G/loss_id_psnt: 0.00097293, spk_loss: 0.00006716, content_adv_loss: 4.60707092, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00003540, mi_rp_loss: -0.00001546, lld_cp_loss: -63.21662521, lld_rc_loss: -15.81787777, lld_rp_loss: -63.23201752
Saved model checkpoints into run_/models...
Validation loss: 32.39069652557373
run_VQ,继续对以上进行完善,增加VQ和CPC
Elapsed [0:04:19], Iteration [600/1000000], G/loss_id: 0.21996836, G/loss_id_psnt: 0.25334343, spk_loss: 4.64389563, content_adv_loss: 4.59541082, mi_cp_loss: 0.00000005, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000003, lld_cp_loss: -63.99991989, lld_rc_loss: -15.99912262, lld_rp_loss: -63.99991989, vq_loss: 0.21500304, cpc_loss: 2.32923222
[25.67204237 24.83198941 23.82392436 24.96639788 25.70564449 24.59677458]
...
Validation loss: 44.1766471862793
Elapsed [2 days, 21:25:56], Iteration [518300/1000000], G/loss_id: 0.00244417, G/loss_id_psnt: 0.00243542, spk_loss: 0.00113923, content_adv_loss: 4.59757185, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000069, mi_rp_loss: 0.00000080, lld_cp_loss: -63.96213531, lld_rc_loss: -15.99189568, lld_rp_loss: -63.96203232, vq_loss: 816.06317139, cpc_loss: 1.40695262
[58.23252797 57.29166865 56.85483813 55.94757795 53.49462628 53.93145084]
...
Elapsed [5 days, 12:36:53], Iteration [1000000/1000000], G/loss_id: 0.00149921, G/loss_id_psnt: 0.00148138, spk_loss: 0.00003648, content_adv_loss: 4.60524797, mi_cp_loss: -0.00000074, mi_rc_loss: 0.00000015, mi_rp_loss: 0.00000297, lld_cp_loss: -63.96846771, lld_rc_loss: -15.98954391, lld_rp_loss: -63.96820068, vq_loss: 3281.53906250, cpc_loss: 1.39505339
[65.49059153 62.63440847 62.02957034 60.18145084 58.77016187 58.36693645]
Saved model checkpoints into run_VQ/models...
Validation loss: 44.55451202392578
run_VQ_1,lambda_cd = 0.1 -> 1
Elapsed [0:11:13], Iteration [1500/1000000], G/loss_id: 0.20114005, G/loss_id_psnt: 0.20369399, spk_loss: 4.10042000, content_adv_loss: 4.61024761, mi_cp_loss: 0.00000031, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000008, lld_cp_loss: -63.99993896, lld_rc_loss: -15.99916458, lld_rp_loss: -63.99987411, vq_loss: 0.26906750, cpc_loss: 2.26129460
[33.16532373 33.36693645 33.8373661 31.72042966 31.01478517 31.85483813]
...
Validation loss: 49.884931564331055 [403/1899]
Elapsed [2 days, 19:55:09], Iteration [500100/1000000], G/loss_id: 0.00222024, G
/loss_id_psnt: 0.00221478, spk_loss: 0.00017085, content_adv_loss: 4.60785437, m
i_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: -0.00000000, lld_cp_l
oss: -64.00000000, lld_rc_loss: -15.99412155, lld_rp_loss: -63.99996185, vq_loss
: 885.47814941, cpc_loss: 1.49846244
[48.7231195 47.81585932 46.77419364 45.93414068 47.27822542 45.69892585]
run_pitch,增加一个pitch的decoder,lambda_cd = 1 -> 0.1,pitch的decoder的权重系数设为1
Elapsed [0:01:50], Iteration [200/1000000], G/loss_id: 0.03694459, G/loss_id_psnt: 0.88864940, spk_loss: 4.62199402, content_adv_loss: 4.66859436, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000005, mi_rp_loss: 0.00000007, lld_cp_loss: -63.99984741, lld_rc_loss: -15.99911499, lld_rp_loss: -63.99986267, vq_loss: 0.18332298, cpc_loss: 2.39174533, pitch_loss: 84147137801697099776.00000000
[16.83467776 16.70026928 17.37231165 17.13709682 17.40591377 19.52284873]
...
Elapsed [7:49:14], Iteration [53200/1000000], G/loss_id: 0.00311949, G/loss_id_psnt: 0.00310999, spk_loss: 0.22137184, content_adv_loss: 4.62207985, mi_cp_loss: -0.00000042, mi_rc_loss: 0.00000002, mi_rp_loss: -0.00000121, lld_cp_loss: -63.99859238, lld_rc_loss: -15.99206543, lld_rp_loss: -63.66162872, vq_loss: 6.28716087, cpc_loss: 1.54115570, pitch_loss: 82519535137146798080.00000000
[49.05914068 49.96639788 48.01747203 47.47983813 45.69892585 45.0268805 ]
...
改正loss过大的问题后:
Elapsed [0:00:56], Iteration [100/1000000], G/loss_id: 0.05138026, G/loss_id_psnt: 0.77242196, spk_loss: 0.68211454, content_adv_loss: 3.45159459, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000000, mi_rp_loss: -0.00000000, lld_cp_loss: -63.99995041, lld_rc_loss: -15.99927044, lld_rp_loss: -63.99993896, vq_loss: 0.17577030, cpc_loss: 2.39789486, pitch_loss: 0.03641313
[14.41532224 13.70967776 11.79435477 14.11290318 15.15457034 11.35752723]
...
Elapsed [1 day, 23:38:22], Iteration [353300/1000000], G/loss_id: 0.00200523, G/loss_id_psnt: 0.00200193, spk_loss: 0.00335806, content_adv_loss: 4.60966444, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000065, mi_rp_loss: -0.00271915, lld_cp_loss: -61.08781052, lld_rc_loss: -15.98555946, lld_rp_loss: -61.11469650, vq_loss: 361.41998291, cpc_loss: 1.45688534, pitch_loss: 0.00808701
[55.07392287 53.76344323 52.08333135 51.74731016 51.00806355 52.55376101]
run_pitch_,可以发现以上pitch_loss
大的离谱,还是在我加了tanh()的情况下,原来是本身的f0
的值在-10000000000.0 ~ 1.0 之间,就很奇怪,怎么这么小?但我用以下函数检查的时候:
import numpy as np
def speaker_normalization(f0, index_nonzero, mean_f0, std_f0):
# f0 is logf0
f0 = f0.astype(float).copy()
#index_nonzero = f0 != 0
f0[index_nonzero] = (f0[index_nonzero] - mean_f0) / std_f0 / 4.0
f0[index_nonzero] = np.clip(f0[index_nonzero], -1, 1)
f0[index_nonzero] = (f0[index_nonzero] + 1) / 2.0
return f0
path = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/raptf0/p226/p226_025_cat.npy"
# path = "/ceph/home/yangsc21/Python/VCTK/wav16/raptf0_100_crop_cat/p225/p225_cat.npy"
f0_rapt = np.load(path)
index_nonzero = (f0_rapt != -1e10)
mean_f0, std_f0 = np.mean(f0_rapt[index_nonzero]), np.std(f0_rapt[index_nonzero])
f0_norm = speaker_normalization(f0_rapt, index_nonzero, mean_f0, std_f0)
f0_norm[f0_norm==-1e10] = 0
print(f0_rapt, np.max(f0_rapt), np.min(f0_rapt), index_nonzero, mean_f0, std_f0, f0_norm, np.max(f0_norm), np.min(f0_norm), np.mean(f0_norm))
结果如下:
[-1.e+10 -1.e+10 -1.e+10 ... -1.e+10 -1.e+10 -1.e+10] 1.0 -10000000000.0 [False False False ... False False False] 0.500022 0.12482828 [0. 0. 0. ... 0. 0. 0.] 1.0 0.0 0.278805128421691
可以发现,除了-1e10
,剩下就是从0-1之间的数了,那这里过一个Sigmoid(),权重系数设为0.1
Elapsed [0:00:23], Iteration [40/1000000], G/loss_id: 0.05261087, G/loss_id_psnt: 0.72422123, spk_loss: 0.69748354, content_adv_loss: 4.07323647, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00000001, mi_rp_loss: 0.00000005, lld_cp_loss: -63.99515915, lld_rc_loss: -15.99876404, lld_rp_loss: -63.99857712, vq_loss: 0.12072643, cpc_loss: 2.39789486, pitch_loss: 0.05633526
[15.35618305 15.42338729 14.34811801 15.42338729 14.71774131 10.95430106]
...
Elapsed [1 day, 23:40:28], Iteration [317600/1000000], G/loss_id: 0.00184345, G/loss_id_psnt: 0.00184246, spk_loss: 0.12309902, content_adv_loss: 4.59243107, mi_cp_loss: 0.00000003, mi_rc_loss: 0.00000001, mi_rp_loss: 0.00000004, lld_cp_loss: -63.99990082, lld_rc_loss: -15.98676586, lld_rp_loss: -63.99086761, vq_loss: 322.91177368, cpc_loss: 1.37124848, pitch_loss: 0.00887575
[63.23924661 62.29838729 60.61828136 59.40859914 58.90457034 58.63575339]
run_pitch_2,不用VQ+CPC,只用MI + 0.1 * pitch decoder
Elapsed [0:04:26], Iteration [700/1000000], G/loss_id: 0.16231853, G/loss_id_psnt: 0.32284465, spk_loss: 4.62042952, content_adv_loss: 4.61721945, mi_cp_loss: 0.00780822, mi_rc_loss: 0.00016762, mi_rp_loss: 0.00188451, lld_cp_loss: -61.02302551, lld_rc_loss: -15.70524502, lld_rp_loss: -59.94433594, pitch_loss: 0.01591282
...
Elapsed [4 days, 23:25:32], Iteration [1000000/1000000], G/loss_id: 0.00075769, G/loss_id_psnt: 0.00075213, spk_loss: 0.00020621, content_adv_loss: 4.60287952, mi_cp_loss: 0.00002108, mi_rc_loss: 0.00001613, mi_rp_loss: 0.00000331, lld_cp_loss: -63.69352341, lld_rc_loss: -15.81671143, lld_rp_loss: -63.69023132, pitch_loss: 0.00293102
Saved model checkpoints into run_pitch_2/models...
Validation loss: 25.982803344726562
run_pitch_3,VQCPC after Mel Encoder (rhythm)
Elapsed [0:00:47], Iteration [100/1000000], G/loss_id: 0.03398866, G/loss_id_psnt: 0.81710476, spk_loss: 4.63411522, content_adv_loss: 4.56968164, mi_cp_loss: 0.00585751, mi_rc_loss: 0.00000011, mi_rp_loss: -0.00000144, lld_cp_loss: -61.00984192, lld_rc_loss: -15.42654037, lld_rp_loss: -59.25158691, vq_loss: 0.05246538, cpc_loss: 2.39724064, pitch_loss: 0.01885412
[40.2777791 45.13888955 50.69444776 56.25 63.19444776 67.36111045]
...
Elapsed [5 days, 0:38:57], Iteration [1000000/1000000], G/loss_id: 0.00155650, G/loss_id_psnt: 0.00154984, spk_loss: 0.00189495, content_adv_loss: 4.60772943, mi_cp_loss: 0.00018108, mi_rc_loss: 0.00001328, mi_rp_loss: 0.00000790, lld_cp_loss: -63.04440308, lld_rc_loss: -12.41486168, lld_rp_loss: -63.04750824, vq_loss: 0.12090041, cpc_loss: 1.45722890, pitch_loss: 0.01511321
[56.25 61.8055582 63.54166865 67.01388955 69.79166865 75.3472209 ]
Saved model checkpoints into run_pitch_3/models...
Validation loss: 1484.6753540039062
Saved model checkpoints into run_pitch_wo_adv/models...
Validation loss: 57.68343734741211
Elapsed [4 days, 14:00:24], Iteration [800100/1000000], G/loss_id: 0.00046955, G/loss_id_psnt: 0.00046547, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000706, mi_rc_loss: 0.00001590, mi_rp_loss: -0.00017925, lld_cp_loss: -59.52406311, lld_rc_loss: -15.50933647, lld_rp_loss: -59.52644348, pitch_loss: 0.00289221
Elapsed [3 days, 20:09:18], Iteration [861400/1000000], G/loss_id: 0.00078604, G/loss_id_psnt: 0.00077555, spk_loss: 0.01067678, content_adv_loss: 4.61410141, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00468707
Elapsed [3 days, 20:10:00], Iteration [861500/1000000], G/loss_id: 0.00068698, G/loss_id_psnt: 0.00067951, spk_loss: 0.21367706, content_adv_loss: 4.60780621, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00338293
Elapsed [3 days, 21:44:47], Iteration [807800/1000000], G/loss_id: 0.00086264, G/loss_id_psnt: 0.00085352, spk_loss: 0.00084184, content_adv_loss: 4.60911512, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00008330, mi_rp_loss: 0.00001434, lld_cp_loss: -62.65066147, lld_rc_loss: -14.44594097, lld_rp_loss: -62.67908859
Elapsed [3 days, 21:45:36], Iteration [807900/1000000], G/loss_id: 0.00084089, G/loss_id_psnt: 0.00082641, spk_loss: 0.00110251, content_adv_loss: 4.60943460, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00016344, mi_rp_loss: 0.00001484, lld_cp_loss: -62.94990158, lld_rc_loss: -14.53773308, lld_rp_loss: -62.95674896
Elapsed [3 days, 17:36:39], Iteration [886500/1000000], G/loss_id: 0.00060688, G/loss_id_psnt: 0.00059975, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00623717
Elapsed [5 days, 4:59:55], Iteration [831100/1000000], G/loss_id: 0.00050043, G/loss_id_psnt: 0.00049772, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000898, mi_rp_loss: -0.00004322, lld_cp_loss: -63.60225677, lld_rc_loss: -15.57159233, lld_rp_loss: -63.61416626
Elapsed [4 days, 2:38:30], Iteration [818800/1000000], G/loss_id: 0.00079214, G/loss_id_psnt: 0.00078017, spk_loss: 0.00368636, content_adv_loss: 4.61357307, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
Elapsed [2 days, 19:02:29], Iteration [827100/1000000], G/loss_id: 0.00056297, G/loss_id_psnt: 0.00055678, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
Elapsed [6 days, 15:18:42], Iteration [1000000/1000000], G/loss_id: 0.00051162, G/loss_id_psnt: 0.00050539, spk_loss: 0.00400598, content_adv_loss: 4.59954834, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00137981, mi_rp_loss: 0.00045807, lld_cp_loss: -117.74221039, lld_rc_loss: -8.09668541, lld_rp_loss: -117.96425629, pitch_loss: 0.00297939
Saved model checkpoints into run_dim2_pitch/models...
Validation loss: 24.739643096923828
Elapsed [6 days, 17:55:16], Iteration [1000000/1000000], G/loss_id: 0.00193528, G/loss_id_psnt: 0.00193324, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: -0.00000000, mi_rc_loss: 0.00027657, mi_rp_loss: 0.00118168, lld_cp_loss: -36.59268570, lld_rc_loss: -31.09467697, lld_rp_loss: -41.84099579, pitch_loss: 0.00602757
Saved model checkpoints into run_dim2_pitch_wo_adv/models...
Validation loss: 61.93159866333008
Elapsed [4 days, 14:43:55], Iteration [874800/1000000], G/loss_id: 0.00076253, G/loss_id_psnt: 0.00075406, spk_loss: 0.00023900, content_adv_loss: 4.60489750, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00438170
Elapsed [7 days, 12:10:08], Iteration [1000000/1000000], G/loss_id: 0.00056876, G/loss_id_psnt: 0.00055979, spk_loss: 0.00775459, content_adv_loss: 4.60920477, mi_cp_loss: -0.00001970, mi_rc_loss: -0.00001161, mi_rp_loss: 0.00000567, lld_cp_loss: -126.23571777, lld_rc_loss: -28.51069260, lld_rp_loss: -126.23281097
Saved model checkpoints into run_dim2_pitch_wo_pitch/models...
Validation loss: 23.904769897460938
Elapsed [7 days, 2:04:21], Iteration [1000000/1000000], G/loss_id: 0.00036989, G/loss_id_psnt: 0.00036615, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000, pitch_loss: 0.00280974
Saved model checkpoints into run_dim2_pitch_wo_adv_mi/models...
Validation loss: 17.292850494384766
Elapsed [4 days, 14:56:20], Iteration [819400/1000000], G/loss_id: 0.00043808, G/loss_id_psnt: 0.00043352, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00003838, mi_rp_loss: -0.00066438, lld_cp_loss: -116.34136963, lld_rc_loss: -31.60812950, lld_rp_loss: -116.60922241
Elapsed [4 days, 1:50:09], Iteration [837000/1000000], G/loss_id: 0.00066237, G/loss_id_psnt: 0.00065387, spk_loss: 0.16599277, content_adv_loss: 4.60649395, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
Elapsed [3 days, 16:58:49], Iteration [817700/1000000], G/loss_id: 0.00039425, G/loss_id_psnt: 0.00039077, spk_loss: 0.00000000, content_adv_loss: 0.00000000, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00000000, lld_cp_loss: 0.00000000, lld_rc_loss: 0.00000000, lld_rp_loss: 0.00000000
run,以上效果的转化音色时效果不好,这里把speaker embedding 改成one hot向量,
Elapsed [0:00:42], Iteration [100/1000000], G/loss_id: 0.03971826, G/loss_id_psnt: 0.77077729, content_adv_loss: 4.60181856, mi_cp_loss: 0.01571883, mi_rc_loss: 0.00013684, mi_rp_loss: 0.00233091, lld_cp_loss: -61.88400269, lld_rc_loss: -15.71234703, lld_rp_loss: -58.84709930
...
Elapsed [2 days, 23:36:30], Iteration [664400/1000000], G/loss_id: 0.00082198, G/loss_id_psnt: 0.00081689, content_adv_loss: 4.60013008, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00009694, mi_rp_loss: -0.00047144, lld_cp_loss: -55.83151245, lld_rc_loss: -15.27785683, lld_rp_loss: -55.92704010
...
Elapsed [4 days, 12:08:59], Iteration [1000000/1000000], G/loss_id: 0.00[1/1862]
G/loss_id_psnt: 0.00059563, content_adv_loss: 4.61593437, mi_cp_loss: 0.00000000, mi_rc_loss: 0.00007099, mi_rp_loss: -0.00033119, lld_cp_loss: -50.61760712, lld_rc_loss: -15.26475143, lld_rp_loss: -51.30073166
Saved model checkpoints into run/models...
Validation loss: 37.75163459777832
run_pitch,
Elapsed [0:00:51], Iteration [100/1000000], G/loss_id: 0.04715083, G/loss_id_psnt: 0.75423688, content_adv_loss: 4.61723232, mi_cp_loss: 0.01895704, mi_rc_loss: 0.00019252, mi_rp_loss: -0.00003131, lld_cp_loss: -60.86524582, lld_rc_loss: -15.71782684, lld_rp_loss: -57.30221176, pitch_loss: 0.03021768
...
Elapsed [2 days, 23:46:05], Iteration [598000/1000000], G/loss_id: 0.00082708, G/loss_id_psnt: 0.00082202, content_adv_loss: 4.60971642, mi_cp_loss: 0.00000000, mi_rc_loss: -0.00007787, mi_rp_loss: 0.00014183, lld_cp_loss: -62.80820847, lld_rc_loss: -15.05598736, lld_rp_loss: -62.85067749, pitch_loss: 0.00637796
run_use_VQCPC,迭代到580000代,被人催下13号机了
Elapsed [3 days, 5:22:11], Iteration [580000/1000000], G/loss_id: 0.0027 loss_id_psnt: 0.00274249, content_adv_loss: 4.59387255, mi_cp_loss: 0.00000014, mi_rc_loss: 0.00000099, mi_rp_loss: 0.00000026, lld_cp_loss: -63.99961853, lld_rc_loss: -15.95115471, lld_rp_loss: -63.99967575, vq_loss: 1067.20275879, cpc_loss: 1.43544137, pitch_loss: 0.01039015
[58.77016187 58.33333135 56.95564747 55.94757795 54.30107713 54.33467627]
Saved model checkpoints into run_use_VQCPC/models...
Validation loss: 81.5089225769043
run_use_VQCPC_2
Elapsed [4 days, 11:03:03], Iteration [812400/1000000], G/loss_id: 0.00247453, G/loss_id_psnt: 0.00247133, content_adv_loss: 4.60783482, mi_cp_loss: 0.00007319, mi_rc_loss: -0.00000709, mi_rp_loss: 0.00000088, lld_cp_loss: -63.97900009, lld_rc_loss: -15.18019390, lld_rp_loss: -63.95431519, vq_loss: 0.13141513, cpc_loss: 1.21822941, pitch_loss: 0.01545527
[64.23611045 63.88888955 64.23611045 63.54166865 63.54166865 65.2777791 ]
new 我改了一下,先取mel谱过G1五次,再把mel谱过G1和G2,这样就会慢一些
MI:
Elapsed [5 days, 20:36:16], Iteration [676100/1000000], G/loss_id: 0.00094016, G/loss_id_psnt: 0.00093268, content_adv_loss: 4.60860682, mi_cp_loss: -0.00006625, mi_rc_loss: 0.00011470, mi_rp_loss: 0.00001495, lld_cp_loss: -62.70265961, lld_rc_loss: -14.90209007, lld_rp_loss: -62.72354126
MI + pitch:
Elapsed [5 days, 20:37:14], Iteration [647300/1000000], G/loss_id: 0.00082176, G/loss_id_psnt: 0.00081603, content_adv_loss: 4.60966206, mi_cp_loss: 0.00002162, mi_rc_loss: 0.00000000, mi_rp_loss: 0.00004071, lld_cp_loss: -63.53780365, lld_rc_loss: -15.10877800, lld_rp_loss: -63.54629135, pitch_loss: 0.00451364
1. Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations
Objective evaluation:MCD、WER
Subjective evaluation:MOS
英文应该是WER,先用ESPnet: end-to-end speech processing toolkit
工具进行语音识别,官方github,WER计算方式可参考此github
以下识别结果为’SS’,待修改:
import json
import torch
import argparse
from espnet.bin.asr_recog import get_parser
from espnet.nets.pytorch_backend.e2e_asr_transformer import E2E
import os
import scipy.io.wavfile as wav
from python_speech_features import fbank
filename = "/ceph/home/yangsc21/Python/autovc/SpeechSplit/assets/test_wav/p225/p225_001.wav"
sample_rate, waveform = wav.read(filename)
fbank1, _ = fbank(waveform,samplerate=16000,winlen=0.025,winstep=0.01,
nfilt=86,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)
# print(fbank1[0].shape, fbank1[1].shape) # (204, 86), (204, )
root = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/asr1/"
model_dir = "/ceph/home/yangsc21/Python/autovc/espnet/egs/tedlium3/exp/train_trim_sp_pytorch_nbpe500_ngpu8_train_pytorch_transformer.v2_specaug/results/"
# load model
with open(model_dir + "/model.json", "r") as f:
idim, odim, conf = json.load(f)
model = E2E.build(idim, odim, **conf)
model.load_state_dict(torch.load(model_dir + "/model.last10.avg.best"))
model.cpu().eval()
# load tocken_list
token_list = conf['char_list']
print(token_list)
# recognize speech
parser = get_parser()
args = parser.parse_args(["--beam-size", "1", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])
x = torch.as_tensor(fbank1).to(torch.float32)
result = model.recognize(x, args, token_list)
print(result)
s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("", "").replace("", " ").replace("", "")
print("prediction: ", s)
放弃用espnet,用espnet_model_zoo,官方github,语料库和网络名字可在此看到,这里我们基于librispeech
,用espnet2
实现ASR,
2. Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder
tSNE Visualization of speaker embedding space
Fig. 3 illustrates speaker embedding visualized by tSNE method, there are 30 utterances sampled for every speaker to calculate the speaker representation. According to the empirical results, we found that a chunk of 2 seconds is adequate to extract the speaker representation. As shown in Fig. 3, speaker embeddings are separable for different speakers. In contrast, the speaker embeddings of utterances of the same speaker are close to each other. As a result, our method is able to extract speaker-dependence information by using the encoder network.
p335 p264 p247 p278 p272 p262
(F, F, M, M, M, F)这些说话人没有出现在训练中
3. Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model
加个text?
4. Non-Parallel Many-To-Many Voice Conversion Using Local Linguistic Tokens
VQ-VAE
5. Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention
Ablation studies
6. Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features
F0 distributions, subjective evaluations
7. Non-Autoregressive Sequence-To-Sequence Voice Conversion
root mean square error (RMSE) of log F0, and character error rate (CER)
Towards fine-grained prosody control for voice conversion