• Speech | 轻量级语音合成论文详解及项目实现


    2023_LIGHTWEIGHT AND HIGH-FIDELITY END-TO-END TEXT-TO-SPEECH WITH MULTI-BAND GENERATION AND INVERSE SHORT-TIME FOURIER TRANSFORM

    Paper:https://arxiv.org/pdf/2210.15975.pdf

    Code:GitHub - misakiudon/MB-iSTFT-VITS-multilingual: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform with Multilingual Cleaners

    目录

    1.论文详解

    1.1.介绍

    1.2.VITS算法

    1.3.提出的方法

    1.4.实验 

     1.5.结论

    2.项目实现

    2.1.数据准备

    2.2.数据预处理

    2.3.对于文本的处理

    2.4.训练

    2.5.推理

    【PS】

    【PS1】ERROR: Could not build wheels for pyopenjtalk, which is required to install pyproject.toml-based projects

    【PS2】AttributeError: 'HParams' object has no attribute 'seed'

    【PS3】EOFError: Ran out of input

    【PS4】数据未生成对应的spec.pt文件

    【PS5】  TypeError: __init__() takes 1 positional argument but 2 were given

    【PS6】Traceback (most recent call last):  File "sc_test.py", line 2, in     import soundcard  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/__init__.py", line 4, in     from soundcard.pulseaudio import *  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 265, in     _pulse = _PulseAudio()  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 76, in __init__    assert self._pa_context_get_state(self.context)==_pa.PA_CONTEXT_READYAssertionError

    扩展

    多语言处理方法介绍

    韩语

     方法1:韩语转音素(G2p)

     方法2:声学模型

    VScode 快捷键

    1.论文详解

    1.1.介绍

    介绍了之前的俩阶段语音合成(声学模型和Vocoders),因为VITS是高质量端到端的模型,所以论文提出的模型是基于VITS轻量级的端到端模型,论文主要几种在模型的解码部分,也就是转换潜在的声学特征到wavaform,用简单的反向短时傅立叶变换 (iSTFT)代替一部分解码器,以高效地完成频域到时域的转换.在推理提升速度时,使用多段处理。在提出的方法时,每一个iSTFTNet,子段信号。推理时,比原本的VITS快了4.1倍,

    1.2.VITS算法

    1.2.1.简短的介绍了vits的原理,这里不过多介绍,更多可查看【VITS论文总结及项目复现

    1.2.2.每个模型的推理速度

    这里用RTF计算,也就是时间token除以合成的语音长度,作为客观标准。

    1.3.提出的方法

    1.3.1.在输出层将原本的vits中HiFi-GAN替换成样本反向短时傅立叶变换

    1.3.2.多段反向短时傅立叶变换变量自动编码(Multi-band iSTFT VITS)

    下图为模型的整体框架,

     

    1.3.3.多流反向短时傅立叶变换变量自动编码(Multi-stream iSTFT VITS)
    与 MB-iSTFT-VITS 不同,MS-iSTFT-VITS波形是完全可训练的,不再受固定子带信号的限制。

    1.4.实验 

    数据集 :  LJ Speech dataset

    比较了5种基于vits的模型:

    1. VITS: 正式的VITS1,其超参数与原始 VITS 相同。
    2. Nix-TTS:Nix-TTS 的预训练模型2. 使用的模型是经过优化的 ONNX 版本 [27]。请注意,实验中使用的数据集与 Nix-TTS 使用的数据集完全相同。
    3. iSTFT-VITS :将 iSTFTNet 纳入 VITS 解码器部分的模型。iSTFTNet 的架构为 V1-C8C8I,这是 [14] 中描述的最佳平衡模型。该架构包含两个上采样比例为[8, 8]. 快速傅立叶变换(FFT)的大小、跳变长度和窗口长度分别设置为 16、4 和16。
    4. MB-iSTFT-VITS 第 3.2 节中介绍的一个拟议模型。子频带数 N 设为 4。两个残差块的上采样比例为 [4,4],以匹配每个子带信号分解后的分辨率。pseudo-QMF的分析滤波器分解的每个子带信号的分辨率相匹配。iSTFT 部分的 FFT 大小、跳变长度和窗口长度与伪 QMF 分析滤波器分解的各子波段信号的分辨率相同。
    5. MS-iSTFT-VITS: 第 3.3 节中介绍的另一种拟议模型。根据 [10],基于卷积的可训练合成滤波器的内核大小设定为 63,不带偏差。可训练合成滤波器的核大小设定为 63,没有偏差。其他条件与 MB-iSTFT-VITS 相同。

    下图为比较结果

     1.5.结论

    在本文中,提出了一种端到端 TTS 系统,该系统能够在设备上实现高速语音合成。提出的方法是在名为 VITS 的成功端到端模型的基础上构建的,但采用了几种技术来加快推理速度,例如通过 iSTFT 减少解码器计算的冗余,以及采用多波段并行策略.由于所提出的模型是以完全端到端方式进行优化的,与传统的两阶段方法相比,所提出的模型可以充分享受其强大的优化过程。实验结果表明方法生成的语音与 VITS 合成的语音一样自然,同时还能以更快的速度生成波形。未来的研究包括将提出的方法扩展到多扬声器模型。

    2.项目实现

    2.1.数据准备

    准备的语音数据必须是22050Hz,单通道(Mono),PCM-16bit

    单人

    1. dataset/001.wav|您好
    2. dataset/001.wav|안녕하세요
    3. dataset/001.wav|こんにちは。

    多人

    dataset/001.wav|0|こんにちは。

    中间数字是人物序号ID,从0开始 

    1. # Cython-version Monotonoic Alignment Search
    2. cd monotonic_align
    3. mkdir monotonic_align
    4. python setup.py build_ext --inplace

     论文提供了几种训练结构,因为在论文中MB-iSTFT-VITS效果最好,本文使用此模型

    改变training_files` 和 `validation_files`处的路径

    新建一个自己数据集的config.json文件

    修改:

    1. "training_files":"./filelists/history_train_filelist.txt.cleaned",
    2. "validation_files":"./filelists/history_train_filelist.txt.cleaned",
    3. "text_cleaners":["cjke_cleaners2"], #多语言自定义函数
    4.  # "text_cleaners":["korean_cleaners"], #训练语言

    2.2.数据预处理

    1. # Single speaker
    2. # python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'
    3. #本文实例
    4. # 方法一:声学模型(暂时未运行)
    5. python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'korean_cleaners'
    6. # 方法二:G2p模型
    7. python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'cjke_cleaners2'
    8. # Mutiple speakers
    9. python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'

    运行后生成

    2.3.对于文本的处理

    中英韩日对文本的处理稍微有一点点差异

    包含了文本处理,转化方式,文本正则化,符号处理

    2.4.训练

    1. # 单人
    2. #python train_latest.py -c -m
    3. python train_latest.py -c configs/myvoice.json -m myvoice_model(文件夹名称,随便起)
    4. # 多人
    5. #python train_latest_ms.py -c -m

     

    训练后都存放着在logs文件夹下

    2.5.推理

    1. import warnings
    2. warnings.filterwarnings(action='ignore')
    3. import os
    4. import time
    5. import torch
    6. import utils
    7. import argparse
    8. import commons
    9. import utils
    10. from models import SynthesizerTrn
    11. from text.symbols import symbols
    12. from text import cleaned_text_to_sequence
    13. #日语from g2p import pyopenjtalk_g2p_prosody
    14. #韩语
    15. from g2pk2 import G2p
    16. import soundcard as sc
    17. import soundfile as sf
    18. def get_text(text, hps):
    19. text_norm = cleaned_text_to_sequence(text)
    20. if hps.data.add_blank:
    21. text_norm = commons.intersperse(text_norm, 0)
    22. text_norm = torch.LongTensor(text_norm)
    23. return text_norm
    24. def inference(args):
    25. config_path = args.config
    26. G_model_path = args.model_path
    27. # check device
    28. if torch.cuda.is_available() is True:
    29. print("Enter the device number to use.")
    30. key = input("GPU:0, CPU:1 ===> ")
    31. if key == "0":
    32. device="cuda:0"
    33. elif key=="1":
    34. device="cpu"
    35. print(f"Device : {device}")
    36. else:
    37. print(f"CUDA is not available. Device : cpu")
    38. device = "cpu"
    39. # load config.json
    40. hps = utils.get_hparams_from_file(config_path)
    41. # load checkpoint
    42. net_g = SynthesizerTrn(
    43. len(symbols),
    44. hps.data.filter_length // 2 + 1,
    45. hps.train.segment_size // hps.data.hop_length,
    46. **hps.model).cuda()
    47. _ = net_g.eval()
    48. _ = utils.load_checkpoint(G_model_path, net_g, None)
    49. # play audio by system default
    50. speaker = sc.get_speaker(sc.default_speaker().name)
    51. # parameter settings
    52. noise_scale = torch.tensor(0.66) # adjust z_p noise
    53. noise_scale_w = torch.tensor(0.8) # adjust SDP noise
    54. length_scale = torch.tensor(1.0) # adjust sound length scale (talk speed)
    55. if args.is_save is True:
    56. n_save = 0
    57. save_dir = os.path.join("./infer_logs/")
    58. os.makedirs(save_dir, exist_ok=True)
    59. ### Dummy Input ###
    60. with torch.inference_mode():
    61. #日语stn_phn = pyopenjtalk_g2p_prosody("速度計測のためのダミーインプットです。")
    62. stn_phn = G2p("소프트웨어 교육의 중요성이 날로 더해가는데 학생들은 소프트웨어 관련 교육을 쉽게 지루해해요")
    63. stn_tst = get_text(stn_phn, hps)
    64. # generate audio
    65. x_tst = stn_tst.cuda().unsqueeze(0)
    66. x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
    67. audio = net_g.infer(x_tst,
    68. x_tst_lengths,
    69. noise_scale=noise_scale,
    70. noise_scale_w=noise_scale_w,
    71. length_scale=length_scale)[0][0,0].data.cpu().float().numpy()
    72. while True:
    73. # get text
    74. text = input("Enter text. ==> ")
    75. if text=="":
    76. print("Empty input is detected... Exit...")
    77. break
    78. # measure the execution time
    79. torch.cuda.synchronize()
    80. start = time.time()
    81. # required_grad is False
    82. with torch.inference_mode():
    83. #日语stn_phn = pyopenjtalk_g2p_prosody(text)
    84. #韩语
    85. stn_phn = G2p(text)
    86. stn_tst = get_text(stn_phn, hps)
    87. # generate audio
    88. x_tst = stn_tst.cuda().unsqueeze(0)
    89. x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
    90. audio = net_g.infer(x_tst,
    91. x_tst_lengths,
    92. noise_scale=noise_scale,
    93. noise_scale_w=noise_scale_w,
    94. length_scale=length_scale)[0][0,0].data.cpu().float().numpy()
    95. # measure the execution time
    96. torch.cuda.synchronize()
    97. elapsed_time = time.time() - start
    98. print(f"Gen Time : 0.0621")
    99. # play audio
    100. speaker.play(audio, hps.data.sampling_rate)
    101. # save audio
    102. if args.is_save is True:
    103. n_save += 1
    104. data = audio
    105. try:
    106. save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text}.wav")
    107. sf.write(
    108. file=save_path,
    109. data=data,
    110. samplerate=hps.data.sampling_rate,
    111. format="WAV")
    112. except:
    113. save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text[:10]}〜.wav")
    114. sf.write(
    115. file=save_path,
    116. data=data,
    117. samplerate=hps.data.sampling_rate,
    118. format="WAV")
    119. print(f"Audio is saved at : {save_path}")
    120. return 0
    121. if __name__ == "__main__":
    122. parser = argparse.ArgumentParser()
    123. parser.add_argument('--config',
    124. type=str,
    125. required=True,
    126. #default="./logs/ITA_CORPUS/config.json" ,
    127. help='Path to configuration file')
    128. parser.add_argument('--model_path',
    129. type=str,
    130. required=True,
    131. #default="./logs/ITA_CORPUS/G_1200.pth",
    132. help='Path to checkpoint')
    133. parser.add_argument('--is_save',
    134. type=str,
    135. default=True,
    136. help='Whether to save output or not')
    137. args = parser.parse_args()
    138. inference(args)

    如果出错,请查看【PS5】 

     相关项目总结

    GitHub - jaywalnut310/vits: VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speechicon-default.png?t=N7T8https://github.com/jaywalnut310/vits

    GitHub - MasayaKawamura/MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transformicon-default.png?t=N7T8https://github.com/MasayaKawamura/MB-iSTFT-VITS

    GitHub - FENRlR/MB-iSTFT-VITS2: Application of MB-iSTFT-VITS components to vits2_pytorchicon-default.png?t=N7T8https://github.com/FENRlR/MB-iSTFT-VITS2GitHub - p0p4k/vits2_pytorch: unofficial vits2-TTS implementation in pytorchicon-default.png?t=N7T8https://github.com/p0p4k/vits2_pytorchGitHub - daniilrobnikov/vits2: VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Designicon-default.png?t=N7T8https://github.com/daniilrobnikov/vits2GitHub - fishaudio/Bert-VITS2: vits2 backbone with berticon-default.png?t=N7T8https://github.com/fishaudio/Bert-VITS2GitHub - PlayVoice/whisper-vits-svc: Core Engine of Singing Voice Conversion & Singing Voice Cloneicon-default.png?t=N7T8https://github.com/PlayVoice/whisper-vits-svcGitHub - tonnetonne814/SiFi-VITS2-44100-Ja: DDPM-based Pitch Generation and Pitch Controllable Voice Synthesis.icon-default.png?t=N7T8https://github.com/tonnetonne814/SiFi-VITS2-44100-JaGitHub - Tsunamicloud/Emotion_Bert_VITS2icon-default.png?t=N7T8https://github.com/Tsunamicloud/Emotion_Bert_VITS2

    【PS】

    【PS1】ERROR: Could not build wheels for pyopenjtalk, which is required to install pyproject.toml-based projects

    在安装时pip install -r requirements.txt出现以下错误

    参考1

     尝试

    pip install pycocotools -i https://pypi.python.org/simple

     依旧出错

    【PS2】AttributeError: 'HParams' object has no attribute 'seed'

    config中配置文件缺少seed

    重新修改配置文件

    【PS3】EOFError: Ran out of input

    作者默认训练的gpu个数5

    将num_worker改为0

    然后又出现

    将/workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py中更改第35行,改为

      sequence = [_symbol_to_id[symbol] for symbol in cleaned_text if symbol in _symbol_to_id.keys()]

    【PS4】数据未生成对应的spec.pt文件

    【PS5】  TypeError: __init__() takes 1 positional argument but 2 were given

     对与不同语言文字的处理库不同,

    【PS6】Traceback (most recent call last):
      File "sc_test.py", line 2, in
        import soundcard
      File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/__init__.py", line 4, in
        from soundcard.pulseaudio import *
      File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 265, in
        _pulse = _PulseAudio()
      File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 76, in __init__
        assert self._pa_context_get_state(self.context)==_pa.PA_CONTEXT_READY
    AssertionError

    参考

    'import soundcard' throws an error when run through a systemd service · Issue #133 · bastibe/SoundCard (github.com)

    扩展

    多语言处理方法介绍

    中文

    声学特征(linguistic feature)

    文本转音素(G2P):将文本转换为注音,比如“中国”转化为“zhong1 guo2”。

    韩语

     方法1:韩文转国际音标(korean_to_ipa)

     固定数据格式

    修改/workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py和cleaners.py中对文本的处理

    修改配置文件/workspace/tts/MB-iSTFT-VITS-multilingual/configs/自己的数据集名称.json

    1. from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa
    2. from g2pk2 import G2p
    3. def cjke_cleaners2(text):
    4. text = re.sub(r'(.*?)',
    5. lambda x: korean_to_ipa(x.group(1))+' ', text)
    6. text = re.sub(r'\s+$', '', text)
    7. text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text)
    8. for texts in text:
    9. cleaned_text = korean_to_ipa(text[4:-4])
    10. if re.match(r'[^\.,!\?\-…~]', text[-1]):
    11. text += '.'
    12. return text
     方法2:声学模型

    声学模型(Acoustic model, AM)了,从生成模型的角度考虑,声学模型刻画的就是某一个字(或者词,或者音素,任意一种建模单元)对应的音频特征的概率分布模型,也就是生成某一段声学特征 OO 的概率,这个音频特征的概率分布模型,最简单的可以用GMM表示,也可以用HMM+GMM表示,还可以用神经网络来表示。

     文本转音素(G2p)

    scarletcho/KoG2P: Korean grapheme-to-phone conversion in Python (github.com)icon-default.png?t=N7T8https://github.com/scarletcho/KoG2P

    对于文本的处理

    在韩文中,除了作者提供的方式

    首先增加korean.py,包含对韩文,以及标点符号等处理方式,以下是具体代码

    1. import re
    2. from jamo import h2j, j2hcj
    3. import ko_pron
    4. # This is a list of Korean classifiers preceded by pure Korean numerals.
    5. _korean_classifiers = '군데 권 개 그루 닢 대 두 마리 모 모금 뭇 발 발짝 방 번 벌 보루 살 수 술 시 쌈 움큼 정 짝 채 척 첩 축 켤레 톨 통'
    6. # List of (hangul, hangul divided) pairs:
    7. _hangul_divided = [(re.compile('%s' % x[0]), x[1]) for x in [
    8. ('ㄳ', 'ㄱㅅ'),
    9. ('ㄵ', 'ㄴㅈ'),
    10. ('ㄶ', 'ㄴㅎ'),
    11. ('ㄺ', 'ㄹㄱ'),
    12. ('ㄻ', 'ㄹㅁ'),
    13. ('ㄼ', 'ㄹㅂ'),
    14. ('ㄽ', 'ㄹㅅ'),
    15. ('ㄾ', 'ㄹㅌ'),
    16. ('ㄿ', 'ㄹㅍ'),
    17. ('ㅀ', 'ㄹㅎ'),
    18. ('ㅄ', 'ㅂㅅ'),
    19. ('ㅘ', 'ㅗㅏ'),
    20. ('ㅙ', 'ㅗㅐ'),
    21. ('ㅚ', 'ㅗㅣ'),
    22. ('ㅝ', 'ㅜㅓ'),
    23. ('ㅞ', 'ㅜㅔ'),
    24. ('ㅟ', 'ㅜㅣ'),
    25. ('ㅢ', 'ㅡㅣ'),
    26. ('ㅑ', 'ㅣㅏ'),
    27. ('ㅒ', 'ㅣㅐ'),
    28. ('ㅕ', 'ㅣㅓ'),
    29. ('ㅖ', 'ㅣㅔ'),
    30. ('ㅛ', 'ㅣㅗ'),
    31. ('ㅠ', 'ㅣㅜ')
    32. ]]
    33. # List of (Latin alphabet, hangul) pairs:
    34. _latin_to_hangul = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
    35. ('a', '에이'),
    36. ('b', '비'),
    37. ('c', '시'),
    38. ('d', '디'),
    39. ('e', '이'),
    40. ('f', '에프'),
    41. ('g', '지'),
    42. ('h', '에이치'),
    43. ('i', '아이'),
    44. ('j', '제이'),
    45. ('k', '케이'),
    46. ('l', '엘'),
    47. ('m', '엠'),
    48. ('n', '엔'),
    49. ('o', '오'),
    50. ('p', '피'),
    51. ('q', '큐'),
    52. ('r', '아르'),
    53. ('s', '에스'),
    54. ('t', '티'),
    55. ('u', '유'),
    56. ('v', '브이'),
    57. ('w', '더블유'),
    58. ('x', '엑스'),
    59. ('y', '와이'),
    60. ('z', '제트')
    61. ]]
    62. # List of (ipa, lazy ipa) pairs:
    63. _ipa_to_lazy_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
    64. ('t͡ɕ','ʧ'),
    65. ('d͡ʑ','ʥ'),
    66. ('ɲ','n^'),
    67. ('ɕ','ʃ'),
    68. ('ʷ','w'),
    69. ('ɭ','l`'),
    70. ('ʎ','ɾ'),
    71. ('ɣ','ŋ'),
    72. ('ɰ','ɯ'),
    73. ('ʝ','j'),
    74. ('ʌ','ə'),
    75. ('ɡ','g'),
    76. ('\u031a','#'),
    77. ('\u0348','='),
    78. ('\u031e',''),
    79. ('\u0320',''),
    80. ('\u0339','')
    81. ]]
    82. def latin_to_hangul(text):
    83. for regex, replacement in _latin_to_hangul:
    84. text = re.sub(regex, replacement, text)
    85. return text
    86. def divide_hangul(text):
    87. text = j2hcj(h2j(text))
    88. for regex, replacement in _hangul_divided:
    89. text = re.sub(regex, replacement, text)
    90. return text
    91. def hangul_number(num, sino=True):
    92. '''Reference https://github.com/Kyubyong/g2pK'''
    93. num = re.sub(',', '', num)
    94. if num == '0':
    95. return '영'
    96. if not sino and num == '20':
    97. return '스무'
    98. digits = '123456789'
    99. names = '일이삼사오육칠팔구'
    100. digit2name = {d: n for d, n in zip(digits, names)}
    101. modifiers = '한 두 세 네 다섯 여섯 일곱 여덟 아홉'
    102. decimals = '열 스물 서른 마흔 쉰 예순 일흔 여든 아흔'
    103. digit2mod = {d: mod for d, mod in zip(digits, modifiers.split())}
    104. digit2dec = {d: dec for d, dec in zip(digits, decimals.split())}
    105. spelledout = []
    106. for i, digit in enumerate(num):
    107. i = len(num) - i - 1
    108. if sino:
    109. if i == 0:
    110. name = digit2name.get(digit, '')
    111. elif i == 1:
    112. name = digit2name.get(digit, '') + '십'
    113. name = name.replace('일십', '십')
    114. else:
    115. if i == 0:
    116. name = digit2mod.get(digit, '')
    117. elif i == 1:
    118. name = digit2dec.get(digit, '')
    119. if digit == '0':
    120. if i % 4 == 0:
    121. last_three = spelledout[-min(3, len(spelledout)):]
    122. if ''.join(last_three) == '':
    123. spelledout.append('')
    124. continue
    125. else:
    126. spelledout.append('')
    127. continue
    128. if i == 2:
    129. name = digit2name.get(digit, '') + '백'
    130. name = name.replace('일백', '백')
    131. elif i == 3:
    132. name = digit2name.get(digit, '') + '천'
    133. name = name.replace('일천', '천')
    134. elif i == 4:
    135. name = digit2name.get(digit, '') + '만'
    136. name = name.replace('일만', '만')
    137. elif i == 5:
    138. name = digit2name.get(digit, '') + '십'
    139. name = name.replace('일십', '십')
    140. elif i == 6:
    141. name = digit2name.get(digit, '') + '백'
    142. name = name.replace('일백', '백')
    143. elif i == 7:
    144. name = digit2name.get(digit, '') + '천'
    145. name = name.replace('일천', '천')
    146. elif i == 8:
    147. name = digit2name.get(digit, '') + '억'
    148. elif i == 9:
    149. name = digit2name.get(digit, '') + '십'
    150. elif i == 10:
    151. name = digit2name.get(digit, '') + '백'
    152. elif i == 11:
    153. name = digit2name.get(digit, '') + '천'
    154. elif i == 12:
    155. name = digit2name.get(digit, '') + '조'
    156. elif i == 13:
    157. name = digit2name.get(digit, '') + '십'
    158. elif i == 14:
    159. name = digit2name.get(digit, '') + '백'
    160. elif i == 15:
    161. name = digit2name.get(digit, '') + '천'
    162. spelledout.append(name)
    163. return ''.join(elem for elem in spelledout)
    164. def number_to_hangul(text):
    165. '''Reference https://github.com/Kyubyong/g2pK'''
    166. tokens = set(re.findall(r'(\d[\d,]*)([\uac00-\ud71f]+)', text))
    167. for token in tokens:
    168. num, classifier = token
    169. if classifier[:2] in _korean_classifiers or classifier[0] in _korean_classifiers:
    170. spelledout = hangul_number(num, sino=False)
    171. else:
    172. spelledout = hangul_number(num, sino=True)
    173. text = text.replace(f'{num}{classifier}', f'{spelledout}{classifier}')
    174. # digit by digit for remaining digits
    175. digits = '0123456789'
    176. names = '영일이삼사오육칠팔구'
    177. for d, n in zip(digits, names):
    178. text = text.replace(d, n)
    179. return text
    180. def korean_to_lazy_ipa(text):
    181. text = latin_to_hangul(text)
    182. text = number_to_hangul(text)
    183. text=re.sub('[\uac00-\ud7af]+',lambda x:ko_pron.romanise(x.group(0),'ipa').split('] ~ [')[0],text)
    184. for regex, replacement in _ipa_to_lazy_ipa:
    185. text = re.sub(regex, replacement, text)
    186. return text
    187. def korean_to_ipa(text):
    188. text = korean_to_lazy_ipa(text)
    189. return text.replace('ʧ','tʃ').replace('ʥ','dʑ')

     然后在/workspace/tts/MB-iSTFT-VITS-multilingual/text/cleaners.py中进行调用

    1. from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa
    2. def korean_cleaners(text):
    3. #Pipeline for Korean text
    4. text = latin_to_hangul(text)
    5. text = number_to_hangul(text)
    6. text = divide_hangul(text)
    7. text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
    8. return text
    9. def korean_cleaners2(text):
    10. #Pipeline for Korean text
    11. text = latin_to_hangul(text)
    12. g2p = G2p()
    13. text = g2p(text)
    14. text = divide_hangul(text)
    15. text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
    16. return text

    多语言

     中文日语韩语英语cjke_cleaners/ cjke_cleaners2转国际音标

    1. def cjke_cleaners2(text):
    2. from text.mandarin import chinese_to_ipa
    3. from text.japanese import japanese_to_ipa2
    4. from text.korean import korean_to_ipa
    5. from text.english import english_to_ipa2
    6. text = re.sub(r'\[ZH\](.*?)\[ZH\]',
    7. lambda x: chinese_to_ipa(x.group(1)) + ' ', text)
    8. text = re.sub(r'\[JA\](.*?)\[JA\]',
    9. lambda x: japanese_to_ipa2(x.group(1)) + ' ', text)
    10. text = re.sub(r'\[KO\](.*?)\[KO\]',
    11. lambda x: korean_to_ipa(x.group(1)) + ' ', text)
    12. text = re.sub(r'\[EN\](.*?)\[EN\]',
    13. lambda x: english_to_ipa2(x.group(1)) + ' ', text)
    14. text = re.sub(r'\s+$', '', text)
    15. text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text)
    16. return text

    VScode 快捷键

    Ctrl+F:寻找关键字

    点击左边的箭头进行替换

    Ctrl+alt+enter替换所有

    参考文献

    【1】tonnetonne814/MB-iSTFT-VITS-44100-Ja: 44100Hz日本語音源に対応した MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transformです。 (github.com)

     【2】细读经典:Lightweight VITS - 知乎 (zhihu.com)

  • 相关阅读:
    Docker安装Mycat和Mysql进行水平分库分表实战【图文教学】
    Springboot中使用Redis
    设计模式——行为型设计模式
    可以作为艺术作品欣赏的CT三维重建技术。
    Linux编译安装dig9.18
    黑灰产眼中的NFT:平台嗷嗷待宰,用户送钱上门
    并发编程的一点思考
    激光共聚焦如何选择荧光染料
    R语言实战应用案例:论文篇(二)-特殊盒须图绘制
    NVIDIA AGX Xavier 部署 CUDA-PointPillars
  • 原文地址:https://blog.csdn.net/weixin_44649780/article/details/132717425