AI语音克隆

安装

下载安装github代码库

git clone https://github.com/Plachtaa/VITS-fast-fine-tuning.git
1

安装文档
 中日语言模型网站
目前支持的任务:

从 10条以上的短音频克隆角色声音
从 3分钟以上的长音频（单个音频只能包含单说话人）克隆角色声音
从 3分钟以上的视频（单个视频只能包含单说话人）克隆角色声音
通过输入 bilibili视频链接（单个视频只能包含单说话人）克隆角色声音

本地运行和推理

python VC_inference.py --model_dir ./OUTPUT_MODEL/G_latest.pth --share True
1

这个时候在本地的浏览器打开网址

http://localhost:7860
1

就可以看到语音tts的使用界面，但这只能在本地电脑能看到，如果要在远程的电脑上访问，可以使用cpolar

cpolar http 7860
1

这个时候就会出现一个访问的网址链接。

本地训练

1.创建conda运行环境

conda create -n tts python=3.8
1

2.安装环境依赖

pip install -r requirements.txt
1

在这个过程中，有一部分安装包，比如OpenAI的whisper代码包，可能因为网络问题，而无法访问，无法使用pip进行网络安装。可以在其它地方，单独下载好代码包，然后使用pip单独安装本地包。
3.安装GPU版本的PyTorch

# CUDA 11.6
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
# CUDA 11.7
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
1
2
3
4

4.安装视频模块包

pip install imageio==2.4.1
pip install moviepy
1
2

5.构建预处理模块

cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..
1
2
3
4

6.下载辅助数据包

mkdir pretrained_models
# download data for fine-tuning
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip
unzip sampled_audio4ft_v2.zip
# create necessary directories
mkdir video_data
mkdir raw_audio
mkdir denoised_audio
mkdir custom_character_voice
mkdir segmented_character_voice
1
2
3
4
5
6
7
8
9
10

7.下载预训练模型

CJE: Trilingual (Chinese, Japanese, English)
CJ: Dualigual (Chinese, Japanese)
C: Chinese only
1
2
3

wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json
1
2
3

wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json -O ./configs/finetune_speaker.json
1
2
3

wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json -O ./configs/finetune_speaker.json
1
2
3

8.将语音数据放置在对应的文件目录

短语音
将多段语音打包成zip文件，文件结构为

Your-zip-file.zip
├───Character_name_1
├   ├───xxx.wav
├   ├───...
├   ├───yyy.mp3
├   └───zzz.wav
├───Character_name_2
├   ├───xxx.wav
├   ├───...
├   ├───yyy.mp3
├   └───zzz.wav
├───...
├
└───Character_name_n
    ├───xxx.wav
    ├───...
    ├───yyy.mp3
    └───zzz.wav
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

将打包文件放置在./custom_character_voice/
运行

unzip ./custom_character_voice/custom_character_voice.zip -d ./custom_character_voice/
1

长语音
将wav格式的语音命名为Diana_234135.wav，放置在./raw_audio/
视频
将视频命名为Taffy_332452.mp4，放置在./video_data/

9.处理音频

python scripts/video2audio.py
python scripts/denoise_audio.py
python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/resample.py
1
2
3
4
5

注意将"{PRETRAINED_MODEL}"替换为"C"，如果GPU内存没有12GB，将whisper_size替换为medium或small。

10.处理文本数据
选择对应的辅助数据包，运行

python preprocess_v2.py --add_auxiliary_data True --languages "C"
1

如果不选择辅助数据包，运行

python preprocess_v2.py --languages "{PRETRAINED_MODEL}"
1

11.开始训练
运行命令，开始训练

python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True
1

如果是从一个训练过的模型，开始继续训练

python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed False --cont True
1

12.清除语音数据

rm -rf ./custom_character_voice/* ./video_data/* ./raw_audio/* ./denoised_audio/* ./segmented_character_voice/* ./separated/* long_character_anno.txt short_character_anno.txt
1

del /Q /S .\custom_character_voice\* .\video_data\* .\raw_audio\* .\denoised_audio\* .\segmented_character_voice\* .\separated\* long_character_anno.txt short_character_anno.txt
1

相关阅读:
无服务架构--Serverless
当别人的话很刺耳时怎么办？
用HTML+CSS+JS做一个漂亮简单的公司网站（JavaScript期末大作业）
k8s--基础--29.1--ingress--介绍
 Kubernetes 使用 PVC 持久卷后，持久卷内数据丢失问题
 十九、类型信息（1）
第二课：使用C++实现视频去水印
 又拍云之 Keepalived 高可用部署
 （附源码）python房屋租赁管理系统毕业设计 745613
Linux- 由映射文件I/O问题引出的SIGBUS & 空洞文件（Sparse File）
原文地址：https://blog.csdn.net/wanchaochaochao/article/details/134478528