同步操作将从 PaddlePaddle/PaddleSpeech 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
(简体中文|English)
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
The models in PaddleSpeech TTS have the following mapping relationship:
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. examples/csmsc
Go to the directory
cd examples/csmsc/voc1
Source env
source path.sh
Must do this before you start to do anything.
Set MAIN_ROOT
as project dir. Using parallelwave_gan
model as MODEL
.
Main entrypoint
bash run.sh
This is just a demo, please make sure source data have been prepared well and every step
works well before the next step
.
cd examples/csmsc/tts3
source path.sh
MAIN_ROOT
as project dir. Using fastspeech2
model as MODEL
.bash run.sh
step
works well before the next step
.The steps in run.sh
mainly include:
For more details, you can see README.md
in examples.
This section shows how to use pretrained models provided by TTS and make an inference with them.
Pretrained models in TTS are provided in an archive. Extract it to get a folder like this: Acoustic Models:
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
├── spk_id_map.txt (optional)
└── tone_id_map.txt (optional)
Vocoders:
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
└── stats.npy
default.yaml
stores the config used to train the model.snapshot_iter_*.pdz
is the checkpoint file, where *
is the steps it has been trained.*_stats.npy
is the stats file of the feature if it has been normalized before training.phone_id_map.txt
is the map of phonemes to phoneme_ids.tone_id_map.txt
is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)spk_id_map.txt
is the map of speakers to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)The example code below shows how to use the models for prediction.
The code below shows how to use a FastSpeech2
model. After loading the pretrained model, use it and the normalizer object to construct a prediction object,then use fastspeech2_inferencet(phone_ids)
to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.modules.normalizer import ZScore
# examples/fastspeech2/baker/frontend.py
from frontend import Frontend
# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
# text to spectrogram
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
The code below shows how to use a Parallel WaveGAN
model. Like the example above, after loading the pretrained model, use it and the normalizer object to construct a prediction object,then use pwg_inference(mel)
to generate raw audio (in wav format).
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()
# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)
# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
audio_path,
wav.numpy(),
samplerate=fastspeech2_config.fs)
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。