语音合成技术的概念
- 让计算机听明白人在说什么,甚至让人与计算机进行语音交流一直是人们的梦想之一。
- 语音合成是声学,语言学,数字信号处理,计算机科学等多个学科的跨学科子领域。
- 语音合成技术能将文字信息转化为语音朗读出来。
- 随着近些年来人工智能相关技术的不断发展,语音合成技术在诸如导航等方面发挥着重要的作用。
语音合成技术介绍
- 文本预处理得到语言特征的特征向量
- 特征向量输入声学模型的编码器,编入神经网络的隐藏层
- 将隐藏层特征输入解码器得到语音特征描述(频谱图)
- 频谱图输入声码器,将频谱图还原为声音文件
- 更详细的介绍可以查阅相关论文。
使用NeMo进行自然语音生成
- 使用NVIDIA的NeMo工具可以很简单的完成语音合成中的相关步骤
- NeMo底层使用了CUDA和PyTorch并集成了ASR、RRS和NLP的工具库
- 可以在NVIDIA NGC中下载预训练模型,在NeMo中加载,进行迁移学习,大大提高训练速度
- 只需要几行代码几乎就能完成一个简单的语音模型训练
环境准备
一台ubuntu系统的电脑
命令行中运行
切换清华源并下载miniconda
代码语言:shell复制export DL_SITE=https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda
wget -c $DL_SITE/Miniconda3-py38_4.10.3-Linux-x86_64.sh
bash Miniconda3-py38_4.10.3-Linux-x86_64.sh
source ~/.bashrc
安装NeMo
代码语言:shell复制apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install --user pytest-runner
pip install rosa numpy==1.19.4
pip install torchmetrics==0.6.0
pip install nemo_toolkit[all]==1.4.0
pip install ASR-metrics
进行语音模型训练
tacotron2.py训练脚本
代码语言:python代码运行次数:0复制import pytorch_lightning as pl
from nemo.collections.common.callbacks import LogEpochTimeCallback
from nemo.collections.tts.models import Tacotron2Model
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager
# hydra_runner is a thin NeMo wrapper around Hydra
# It looks for a config named tacotron2.yaml inside the conf folder
# Hydra parses the yaml and returns it as a Omegaconf DictConfig
@hydra_runner(config_path="conf", config_name="tacotron2")
def main(cfg):
# Define the Lightning trainer
trainer = pl.Trainer(**cfg.trainer)
# exp_manager is a NeMo construct that helps with logging and checkpointing
exp_manager(trainer, cfg.get("exp_manager", None))
# Define the Tacotron 2 model, this will construct the model as well as
# define the training and validation dataloaders
model = Tacotron2Model(cfg=cfg.model, trainer=trainer)
# Let's add a few more callbacks
lr_logger = pl.callbacks.LearningRateMonitor()
epoch_time_logger = LogEpochTimeCallback()
trainer.callbacks.extend([lr_logger, epoch_time_logger])
# Call lightning trainer's fit() to train the model
trainer.fit(model)
if __name__ == '__main__':
main() # noqa pylint: disable=no-value-for-parameter
配置文件放入./conf中
代码语言:yaml复制name: Tacotron2
sample_rate: 22050
# <PAD>, <BOS>, <EOS> will be added by the tacotron2.py script
labels: [' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y', 'z']
n_fft: 1024
n_mels: 80
fmax: 8000
n_stride: 256
pad_value: -11.52
train_dataset: ???
validation_datasets: ???
model:
labels: ${labels}
train_ds:
dataset:
_target_: "nemo.collections.asr.data.audio_to_text.AudioToCharDataset"
manifest_filepath: ${train_dataset}
max_duration: null
min_duration: 0.1
trim: false
int_values: false
normalize: true
sample_rate: ${sample_rate}
# bos_id: 66
# eos_id: 67
# pad_id: 68 These parameters are added automatically in Tacotron2
dataloader_params:
drop_last: false
shuffle: true
batch_size: 48
num_workers: 4
validation_ds:
dataset:
_target_: "nemo.collections.asr.data.audio_to_text.AudioToCharDataset"
manifest_filepath: ${validation_datasets}
max_duration: null
min_duration: 0.1
int_values: false
normalize: true
sample_rate: ${sample_rate}
trim: false
# bos_id: 66
# eos_id: 67
# pad_id: 68 These parameters are added automatically in Tacotron2
dataloader_params:
drop_last: false
shuffle: false
batch_size: 48
num_workers: 8
preprocessor:
_target_: nemo.collections.asr.parts.preprocessing.features.FilterbankFeatures
dither: 0.0
nfilt: ${n_mels}
frame_splicing: 1
highfreq: ${fmax}
log: true
log_zero_guard_type: clamp
log_zero_guard_value: 1e-05
lowfreq: 0
mag_power: 1.0
n_fft: ${n_fft}
n_window_size: 1024
n_window_stride: ${n_stride}
normalize: null
pad_to: 16
pad_value: ${pad_value}
preemph: null
sample_rate: ${sample_rate}
window: hann
encoder:
_target_: nemo.collections.tts.modules.tacotron2.Encoder
encoder_kernel_size: 5
encoder_n_convolutions: 3
encoder_embedding_dim: 512
decoder:
_target_: nemo.collections.tts.modules.tacotron2.Decoder
decoder_rnn_dim: 1024
encoder_embedding_dim: ${model.encoder.encoder_embedding_dim}
gate_threshold: 0.3
max_decoder_steps: 5000
n_frames_per_step: 1 # currently only 1 is supported
n_mel_channels: ${n_mels}
p_attention_dropout: 0.1
p_decoder_dropout: 0.1
prenet_dim: 256
prenet_p_dropout: 0.5
# Attention parameters
attention_dim: 128
attention_rnn_dim: 1024
# AttentionLocation Layer parameters
attention_location_kernel_size: 31
attention_location_n_filters: 32
early_stopping: true
postnet:
_target_: nemo.collections.tts.modules.tacotron2.Postnet
n_mel_channels: ${n_mels}
p_dropout: 0.5
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
optim:
name: adam
lr: 1e-3
weight_decay: 1e-6
# scheduler setup
sched:
name: CosineAnnealing
min_lr: 1e-5
trainer:
gpus: 1 # number of gpus
max_epochs: ???
num_nodes: 1
accelerator: ddp
accumulate_grad_batches: 1
checkpoint_callback: False # Provided by exp_manager
logger: False # Provided by exp_manager
gradient_clip_val: 1.0
flush_logs_every_n_steps: 1000
log_every_n_steps: 200
check_val_every_n_epoch: 25
exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: True
create_checkpoint_callback: True
收集语音数据并生成语言数据清单的json文件,清单格式如下
代码语言:json复制{"audio_filepath":"语音文件位置", "duration":语音时长, "text":"语音表示的文本内容"}
然后就可以用python代码进行模型训练了
代码语言:python代码运行次数:0复制import nemo
import nemo.collections.tts as nemo_tts
from nemo.collections.tts.models import Tacotron2Model
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt
#导入所需要模块
! HYDRA_FULL_ERROR=1
python tacotron2.py train_dataset=训练集的json文件路径
validation_datasets=测试集的json文件路径
trainer.max_epochs=4000
trainer.accelerator=null
trainer.check_val_every_n_epoch=1
训练好的模型会保存在./nemo_experiments/Tacotron2/训练时间/checkpoints/Tacotron2.nemo文件中。
查看训练结果
- 在NVIDIA NGC中下载melgan声码器模型tts_melgan.nemo
- 运行如下代码查看语音结果
model = Tacotron2Model.restore_from("模型的路径")
vocoder = nemo_tts.models.MelGanModel.restore_from("tts_melgan.nemo")
#生成语音对应频谱图
text = "需要发声的语音内容"
tokens = model.parse(text)
spectrogram = model.generate_spectrogram(tokens = tokens)
%matplotlib inline
imshow(spectrogram.cpu().detach().numpy()[0,...], origin="lower")
plt.show()
#通过声码器播放合成的语音内容
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
import IPython
IPython.display.Audio(audio.to('cpu').detach().numpy(), rate=22050)