使用nemo训练语音合成模型

2022-06-03 22:35:49 浏览数 (2)

语音合成技术的概念

  • 让计算机听明白人在说什么,甚至让人与计算机进行语音交流一直是人们的梦想之一。
  • 语音合成是声学,语言学,数字信号处理,计算机科学等多个学科的跨学科子领域。
  • 语音合成技术能将文字信息转化为语音朗读出来。
  • 随着近些年来人工智能相关技术的不断发展,语音合成技术在诸如导航等方面发挥着重要的作用。

语音合成技术介绍

图片1.png图片1.png
  • 文本预处理得到语言特征的特征向量
  • 特征向量输入声学模型的编码器,编入神经网络的隐藏层
  • 将隐藏层特征输入解码器得到语音特征描述(频谱图)
  • 频谱图输入声码器,将频谱图还原为声音文件
  • 更详细的介绍可以查阅相关论文。

使用NeMo进行自然语音生成

  • 使用NVIDIA的NeMo工具可以很简单的完成语音合成中的相关步骤
  • NeMo底层使用了CUDA和PyTorch并集成了ASR、RRS和NLP的工具库
  • 可以在NVIDIA NGC中下载预训练模型,在NeMo中加载,进行迁移学习,大大提高训练速度
  • 只需要几行代码几乎就能完成一个简单的语音模型训练

环境准备

一台ubuntu系统的电脑

命令行中运行

切换清华源并下载miniconda

代码语言:shell复制
export DL_SITE=https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda
wget -c $DL_SITE/Miniconda3-py38_4.10.3-Linux-x86_64.sh
bash Miniconda3-py38_4.10.3-Linux-x86_64.sh
source ~/.bashrc

安装NeMo

代码语言:shell复制
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython 
pip install --user pytest-runner 
pip install rosa numpy==1.19.4 
pip install torchmetrics==0.6.0 
pip install nemo_toolkit[all]==1.4.0 
pip install ASR-metrics

进行语音模型训练

tacotron2.py训练脚本

代码语言:python代码运行次数:0复制
import pytorch_lightning as pl
from nemo.collections.common.callbacks import LogEpochTimeCallback
from nemo.collections.tts.models import Tacotron2Model
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager
# hydra_runner is a thin NeMo wrapper around Hydra
# It looks for a config named tacotron2.yaml inside the conf folder
# Hydra parses the yaml and returns it as a Omegaconf DictConfig
@hydra_runner(config_path="conf", config_name="tacotron2")
def main(cfg):
    # Define the Lightning trainer
    trainer = pl.Trainer(**cfg.trainer)
    # exp_manager is a NeMo construct that helps with logging and checkpointing
    exp_manager(trainer, cfg.get("exp_manager", None))
    # Define the Tacotron 2 model, this will construct the model as well as
    # define the training and validation dataloaders
    model = Tacotron2Model(cfg=cfg.model, trainer=trainer)
    # Let's add a few more callbacks
    lr_logger = pl.callbacks.LearningRateMonitor()
    epoch_time_logger = LogEpochTimeCallback()
    trainer.callbacks.extend([lr_logger, epoch_time_logger])
    # Call lightning trainer's fit() to train the model
    trainer.fit(model)
if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

配置文件放入./conf中

代码语言:yaml复制
name: Tacotron2
sample_rate: 22050
# <PAD>, <BOS>, <EOS> will be added by the tacotron2.py script
labels: [' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
        'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']',
        'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
        'u', 'v', 'w', 'x', 'y', 'z']
n_fft: 1024
n_mels: 80
fmax: 8000
n_stride: 256
pad_value: -11.52
train_dataset: ???
validation_datasets: ???

model:
  labels: ${labels}
  train_ds:
    dataset:
      _target_: "nemo.collections.asr.data.audio_to_text.AudioToCharDataset"
      manifest_filepath: ${train_dataset}
      max_duration: null
      min_duration: 0.1
      trim: false
      int_values: false
      normalize: true
      sample_rate: ${sample_rate}
      # bos_id: 66
      # eos_id: 67
      # pad_id: 68  These parameters are added automatically in Tacotron2
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 48
      num_workers: 4


  validation_ds:
    dataset:
      _target_: "nemo.collections.asr.data.audio_to_text.AudioToCharDataset"
      manifest_filepath: ${validation_datasets}
      max_duration: null
      min_duration: 0.1
      int_values: false
      normalize: true
      sample_rate: ${sample_rate}
      trim: false
      # bos_id: 66
      # eos_id: 67
      # pad_id: 68  These parameters are added automatically in Tacotron2
    dataloader_params:
      drop_last: false
      shuffle: false
      batch_size: 48
      num_workers: 8

  preprocessor:
    _target_: nemo.collections.asr.parts.preprocessing.features.FilterbankFeatures
    dither: 0.0
    nfilt: ${n_mels}
    frame_splicing: 1
    highfreq: ${fmax}
    log: true
    log_zero_guard_type: clamp
    log_zero_guard_value: 1e-05
    lowfreq: 0
    mag_power: 1.0
    n_fft: ${n_fft}
    n_window_size: 1024
    n_window_stride: ${n_stride}
    normalize: null
    pad_to: 16
    pad_value: ${pad_value}
    preemph: null
    sample_rate: ${sample_rate}
    window: hann

  encoder:
    _target_: nemo.collections.tts.modules.tacotron2.Encoder
    encoder_kernel_size: 5
    encoder_n_convolutions: 3
    encoder_embedding_dim: 512

  decoder:
    _target_: nemo.collections.tts.modules.tacotron2.Decoder
    decoder_rnn_dim: 1024
    encoder_embedding_dim: ${model.encoder.encoder_embedding_dim}
    gate_threshold: 0.3
    max_decoder_steps: 5000
    n_frames_per_step: 1  # currently only 1 is supported
    n_mel_channels: ${n_mels}
    p_attention_dropout: 0.1
    p_decoder_dropout: 0.1
    prenet_dim: 256
    prenet_p_dropout: 0.5
    # Attention parameters
    attention_dim: 128
    attention_rnn_dim: 1024
    # AttentionLocation Layer parameters
    attention_location_kernel_size: 31
    attention_location_n_filters: 32
    early_stopping: true

  postnet:
    _target_: nemo.collections.tts.modules.tacotron2.Postnet
    n_mel_channels: ${n_mels}
    p_dropout: 0.5
    postnet_embedding_dim: 512
    postnet_kernel_size: 5
    postnet_n_convolutions: 5

  optim:
    name: adam
    lr: 1e-3
    weight_decay: 1e-6

    # scheduler setup
    sched:
      name: CosineAnnealing
      min_lr: 1e-5


trainer:
  gpus: 1 # number of gpus
  max_epochs: ???
  num_nodes: 1
  accelerator: ddp
  accumulate_grad_batches: 1
  checkpoint_callback: False  # Provided by exp_manager
  logger: False  # Provided by exp_manager
  gradient_clip_val: 1.0
  flush_logs_every_n_steps: 1000
  log_every_n_steps: 200
  check_val_every_n_epoch: 25

exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: True
  create_checkpoint_callback: True

收集语音数据并生成语言数据清单的json文件,清单格式如下

代码语言:json复制
{"audio_filepath":"语音文件位置", "duration":语音时长, "text":"语音表示的文本内容"}

然后就可以用python代码进行模型训练了

代码语言:python代码运行次数:0复制
import nemo
import nemo.collections.tts as nemo_tts
from nemo.collections.tts.models import Tacotron2Model
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt
#导入所需要模块
! HYDRA_FULL_ERROR=1 
python tacotron2.py train_dataset=训练集的json文件路径 
validation_datasets=测试集的json文件路径 
trainer.max_epochs=4000 
trainer.accelerator=null 
trainer.check_val_every_n_epoch=1

训练好的模型会保存在./nemo_experiments/Tacotron2/训练时间/checkpoints/Tacotron2.nemo文件中。

查看训练结果

  • 在NVIDIA NGC中下载melgan声码器模型tts_melgan.nemo
  • 运行如下代码查看语音结果
代码语言:python代码运行次数:0复制
model = Tacotron2Model.restore_from("模型的路径")
vocoder = nemo_tts.models.MelGanModel.restore_from("tts_melgan.nemo")
#生成语音对应频谱图
text = "需要发声的语音内容"
tokens = model.parse(text)
spectrogram = model.generate_spectrogram(tokens = tokens)
%matplotlib inline
imshow(spectrogram.cpu().detach().numpy()[0,...], origin="lower")
plt.show()

#通过声码器播放合成的语音内容
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
import IPython
IPython.display.Audio(audio.to('cpu').detach().numpy(), rate=22050)

0 人点赞