金融/语音/音频处理学术速递[7.8]

2021-07-27 10:28:38 浏览数 (1)

q-fin金融,共计5篇

cs.SD语音,共计9篇

eess.AS音频处理,共计9篇

1.q-fin金融:

【1】 Pseudo-Model-Free Hedging for Variable Annuities via Deep Reinforcement Learning 标题:基于深度强化学习的可变年金伪无模型套期保值

作者:Wing Fung Chong,Haoen Cui,Yuxuan Li 链接:https://arxiv.org/abs/2107.03340 摘要:本文应用深度强化学习的方法重新研究了可变年金的套期保值问题。强化学习代理不再先验地假设精算和金融双重市场模型,而是通过与市场的交互作用,通过收集锚定套期保值报酬信号来学习如何进行套期保值。通过最近提出的近端策略优化,无伪模型强化学习代理的性能与正确的Delta相当,但优于错误指定的Delta。强化学习代理还与在线学习相结合,以展示其完全适应市场的能力。 摘要:This paper applies a deep reinforcement learning approach to revisit the hedging problem of variable annuities. Instead of assuming actuarial and financial dual-market model a priori, the reinforcement learning agent learns how to hedge by collecting anchor-hedging reward signals through interactions with the market. By the recently advanced proximal policy optimization, the pseudo-model-free reinforcement learning agent performs equally well as the correct Delta, while outperforms the misspecified Deltas. The reinforcement learning agent is also integrated with online learning to demonstrate its full adaptive capability to the market.

【2】 Economic prospects of the Russian-Chinese partnership in the logistics projects of the Eurasian Economic Union and the Silk Road Economic Belt: a scientific literature review 标题:欧亚经济联盟和丝绸之路经济带物流项目中俄中伙伴关系的经济前景:科学文献综述

作者:Elena Rudakova,Alla Pavlova,Oleg Antonov,Kira Kuntsevich,Yue Yang 机构:PhD in Law, Associate Professor, Law Institute of the Russian University of Transport, Russia, Moscow, Lecturer, RANEPA, Russia, Moscow, Associate Professor, Hebi University, China 备注:13 pages. Key words: logistics, partnership, Eurasian Economic Union, Silk Road Economic Belt. JEL codes: F-01; F-02; F-15 链接:https://arxiv.org/abs/2107.03116 摘要:本文作者回顾了有关俄中在欧亚经济联盟和丝绸之路经济带经济物流项目结合领域合作发展的科学文献。不仅俄罗斯,而且中国专家对这些项目的意见,这提供了新丝绸之路的概念在两国的视野扩展。 摘要:The authors of the article have reviewed the scientific literature on the development of the Russian-Chinese cooperation in the field of combining economic and logistics projects of the Eurasian Economic Union and the Silk Road Economic Belt. The opinions of not only Russian, but also Chinese experts on these projects are indicated, which provides the expansion of the vision of the concept of the New Silk Road in both countries.

【3】 Estimating the economic value of ultrafine particles information: A contingent valuation method 标题:估算超细颗粒信息的经济价值:一种条件估值法

作者:Eunjung Cho,Youngsang Cho 机构:a Department of Industrial Engineering, College of Engineering, Yonsei University, Yonsei-Ro, Seodaemun-gu, Seoul, South Korea, Corresponding author:, Professor, Tel: ,-,-,-, Fax: ,-,-,- 链接:https://arxiv.org/abs/2107.03034 摘要:超细颗粒(UFPs)是一种直径小于100nm的颗粒物(PM),全球对它的关注正在增加。这些比小于2.5微米的颗粒物(PM2.5)对健康的影响更严重的颗粒物,由于其特性不同于其他空气污染物,很难用现有的方法进行测量。因此,需要一个新的监测系统来获取准确的UFPs信息,这将增加政府和人民的财政负担。在这项研究中,我们通过评估UFPs监测和报告系统的支付意愿(WTP)来估计UFPs信息的经济价值。我们使用了条件估值法(CVM)和一个半有界二分选择(OOHBDC)尖峰模型。我们分析了被调查者的社会经济变量以及他们对PM的认知水平如何影响他们的WTP。因此,我们通过在线调查收集了1040名韩国受访者的WTP数据。建立UFPs监测和报告系统的估计平均WTP为每户每年6958.55-7222.55韩元(6.22-6.45美元)。我们发现,对目前的空气污染物信息满意的人,对UFPs的了解相对较多,对UFPs监测和报告系统的WTP较高。研究结果可以用来建立新的政策来应对PM,包括UFPs。 摘要:Global concern regarding ultrafine particles (UFPs), which are particulate matter (PM) with a diameter of less than 100nm, is increasing. These particles-with more serious health effects than PM less than 2.5 micrometers (PM2.5)-are difficult to measure using the current methods because their characteristics are different from those of other air pollutants. Therefore, a new monitoring system is required to obtain accurate UFPs information, which will raise the financial burden of the government and people. In this study, we estimated the economic value of UFPs information by evaluating the willingness-to-pay (WTP) for the UFPs monitoring and reporting system. We used the contingent valuation method (CVM) and the one-and-one-half-bounded dichotomous choice (OOHBDC) spike model. We analyzed how the respondents' socio-economic variables, as well as their cognition level of PM, affected their WTP. Therefore, we collected WTP data of 1,040 Korean respondents through an online survey. The estimated mean WTP for building a UFPs monitoring and reporting system is KRW 6,958.55-7,222.55 (USD 6.22-6.45) per household per year. We found that people satisfied with the current air pollutant information, and generally possessing relatively greater knowledge of UFPs, have higher WTP for a UFPs monitoring and reporting system. The results can be used to establish new policies response to PM including UFPs.

【4】 Decreasing Incomes Increase Selfishness 标题:收入减少会增加自私

作者:Nickolas Gagnon,Riccardo D. Saulle,Henrik W. Zaunbrecher 链接:https://arxiv.org/abs/2107.02888 摘要:我们使用一个受控的实验室实验来研究一段时间内收入减少对这段时间结束时再分配决定的因果影响,在这样一个环境中,我们在这段时间内保持固定的收入总和。首先,我们调查了负收入趋势(个人收入下降)的影响,这意味着与最近的过去相比,收入下降。第二,我们调查了负收入趋势相对于另一个人收入趋势的影响(个人间减少)。如果个人内部或个人之间的减少对个人造成不满,那么该人可能会变得更加自私,以获得补偿。我们在一个多周期模型中对这两个效应进行了形式化处理,扩展了一个标准的不等式厌恶模型。总的来说,在表现出足够强烈的社会偏好的条件下,我们发现,当个人经历收入下降时,他们的行为确实更加自私。虽然许多研究考察了收入不平等对再分配决策的影响,但我们深入研究了一个人收入背后的历史,以隔离收入变化的影响。 摘要:We use a controlled laboratory experiment to study the causal impact of income decreases within a time period on redistribution decisions at the end of that period, in an environment where we keep fixed the sum of incomes over the period. First, we investigate the effect of a negative income trend (intra-personal decrease), which means a decreasing income compared to one's recent past. Second, we investigate the effect ofa negative income trend relative to the income trend of another person (inter-personal decrease). If intra-personal or inter-personal decreases create dissatisfaction for an individual, that person may become more selfish to obtain compensation. We formal-ize both effects in a multi-period model augmenting a standard model of inequality aversion. Overall, conditional on exhibiting sufficiently-strong social preferences, we find that individuals indeed behave more selfishly when they experience decreasing incomes. While many studies examine the effect of income inequality on redistribution decisions, we delve into the history behind one's income to isolate the effect of income changes.

【5】 Big Data Information and Nowcasting: Consumption and Investment from Bank Transactions in Turkey 标题:大数据信息与现在预测:来自土耳其银行交易的消费和投资

作者:Ali B. Barlas,Seda Guler Mert,Berk Orkun Isa,Alvaro Ortiz,Tomasa Rodrigo,Baris Soybilgen,Ege Yazgan 机构: Baris Soybil-gen (Bilgi University) and Ege Yazgan (Bilgi University)AbstractWe use the aggregate information from individual-to-firm and firm-to-firm in GarantiBBVA Bank transactions to mimic domestic private demand 备注:31 pages, 7 figures, 9 tables 链接:https://arxiv.org/abs/2107.03299 摘要:我们利用加兰蒂BBVA银行交易中从个人到企业和企业到企业的总信息来模拟国内私人需求。特别是,我们以土耳其为例,实时复制季度国民账户总消费和投资(固定资本形成总额)及其较大组成部分(机械设备和建筑)。为了验证来自这些指标的信息的有用性,我们使用不同的即时预测模型测试了这两个指标对土耳其国内生产总值的即时预测能力。结果是成功的,并证实了有用的消费和投资银行交易的即时预测目的。当传统的硬数据信息稀缺时,大数据信息的价值在即时预测过程开始时更具相关性。这使得这些信息特别适用于那些统计数据发布滞后时间较长的国家,比如新兴市场。 摘要:We use the aggregate information from individual-to-firm and firm-to-firm in Garanti BBVA Bank transactions to mimic domestic private demand. Particularly, we replicate the quarterly national accounts aggregate consumption and investment (gross fixed capital formation) and its bigger components (Machinery and Equipment and Construction) in real time for the case of Turkey. In order to validate the usefulness of the information derived from these indicators we test the nowcasting ability of both indicators to nowcast the Turkish GDP using different nowcasting models. The results are successful and confirm the usefulness of Consumption and Investment Banking transactions for nowcasting purposes. The value of the Big data information is more relevant at the beginning of the nowcasting process, when the traditional hard data information is scarce. This makes this information specially relevant for those countries where statistical release lags are longer like the Emerging Markets.

2.cs.SD语音:

【1】 SoundStream: An End-to-End Neural Audio Codec 标题:Soundstream:一种端到端的神经音频编解码器

作者:Neil Zeghidour,Alejandro Luebs,Ahmed Omran,Jan Skoglund,Marco Tagliasacchi 链接:https://arxiv.org/abs/2107.03312 摘要:我们提出了SoundStream,一种新的神经音频编解码器,可以有效地压缩语音,音乐和一般音频比特率通常是针对语音定制编解码器。SoundStream依赖于一个模型结构,该结构由一个完全卷积的编码器/解码器网络和一个残差矢量量化器组成,它们端到端地联合训练。训练利用了文本到语音和语音增强方面的最新进展,这些技术结合了对抗性和重建损失,允许从量化嵌入中生成高质量的音频内容。通过对量化器层应用结构化丢包进行训练,单个模型可以跨3kbps到18kbps的可变比特率进行操作,与在固定比特率下训练的模型相比,质量损失可以忽略不计。此外,该模型还适用于低延迟实现,支持流式推理,并在智能手机CPU上实时运行。在使用24kHz采样率音频的主观评估中,3kbps的声音流在12kbps时优于Opus,在9.6kbps时接近EVS。此外,我们能够在编码器或解码器端执行联合压缩和增强,而无需额外的延迟,这是我们通过语音背景噪声抑制来证明的。 摘要:We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.

【2】 VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis 标题:VAENAR-TTS:基于变分自动编码器的非自回归文语合成

作者:Hui Lu,Zhiyong Wu,Xixin Wu,Xu Li,Shiyin Kang,Xunying Liu,Helen Meng 机构:Dept. of Systems Engineering & Engineering Management, Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, CUHK, Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems 链接:https://arxiv.org/abs/2107.03298 摘要:提出了一种基于变分自动编码器的非自回归文语转换(VAENAR-TTS)模型。基于序列到序列结构的自回归TTS(AR-TTS)模型可以产生高质量的语音,但其序列解码过程非常耗时。最近,非自回归TTS(NAR-TTS)模型被证明在并行译码过程中更有效。然而,这些NAR-TTS模型依赖音素水平的持续时间来生成文本和光谱图之间的硬对齐。通过强制对齐或知识提炼获取持续时间标签非常麻烦。此外,基于音素扩展的硬对齐会降低合成语音的自然度。相比之下,所提出的VAENAR-TTS模型是一种端到端的方法,不需要音素级的持续时间。VAENAR-TTS模型不包含递归结构,在训练和推理阶段都是完全非自回归的。在VAE结构的基础上,将对齐信息编码到潜变量中,在解码器中采用基于注意的文本与潜变量的软对齐来重构谱图。实验表明,VAENAR-TTS的合成质量达到了最先进的水平,合成速度与其他NAR-TTS模型相当。 摘要:This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.

【3】 MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification 标题:MACCIF-TDNN:基于TDNN的说话人确认中信道和上下文相互依赖特征的多方面聚合

作者:Fangyuan Wang,Zhigang Song,Hongchen Jiang,Bo Xu 机构:Institute of automation, Chinese Academy of Sciences , Beijing , China, Beijing University of Technology , Beijing , China 备注:6 pages. arXiv admin note: text overlap with arXiv:2005.07143 by other authors 链接:https://arxiv.org/abs/2107.03104 摘要:最近的大多数说话人验证结果都是通过X-vector及其后续变体来实现的。本文提出了一种基于时延神经网络(TDNN)的多角度融合信道和上下文相互依赖特征的网络结构。首先,我们使用ECAPA-TDNN中的SE-Res2Blocks对信道相关性进行显式建模,实现信道特征的自适应校正,并与传统的基于TDNN的方法相比,在更细粒度的层次上对局部上下文特征进行多尺度处理。其次,我们探讨利用Transformer的编码器结构,在话语层面上建立全局上下文相互依赖的特征模型,以获得更好的长期时间特征。在池层之前,我们将SE-res2block和Transformer-encoder的输出进行聚合,以利用各自学习到的互补信道和上下文相互依赖特性。最后,我们也发现,与单一专注的统计资料共用不同,我们可以将共用方法扩展成一种可以从多个方面区分特征的多头方法。提出的MACCIF-TDNN体系结构可以优于大多数基于VoxCeleb1测试集的最新TDNN系统。 摘要:Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, instead of performing a single attentive statistics pooling, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.

【4】 Adversarial Auto-Encoding for Packet Loss Concealment 标题:用于分组丢失隐藏的对抗性自动编码

作者:Santiago Pascual,Joan Serrà,Jordi Pons 机构:Dolby Laboratories 链接:https://arxiv.org/abs/2107.03100 摘要:像IP语音这样的通信技术在受限的实时条件下运行,语音包会受到网络的延迟和丢失。在这种情况下,分组丢失隐藏(PLC)算法重构丢失的帧,直到接收到新的真实分组。近年来,自回归深层神经网络已被证明优于PLC的信号处理方法,特别是对于超过60毫秒的长期预测。在这项工作中,我们提出了一种非自回归对抗式自动编码器,命名为PLAAE,在波形域执行实时PLC。PLAAE有一个因果卷积结构,它以一种自动编码的方式学习,在敌方损失的帮助下重建有间隙的信号。在推理过程中,相对于自回归模型,它能够在单个前馈步骤中预测此类间隙的平滑和连贯的连续性。我们的评估强调了PLAAE在光谱和语调重建、感知质量和可理解性方面优于两个经典的可编程逻辑控制器和两个深度自回归模型。 摘要:Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing methods for PLC, specially for long-term predictions beyond 60 ms. In this work, we propose a non-autoregressive adversarial auto-encoder, named PLAAE, to perform real-time PLC in the waveform domain. PLAAE has a causal convolutional structure, and it learns in an auto-encoder fashion to reconstruct signals with gaps, with the help of an adversarial loss. During inference, it is able to predict smooth and coherent continuations of such gaps in a single feed-forward step, as opposed to autoregressive models. Our evaluation highlights the superiority of PLAAE over two classic PLCs and two deep autoregressive models in terms of spectral and intonation reconstruction, perceptual quality, and intelligibility.

【5】 Efficient Transformer for Direct Speech Translation 标题:一种高效的直接语音翻译转换器

作者:Belen Alastruey,Gerard I. Gállego,Marta R. Costa-jussà 机构:TALP Research Center, Universitat Politecnica de Catalunya, Barcelona 备注:(c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works 链接:https://arxiv.org/abs/2107.03069 摘要:基于Transformer的模型的出现已经超越了文本的障碍。在处理语音时,我们必须面对一个问题:音频输入的序列长度不适合转换器。为了绕过这个问题,通常的方法是添加跨步卷积层,在使用Transformer之前减少序列长度。在本文中,我们提出了一种直接语音翻译的新方法,由于有一个高效的转换器,我们不必在转换器之前使用卷积层就可以处理频谱图。这允许编码器直接从光谱图中学习,并且不会丢失任何信息。我们创建了一个编码器-解码器模型,其中编码器是一个高效的转换器——长转换器——而解码器是一个传统的转换器-解码器。结果表明,这是一个很有前途的研究方向。 摘要:The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.

【6】 Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis 标题:多说话人语音合成中控制信息的有效区分使用

作者:Qinghua Wu,Quanbo Shen,Jian Luan,YuJun Wang 机构:Xiaomi Technology Co. Ltd., Beijing, China 链接:https://arxiv.org/abs/2107.03065 摘要:在多说话人语音合成中,由于说话人的年龄、说话风格、语速、情绪等各不相同,因此来自多个说话人的数据往往具有很大的多样性。数据的多样性将导致一对多映射问题。提高多说话人语音合成的建模能力是一个重要而富有挑战性的问题。针对这一问题,本文研究了在编解码框架中如何有效利用与文本内容信息相区别的说话人和基音等控制信息:1)设计了一种基于基音和能量的语音谐波结构表示方法,称为激励谱图。激励谱连同文本内容一起被馈送到解码器以指导mel谱图的谐波学习。2) 提出条件选通LSTM(CGLSTM),通过说话人嵌入对输入/输出/遗忘门进行重新加权,控制网络中文本内容信息的流动。实验表明,在多说话人生成模型的训练中,mel谱图的重建误差显著降低,说话人自适应模型的主观评价有了很大的提高,可懂度的平均意见得分(MOS)提高了0.81分。 摘要:In multi-speaker speech synthesis, data from a number of speakers usually tends to have great diversity due to the fact that the speakers may differ largely in their ages, speaking styles, speeds, emotions, and so on. The diversity of data will lead to the one-to-many mapping problem cite{Ren2020FastSpeech2F, Kumar2020FewSA}. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper researches into the effective use of control information such as speaker and pitch which are differentiated from text-content information in our encoder-decoder framework: 1) Design a representation of harmonic structure of speech, called excitation spectrogram, from pitch and energy. The excitation spectrogrom is, along with the text-content, fed to the decoder to guide the learning of harmonics of mel-spectrogram. 2) Propose conditional gated LSTM (CGLSTM) whose input/output/forget gates are re-weighted by speaker embedding to control the flow of text-content information in the network. The experiments show significant reduction in reconstruction errors of mel-spectrogram in the training of multi-speaker generative model, and a great improvement is observed in the subjective evaluation of speaker adapted model, e.g, the Mean Opinion Score (MOS) of intelligibility increases by 0.81 points.

【7】 Improving Speech Recognition Accuracy of Local POI Using Geographical Models 标题:利用地理模型提高局部POI的语音识别精度

作者:Songjun Cao,Yike Zhang,Xiaobing Feng,Long Ma 机构:Tencent Cloud Xiaowei, Beijing, China 备注:Accepted by SLT 2021 链接:https://arxiv.org/abs/2107.03165 摘要:如今,语音搜索兴趣点(POI)越来越流行。然而,由于多方言和大量的POI,本地POI的语音识别仍然是一个挑战。本文从两个方面提高了局部POI的语音识别精度。首先,提出了一种地理声学模型(Geo-AM)。geoam利用方言特有的输入特征和方言特有的顶层处理多方言问题。其次,在语音识别系统中集成了一组特定于地理位置的语言模型(geo-LMs),以提高长尾和同音词POI的识别精度。在解码过程中,根据用户的地理位置根据需要选择特定的语言模型。实验表明,Geo-AM在重音测试集上的相对字符错误率降低了6.5%$sim$10.1%,Geo-AM和Geo-LM在腾讯地图任务上的相对字符错误率降低了18.7%以上。 摘要:Nowadays voice search for points of interest (POI) is becoming increasingly popular. However, speech recognition for local POI has remained to be a challenge due to multi-dialect and massive POI. This paper improves speech recognition accuracy for local POI from two aspects. Firstly, a geographic acoustic model (Geo-AM) is proposed. The Geo-AM deals with multi-dialect problem using dialect-specific input feature and dialect-specific top layer. Secondly, a group of geo-specific language models (Geo-LMs) are integrated into our speech recognition system to improve recognition accuracy of long tail and homophone POI. During decoding, specific language models are selected on demand according to users' geographic location. Experiments show that the proposed Geo-AM achieves 6.5%$sim$10.1% relative character error rate (CER) reduction on an accent testset and the proposed Geo-AM and Geo-LM totally achieve over 18.7% relative CER reduction on Tencent Map task.

【8】 Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers 标题:改进的基于CTC-CRF的基于字片和符号器的端到端语音识别

作者:Huahuan Zheng,Wenjie Peng,Zhijian Ou,Jinsong Zhang 机构:Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China, School of Information Science, Beijing Language and Culture University, China 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.03007 摘要:自动语音识别系统在过去的几十年中得到了很大的改进,目前的系统主要是基于混合和端到端的。最近提出的CTC-CRF框架继承了混合方法的数据效率和端到端方法的简单性。本文进一步提出了基于CTC-CRF的ASR技术,并对其建模单元和神经结构进行了探讨。具体地说,我们研究的技术,使最近开发的词条建模单元和符合神经网络能够成功地应用于CTC CRF。实验在两个英语数据集(switchember,Librispeech)和一个来自CommonVoice的德语数据集上进行。实验结果表明:(1)一致性能显著提高识别性能(ii)对于字-音对应程度较低的目标语言(如英语),与基于电话的系统相比,基于单词的系统的性能稍差,而当目标语言(如德语)的字-音对应程度较高时,这两个系统的性能相同。 摘要:Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).

【9】 A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio 标题:单声道长形音频说话人属性ASR的模块化和联合方法比较研究

作者:Naoyuki Kanda,Xiong Xiao,Jian Wu,Tianyan Zhou,Yashesh Gaur,Xiaofei Wang,Zhong Meng,Zhuo Chen,Takuya Yoshioka 机构:Microsoft Corp., USA 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.02852 摘要:说话人属性自动语音识别(SA-ASR)是一项从多人录音中识别“谁说了什么”的任务。SA-ASR系统通常由语音分离、说话人二值化和ASR等多个模块组成。另一方面,考虑到联合优化,最近提出了一种端到端(E2E)SA-ASR模型,并在仿真数据上取得了很好的结果。在本文中,我们介绍了我们最近的研究比较这种模块化和联合方法的SA-ASR的真实单耳录音。我们利用大规模的训练数据,包括75000小时的ASR训练数据和用于说话人表征学习的VoxCeleb语料库,为模块化和联合方法开发了最先进的SA-ASR系统。我们还提出了一个新的管道,在说话人聚类后执行E2E SA-ASR模型。我们对AMI会议语料库的评估显示,在使用少量真实数据进行微调后,联合系统的准确率比最佳模块系统高9.2-29.4%,而模块系统在微调前的准确率更高。我们还进行了各种错误分析,以显示单耳SA-ASR的剩余问题。 摘要:Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the VoxCeleb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2--29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.

3.eess.AS音频处理:

【1】 Improving Speech Recognition Accuracy of Local POI Using Geographical Models 标题:利用地理模型提高局部POI的语音识别精度

作者:Songjun Cao,Yike Zhang,Xiaobing Feng,Long Ma 机构:Tencent Cloud Xiaowei, Beijing, China 备注:Accepted by SLT 2021 链接:https://arxiv.org/abs/2107.03165 摘要:如今,语音搜索兴趣点(POI)越来越流行。然而,由于多方言和大量的POI,本地POI的语音识别仍然是一个挑战。本文从两个方面提高了局部POI的语音识别精度。首先,提出了一种地理声学模型(Geo-AM)。geoam利用方言特有的输入特征和方言特有的顶层处理多方言问题。其次,在语音识别系统中集成了一组特定于地理位置的语言模型(geo-LMs),以提高长尾和同音词POI的识别精度。在解码过程中,根据用户的地理位置根据需要选择特定的语言模型。实验表明,Geo-AM在重音测试集上的相对字符错误率降低了6.5%$sim$10.1%,Geo-AM和Geo-LM在腾讯地图任务上的相对字符错误率降低了18.7%以上。 摘要:Nowadays voice search for points of interest (POI) is becoming increasingly popular. However, speech recognition for local POI has remained to be a challenge due to multi-dialect and massive POI. This paper improves speech recognition accuracy for local POI from two aspects. Firstly, a geographic acoustic model (Geo-AM) is proposed. The Geo-AM deals with multi-dialect problem using dialect-specific input feature and dialect-specific top layer. Secondly, a group of geo-specific language models (Geo-LMs) are integrated into our speech recognition system to improve recognition accuracy of long tail and homophone POI. During decoding, specific language models are selected on demand according to users' geographic location. Experiments show that the proposed Geo-AM achieves 6.5%$sim$10.1% relative character error rate (CER) reduction on an accent testset and the proposed Geo-AM and Geo-LM totally achieve over 18.7% relative CER reduction on Tencent Map task.

【2】 Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers 标题:改进的基于CTC-CRF的基于字片和符号器的端到端语音识别

作者:Huahuan Zheng,Wenjie Peng,Zhijian Ou,Jinsong Zhang 机构:Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China, School of Information Science, Beijing Language and Culture University, China 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.03007 摘要:自动语音识别系统在过去的几十年中得到了很大的改进,目前的系统主要是基于混合和端到端的。最近提出的CTC-CRF框架继承了混合方法的数据效率和端到端方法的简单性。本文进一步提出了基于CTC-CRF的ASR技术,并对其建模单元和神经结构进行了探讨。具体地说,我们研究的技术,使最近开发的词条建模单元和符合神经网络能够成功地应用于CTC CRF。实验在两个英语数据集(switchember,Librispeech)和一个来自CommonVoice的德语数据集上进行。实验结果表明:(1)一致性能显著提高识别性能(ii)对于字-音对应程度较低的目标语言(如英语),与基于电话的系统相比,基于单词的系统的性能稍差,而当目标语言(如德语)的字-音对应程度较高时,这两个系统的性能相同。 摘要:Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).

【3】 A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio 标题:单声道长形音频说话人属性ASR的模块化和联合方法比较研究

作者:Naoyuki Kanda,Xiong Xiao,Jian Wu,Tianyan Zhou,Yashesh Gaur,Xiaofei Wang,Zhong Meng,Zhuo Chen,Takuya Yoshioka 机构:Microsoft Corp., USA 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.02852 摘要:说话人属性自动语音识别(SA-ASR)是一项从多人录音中识别“谁说了什么”的任务。SA-ASR系统通常由语音分离、说话人二值化和ASR等多个模块组成。另一方面,考虑到联合优化,最近提出了一种端到端(E2E)SA-ASR模型,并在仿真数据上取得了很好的结果。在本文中,我们介绍了我们最近的研究比较这种模块化和联合方法的SA-ASR的真实单耳录音。我们利用大规模的训练数据,包括75000小时的ASR训练数据和用于说话人表征学习的VoxCeleb语料库,为模块化和联合方法开发了最先进的SA-ASR系统。我们还提出了一个新的管道,在说话人聚类后执行E2E SA-ASR模型。我们对AMI会议语料库的评估显示,在使用少量真实数据进行微调后,联合系统的准确率比最佳模块系统高9.2-29.4%,而模块系统在微调前的准确率更高。我们还进行了各种错误分析,以显示单耳SA-ASR的剩余问题。 摘要:Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the VoxCeleb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2--29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.

【4】 SoundStream: An End-to-End Neural Audio Codec 标题:Soundstream:一种端到端的神经音频编解码器

作者:Neil Zeghidour,Alejandro Luebs,Ahmed Omran,Jan Skoglund,Marco Tagliasacchi 链接:https://arxiv.org/abs/2107.03312 摘要:我们提出了SoundStream,一种新的神经音频编解码器,可以有效地压缩语音,音乐和一般音频比特率通常是针对语音定制编解码器。SoundStream依赖于一个模型结构,该结构由一个完全卷积的编码器/解码器网络和一个残差矢量量化器组成,它们端到端地联合训练。训练利用了文本到语音和语音增强方面的最新进展,这些技术结合了对抗性和重建损失,允许从量化嵌入中生成高质量的音频内容。通过对量化器层应用结构化丢包进行训练,单个模型可以跨3kbps到18kbps的可变比特率进行操作,与在固定比特率下训练的模型相比,质量损失可以忽略不计。此外,该模型还适用于低延迟实现,支持流式推理,并在智能手机CPU上实时运行。在使用24kHz采样率音频的主观评估中,3kbps的声音流在12kbps时优于Opus,在9.6kbps时接近EVS。此外,我们能够在编码器或解码器端执行联合压缩和增强,而无需额外的延迟,这是我们通过语音背景噪声抑制来证明的。 摘要:We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.

【5】 VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis 标题:VAENAR-TTS:基于变分自动编码器的非自回归文语合成

作者:Hui Lu,Zhiyong Wu,Xixin Wu,Xu Li,Shiyin Kang,Xunying Liu,Helen Meng 机构:Dept. of Systems Engineering & Engineering Management, Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, CUHK, Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems 链接:https://arxiv.org/abs/2107.03298 摘要:提出了一种基于变分自动编码器的非自回归文语转换(VAENAR-TTS)模型。基于序列到序列结构的自回归TTS(AR-TTS)模型可以产生高质量的语音,但其序列解码过程非常耗时。最近,非自回归TTS(NAR-TTS)模型被证明在并行译码过程中更有效。然而,这些NAR-TTS模型依赖音素水平的持续时间来生成文本和光谱图之间的硬对齐。通过强制对齐或知识提炼获取持续时间标签非常麻烦。此外,基于音素扩展的硬对齐会降低合成语音的自然度。相比之下,所提出的VAENAR-TTS模型是一种端到端的方法,不需要音素级的持续时间。VAENAR-TTS模型不包含递归结构,在训练和推理阶段都是完全非自回归的。在VAE结构的基础上,将对齐信息编码到潜变量中,在解码器中采用基于注意的文本与潜变量的软对齐来重构谱图。实验表明,VAENAR-TTS的合成质量达到了最先进的水平,合成速度与其他NAR-TTS模型相当。 摘要:This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.

【6】 MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification 标题:MACCIF-TDNN:基于TDNN的说话人确认中信道和上下文相互依赖特征的多方面聚合

作者:Fangyuan Wang,Zhigang Song,Hongchen Jiang,Bo Xu 机构:Institute of automation, Chinese Academy of Sciences , Beijing , China, Beijing University of Technology , Beijing , China 备注:6 pages. arXiv admin note: text overlap with arXiv:2005.07143 by other authors 链接:https://arxiv.org/abs/2107.03104 摘要:最近的大多数说话人验证结果都是通过X-vector及其后续变体来实现的。本文提出了一种基于时延神经网络(TDNN)的多角度融合信道和上下文相互依赖特征的网络结构。首先,我们使用ECAPA-TDNN中的SE-Res2Blocks对信道相关性进行显式建模,实现信道特征的自适应校正,并与传统的基于TDNN的方法相比,在更细粒度的层次上对局部上下文特征进行多尺度处理。其次,我们探讨利用Transformer的编码器结构,在话语层面上建立全局上下文相互依赖的特征模型,以获得更好的长期时间特征。在池层之前,我们将SE-res2block和Transformer-encoder的输出进行聚合,以利用各自学习到的互补信道和上下文相互依赖特性。最后,我们也发现,与单一专注的统计资料共用不同,我们可以将共用方法扩展成一种可以从多个方面区分特征的多头方法。提出的MACCIF-TDNN体系结构可以优于大多数基于VoxCeleb1测试集的最新TDNN系统。 摘要:Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, instead of performing a single attentive statistics pooling, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.

【7】 Adversarial Auto-Encoding for Packet Loss Concealment 标题:用于分组丢失隐藏的对抗性自动编码

作者:Santiago Pascual,Joan Serrà,Jordi Pons 机构:Dolby Laboratories 链接:https://arxiv.org/abs/2107.03100 摘要:像IP语音这样的通信技术在受限的实时条件下运行,语音包会受到网络的延迟和丢失。在这种情况下,分组丢失隐藏(PLC)算法重构丢失的帧,直到接收到新的真实分组。近年来,自回归深层神经网络已被证明优于PLC的信号处理方法,特别是对于超过60毫秒的长期预测。在这项工作中,我们提出了一种非自回归对抗式自动编码器,命名为PLAAE,在波形域执行实时PLC。PLAAE有一个因果卷积结构,它以一种自动编码的方式学习,在敌方损失的帮助下重建有间隙的信号。在推理过程中,相对于自回归模型,它能够在单个前馈步骤中预测此类间隙的平滑和连贯的连续性。我们的评估强调了PLAAE在光谱和语调重建、感知质量和可理解性方面优于两个经典的可编程逻辑控制器和两个深度自回归模型。 摘要:Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing methods for PLC, specially for long-term predictions beyond 60 ms. In this work, we propose a non-autoregressive adversarial auto-encoder, named PLAAE, to perform real-time PLC in the waveform domain. PLAAE has a causal convolutional structure, and it learns in an auto-encoder fashion to reconstruct signals with gaps, with the help of an adversarial loss. During inference, it is able to predict smooth and coherent continuations of such gaps in a single feed-forward step, as opposed to autoregressive models. Our evaluation highlights the superiority of PLAAE over two classic PLCs and two deep autoregressive models in terms of spectral and intonation reconstruction, perceptual quality, and intelligibility.

【8】 Efficient Transformer for Direct Speech Translation 标题:一种高效的直接语音翻译转换器

作者:Belen Alastruey,Gerard I. Gállego,Marta R. Costa-jussà 机构:TALP Research Center, Universitat Politecnica de Catalunya, Barcelona 备注:(c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works 链接:https://arxiv.org/abs/2107.03069 摘要:基于Transformer的模型的出现已经超越了文本的障碍。在处理语音时,我们必须面对一个问题:音频输入的序列长度不适合转换器。为了绕过这个问题,通常的方法是添加跨步卷积层,在使用Transformer之前减少序列长度。在本文中,我们提出了一种直接语音翻译的新方法,由于有一个高效的转换器,我们不必在转换器之前使用卷积层就可以处理频谱图。这允许编码器直接从光谱图中学习,并且不会丢失任何信息。我们创建了一个编码器-解码器模型,其中编码器是一个高效的转换器——长转换器——而解码器是一个传统的转换器-解码器。结果表明,这是一个很有前途的研究方向。 摘要:The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.

【9】 Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis 标题:多说话人语音合成中控制信息的有效区分使用

作者:Qinghua Wu,Quanbo Shen,Jian Luan,YuJun Wang 机构:Xiaomi Technology Co. Ltd., Beijing, China 链接:https://arxiv.org/abs/2107.03065 摘要:在多说话人语音合成中,由于说话人的年龄、说话风格、语速、情绪等各不相同,因此来自多个说话人的数据往往具有很大的多样性。数据的多样性将导致一对多映射问题。提高多说话人语音合成的建模能力是一个重要而富有挑战性的问题。针对这一问题,本文研究了在编解码框架中如何有效利用与文本内容信息相区别的说话人和基音等控制信息:1)设计了一种基于基音和能量的语音谐波结构表示方法,称为激励谱图。激励谱连同文本内容一起被馈送到解码器以指导mel谱图的谐波学习。2) 提出条件选通LSTM(CGLSTM),通过说话人嵌入对输入/输出/遗忘门进行重新加权,控制网络中文本内容信息的流动。实验表明,在多说话人生成模型的训练中,mel谱图的重建误差显著降低,说话人自适应模型的主观评价有了很大的提高,可懂度的平均意见得分(MOS)提高了0.81分。 摘要:In multi-speaker speech synthesis, data from a number of speakers usually tends to have great diversity due to the fact that the speakers may differ largely in their ages, speaking styles, speeds, emotions, and so on. The diversity of data will lead to the one-to-many mapping problem cite{Ren2020FastSpeech2F, Kumar2020FewSA}. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper researches into the effective use of control information such as speaker and pitch which are differentiated from text-content information in our encoder-decoder framework: 1) Design a representation of harmonic structure of speech, called excitation spectrogram, from pitch and energy. The excitation spectrogrom is, along with the text-content, fed to the decoder to guide the learning of harmonics of mel-spectrogram. 2) Propose conditional gated LSTM (CGLSTM) whose input/output/forget gates are re-weighted by speaker embedding to control the flow of text-content information in the network. The experiments show significant reduction in reconstruction errors of mel-spectrogram in the training of multi-speaker generative model, and a great improvement is observed in the subjective evaluation of speaker adapted model, e.g, the Mean Opinion Score (MOS) of intelligibility increases by 0.81 points.

0 人点赞