金融/语音/音频处理学术速递[7.6]

2021-07-27 10:24:29 浏览数 (1)

q-fin金融,共计12篇

cs.SD语音,共计14篇

eess.AS音频处理,共计14篇

1.q-fin金融:

【1】 Optimal transport for model calibration 标题:用于模型校准的最优运输

作者:Ivan Guo,Gregoire Loeper,Jan Obloj,Shiyi Wang 机构: 1School of Mathematics, Monash University, Australia 3University of Oxford 备注:15 pages, 9 figures 链接:https://arxiv.org/abs/2107.01978 摘要:我们提供了一个最新的结果调查模型校准优化运输。给出了欧式期权的一般框架,讨论了欧式期权的局部和局部随机波动率模型的标定、VIX/SPX联合标定问题以及一些路径相关期权的标定。我们解释了数值算法,并给出了综合数据和市场数据的例子。 摘要:We provide a survey of recent results on model calibration by Optimal Transport. We present the general framework and then discuss the calibration of local, and local-stochastic, volatility models to European options, the joint VIX/SPX calibration problem as well as calibration to some path-dependent options. We explain the numerical algorithms and present examples both on synthetic and market data.

【2】 Mobility decisions, economic dynamics and epidemic 标题:流动决策、经济动态和流行病

作者:Giorgio Fabbri,Salvatore Federico,Davide Fiaschi,Fausto Gozzi 机构:bUniversita degli Studi di Genova, cUniversita degli Studi di Pisa 链接:https://arxiv.org/abs/2107.01746 摘要:本文在一个动态宏观经济一般均衡框架下,提出了一个包含易感传染病恢复死亡(SIRD)传染病模型的理论模型。后者既影响他们的收入(和消费),也影响他们感染和被感染的可能性。个体流动性选择之间的战略互补性推动了总体经济活动的演化,而个体流动性引起的感染外部性影响着疾病的扩散。前瞻性个体对总体流动性和流行病动态的理性预期决定了个体的流动决策。该模型允许评估流动限制的替代方案,特别是依赖于流行病状态的政策。我们证明了平衡点的存在性,并给出了求平衡点(a)的递推构造方法,这也指导了我们的数值研究。我们利用意大利在2020年2月至2021年5月期间关于COVID-19流行病的经验对模型进行了校准。我们讨论了我们的经济SIRD(ESIRD)模型如何产生与具有恒定代理流动性的SIRD模型显著不同的经济和流行病动力学。最后,通过数值探索,我们说明了如何利用该模型设计一个有效的状态依赖流动性限制策略,以缓解卫生系统中的流行病高峰,并考虑到由于流动性减少而造成的经济损失与由于传染病传播较低而造成的较低死亡率之间的权衡。 摘要:In this paper we propose a theoretical model including a susceptible-infected-recovered-dead (SIRD) model of epidemic in a dynamic macroeconomic general equilibrium framework with agents' mobility. The latter affect both their income (and consumption) and their probability of infecting and of being infected. Strategic complementarities among individual mobility choices drive the evolution of aggregate economic activity, while infection externalities caused by individual mobility affect disease diffusion. Rational expectations of forward looking agents on the dynamics of aggregate mobility and epidemic determine individual mobility decisions. The model allows to evaluate alternative scenarios of mobility restrictions, especially policies dependent on the state of epidemic. We prove the existence of an equilibrium and provide a recursive construction method for finding equilibrium(a), which also guides our numerical investigations. We calibrate the model by using Italian experience on COVID-19 epidemic in the period February 2020 - May 2021. We discuss how our economic SIRD (ESIRD) model produces a substantially different dynamics of economy and epidemic with respect to a SIRD model with constant agents' mobility. Finally, by numerical explorations we illustrate how the model can be used to design an efficient policy of state-of-epidemic-dependent mobility restrictions, which mitigates the epidemic peaks stressing health system, and allows for trading-off the economic losses due to reduced mobility with the lower death rate due to the lower spread of epidemic.

【3】 Supporting decisions by unleashing multiple mindsets using pairwise comparisons method 标题:使用成对比较方法释放多个心态来支持决策

作者:Salvatore Greco,Sajid Siraj,Michele Lundy 机构:Department of Economics and Quantitative Methods, University of Catania, Italy, Portsmouth Business School, University of Portsmouth, Portsmouth, UK, Centre for Decision Research, Leeds University Business School, Leeds, UK 链接:https://arxiv.org/abs/2107.01731 摘要:两两比较判断中的不一致性通常被认为是一种不必要的现象,研究人员已经提出了一些技术来减少或纠正这种现象。我们的观点是,这种不一致性释放了决策者的不同心态,在生成作为决策支持的建议时应予以考虑。为此目的,我们考虑生成树分析,这是一种最近出现的想法,用于成对比较方法,该方法表示多个心态(以对应于不同生成树的多个向量)。到目前为止,由生成树方法提供的向量的多样性已经被合并成单个偏好向量,丢失了关于多个心态的信息。为了保留这一信息,我们提出了一种新的方法采取的方法类似于随机多准则可接受性分析。考虑到与不同心态相对应的所有备选方案的排名,我们的方法给出了备选方案达到给定排名位置的概率以及备选方案优于另一个备选方案的概率。由于生成树的指数数使得它们的计数变得不可行,我们建议使用生成树的统计抽样来计算近似概率。我们的方法也很有吸引力,因为它也可以应用于不完整的成对比较集。我们通过一个教学实例以及一个为农村地区选择电信骨干基础设施的实际案例来证明它的有效性。 摘要:Inconsistency in pairwise comparison judgements is often perceived as an unwanted phenomenon and researchers have proposed a number of techniques to either reduce it or to correct it. We take a viewpoint that this inconsistency unleashes different mindsets of the decision maker(s) that should be taken into account when generating recommendations as decision support. With this aim we consider the spanning trees analysis which is a recently emerging idea for use with the pairwise comparison approach that represents the plurality of mindsets (in terms of a plurality of vectors corresponding to different spanning trees). Until now, the multiplicity of the vectors supplied by the spanning trees approach have been amalgamated into a single preference vector, losing the information about the plurality of mindsets. To preserve this information, we propose a novel methodology taking an approach similar to Stochastic Multi-criteria Acceptability Analysis. Considering all the rankings of alternatives corresponding to the different mindsets, our methodology gives the probability that an alternative attains a given ranking position as well as the probability that an alternative is preferred to another one. Since the exponential number of spanning trees makes their enumeration prohibitive, we propose computing approximate probabilities using statistical sampling of the spanning trees. Our approach is also appealing because it can be applied also to incomplete sets of pairwise comparisons. We demonstrate its usefulness with a didactic example as well as with an application to a real-life case of selecting a Telecom backbone infrastructure for rural areas.

【4】 Asymptotic Analysis of Risk Premia Induced by Law-Invariant Risk Measures 标题:法律不变风险测度引起的风险溢价的渐近分析

作者:Thomas Knispel,Roger J. A. Laeven,Gregor Svindland 机构:Department of Business and Economics, Berlin School of Economics and Law, Amsterdam School of Economics, University of Amsterdam, and EURANDOM, Institute of Probability and Statistics and House of Insurance, Leibniz University Hannover 链接:https://arxiv.org/abs/2107.01730 摘要:在一类具有秩依赖效用偏好的律不变风险测度下,分析了无限扩张风险池中与Pareto最优风险分担契约相关的风险溢价的极限行为。我们证明了相应的收敛速度通常只有$n^{1/2}$,而不是传统的$n$,其中$n$是池中风险的多重性,这取决于精确的风险偏好。 摘要:We analyze the limiting behavior of the risk premium associated with the Pareto optimal risk sharing contract in an infinitely expanding pool of risks under a general class of law-invariant risk measures encompassing rank-dependent utility preferences. We show that the corresponding convergence rate is typically only $n^{1/2}$ instead of the conventional $n$, with $n$ the multiplicity of risks in the pool, depending upon the precise risk preferences.

【5】 Trading patterns within and between regions: a network analysis 标题:区域内和区域间贸易模式的网络分析

作者:Matthew Smith,Yasaman Sarabi 机构: The Business School, Edinburgh Napier University, Edinburgh, UK, Edinburgh Business School, Heriot-Watt University, Edinburgh, UK 备注:20 pages, 3 figures 链接:https://arxiv.org/abs/2107.01696 摘要:本研究探讨了国际贸易网络(ITN)的区域化模式。本研究利用古尔德-费尔南德斯的研究方法,探讨国家在不同地区划分的ITN中所扮演的角色。对三个具有不同技术含量的网络(代表高科技、中技术和低技术商品的贸易)的三个itn进行了检验。基于控制程度集中和聚类模式的先进网络模型,将模拟网络数据与观测数据进行比较,以检验国家在区域内和区域之间发挥的作用是否是集中和聚类模式的结果。研究结果表明,国家在区域间和区域内所扮演的角色是集中和集群模式的结果;表明在调查现代全球经济中的区域化和全球化模式时,需要检查枢纽的存在。 摘要:This study examines patterns of regionalisation in the International Trade Network (ITN). This study makes use of Gould Fernandez brokerage to examine the roles countries play in the ITN linking different regional partitions. An examination of three ITNs is provided for three networks with varying levels of technological content, representing trade in high tech, medium tech, and low-tech goods. Simulated network data, based on an advanced network model controlling for degree centralisation and clustering patterns, is compared to the observed data to examine whether the roles countries play within and between regions are result of centralisation and clustering patterns. The findings indicate that the roles countries play between and within regions is a result of centralisation and clustering patterns; indicating a need to examine the presence of hubs when investigating regionalisation and globalisation patterns in the modern global economy.

【6】 Deep calibration of the quadratic rough Heston model 标题:二次粗糙Heston模型的深度标定

作者:Mathieu Rosenbaum,Jianfei Zhang 机构: Ecole Polytechnique, CMAP, Palaiseau Cedex, France, Exoduspoint Capital Management, Boulevard Haussmann, Paris, France 链接:https://arxiv.org/abs/2107.01611 摘要:二次粗糙Heston模型为粗糙波动率范式中的Zumbach效应提供了一种自然的编码方式。我们应用多因子近似和深度学习的方法来建立一个有效的模型校正程序。结果表明,该模型能够很好地再现SPX和VIX隐含波动率。我们通常在买卖价差内获得VIX期权价格,并在货币倾斜时获得SPX的极好拟合。此外,我们还解释了如何利用训练好的神经网络对套期保值进行即时计算。 摘要:The quadratic rough Heston model provides a natural way to encode Zumbach effect in the rough volatility paradigm. We apply multi-factor approximation and use deep learning methods to build an efficient calibration procedure for this model. We show that the model is able to reproduce very well both SPX and VIX implied volatilities. We typically obtain VIX option prices within the bid-ask spread and an excellent fit of the SPX at-the-money skew. Moreover, we also explain how to use the trained neural networks for hedging with instantaneous computation of hedging quantities.

【7】 A Note on Utility Maximization with Proportional Transaction Costs and Stability of Optimal Portfolios 标题:关于交易费用成比例的效用最大化与最优投资组合稳定性的注记

作者:Erhan Bayraktar,Christoph Czichowsky,Leonid Dolinskyi,Yan Dolinsky 备注:9 pages 链接:https://arxiv.org/abs/2107.01568 摘要:本文的目的是建立具有比例交易费用的效用最大化问题的最优交易策略的极限定理。这个极限定理解决了文献[4]中的一个悬而未决的问题。证明的主要思想是建立最优策略的唯一性结果。令人惊讶的是,到目前为止,还没有关于最优交易策略唯一性的结果。唯一性的证明主要基于最近在[6,7,8]中发展的对偶方法。 摘要:The aim of this short note is to establish a limit theorem for the optimal trading strategies in the setup of the utility maximization problem with proportional transaction costs. This limit theorem resolves the open question from [4]. The main idea of our proof is to establish a uniqueness result for the optimal strategy. Surprisingly, up to date, there are no results related to the uniqueness of the optimal trading strategy. The proof of the uniqueness is heavily based on the dual approach which was developed recently in [6,7,8].

【8】 Decentralizing Centralized Matching Markets: Implications from Early Offers in University Admissions 标题:分散的集中配对市场:大学招生早期录取的启示

作者:Julien Grenet,YingHua He,Dorothea Kübler 机构:Implications from Early Offers in University Admissions, are consistent with students’costly learning about universities, Rice University and Toulouse Schoolof Economics, Department of Economics MS- 2 2 链接:https://arxiv.org/abs/2107.01532 摘要:匹配文献通常建议市场集中化的假设下,代理人知道自己的偏好,他们的偏好是固定的。我们在一个准实验中发现了与这个假设相反的证据。在德国的大学招生中,一个票据交换所实现了早期的大风Shapley算法的早期阶段。我们表明,在这个分散的阶段,早期的报价,虽然不是更可取的,但比后来的报价更容易被接受。这些结果,加上调查证据和理论模型,与学生对大学的昂贵学习是一致的。我们提出了一种混合机制,将分散和集中的优势结合起来。 摘要:The matching literature often recommends market centralization under the assumption that agents know their own preferences and that their preferences are fixed. We find counterevidence to this assumption in a quasi-experiment. In Germany's university admissions, a clearinghouse implements the early stages of the Gale-Shapley algorithm in real time. We show that early offers made in this decentralized phase, although not more desirable, are accepted more often than later ones. These results, together with survey evidence and a theoretical model, are consistent with students' costly learning about universities. We propose a hybrid mechanism to combine the advantages of decentralization and centralization.

【9】 Risk aversion and uniqueness of equilibrium in economies with two goods and HARA preferences 标题:具有两种商品和Hara偏好的经济中的风险规避和均衡的唯一性

作者:Andrea Loi,Stefano Matta 备注:11 pages, 1 figure 链接:https://arxiv.org/abs/2107.01947 摘要:我们研究了风险规避、消费者数量和均衡唯一性之间的关系。我们考虑一种具有两种商品和$C$不耐烦类型的经济,其中每种类型都具有加性可分离偏好,其具有HARA Bernoulli效用函数、$UYH(x):= Frace{γ}{ 1 -Gamm}(B Frac{{A}{{Gamm} x)^ {-γ} $。我们证明了如果$gamma在(1,frac{c}{c-1}]$,平衡是唯一的。此外,所使用的方法,包括牛顿的对称多项式和笛卡尔的符号规则,使我们能够在一个封闭形式的表达式中提供唯一性的新的充分条件,突出了禀赋、耐心和特定HARA参数所起的作用。最后,对于$gamma=3$的CRRA-Bernoulli效用函数的特殊情形,给出了保证唯一性的新的充要条件。 摘要:We study the connection between risk aversion, number of consumers and uniqueness of equilibrium. We consider an economy with two goods and $c$ impatience types, where each type has additive separable preferences with HARA Bernoulli utility function, $u_H(x):=frac{gamma}{1-gamma}(b frac{a}{gamma}x)^{1-gamma}$. We show that if $gammain (1, frac{c}{c-1}]$, the equilibrium is unique. Moreover, the methods used, involving Newton's symmetric polynomials and Descartes' rule of signs, enable us to offer new sufficient conditions for uniqueness in a closed-form expression highlighting the role played by endowments, patience and specific HARA parameters. Finally new necessary and sufficient conditions in ensuring uniqueness are derived for the particular case of CRRA Bernoulli utility functions with $gamma =3$.

【10】 The Role of "Live" in Livestreaming Markets: Evidence Using Orthogonal Random Forest

作者:Ziwei Cong,Jia Liu,Puneet Manchanda 链接:https://arxiv.org/abs/2107.01629 摘要:对于livestreaming这种不断增长的媒体,人们普遍认为它的价值在于它的“直播”组件。在本文中,我们利用一个大型livestreaming平台的数据来检验这个观点。我们能够做到这一点,因为这个平台还允许观众购买录制的livestream版本。我们通过估计需求在livestream之前、当天和之后对价格的反应来总结livestreaming内容的价值。为此,我们提出了一个广义正交随机森林框架。这个框架允许我们在存在高维混杂因素的情况下估计异质治疗效果,这些混杂因素与治疗政策(即价格)的关系复杂但部分已知。我们发现,在价格弹性的需求在时间上的距离,以安排直播一天和一天之后显着的动态。具体来说,随着时间的推移,需求对价格的敏感度逐渐降低到直播日,而在直播日则没有弹性。在后livestream时期,需求仍然对价格敏感,但远低于前livestream时期。这表明livestreaming的价值在live组件之外持续存在。最后,我们为驱动我们结果的可能机制提供了提示性证据。这些是livestream前后模式的质量不确定性降低,以及在livestream的当天与创建者进行实时交互的可能性。 摘要:The common belief about the growing medium of livestreaming is that its value lies in its "live" component. In this paper, we leverage data from a large livestreaming platform to examine this belief. We are able to do this as this platform also allows viewers to purchase the recorded version of the livestream. We summarize the value of livestreaming content by estimating how demand responds to price before, on the day of, and after the livestream. We do this by proposing a generalized Orthogonal Random Forest framework. This framework allows us to estimate heterogeneous treatment effects in the presence of high-dimensional confounders whose relationships with the treatment policy (i.e., price) are complex but partially known. We find significant dynamics in the price elasticity of demand over the temporal distance to the scheduled livestreaming day and after. Specifically, demand gradually becomes less price sensitive over time to the livestreaming day and is inelastic on the livestreaming day. Over the post-livestream period, demand is still sensitive to price, but much less than the pre-livestream period. This indicates that the vlaue of livestreaming persists beyond the live component. Finally, we provide suggestive evidence for the likely mechanisms driving our results. These are quality uncertainty reduction for the patterns pre- and post-livestream and the potential of real-time interaction with the creator on the day of the livestream.

【11】 Cleaning large-dimensional covariance matrices for correlated samples 标题:清理相关样本的高维协方差矩阵

作者:Zdzislaw Burda,Andrzej Jarosz 机构:AGH University of Science and Technology, Applied Computer Science, al. Mickiewicza ,-, Krakow, Poland 备注:5 pages, 3 figures 链接:https://arxiv.org/abs/2107.01352 摘要:在自相关样本集上导出了大维协方差矩阵的非线性收缩估计,从而推广了Ledoit-P'{e}ch'{e}的最新公式。随机矩阵理论便于计算。利用Ledoit-Wolf核估计技术,将计算结果转化为一个有效的算法,并建立了一个相关的Python库收缩。给出了一个指数衰减自相关的例子。 摘要:A non-linear shrinkage estimator of large-dimensional covariance matrices is derived in a setting of auto-correlated samples, thus generalizing the recent formula by Ledoit-P'{e}ch'{e}. The calculation is facilitated by random matrix theory. The result is turned into an efficient algorithm, and an associated Python library, shrinkage, with help of Ledoit-Wolf kernel estimation technique. An example of exponentially-decaying auto-correlations is presented.

【12】 Visual Time Series Forecasting: An Image-driven Approach 标题:可视化时间序列预测:一种图像驱动的方法

作者:Naftali Cohen,Srijan Sood,Zhen Zeng,Tucker Balch,Manuela Veloso 机构:J. P. Morgan AI Research, New York, NY 链接:https://arxiv.org/abs/2107.01273 摘要:在这项工作中,我们把时间序列预测作为一项计算机视觉任务来处理。我们将输入数据捕获为图像,并训练模型以生成后续图像。这种方法的结果是预测分布,而不是逐点的值。为了评估我们方法的稳健性和质量,我们检查了各种数据集和多种评估指标。我们的实验表明,我们的预测工具是有效的周期性数据,但有点不规则的数据,如股票价格。重要的是,当使用基于图像的评估指标时,我们发现我们的方法优于各种基线,包括ARIMA,以及我们的深度学习方法的数值变化。 摘要:In this work, we address time-series forecasting as a computer vision task. We capture input data as an image and train a model to produce the subsequent image. This approach results in predicting distributions as opposed to pointwise values. To assess the robustness and quality of our approach, we examine various datasets and multiple evaluation metrics. Our experiments show that our forecasting tool is effective for cyclic data but somewhat less for irregular data such as stock prices. Importantly, when using image-based evaluation metrics, we find our method to outperform various baselines, including ARIMA, and a numerical variation of our deep learning approach.

2.cs.SD语音:

【1】 DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling 标题:DeepRapper:基于韵律和节奏建模的神经Rap生成

作者:Lanqing Xue,Kaitao Song,Duocai Wu,Xu Tan,Nevin L. Zhang,Tao Qin,Wei-Qiang Zhang,Tie-Yan Liu 机构:The Hong Kong University of Science and Technology, Nanjing University of Science and Technology, Fudan University, Microsoft Research Asia, Tsinghua University 备注:Accepted by ACL 2021 main conference 链接:https://arxiv.org/abs/2107.01875 摘要:说唱生成的目的是产生歌词和相应的歌唱节拍,它需要同时模拟押韵和节奏。以前的说唱作品主要集中在押韵的歌词上,而忽略了对说唱表演很重要的节奏。在本文中,我们开发了DeepRapper,一个基于Transformer的rap生成系统,可以同时模拟韵律和节奏。由于没有可用的rap节奏数据集,我们开发了一个数据挖掘管道来收集一个大规模的rap数据集,其中包括大量的rap歌曲与对齐的歌词和节奏。其次,我们设计了一个基于变换器的自回归语言模型,对韵律和韵律进行了细致的建模。具体地说,我们以相反的顺序生成歌词,用押韵表示和约束来增强押韵,并在歌词中插入节拍符号来进行节奏/节拍建模。据我们所知,DeepRapper是第一个同时产生押韵和节奏的说唱系统。客观和主观评价都表明,深说唱产生了富有创意和高质量的说唱韵律和节奏。代码将在GitHub上发布。 摘要:Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.

【2】 Arabic Code-Switching Speech Recognition using Monolingual Data 标题:使用单语数据的阿拉伯语码转换语音识别

作者:Ahmed Ali,Shammur Chowdhury,Amir Hussein,Yasser Hifny 机构:Qatar Computing Research Institute, HBKU, Doha, Qatar, Kanari AI, California, USA, University of Helwan, Egypt 备注:Accepted in Interspeech 2021, speech recognition, code-switching, ASR, transformer, WFST, graph approach 链接:https://arxiv.org/abs/2107.01573 摘要:在全球化的背景下,自动语音识别(ASR)中的码切换是一个重要的挑战。最近对多语种ASR的研究表明,与单语系统相比,多语种ASR有潜在的改进。通过一系列大规模ASR实验,研究了ASR多语言建模的关键问题。我们的创新框架在加权有限状态传感器(WFST)框架中部署了一种多图方法。我们将我们的WFST解码策略与训练在相同数据上的Transformer序列对序列系统进行了比较。给出了阿拉伯语和英语之间的码切换场景,我们的结果表明WFST解码方法更适合于句子间的码切换数据集。此外,转换系统在句内语码转换任务中表现较好。在这项研究中,我们发布了一个人工生成的开发和测试集,以及生态代码转换测试集,以测试ASR的性能。 摘要:Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (WFST) framework. We compare our WFST decoding strategies with a transformer sequence to sequence system trained on the same data. Given a code-switching scenario between Arabic and English languages, our results show that the WFST decoding approaches were more suitable for the intersentential code-switching datasets. In addition, the transformer system performed better for intrasentential code-switching task. With this study, we release an artificially generated development and test sets, along with ecological code-switching test set, to benchmark the ASR performance.

【3】 Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation 标题:联合端到端多人重叠语音识别和说话人属性估计的统一自回归建模

作者:Ryo Masumura,Daiki Okamura,Naoki Makishima,Mana Ihori,Akihiko Takashima,Tomohiro Tanaka,Shota Orihashi 机构:† NTT Media Intelligence Laboratories, NTT Corporation, Japan, ‡ Nagaoka University of Technology, Japan 备注:Accepted at Interspeech 2021 链接:https://arxiv.org/abs/2107.01549 摘要:本文提出了一种新的单通道多说话人重叠自动语音识别(ASR)系统建模方法。基于完全神经网络的端到端模型极大地提高了多任务重叠ASR任务的性能。端到端建模的一种很有前途的方法是具有串行输出训练的自回归建模,其中多个说话人的转录一个接一个地递归生成。这使我们能够自然地捕捉说话者之间的关系。然而,传统的建模方法不能明确地考虑个体话语的说话人属性,如性别和年龄信息。事实上,当每个演讲者的性别相同或年龄相近时,他们的演技就会下降。为了解决这个问题,我们提出了端到端多说话人重叠ASR和说话人属性估计的统一自回归模型。我们的核心思想是在统一的自回归模型中处理性别和年龄估计任务。在该方法中,基于Transformer的自回归模型不仅递归地生成文本标记,而且递归地生成每个说话人的属性标记。这使得我们能够有效地利用说话人属性来提高多说话人重叠ASR。在日语多说话人重叠ASR任务上的实验表明了该方法的有效性。 摘要:In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation. Our key idea is to handle gender and age estimation tasks within the unified autoregressive modeling. In the proposed method, transformer-based autoregressive model recursively generates not only textual tokens but also attribute tokens of each speaker. This enables us to effectively utilize speaker attributes for improving multi-talker overlapped ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the effectiveness of the proposed method.

【4】 Development of a Conversation State Recognition System 标题:一种通话状态识别系统的开发

作者:Sujay Uday Rittikar 机构:(DKTE’s Textile and Engineering Institute, India) 链接:https://arxiv.org/abs/2107.01462 摘要:随着使用LSTM的说话人重分类概念的发展,相对而言,理解输入音频流数据的特定段的说话人身份比手动标记数据更容易。考虑到这样的概念,非常希望考虑使用所识别的说话人身份来帮助识别会话中的说话人状态的可能性。在这项研究中,马尔可夫链被用来识别和更新同一组说话人之间的下一次会话的说话人状态,以便在最自然和最长的会话中识别他们的状态。该模型基于两个数据集中三个或三个以上说话人的自然对话的几个音频样本,识别状态的总错误百分比小于或等于12%。研究结果表明,对说话人二值化的扩展可以有效地预测会话的状态。 摘要:With the evolution of the concept of Speaker diarization using LSTM, it is relatively easier to understand the speaker identities for specific segments of input audio stream data than manually tagging the data. With such a concept, it is highly desirable to consider the possibility of using the identified speaker identities to aid in recognizing the Speaker States in a conversation. In this study, the Markov Chains are used to identify and update the Speaker States for the next conversations between the same set of speakers, to enable identification of their states in the most natural and long conversations. The model is based on several audio samples from natural conversations of three or greater than three speakers in two datasets with overall total error percentages for recognized states being lesser than or equal to 12%. The findings imply that the proposed extension to the Speaker diarization is effective to predict the states for a conversation.

【5】 A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification 标题:一种用于低复杂度设备的彩票假设框架--鲁棒神经声学场景分类

作者:Chao-Han Huck Yang,Hu Hu,Sabato Marco Siniscalchi,Qing Wang,Yuyang Wang,Xianjun Xia,Yuanjun Zhao,Yuzhong Wu,Yannan Wang,Jun Du,Chin-Hui Lee 机构:School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA, Computer Engineering School, University of Enna Kore, Italy, University of Science and Technology of China, HeFei, China, Tencent Media Lab, Tencent Corporation, China 备注:None 链接:https://arxiv.org/abs/2107.01461 摘要:提出了一种结合数据增强、知识转移、剪枝和量化的神经模型压缩策略,用于设备鲁棒声场景分类。具体来说,我们利用最近提出的先进的神经网络剪枝机制,即彩票假设(LTH),在低资源环境下处理ASC任务,以找到与少量非零模型参数相关联的子网络神经模型。通过研究各种数据增强和压缩方案,评估了LTH在低复杂度声学建模中的有效性,并提出了一种高效的低复杂度多设备ASC联合框架&声学彩票。声学彩票可以将ASC模型压缩到1/10^{4}$以上,与未压缩的种子模型相比,获得了更好的性能(验证准确率为74.01%,对数损失为0.76)。本文所报告的所有结果都是基于四个小组的共同努力,即GT-USTC-UKE-Tencent,旨在解决DCASE 2021挑战任务1a中的“多设备低复杂度声场景分类(ASC)”问题。 摘要:We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could compress an ASC model over $1/10^{4}$ and attain a superior performance (validation accuracy of 74.01% and Log loss of 0.76) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.

【6】 The HCCL Speaker Verification System for Far-Field Speaker Verification Challenge 标题:面向远场说话人确认挑战的HCCL说话人确认系统

作者:Zhuo Li,Ce Fang,Runqiu Xiao,Zhigao Chen,Wenchao Wang,Yonghong Yan 机构:Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese, University of Chinese Academy of Sciences, Beijing, China, Xinjiang Key Laboratory of Minority Speech and Language Information Processing, Xinjiang 链接:https://arxiv.org/abs/2107.01329 摘要:本文描述了HCCL小组提交的远场说话人验证挑战的系统。我们之前在AIshell Speaker Verification Challenge 2019中的工作表明,神经网络架构强大的建模能力可以为此类任务提供优异的性能。因此,在这个挑战中,我们致力于构建基于TDNN、Resnet和Res2net块的深层神经网络结构。所开发的系统大多采用神经网络嵌入式,后端采用PLDA。首先,采用速度摄动法对数据进行增广,得到了显著的性能改善。然后,我们探讨了AMsoftmax损耗函数的使用,并提出在使用AMsoftmax损耗函数训练模型时加入一个CE损耗分支。此外,还探讨了分数标准化对绩效的影响。最后一个系统是四个系统的融合,在task1评估集上达到minDCF 0.5342,EER 5.05%,在task3评估集上达到minDCF 0.5193,EER 5.47%。 摘要:This paper describes the systems submitted by team HCCL to the Far-Field Speaker Verification Challenge. Our previous work in the AIshell Speaker Verification Challenge 2019 shows that the powerful modeling abilities of Neural Network architectures can provide exceptional performance for this kind of task. Therefore, in this challenge, we focus on constructing deep Neural Network architectures based on TDNN, Resnet and Res2net blocks. Most of the developed systems consist of Neural Network embeddings are applied with PLDA backend. Firstly, the speed perturbation method is applied to augment data and significant performance improvements are achieved. Then, we explore the use of AMsoftmax loss function and propose to join a CE-loss branch when we train model using AMsoftmax loss. In addition, the impact of score normalization on performance is also investigated. The final system, a fusion of four systems, achieves minDCF 0.5342, EER 5.05% on task1 eval set, and achieves minDCF 0.5193, EER 5.47% on task3 eval set.

【7】 Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input 标题:基于文本和超声舌象的节律输入语音合成

作者:Tamás Gábor Csapó,László Tóth,Gábor Gosztolya,Alexandra Markó 机构: 5 1Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary 3Institute of Informatics, University of Szeged, Hungary 5Department of Applied Linguistics and Phonetics, E¨otv¨os Lor´and University 备注:accepted at SSW11 (11th Speech Synthesis Workshop) 链接:https://arxiv.org/abs/2107.02003 摘要:研究表明,发音信息可以有效地提高基于HMM和DNN的文本语音合成的性能。语音合成的研究传统上集中于文本到语音的转换,当输入是文本或估计的语言表示时,目标是合成语音。然而,近十年来兴起的一个研究领域是发音到语音合成(Silent speech Interface,SSI的目标应用),其目标是从发音器官运动的某些表征合成语音。在本文中,我们扩展了传统的(声码器为基础的)DNN-TTS与发音输入,估计从超声舌图像。我们只比较文本输入、超声波输入和组合输入。利用八位说话人的数据,我们证明了在有限的数据环境下,文本和发音输入相结合的方法具有优势,即与单一文本输入相比,它可以提高合成语音的自然度。此外,我们还分析了几个说话人的声带记录,结果表明,超声换能器定位的偏差会对最终的合成性能产生负面影响。 摘要:Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based text-to-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text input. Besides, we analyze the ultrasound tongue recordings of several speakers, and show that misalignments in the ultrasound transducer positioning can have a negative effect on the final synthesis performance.

【8】 Investigation of Practical Aspects of Single Channel Speech Separation for ASR 标题:ASR中单通道语音分离的实用化研究

作者:Jian Wu,Zhuo Chen,Sanyuan Chen,Yu Wu,Takuya Yoshioka,Naoyuki Kanda,Shujie Liu,Jinyu Li 机构:Microsoft Corporation, Harbin Institute of Technology 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.01922 摘要:语音分离由于其处理重叠语音的能力以及与自动语音识别(ASR)等下游任务相结合的灵活性,已经成功地应用于会话转录系统的前端处理模块。然而,一个语音分离模型往往会引入目标语音失真,导致一个次优的字错误率(WER)。在本文中,我们描述了我们为提高单通道语音分离系统的性能所做的努力。具体来说,我们研究了一个两阶段的训练方案,首先应用特征级优化准则进行预训练,然后使用端到端(E2E)语音识别模型进行面向ASR的优化准则。同时,为了保持模型的轻量级,我们引入了一种改进的师生学习模型压缩技术。通过这些方法的结合,我们在LibriCSS数据集上使用了参数小于10M的模型,与之前的最新结果相比,在话语评估和连续评估方面的绝对平均WER分别提高了2.70%和0.77% 摘要:Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches, we achieve a absolute average WER improvement of 2.70% and 0.77% using models with less than 10M parameters compared with the previous state-of-the-art results on the LibriCSS dataset for utterance-wise evaluation and continuous evaluation, respectively

【9】 A comparative study of eight human auditory models of monaural processing 标题:8种人类单耳加工听觉模型的比较研究

作者:Alejandro Osses Vecchi,Léo Varnet,Laurel H. Carney,Torsten Dau,Ian C. Bruce,Sarah Verhulst,Piotr Majdak 机构:)Laboratoire des systemes perceptifs, D´epartement d’´etudes cognitives, ´Ecole Normale Sup´erieure, PSL University, CNRS, Paris, France∗), )Departments of Biomedical Engineering and Neuroscience, University of Rochester, Rochester, NY, USA 链接:https://arxiv.org/abs/2107.01753 摘要:许多听觉模型已经用不同的方法开发出来,不管是生理的还是知觉的,但是它们有着相似的信号处理阶段,因为它们是受听觉系统相同组成部分的启发。我们比较了八个单耳模型,这些模型可以在听觉建模工具箱中公开访问。我们讨论了使模型输出相互比较所需的考虑因素,以及以下模型处理阶段或其等效阶段的结果:外耳和中耳、耳蜗滤器库、内毛细胞、听神经突触、耳蜗核和下丘。讨论包括一些实际的考虑因素有关的使用单声道阶段在双耳框架。 摘要:A number of auditory models have been developed using diverging approaches, either physiological or perceptual, but they share comparable stages of signal processing, as they are inspired by the same constitutive parts of the auditory system. We compare eight monaural models that are openly accessible in the Auditory Modelling Toolbox. We discuss the considerations required to make the model outputs comparable to each other, as well as the results for the following model processing stages or their equivalents: outer and middle ear, cochlear filter bank, inner hair cell, auditory nerve synapse, cochlear nucleus, and inferior colliculus. The discussion includes some practical considerations related to the use of monaural stages in binaural frameworks.

【10】 EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion 标题:EditSpeech:基于部分推理和双向融合的文本语音编辑系统

作者:Daxin Tan,Liqun Deng,Yu Ting Yeung,Xin Jiang,Xiao Chen,Tan Lee 机构:Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, Huawei Noah’s Ark Lab, Shenzhen, China 链接:https://arxiv.org/abs/2107.01554 摘要:本文介绍了一个语音编辑系统EditSpeech的设计、实现和评价,它允许用户在给定的语音中进行删除、插入和替换,而不会造成语音质量和自然度的下降。EditSpeech系统是基于神经文本到语音(NTTS)合成框架开发的。提出了部分推理和双向融合的方法,有效地融合了与编辑区域相关的上下文信息,实现了左右边界的平滑过渡。对未修改的话语部分的失真得到了缓解。在多说话人情景下,开发并评估了EditSpeech系统的英文和中文版本。客观和主观评价表明,EditSpeech在低频谱失真和良好的语音质量方面优于一些基本系统。音频样本可在线演示https://daxintan-cuhk.github.io/EditSpeech/ . 摘要:This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .

【11】 Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors 标题:基于全局吸引子和局部吸引子的无限说话人神经二值化

作者:Shota Horiguchi,Shinji Watanabe,Paola Garcia,Yawen Xue,Yuki Takashima,Yohei Kawaguchi 机构:Hitachi, Ltd., Japan, Carnegie Mellon University, USA, Johns Hopkins University, USA 链接:https://arxiv.org/abs/2107.01545 摘要:基于吸引子的端到端二值化在具有挑战性的数据集上实现了与经过仔细调整的传统基于聚类的方法相当的精度。然而,它的主要缺点是不能处理说话者的数目大于训练中观察到的数目的情况。这是因为它的说话人计数依赖于监督学习。在这项工作中,我们引入了一个嵌入在基于吸引子的端到端二值化中的无监督聚类过程。我们首先将一系列的逐帧嵌入分解成短的子序列,然后对每个子序列执行基于吸引子的二值化。给定子序列的二值化结果,通过对所有子序列吸引子计算出的向量进行无监督聚类,得到子序列间的说话人对应关系。这使得即使每个子序列的输出扬声器的数量是有限的,也可以为整个记录产生大量扬声器的二值化结果。实验结果表明,该方法能对未知数量的说话人产生准确的二值化结果。我们的方法在CALLHOME、DIHARD II和DIHARD III数据集上分别达到11.84%、28.33%和19.49%,均优于传统的端到端二值化方法。 摘要:Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.

【12】 TENET: A Time-reversal Enhancement Network for Noise-robust ASR 标题:宗旨:一种用于抗噪ASR的时间反转增强网络

作者:Fu-An Chao,Shao-Wei Fan Jiang,Bi-Cheng Yan,Jeih-weih Hung,Berlin Chen 机构:National Taiwan Normal University, Taipei, Taiwan, National Chi Nan University, Nantou, Taiwan 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.01531 摘要:由于深度学习带来的前所未有的突破,语音增强技术得到了迅速的发展,在声学建模之前,语音增强技术在减轻噪声对语音的影响方面发挥着重要的作用。为了提高语音的感知质量,目前在语音识别领域的最新技术采用了对抗性训练,通过将客观度量与鉴别器相连接。然而,优化语音的感知质量并不一定能提高自动语音识别(ASR)的性能。在这项研究中,我们提出了特尼特,一种新的时间反转增强网络,它利用输入噪声信号本身的转换,即时间反转版本,结合暹罗网络和复杂的双路径Transformer,以提高噪声鲁棒ASR的SE性能。在Voicebank需求数据集上进行的大量实验表明,在SE和ASR评估指标方面,与一些顶级方法相比,TENET可以获得最先进的结果。为了验证模型的泛化能力,我们进一步对含有未知噪声的场景测试集进行了评价,结果也证实了该方法的优越性。 摘要:Due to the unprecedented breakthroughs brought about by deep learning, speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. To increase the perceptual quality of speech, current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved automatic speech recognition (ASR) performance. In this study, we present TENET, a novel Time-reversal Enhancement NETwork, which leverages the transformation of an input noisy signal itself, i.e., the time-reversed version, in conjunction with the siamese network and complex dual-path transformer to promote SE performance for noise-robust ASR. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of both SE and ASR evaluation metrics. To demonstrate the model generalization ability, we further evaluate TENET on the test set of scenarios contaminated with unseen noise, and the results also confirm the superiority of this promising method.

【13】 Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition 标题:放松注意力:提高端到端自动语音识别性能的一种简单方法

作者:Timo Lohrenz,Patrick Schwarz,Zhengyang Li,Tim Fingscheidt 机构:Technische Universit¨at Braunschweig, Institute for Communications Technology, Schleinitzstr. , Braunschweig, Germany 备注:submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.01275 摘要:最近,基于注意的编解码器(AED)模型在多个任务中显示出很高的端到端自动语音识别(ASR)性能。针对这类模型中的过度自信问题,本文引入了放松注意的概念,即在训练过程中简单地向编码器-解码器的注意权值逐步注入均匀分布的注意权值,这很容易用两行代码实现。我们研究了不同AED模型结构和两个突出的ASR任务,华尔街日报(WSJ)和Librispeech中放松注意的效果。我们发现,在外部语言模型的解码过程中,放松注意力训练的Transformer的表现一直优于标准的基线模型。在WSJ上,我们为基于Transformer的端到端语音识别建立了一个新的基准,在只引入一个超参数的情况下,字错误率为3.65%,比最新技术(4.20%)高出13.1%。验收后,模型将在github上发布。 摘要:Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter. Upon acceptance, models will be published on github.

【14】 Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition 标题:用于流式端到端语音识别的双重因果/非因果自我注意

作者:Niko Moritz,Takaaki Hori,Jonathan Le Roux 机构:Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2107.01269 摘要:基于注意力的端到端自动语音识别(ASR)系统最近在许多任务中显示了最先进的结果。然而,对于流式ASR而言,自我注意和基于注意的编解码模型的应用仍然具有挑战性,因为每个单词在说出后不久都必须被识别。在这项工作中,我们提出了双重因果/非因果自我注意(DCN)架构,与限制性自我注意相比,它防止了在深层架构中使用时,整体上下文超出了单层的前瞻性。将DCN与基于分块的和限制性的自我注意进行了比较,结果表明,DCN的ASR性能优于限制性的自我注意,与基于分块的自我注意相比具有竞争性,同时提供了帧同步处理的优势。结合触发注意,提出的流式端到端ASR系统在LibriSpeech、香港科技大学和交换台ASR任务上取得了最新的成果。 摘要:Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.

3.eess.AS音频处理:

【1】 Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input 标题:基于文本和超声舌象的节律输入语音合成

作者:Tamás Gábor Csapó,László Tóth,Gábor Gosztolya,Alexandra Markó 机构: 5 1Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary 3Institute of Informatics, University of Szeged, Hungary 5Department of Applied Linguistics and Phonetics, E¨otv¨os Lor´and University 备注:accepted at SSW11 (11th Speech Synthesis Workshop) 链接:https://arxiv.org/abs/2107.02003 摘要:研究表明,发音信息可以有效地提高基于HMM和DNN的文本语音合成的性能。语音合成的研究传统上集中于文本到语音的转换,当输入是文本或估计的语言表示时,目标是合成语音。然而,近十年来兴起的一个研究领域是发音到语音合成(Silent speech Interface,SSI的目标应用),其目标是从发音器官运动的某些表征合成语音。在本文中,我们扩展了传统的(声码器为基础的)DNN-TTS与发音输入,估计从超声舌图像。我们只比较文本输入、超声波输入和组合输入。利用八位说话人的数据,我们证明了在有限的数据环境下,文本和发音输入相结合的方法具有优势,即与单一文本输入相比,它可以提高合成语音的自然度。此外,我们还分析了几个说话人的声带记录,结果表明,超声换能器定位的偏差会对最终的合成性能产生负面影响。 摘要:Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based text-to-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text input. Besides, we analyze the ultrasound tongue recordings of several speakers, and show that misalignments in the ultrasound transducer positioning can have a negative effect on the final synthesis performance.

【2】 Investigation of Practical Aspects of Single Channel Speech Separation for ASR 标题:ASR中单通道语音分离的实用化研究

作者:Jian Wu,Zhuo Chen,Sanyuan Chen,Yu Wu,Takuya Yoshioka,Naoyuki Kanda,Shujie Liu,Jinyu Li 机构:Microsoft Corporation, Harbin Institute of Technology 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.01922 摘要:语音分离由于其处理重叠语音的能力以及与自动语音识别(ASR)等下游任务相结合的灵活性,已经成功地应用于会话转录系统的前端处理模块。然而,一个语音分离模型往往会引入目标语音失真,导致一个次优的字错误率(WER)。在本文中,我们描述了我们为提高单通道语音分离系统的性能所做的努力。具体来说,我们研究了一个两阶段的训练方案,首先应用特征级优化准则进行预训练,然后使用端到端(E2E)语音识别模型进行面向ASR的优化准则。同时,为了保持模型的轻量级,我们引入了一种改进的师生学习模型压缩技术。通过这些方法的结合,我们在LibriCSS数据集上使用了参数小于10M的模型,与之前的最新结果相比,在话语评估和连续评估方面的绝对平均WER分别提高了2.70%和0.77% 摘要:Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches, we achieve a absolute average WER improvement of 2.70% and 0.77% using models with less than 10M parameters compared with the previous state-of-the-art results on the LibriCSS dataset for utterance-wise evaluation and continuous evaluation, respectively

【3】 A comparative study of eight human auditory models of monaural processing 标题:8种人类单耳加工听觉模型的比较研究

作者:Alejandro Osses Vecchi,Léo Varnet,Laurel H. Carney,Torsten Dau,Ian C. Bruce,Sarah Verhulst,Piotr Majdak 机构:)Laboratoire des systemes perceptifs, D´epartement d’´etudes cognitives, ´Ecole Normale Sup´erieure, PSL University, CNRS, Paris, France∗), )Departments of Biomedical Engineering and Neuroscience, University of Rochester, Rochester, NY, USA 链接:https://arxiv.org/abs/2107.01753 摘要:许多听觉模型已经用不同的方法开发出来,不管是生理的还是知觉的,但是它们有着相似的信号处理阶段,因为它们是受听觉系统相同组成部分的启发。我们比较了八个单耳模型,这些模型可以在听觉建模工具箱中公开访问。我们讨论了使模型输出相互比较所需的考虑因素,以及以下模型处理阶段或其等效阶段的结果:外耳和中耳、耳蜗滤器库、内毛细胞、听神经突触、耳蜗核和下丘。讨论包括一些实际的考虑因素有关的使用单声道阶段在双耳框架。 摘要:A number of auditory models have been developed using diverging approaches, either physiological or perceptual, but they share comparable stages of signal processing, as they are inspired by the same constitutive parts of the auditory system. We compare eight monaural models that are openly accessible in the Auditory Modelling Toolbox. We discuss the considerations required to make the model outputs comparable to each other, as well as the results for the following model processing stages or their equivalents: outer and middle ear, cochlear filter bank, inner hair cell, auditory nerve synapse, cochlear nucleus, and inferior colliculus. The discussion includes some practical considerations related to the use of monaural stages in binaural frameworks.

【4】 EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion 标题:EditSpeech:基于部分推理和双向融合的文本语音编辑系统

作者:Daxin Tan,Liqun Deng,Yu Ting Yeung,Xin Jiang,Xiao Chen,Tan Lee 机构:Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, Huawei Noah’s Ark Lab, Shenzhen, China 链接:https://arxiv.org/abs/2107.01554 摘要:本文介绍了一个语音编辑系统EditSpeech的设计、实现和评价,它允许用户在给定的语音中进行删除、插入和替换,而不会造成语音质量和自然度的下降。EditSpeech系统是基于神经文本到语音(NTTS)合成框架开发的。提出了部分推理和双向融合的方法,有效地融合了与编辑区域相关的上下文信息,实现了左右边界的平滑过渡。对未修改的话语部分的失真得到了缓解。在多说话人情景下,开发并评估了EditSpeech系统的英文和中文版本。客观和主观评价表明,EditSpeech在低频谱失真和良好的语音质量方面优于一些基本系统。音频样本可在线演示https://daxintan-cuhk.github.io/EditSpeech/ . 摘要:This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .

【5】 Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors 标题:基于全局吸引子和局部吸引子的无限说话人神经二值化

作者:Shota Horiguchi,Shinji Watanabe,Paola Garcia,Yawen Xue,Yuki Takashima,Yohei Kawaguchi 机构:Hitachi, Ltd., Japan, Carnegie Mellon University, USA, Johns Hopkins University, USA 链接:https://arxiv.org/abs/2107.01545 摘要:基于吸引子的端到端二值化在具有挑战性的数据集上实现了与经过仔细调整的传统基于聚类的方法相当的精度。然而,它的主要缺点是不能处理说话者的数目大于训练中观察到的数目的情况。这是因为它的说话人计数依赖于监督学习。在这项工作中,我们引入了一个嵌入在基于吸引子的端到端二值化中的无监督聚类过程。我们首先将一系列的逐帧嵌入分解成短的子序列,然后对每个子序列执行基于吸引子的二值化。给定子序列的二值化结果,通过对所有子序列吸引子计算出的向量进行无监督聚类,得到子序列间的说话人对应关系。这使得即使每个子序列的输出扬声器的数量是有限的,也可以为整个记录产生大量扬声器的二值化结果。实验结果表明,该方法能对未知数量的说话人产生准确的二值化结果。我们的方法在CALLHOME、DIHARD II和DIHARD III数据集上分别达到11.84%、28.33%和19.49%,均优于传统的端到端二值化方法。 摘要:Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.

【6】 TENET: A Time-reversal Enhancement Network for Noise-robust ASR 标题:宗旨:一种用于抗噪ASR的时间反转增强网络

作者:Fu-An Chao,Shao-Wei Fan Jiang,Bi-Cheng Yan,Jeih-weih Hung,Berlin Chen 机构:National Taiwan Normal University, Taipei, Taiwan, National Chi Nan University, Nantou, Taiwan 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.01531 摘要:由于深度学习带来的前所未有的突破,语音增强技术得到了迅速的发展,在声学建模之前,语音增强技术在减轻噪声对语音的影响方面发挥着重要的作用。为了提高语音的感知质量,目前在语音识别领域的最新技术采用了对抗性训练,通过将客观度量与鉴别器相连接。然而,优化语音的感知质量并不一定能提高自动语音识别(ASR)的性能。在这项研究中,我们提出了特尼特,一种新的时间反转增强网络,它利用输入噪声信号本身的转换,即时间反转版本,结合暹罗网络和复杂的双路径Transformer,以提高噪声鲁棒ASR的SE性能。在Voicebank需求数据集上进行的大量实验表明,在SE和ASR评估指标方面,与一些顶级方法相比,TENET可以获得最先进的结果。为了验证模型的泛化能力,我们进一步对含有未知噪声的场景测试集进行了评价,结果也证实了该方法的优越性。 摘要:Due to the unprecedented breakthroughs brought about by deep learning, speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. To increase the perceptual quality of speech, current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved automatic speech recognition (ASR) performance. In this study, we present TENET, a novel Time-reversal Enhancement NETwork, which leverages the transformation of an input noisy signal itself, i.e., the time-reversed version, in conjunction with the siamese network and complex dual-path transformer to promote SE performance for noise-robust ASR. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of both SE and ASR evaluation metrics. To demonstrate the model generalization ability, we further evaluate TENET on the test set of scenarios contaminated with unseen noise, and the results also confirm the superiority of this promising method.

【7】 Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition 标题:放松注意力:提高端到端自动语音识别性能的一种简单方法

作者:Timo Lohrenz,Patrick Schwarz,Zhengyang Li,Tim Fingscheidt 机构:Technische Universit¨at Braunschweig, Institute for Communications Technology, Schleinitzstr. , Braunschweig, Germany 备注:submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.01275 摘要:最近,基于注意的编解码器(AED)模型在多个任务中显示出很高的端到端自动语音识别(ASR)性能。针对这类模型中的过度自信问题,本文引入了放松注意的概念,即在训练过程中简单地向编码器-解码器的注意权值逐步注入均匀分布的注意权值,这很容易用两行代码实现。我们研究了不同AED模型结构和两个突出的ASR任务,华尔街日报(WSJ)和Librispeech中放松注意的效果。我们发现,在外部语言模型的解码过程中,放松注意力训练的Transformer的表现一直优于标准的基线模型。在WSJ上,我们为基于Transformer的端到端语音识别建立了一个新的基准,在只引入一个超参数的情况下,字错误率为3.65%,比最新技术(4.20%)高出13.1%。验收后,模型将在github上发布。 摘要:Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter. Upon acceptance, models will be published on github.

【8】 Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition 标题:用于流式端到端语音识别的双重因果/非因果自我注意

作者:Niko Moritz,Takaaki Hori,Jonathan Le Roux 机构:Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2107.01269 摘要:基于注意力的端到端自动语音识别(ASR)系统最近在许多任务中显示了最先进的结果。然而,对于流式ASR而言,自我注意和基于注意的编解码模型的应用仍然具有挑战性,因为每个单词在说出后不久都必须被识别。在这项工作中,我们提出了双重因果/非因果自我注意(DCN)架构,与限制性自我注意相比,它防止了在深层架构中使用时,整体上下文超出了单层的前瞻性。将DCN与基于分块的和限制性的自我注意进行了比较,结果表明,DCN的ASR性能优于限制性的自我注意,与基于分块的自我注意相比具有竞争性,同时提供了帧同步处理的优势。结合触发注意,提出的流式端到端ASR系统在LibriSpeech、香港科技大学和交换台ASR任务上取得了最新的成果。 摘要:Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.

【9】 DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling 标题:DeepRapper:基于韵律和节奏建模的神经Rap生成

作者:Lanqing Xue,Kaitao Song,Duocai Wu,Xu Tan,Nevin L. Zhang,Tao Qin,Wei-Qiang Zhang,Tie-Yan Liu 机构:The Hong Kong University of Science and Technology, Nanjing University of Science and Technology, Fudan University, Microsoft Research Asia, Tsinghua University 备注:Accepted by ACL 2021 main conference 链接:https://arxiv.org/abs/2107.01875 摘要:说唱生成的目的是产生歌词和相应的歌唱节拍,它需要同时模拟押韵和节奏。以前的说唱作品主要集中在押韵的歌词上,而忽略了对说唱表演很重要的节奏。在本文中,我们开发了DeepRapper,一个基于Transformer的rap生成系统,可以同时模拟韵律和节奏。由于没有可用的rap节奏数据集,我们开发了一个数据挖掘管道来收集一个大规模的rap数据集,其中包括大量的rap歌曲与对齐的歌词和节奏。其次,我们设计了一个基于变换器的自回归语言模型,对韵律和韵律进行了细致的建模。具体地说,我们以相反的顺序生成歌词,用押韵表示和约束来增强押韵,并在歌词中插入节拍符号来进行节奏/节拍建模。据我们所知,DeepRapper是第一个同时产生押韵和节奏的说唱系统。客观和主观评价都表明,深说唱产生了富有创意和高质量的说唱韵律和节奏。代码将在GitHub上发布。 摘要:Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.

【10】 Arabic Code-Switching Speech Recognition using Monolingual Data 标题:使用单语数据的阿拉伯语码转换语音识别

作者:Ahmed Ali,Shammur Chowdhury,Amir Hussein,Yasser Hifny 机构:Qatar Computing Research Institute, HBKU, Doha, Qatar, Kanari AI, California, USA, University of Helwan, Egypt 备注:Accepted in Interspeech 2021, speech recognition, code-switching, ASR, transformer, WFST, graph approach 链接:https://arxiv.org/abs/2107.01573 摘要:在全球化的背景下,自动语音识别(ASR)中的码切换是一个重要的挑战。最近对多语种ASR的研究表明,与单语系统相比,多语种ASR有潜在的改进。通过一系列大规模ASR实验,研究了ASR多语言建模的关键问题。我们的创新框架在加权有限状态传感器(WFST)框架中部署了一种多图方法。我们将我们的WFST解码策略与训练在相同数据上的Transformer序列对序列系统进行了比较。给出了阿拉伯语和英语之间的码切换场景,我们的结果表明WFST解码方法更适合于句子间的码切换数据集。此外,转换系统在句内语码转换任务中表现较好。在这项研究中,我们发布了一个人工生成的开发和测试集,以及生态代码转换测试集,以测试ASR的性能。 摘要:Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (WFST) framework. We compare our WFST decoding strategies with a transformer sequence to sequence system trained on the same data. Given a code-switching scenario between Arabic and English languages, our results show that the WFST decoding approaches were more suitable for the intersentential code-switching datasets. In addition, the transformer system performed better for intrasentential code-switching task. With this study, we release an artificially generated development and test sets, along with ecological code-switching test set, to benchmark the ASR performance.

【11】 Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation 标题:联合端到端多人重叠语音识别和说话人属性估计的统一自回归建模

作者:Ryo Masumura,Daiki Okamura,Naoki Makishima,Mana Ihori,Akihiko Takashima,Tomohiro Tanaka,Shota Orihashi 机构:† NTT Media Intelligence Laboratories, NTT Corporation, Japan, ‡ Nagaoka University of Technology, Japan 备注:Accepted at Interspeech 2021 链接:https://arxiv.org/abs/2107.01549 摘要:本文提出了一种新的单通道多说话人重叠自动语音识别(ASR)系统建模方法。基于完全神经网络的端到端模型极大地提高了多任务重叠ASR任务的性能。端到端建模的一种很有前途的方法是具有串行输出训练的自回归建模,其中多个说话人的转录一个接一个地递归生成。这使我们能够自然地捕捉说话者之间的关系。然而,传统的建模方法不能明确地考虑个体话语的说话人属性,如性别和年龄信息。事实上,当每个演讲者的性别相同或年龄相近时,他们的演技就会下降。为了解决这个问题,我们提出了端到端多说话人重叠ASR和说话人属性估计的统一自回归模型。我们的核心思想是在统一的自回归模型中处理性别和年龄估计任务。在该方法中,基于Transformer的自回归模型不仅递归地生成文本标记,而且递归地生成每个说话人的属性标记。这使得我们能够有效地利用说话人属性来提高多说话人重叠ASR。在日语多说话人重叠ASR任务上的实验表明了该方法的有效性。 摘要:In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation. Our key idea is to handle gender and age estimation tasks within the unified autoregressive modeling. In the proposed method, transformer-based autoregressive model recursively generates not only textual tokens but also attribute tokens of each speaker. This enables us to effectively utilize speaker attributes for improving multi-talker overlapped ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the effectiveness of the proposed method.

【12】 Development of a Conversation State Recognition System 标题:一种通话状态识别系统的开发

作者:Sujay Uday Rittikar 机构:(DKTE’s Textile and Engineering Institute, India) 链接:https://arxiv.org/abs/2107.01462 摘要:随着使用LSTM的说话人重分类概念的发展,相对而言,理解输入音频流数据的特定段的说话人身份比手动标记数据更容易。考虑到这样的概念,非常希望考虑使用所识别的说话人身份来帮助识别会话中的说话人状态的可能性。在这项研究中,马尔可夫链被用来识别和更新同一组说话人之间的下一次会话的说话人状态,以便在最自然和最长的会话中识别他们的状态。该模型基于两个数据集中三个或三个以上说话人的自然对话的几个音频样本,识别状态的总错误百分比小于或等于12%。研究结果表明,对说话人二值化的扩展可以有效地预测会话的状态。 摘要:With the evolution of the concept of Speaker diarization using LSTM, it is relatively easier to understand the speaker identities for specific segments of input audio stream data than manually tagging the data. With such a concept, it is highly desirable to consider the possibility of using the identified speaker identities to aid in recognizing the Speaker States in a conversation. In this study, the Markov Chains are used to identify and update the Speaker States for the next conversations between the same set of speakers, to enable identification of their states in the most natural and long conversations. The model is based on several audio samples from natural conversations of three or greater than three speakers in two datasets with overall total error percentages for recognized states being lesser than or equal to 12%. The findings imply that the proposed extension to the Speaker diarization is effective to predict the states for a conversation.

【13】 A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification 标题:一种用于低复杂度设备的彩票假设框架--鲁棒神经声学场景分类

作者:Chao-Han Huck Yang,Hu Hu,Sabato Marco Siniscalchi,Qing Wang,Yuyang Wang,Xianjun Xia,Yuanjun Zhao,Yuzhong Wu,Yannan Wang,Jun Du,Chin-Hui Lee 机构:School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA, Computer Engineering School, University of Enna Kore, Italy, University of Science and Technology of China, HeFei, China, Tencent Media Lab, Tencent Corporation, China 备注:None 链接:https://arxiv.org/abs/2107.01461 摘要:提出了一种结合数据增强、知识转移、剪枝和量化的神经模型压缩策略,用于设备鲁棒声场景分类。具体来说,我们利用最近提出的先进的神经网络剪枝机制,即彩票假设(LTH),在低资源环境下处理ASC任务,以找到与少量非零模型参数相关联的子网络神经模型。通过研究各种数据增强和压缩方案,评估了LTH在低复杂度声学建模中的有效性,并提出了一种高效的低复杂度多设备ASC联合框架&声学彩票。声学彩票可以将ASC模型压缩到1/10^{4}$以上,与未压缩的种子模型相比,获得了更好的性能(验证准确率为74.01%,对数损失为0.76)。本文所报告的所有结果都是基于四个小组的共同努力,即GT-USTC-UKE-Tencent,旨在解决DCASE 2021挑战任务1a中的“多设备低复杂度声场景分类(ASC)”问题。 摘要:We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could compress an ASC model over $1/10^{4}$ and attain a superior performance (validation accuracy of 74.01% and Log loss of 0.76) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.

【14】 The HCCL Speaker Verification System for Far-Field Speaker Verification Challenge 标题:面向远场说话人确认挑战的HCCL说话人确认系统

作者:Zhuo Li,Ce Fang,Runqiu Xiao,Zhigao Chen,Wenchao Wang,Yonghong Yan 机构:Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese, University of Chinese Academy of Sciences, Beijing, China, Xinjiang Key Laboratory of Minority Speech and Language Information Processing, Xinjiang 链接:https://arxiv.org/abs/2107.01329 摘要:本文描述了HCCL小组提交的远场说话人验证挑战的系统。我们之前在AIshell Speaker Verification Challenge 2019中的工作表明,神经网络架构强大的建模能力可以为此类任务提供优异的性能。因此,在这个挑战中,我们致力于构建基于TDNN、Resnet和Res2net块的深层神经网络结构。所开发的系统大多采用神经网络嵌入式,后端采用PLDA。首先,采用速度摄动法对数据进行增广,得到了显著的性能改善。然后,我们探讨了AMsoftmax损耗函数的使用,并提出在使用AMsoftmax损耗函数训练模型时加入一个CE损耗分支。此外,还探讨了分数标准化对绩效的影响。最后一个系统是四个系统的融合,在task1评估集上达到minDCF 0.5342,EER 5.05%,在task3评估集上达到minDCF 0.5193,EER 5.47%。 摘要:This paper describes the systems submitted by team HCCL to the Far-Field Speaker Verification Challenge. Our previous work in the AIshell Speaker Verification Challenge 2019 shows that the powerful modeling abilities of Neural Network architectures can provide exceptional performance for this kind of task. Therefore, in this challenge, we focus on constructing deep Neural Network architectures based on TDNN, Resnet and Res2net blocks. Most of the developed systems consist of Neural Network embeddings are applied with PLDA backend. Firstly, the speed perturbation method is applied to augment data and significant performance improvements are achieved. Then, we explore the use of AMsoftmax loss function and propose to join a CE-loss branch when we train model using AMsoftmax loss. In addition, the impact of score normalization on performance is also investigated. The final system, a fusion of four systems, achieves minDCF 0.5342, EER 5.05% on task1 eval set, and achieves minDCF 0.5193, EER 5.47% on task3 eval set.

0 人点赞