金融/语音/音频处理学术速递[12.22]

2021-12-24 08:53:24 浏览数 (1)

q-fin金融,共计4篇

cs.SD语音,共计7篇

eess.AS音频处理,共计8篇

1.q-fin金融:

【1】 Role of Variable Renewable Energy Penetration on Electricity Price and its Volatility Across Independent System Operators in the United States 标题:美国独立系统运营商可变可再生能源普及率对电价及其波动性的影响 链接:https://arxiv.org/abs/2112.11338

作者:Olukunle O. Owolabi,Toryn L. J. Schafer,Georgia E. Smits,Sanhita Sengupta,Sean E. Ryan,Lan Wang,David S. Matteson,Mila Getmansky Sherman,Deborah A. Sunter 机构:Department of Mechanical Engineering, Tufts University, USA, Department of Statistics and Data Science, Cornell University, USA, School of Statistics, University of Minnesota, USA, Department of Management Science, University of Miami, USA 摘要:随着风能和太阳能——可变可再生能源(VRE)形式的普及,美国电网经历了重大变革。尽管VRE对脱碳有好处,但它在区域电力市场中引发了一些不必要的影响,引起了一些争议。在本研究中,我们基于美国六家独立系统运营商的每小时、实时、历史数据,使用分位数和斜t分布回归分析了VRE渗透对系统电价和价格波动的作用。在修正了时间效应后,我们观察到价格下降,对价格波动产生非线性影响,从而导致VRE渗透率增加。这些结果与现代投资组合理论一致,在现代投资组合理论中,不同的波动性资产可能导致更稳定、风险更低的投资组合。 摘要:The U.S. electrical grid has undergone substantial transformation with increased penetration of wind and solar -- forms of variable renewable energy (VRE). Despite the benefits of VRE for decarbonization, it has garnered some controversy for inducing unwanted effects in regional electricity markets. In this study, we examine the role of VRE penetration on the system electricity price and price volatility based on hourly, real-time, historical data from six Independent System Operators in the U.S. using quantile and skew t-distribution regressions. After correcting for temporal effects, we observe a decrease in price, with non-linear effects on price volatility, for an increase in VRE penetration. These results are consistent with the modern portfolio theory where diverse volatile assets may lead to more stable and less risky portfolios.

【2】 On the decomposition of an insurer's profits and losses 标题:论保险公司损益的分解 链接:https://arxiv.org/abs/2112.11265

作者:Marcus C. Christiansen 机构:Institut f¨ur Mathematik, Carl von Ossietzky Universit¨at Oldenburg, Carl-von-Ossietzky-Straße ,–, DE-, Oldenburg, Germany. 摘要:现行的保险公司报告标准要求对观察到的利润和损失进行分解,以使保险公司资产负债表的变化可归因于特定的风险因素。生成这样的分解是一项非常重要的任务,因为资产负债表通常以非线性方式依赖于风险因素。本文从盈亏分解的公理化角度出发,发现公理必然导致无穷小顺序更新(ISU)分解,前提是后者存在且稳定,而目前的做法是使用顺序更新(SU)分解。公理化方法的普遍性使得结果在保险应用之外也很有用,因为利润和损失应以风险导向的方式进行额外分解。 摘要:Current reporting standards for insurers require a decomposition of observed profits and losses in such a way that changes in the insurer's balance sheet can be attributed to specified risk factors. Generating such a decomposition is a nontrivial task because balance sheets generally depend on the risk factors in a non-linear way. This paper starts from an axiomatic perspective on profit and loss decompositions and finds that the axioms necessarily lead to infinitesimal sequential updating (ISU) decompositions, provided that the latter exist and are stable, whereas the current practice is rather to use sequential updating (SU) decompositions. The generality of the axiomatic approach makes the results useful also beyond insurance applications wherever profits and losses shall be additively decomposed in a risk-oriented manner.

【3】 Estimating economic severity of Air Traffic Flow Management regulations 标题:估计空中交通流量管理法规的经济严重程度 链接:https://arxiv.org/abs/2112.11263

作者:Luis Delgado,Gérald Gurtner,Tatjana Bolić,Lorenzo Castelli 机构:University of Westminster, Architecture and Cities, Marylebone Road, London, GB, NW,LS, Università degli Studi di Trieste, Dipartimento di Ingegneria e Architettura, Via A. Valerio , Trieste, Italy, A R T I C L E I N F O 备注:None 摘要:欧洲空中交通管理网络中基于轨迹的运行和滚动网络运行计划的发展意味着朝着更具协作性和战略性的飞行规划方向发展。这为在协作决策过程中纳入更多信息提供了可能性。考虑到这一点,我们将网络要素(如扇区或机场)的经济风险指标定义为要素因空中交通流量管理(ATFM)法规而对空域用户施加的预期成本。该指标的定义基于对历史ATFM法规数据的分析,该数据提供了累积延迟风险的指示。这种延误风险转化为空域用户的货币风险,从而产生了特定空域要素经济风险的新度量。然后,我们使用一些机器学习技术来寻找导致这种经济风险的参数。该指标伴随着延迟成本预测模型准确性的指示。最后,将经济风险转化为定性的经济严重性分类。可针对不同的时间范围和时间段估算经济风险和经济严重性,提供一个指标,供航空导航服务提供商用于确定可能需要实施战略措施的领域(例如,重新分区或容量供应变化),并由空域用户考虑使用特定空域区域的航线操作。 摘要:The development of trajectory-based operations and the rolling network operations plan in European air traffic management network implies a move towards more collaborative, strategic flight planning. This opens up the possibility for inclusion of additional information in the collaborative decision-making process. With that in mind, we define the indicator for the economic risk of network elements (e.g., sectors or airports) as the expected costs that the elements impose on airspace users due to Air Traffic Flow Management (ATFM) regulations. The definition of the indicator is based on the analysis of historical ATFM regulations data, that provides an indication of the risk of accruing delay. This risk of delay is translated into a monetary risk for the airspace users, creating the new metric of the economic risk of a given airspace element. We then use some machine learning techniques to find the parameters leading to this economic risk. The metric is accompanied by an indication of the accuracy of the delay cost prediction model. Lastly, the economic risk is transformed into a qualitative economic severity classification. The economic risks and consequently economic severity can be estimated for different temporal horizons and time periods providing an indicator which can be used by Air Navigation Service Providers to identify areas which might need the implementation of strategic measures (e.g., resectorisation or capacity provision change), and by Airspace Users to consider operation of routes which use specific airspace regions.

【4】 A level-set approach to the control of state-constrained McKean-Vlasov equations: application to renewable energy storage and portfolio selection 标题:状态约束McKean-Vlasov方程控制的水平集方法:在可再生能源储存和投资组合选择中的应用 链接:https://arxiv.org/abs/2112.11059

作者:Maximilien Germain,Huyên Pham,Xavier Warin 机构:Huyˆen Pham‡ 摘要:我们考虑控制McKean Vlasov动力学(或平均场控制)的概率状态约束。我们依赖于一个水平集方法,该方法提供了一个约束问题的无约束表示,具有精确的惩罚和运行最大或积分成本。然后将该方法推广到公共噪声设置。我们的工作将(Bokanowski,Picarelli和Zidani,SIAM J.Control Optim.54.5(2016),第2568-2593页)和(Bokanowski,Picarelli和Zidani,Appl.Math.Optim.71(2015),第125-163页)扩展到平均场设置。作为无约束问题的重新表述特别适用于问题的数值解决,这是通过机器学习算法的扩展实现的(Carmona,Lauri{`e}re,arXiv:1908.01613,见Ann.Appl.Prob.,2019)。第一个应用涉及在存在平均现场价格影响的情况下可再生电力的存储,另一个应用关注财富概率约束的均值-方差投资组合选择问题。我们还说明了不依赖对偶的原始Markowitz连续时间问题的直接数值解的方法。 摘要:We consider the control of McKean-Vlasov dynamics (or mean-field control) with probabilistic state constraints. We rely on a level-set approach which provides a representation of the constrained problem in terms of an unconstrained one with exact penalization and running maximum or integral cost. The method is then extended to the common noise setting. Our work extends (Bokanowski, Picarelli, and Zidani, SIAM J. Control Optim. 54.5 (2016), pp. 2568--2593) and (Bokanowski, Picarelli, and Zidani, Appl. Math. Optim. 71 (2015), pp. 125--163) to a mean-field setting. The reformulation as an unconstrained problem is particularly suitable for the numerical resolution of the problem, that is achieved from an extension of a machine learning algorithm from (Carmona, Lauri{`e}re, arXiv:1908.01613 to appear in Ann. Appl. Prob., 2019). A first application concerns the storage of renewable electricity in the presence of mean-field price impact and another one focuses on a mean-variance portfolio selection problem with probabilistic constraints on the wealth. We also illustrate our approach for a direct numerical resolution of the primal Markowitz continuous-time problem without relying on duality.

2.cs.SD语音:

【1】 Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition 标题:用于语音识别的神经网络语言模型的混合精度低位量化 链接:https://arxiv.org/abs/2112.11438

作者:Junhao Xu,Jianwei Yu,Shoukang Hu,Xunying Liu,Helen Meng 机构: and Helen Mengare with the Chinese University of Hong Kong 摘要:以长短时记忆递归神经网络(LSTM RNN)和Transformer为代表的最新语言模型(LMs)在实际应用中变得越来越复杂和昂贵。低比特神经网络量化提供了一个强大的解决方案,以大大减少其模型的大小。目前的量化方法都是基于统一的精度,无法考虑LMs不同部分对量化误差的不同性能灵敏度。为此,本文提出了一种新的混合精度神经网络LM量化方法。使用三种技术自动学习LSTM-RNN和基于Transformer的神经LMs的最优局部精度选择。前两种方法基于量化灵敏度度量,其形式为全精度和量化LMs之间测量的KL散度,或可使用无矩阵技术有效近似的Hessian轨迹加权量化扰动。第三种方法基于混合精度神经结构搜索。为了克服梯度下降法直接估计离散量化权值的困难,采用交替方向乘子法(ADMM)有效地训练量化LMs。实验在最先进的LF-MMI CNN-TDNN系统上进行,该系统具有速度扰动、i向量和基于学习隐藏单元贡献(LHUC)的说话人自适应两个任务:交换机电话语音和AMI会议转录。所提出的混合精度量化技术通过在全精度LSTM和Transformer基线LMs上产生高达约16倍的模型大小压缩比,在这两项任务上实现了“无损”量化,同时不会导致统计上显著的字错误率增加。 摘要:State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-the-art LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved "lossless" quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.

【2】 Voice Quality and Pitch Features in Transformer-Based Speech Recognition 标题:基于Transformer的语音识别中的音质和基音特征 链接:https://arxiv.org/abs/2112.11391

作者:Guillermo Cámbara,Jordi Luque,Mireia Farrús 机构:TALN Research Group, Universitat Pompeu Fabra, Barcelona, Spain, Telef´onica I D, Research, Barcelona, Spain, Language and Computation Centre, Universitat de Barcelona, Spain 备注:5 pages, 3 figures, submitted to Speech Prosody 2022 conference 摘要:抖动和微光测量已被证明是语音质量和韵律信息的载体,可提高说话人识别、二值化或自动语音识别(ASR)等任务的性能。然而,此类特征很少用于基于神经的ASR,因为在这种情况下,光谱特征通常占主导地位。在这项工作中,我们研究了将语音质量和音高特征合并到基于转换器的ASR模型中的效果,直觉表明注意机制可能利用潜在的韵律特征。为此,我们提出了韵律和频谱特征的分离卷积前端,表明这种架构选择比将此类音调和音质特征简单地连接到mel频谱图滤波器组产生更好的结果。此外,我们发现使用LibriSpeech基准测试,平均字错误率相对减少了5.6%。这些发现推动了对韵律知识应用的进一步研究,以提高基于变换的ASR的鲁棒性。 摘要:Jitter and shimmer measurements have shown to be carriers of voice quality and prosodic information which enhance the performance of tasks like speaker recognition, diarization or automatic speech recognition (ASR). However, such features have been seldom used in the context of neural-based ASR, where spectral features often prevail. In this work, we study the effects of incorporating voice quality and pitch features altogether and separately to a Transformer-based ASR model, with the intuition that the attention mechanisms might exploit latent prosodic traits. For doing so, we propose separated convolutional front-ends for prosodic and spectral features, showing that this architectural choice yields better results than simple concatenation of such pitch and voice quality features to mel-spectrogram filterbanks. Furthermore, we find mean Word Error Rate relative reductions of up to 5.6% with the LibriSpeech benchmark. Such findings motivate further research on the application of prosody knowledge for increasing the robustness of Transformer-based ASR.

【3】 Safeguarding test signals for acoustic measurement using arbitrary sounds 标题:使用任意声音保护声学测量的测试信号 链接:https://arxiv.org/abs/2112.11373

作者:Hideki Kawahara,Kohei Yatabe 机构:Wakayama University, Sakaedani, Wakayama,-, Japan, Waseda University,-,-, Ookubo, Shinjuku-ku, Tokyo,-, Japan 备注:4 pages, 10 figures, submitted to Acoustical Science and Technology 摘要:我们提出了一种简单的方法,通过转换任何适合测量的声音来测量声音响应。这种方法使我们能够使用音乐片段来测量声学条件。测量这样的条件而不使听者产生恼人的测试声音是有利的。此外,应用同时测量多条路径的基本思想提供了具有实际价值的特性。例如,在目标条件下再现稍微修改的内容时,可以测量偏差(时间稳定、随机和时变)和脉冲响应。该方法的关键思想是在原始声音中加入相对较小的、听起来像噪声的确定性信号。我们称转换后的声音为受保护的测试信号。 摘要:We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measurement of multiple paths provides practically valuable features. For example, it is possible to measure deviations (temporally stable, random, and time-varying) and the impulse response while reproducing slightly modified contents under target conditions. The key idea of the proposed method is to add relatively small deterministic signals that sound like noise to the original sounds. We call the converted sounds safeguarded test signals.

【4】 Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent 标题:基于自监督学习的复周期一致性单声道语音增强 链接:https://arxiv.org/abs/2112.11142

作者:Yi Li,Yang Sun,Syed Mohsen Naqvi 摘要:最近,自监督学习(SSL)技术被引入到解决单耳语音增强问题中。由于缺乏使用干净的阶段信息,大多数SSL方法的增强性能受到限制。因此,本文提出了一种基于相位感知自监督学习的单耳语音增强方法。在基础自动编码器(FAE)的两个解码器中,振幅和相位的潜在表示仅在一组有限的干净语音信号中独立地进行研究。然后,下游自动编码器(DAE)学习干净语音和混合表示之间的共享潜在空间,其中包含大量看不见的混合。提出了一种复杂循环一致性(CCC)机制,以最小化振幅域和相位域之间的重构损失。此外,还注意到,如果将语音特征提取为多分辨率谱,则可以研究分布在不同尺度谱中的所需信息,以进一步提高性能。使用NOISEX和DAPS语料库生成具有不同干扰的混合物,以评估所提出方法的有效性。需要强调的是,在FAE和DAE中馈送的干净语音和混合语音不是成对的。烧蚀和对比实验结果表明,该方法明显优于现有的方法。 摘要:Recently, self-supervised learning (SSL) techniques have been introduced to solve the monaural speech enhancement problem. Due to the lack of using clean phase information, the enhancement performance is limited in most SSL methods. Therefore, in this paper, we propose a phase-aware self-supervised learning based monaural speech enhancement method. The latent representations of both amplitude and phase are studied in two decoders of the foundation autoencoder (FAE) with only a limited set of clean speech signals independently. Then, the downstream autoencoder (DAE) learns a shared latent space between the clean speech and mixture representations with a large number of unseen mixtures. A complex-cycle-consistent (CCC) mechanism is proposed to minimize the reconstruction loss between the amplitude and phase domains. Besides, it is noticed that if the speech features are extracted as the multi-resolution spectra, the desired information distributed in spectra of different scales can be studied to further boost the performance. The NOISEX and DAPS corpora are used to generate mixtures with different interferences to evaluate the efficacy of the proposed method. It is highlighted that the clean speech and mixtures fed in FAE and DAE are not paired. Both ablation and comparison experimental results show that the proposed method clearly outperforms the state-of-the-art approaches.

【5】 Melody Harmonization with Controllable Harmonic Rhythm 标题:用可控的和声节奏实现旋律的和声 链接:https://arxiv.org/abs/2112.11122

作者:Shangda Wu,Yue Yang,Zhaowen Wang,Xiaobing Li,Maosong Sun 机构:Department of Music AI and Information Technology, Central Conservatory of Music, Department of Computer Science and Technology, Tsinghua University, Beijing, China 备注:9 pages, 10 figures, 4 tables 摘要:旋律协调,即为用户给定的旋律生成和弦进行,至今仍是一项具有挑战性的任务。尽管以前基于神经网络的系统可以有效地为旋律生成合适的和弦级数,但很少有研究关注可控的旋律协调,并且没有一个系统能够生成灵活的和声节奏。为了实现和声节奏可控的旋律协调,我们提出了一种基于神经网络的旋律协调系统AutoHarmonizer,该系统可以使用本文提出的可控生成的新采样方法生成更密集或更稀疏的和弦进行。该系统主要由两部分组成:和声节奏模型提供粗粒度的和弦起始信息,和弦模型根据给定的旋律和先前生成的相应和声节奏序列为和弦生成特定的音调。为了评估自动和声器的性能,我们使用了九个指标来比较人类、本文提出的系统和基线的和弦进行情况。实验结果表明,自动和弦器不仅能产生与人类水平相当的和声节奏,而且在不同的设置下能产生比基线质量更好的和弦。此外,我们使用AutoHarmonizer来协调会话数据集(最初是无弦的),并以40925首带有和声的传统爱尔兰民歌结束,命名为会话导语表数据集,这是迄今为止最大的导语表数据集。 摘要:Melody harmonization, namely generating a chord progression for a user-given melody, remains a challenging task to this day. Although previous neural network-based systems can effectively generate an appropriate chord progression for a melody, few studies focus on controllable melody harmonization, and none of them can generate flexible harmonic rhythms. To achieve harmonic rhythm-controllable melody harmonization, we propose AutoHarmonizer, a neural network-based melody harmonization system that can generate denser or sparser chord progressions with the use of a new sampling method for controllable generation proposed in this paper. This system mainly consists of two parts: a harmonic rhythm model provides coarse-grained chord onset information, while a chord model generates specific pitches for chords based on the given melody and the corresponding harmonic rhythm sequence previously generated. To evaluate the performance of AutoHarmonizer, we use nine metrics to compare the chord progressions from humans, the system proposed in this paper and the baseline. Experimental results show that AutoHarmonizer not only generates harmonic rhythms comparable to the human level, but generates chords with overall better quality than baseline at different settings. In addition, we use AutoHarmonizer to harmonize the Session Dataset (which were originally chordless), and ended with 40,925 traditional Irish folk songs with harmonies, named the Session Lead Sheet Dataset, which is the largest lead sheet dataset to date.

【6】 Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement 标题:用三角分解协议规范端到端语音翻译 链接:https://arxiv.org/abs/2112.10991

作者:Yichao Du,Zhirui Zhang,Weizhi Wang,Boxing Chen,Jun Xie,Tong Xu 机构:‡University of Science and Technology of China, Hefei, China, ♮Machine Intelligence Technology Lab, Alibaba DAMO Academy, §Rutgers University, New Brunswick, USA 备注:AAAI 2022 摘要:端到端语音到文本翻译(E2E-ST)由于其潜在的错误传播更少、延迟更低和参数更少而变得越来越流行。给定三元组训练语料库$langle speech,translation,translationrangle$,传统的高质量E2E-ST系统利用$langle speech,translationrangle$对预训练模型,然后利用$langle speech,translationrangle$对进一步优化模型。然而,这个过程在每个阶段只涉及两个元组数据,这种松散耦合无法充分利用三元组数据之间的关联。在本文中,我们试图建立基于语音输入的转录和翻译联合概率模型,以直接利用这类三元组数据。在此基础上,我们提出了一种新的正则化模型训练方法,以改善三重数据中双路径分解的一致性,这在理论上应该是相等的。为了实现这一目标,我们在模型训练目标中引入了两个Kullback-Leibler散度正则化项,以减少双路径输出概率之间的失配。然后,经过良好训练的模型可以通过预定义的早期停止标记自然转换为E2E-ST模型。在MuST-C基准测试上的实验表明,我们提出的方法在所有8种语言对上都显著优于最先进的E2E-ST基线,同时在自动语音识别任务中实现了更好的性能。我们的代码是开源的https://github.com/duyichao/E2E-ST-TDA. 摘要:End-to-end speech-to-text translation~(E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus $langle speech, transcription, translationrangle$, the conventional high-quality E2E-ST system leverages the $langle speech, transcriptionrangle$ pair to pre-train the model and then utilizes the $langle speech, translationrangle$ pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce two Kullback-Leibler divergence regularization terms into the model training objective to reduce the mismatch between output probabilities of dual-path. Then the well-trained model can be naturally transformed as the E2E-ST models by the pre-defined early stop tag. Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving better performance in the automatic speech recognition task. Our code is open-sourced at https://github.com/duyichao/E2E-ST-TDA.

【7】 Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations 标题:音频不变表示的增广对比自监督学习 链接:https://arxiv.org/abs/2112.10950

作者:Melikasadat Emami,Dung Tran,Kazuhito Koishida 机构:† ECE Department, University of California, Los Angeles, Los Angeles, CA, ♦ Applied Sciences Group, Microsoft, Redmond, WA 备注:4 pages, 4 figures 摘要:由于标记数据的稀缺性,提高泛化能力是音频分类的一个主要挑战。自我监督学习(SSL)方法通过利用未标记的数据学习下游分类任务的有用特征来解决这一问题。在这项工作中,我们提出了一个增强的对比SSL框架来学习未标记数据的不变表示。我们的方法将各种扰动应用于未标记的输入数据,并利用对比学习来学习对此类扰动具有鲁棒性的表示。在Audioset和DESED数据集上的实验结果表明,我们的框架在声音/事件分类任务上显著优于最先进的SSL和监督学习方法。 摘要:Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbations to the unlabeled input data and utilizes contrastive learning to learn representations robust to such perturbations. Experimental results on the Audioset and DESED datasets show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on sound/event classification tasks.

3.eess.AS音频处理:

【1】 Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations 标题:音频不变表示的增广对比自监督学习 链接:https://arxiv.org/abs/2112.10950

作者:Melikasadat Emami,Dung Tran,Kazuhito Koishida 机构:† ECE Department, University of California, Los Angeles, Los Angeles, CA, ♦ Applied Sciences Group, Microsoft, Redmond, WA 备注:4 pages, 4 figures 摘要:由于标记数据的稀缺性,提高泛化能力是音频分类的一个主要挑战。自我监督学习(SSL)方法通过利用未标记的数据学习下游分类任务的有用特征来解决这一问题。在这项工作中,我们提出了一个增强的对比SSL框架来学习未标记数据的不变表示。我们的方法将各种扰动应用于未标记的输入数据,并利用对比学习来学习对此类扰动具有鲁棒性的表示。在Audioset和DESED数据集上的实验结果表明,我们的框架在声音/事件分类任务上显著优于最先进的SSL和监督学习方法。 摘要:Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbations to the unlabeled input data and utilizes contrastive learning to learn representations robust to such perturbations. Experimental results on the Audioset and DESED datasets show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on sound/event classification tasks.

【2】 Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding 标题:采用非自回归译码的流式RNN传感器的商榷 链接:https://arxiv.org/abs/2112.11442

作者:Weiran Wang,Ke Hu,Tara Sainath 机构:Tara N. Sainath, Google, Inc. 摘要:我们建议考虑流式RNN-T模型与先前提出的Align-Refine非自回归解码方法及其改进版本的假设一致性。该方法执行几个细化步骤,其中每个步骤共享一个转换器解码器,该解码器同时关注文本特征(从对齐中提取)和音频特征,并输出完整的更新对齐。transformer解码器使用CTC损失进行训练,这有助于并行贪婪解码,并执行完全上下文注意以捕获标签依赖。我们通过引入级联编码器(在细化之前捕获更多音频上下文)和对齐增强(强制学习标签依赖)来改进对齐细化。我们表明,在流式RNN-T模型的假设对齐的条件下,我们的方法比第一次通过的RNN-T获得了更准确的识别结果,并且只有少量的模型参数。 摘要:We propose to deliberate the hypothesis alignment of a streaming RNN-T model with the previously proposed Align-Refine non-autoregressive decoding method and its improved versions. The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features (extracted from alignments) and audio features, and outputs complete updated alignments. The transformer decoder is trained with the CTC loss which facilitates parallel greedy decoding, and performs full-context attention to capture label dependencies. We improve Align-Refine by introducing cascaded encoder that captures more audio context before refinement, and alignment augmentation which enforces learning label dependency. We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T, with only small amount of model parameters.

【3】 Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition 标题:用于语音识别的神经网络语言模型的混合精度低位量化 链接:https://arxiv.org/abs/2112.11438

作者:Junhao Xu,Jianwei Yu,Shoukang Hu,Xunying Liu,Helen Meng 机构: and Helen Mengare with the Chinese University of Hong Kong 摘要:以长短时记忆递归神经网络(LSTM RNN)和Transformer为代表的最新语言模型(LMs)在实际应用中变得越来越复杂和昂贵。低比特神经网络量化提供了一个强大的解决方案,以大大减少其模型的大小。目前的量化方法都是基于统一的精度,无法考虑LMs不同部分对量化误差的不同性能灵敏度。为此,本文提出了一种新的混合精度神经网络LM量化方法。使用三种技术自动学习LSTM-RNN和基于Transformer的神经LMs的最优局部精度选择。前两种方法基于量化灵敏度度量,其形式为全精度和量化LMs之间测量的KL散度,或可使用无矩阵技术有效近似的Hessian轨迹加权量化扰动。第三种方法基于混合精度神经结构搜索。为了克服梯度下降法直接估计离散量化权值的困难,采用交替方向乘子法(ADMM)有效地训练量化LMs。实验在最先进的LF-MMI CNN-TDNN系统上进行,该系统具有速度扰动、i向量和基于学习隐藏单元贡献(LHUC)的说话人自适应两个任务:交换机电话语音和AMI会议转录。所提出的混合精度量化技术通过在全精度LSTM和Transformer基线LMs上产生高达约16倍的模型大小压缩比,在这两项任务上实现了“无损”量化,同时不会导致统计上显著的字错误率增加。 摘要:State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-the-art LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved "lossless" quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.

【4】 Voice Quality and Pitch Features in Transformer-Based Speech Recognition 标题:基于Transformer的语音识别中的音质和基音特征 链接:https://arxiv.org/abs/2112.11391

作者:Guillermo Cámbara,Jordi Luque,Mireia Farrús 机构:TALN Research Group, Universitat Pompeu Fabra, Barcelona, Spain, Telef´onica I D, Research, Barcelona, Spain, Language and Computation Centre, Universitat de Barcelona, Spain 备注:5 pages, 3 figures, submitted to Speech Prosody 2022 conference 摘要:抖动和微光测量已被证明是语音质量和韵律信息的载体,可提高说话人识别、二值化或自动语音识别(ASR)等任务的性能。然而,此类特征很少用于基于神经的ASR,因为在这种情况下,光谱特征通常占主导地位。在这项工作中,我们研究了将语音质量和音高特征合并到基于转换器的ASR模型中的效果,直觉表明注意机制可能利用潜在的韵律特征。为此,我们提出了韵律和频谱特征的分离卷积前端,表明这种架构选择比将此类音调和音质特征简单地连接到mel频谱图滤波器组产生更好的结果。此外,我们发现使用LibriSpeech基准测试,平均字错误率相对减少了5.6%。这些发现推动了对韵律知识应用的进一步研究,以提高基于变换的ASR的鲁棒性。 摘要:Jitter and shimmer measurements have shown to be carriers of voice quality and prosodic information which enhance the performance of tasks like speaker recognition, diarization or automatic speech recognition (ASR). However, such features have been seldom used in the context of neural-based ASR, where spectral features often prevail. In this work, we study the effects of incorporating voice quality and pitch features altogether and separately to a Transformer-based ASR model, with the intuition that the attention mechanisms might exploit latent prosodic traits. For doing so, we propose separated convolutional front-ends for prosodic and spectral features, showing that this architectural choice yields better results than simple concatenation of such pitch and voice quality features to mel-spectrogram filterbanks. Furthermore, we find mean Word Error Rate relative reductions of up to 5.6% with the LibriSpeech benchmark. Such findings motivate further research on the application of prosody knowledge for increasing the robustness of Transformer-based ASR.

【5】 Safeguarding test signals for acoustic measurement using arbitrary sounds 标题:使用任意声音保护声学测量的测试信号 链接:https://arxiv.org/abs/2112.11373

作者:Hideki Kawahara,Kohei Yatabe 机构:Wakayama University, Sakaedani, Wakayama,-, Japan, Waseda University,-,-, Ookubo, Shinjuku-ku, Tokyo,-, Japan 备注:4 pages, 10 figures, submitted to Acoustical Science and Technology 摘要:我们提出了一种简单的方法,通过转换任何适合测量的声音来测量声音响应。这种方法使我们能够使用音乐片段来测量声学条件。测量这样的条件而不使听者产生恼人的测试声音是有利的。此外,应用同时测量多条路径的基本思想提供了具有实际价值的特性。例如,在目标条件下再现稍微修改的内容时,可以测量偏差(时间稳定、随机和时变)和脉冲响应。该方法的关键思想是在原始声音中加入相对较小的、听起来像噪声的确定性信号。我们称转换后的声音为受保护的测试信号。 摘要:We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measurement of multiple paths provides practically valuable features. For example, it is possible to measure deviations (temporally stable, random, and time-varying) and the impulse response while reproducing slightly modified contents under target conditions. The key idea of the proposed method is to add relatively small deterministic signals that sound like noise to the original sounds. We call the converted sounds safeguarded test signals.

【6】 Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent 标题:基于自监督学习的复周期一致性单声道语音增强 链接:https://arxiv.org/abs/2112.11142

作者:Yi Li,Yang Sun,Syed Mohsen Naqvi 摘要:最近,自监督学习(SSL)技术被引入到解决单耳语音增强问题中。由于缺乏使用干净的阶段信息,大多数SSL方法的增强性能受到限制。因此,本文提出了一种基于相位感知自监督学习的单耳语音增强方法。在基础自动编码器(FAE)的两个解码器中,振幅和相位的潜在表示仅在一组有限的干净语音信号中独立地进行研究。然后,下游自动编码器(DAE)学习干净语音和混合表示之间的共享潜在空间,其中包含大量看不见的混合。提出了一种复杂循环一致性(CCC)机制,以最小化振幅域和相位域之间的重构损失。此外,还注意到,如果将语音特征提取为多分辨率谱,则可以研究分布在不同尺度谱中的所需信息,以进一步提高性能。使用NOISEX和DAPS语料库生成具有不同干扰的混合物,以评估所提出方法的有效性。需要强调的是,在FAE和DAE中馈送的干净语音和混合语音不是成对的。烧蚀和对比实验结果表明,该方法明显优于现有的方法。 摘要:Recently, self-supervised learning (SSL) techniques have been introduced to solve the monaural speech enhancement problem. Due to the lack of using clean phase information, the enhancement performance is limited in most SSL methods. Therefore, in this paper, we propose a phase-aware self-supervised learning based monaural speech enhancement method. The latent representations of both amplitude and phase are studied in two decoders of the foundation autoencoder (FAE) with only a limited set of clean speech signals independently. Then, the downstream autoencoder (DAE) learns a shared latent space between the clean speech and mixture representations with a large number of unseen mixtures. A complex-cycle-consistent (CCC) mechanism is proposed to minimize the reconstruction loss between the amplitude and phase domains. Besides, it is noticed that if the speech features are extracted as the multi-resolution spectra, the desired information distributed in spectra of different scales can be studied to further boost the performance. The NOISEX and DAPS corpora are used to generate mixtures with different interferences to evaluate the efficacy of the proposed method. It is highlighted that the clean speech and mixtures fed in FAE and DAE are not paired. Both ablation and comparison experimental results show that the proposed method clearly outperforms the state-of-the-art approaches.

【7】 Melody Harmonization with Controllable Harmonic Rhythm 标题:用可控的和声节奏实现旋律的和声 链接:https://arxiv.org/abs/2112.11122

作者:Shangda Wu,Yue Yang,Zhaowen Wang,Xiaobing Li,Maosong Sun 机构:Department of Music AI and Information Technology, Central Conservatory of Music, Department of Computer Science and Technology, Tsinghua University, Beijing, China 备注:9 pages, 10 figures, 4 tables 摘要:旋律协调,即为用户给定的旋律生成和弦进行,至今仍是一项具有挑战性的任务。尽管以前基于神经网络的系统可以有效地为旋律生成合适的和弦级数,但很少有研究关注可控的旋律协调,并且没有一个系统能够生成灵活的和声节奏。为了实现和声节奏可控的旋律协调,我们提出了一种基于神经网络的旋律协调系统AutoHarmonizer,该系统可以使用本文提出的可控生成的新采样方法生成更密集或更稀疏的和弦进行。该系统主要由两部分组成:和声节奏模型提供粗粒度的和弦起始信息,和弦模型根据给定的旋律和先前生成的相应和声节奏序列为和弦生成特定的音调。为了评估自动和声器的性能,我们使用了九个指标来比较人类、本文提出的系统和基线的和弦进行情况。实验结果表明,自动和弦器不仅能产生与人类水平相当的和声节奏,而且在不同的设置下能产生比基线质量更好的和弦。此外,我们使用AutoHarmonizer来协调会话数据集(最初是无弦的),并以40925首带有和声的传统爱尔兰民歌结束,命名为会话导语表数据集,这是迄今为止最大的导语表数据集。 摘要:Melody harmonization, namely generating a chord progression for a user-given melody, remains a challenging task to this day. Although previous neural network-based systems can effectively generate an appropriate chord progression for a melody, few studies focus on controllable melody harmonization, and none of them can generate flexible harmonic rhythms. To achieve harmonic rhythm-controllable melody harmonization, we propose AutoHarmonizer, a neural network-based melody harmonization system that can generate denser or sparser chord progressions with the use of a new sampling method for controllable generation proposed in this paper. This system mainly consists of two parts: a harmonic rhythm model provides coarse-grained chord onset information, while a chord model generates specific pitches for chords based on the given melody and the corresponding harmonic rhythm sequence previously generated. To evaluate the performance of AutoHarmonizer, we use nine metrics to compare the chord progressions from humans, the system proposed in this paper and the baseline. Experimental results show that AutoHarmonizer not only generates harmonic rhythms comparable to the human level, but generates chords with overall better quality than baseline at different settings. In addition, we use AutoHarmonizer to harmonize the Session Dataset (which were originally chordless), and ended with 40,925 traditional Irish folk songs with harmonies, named the Session Lead Sheet Dataset, which is the largest lead sheet dataset to date.

【8】 Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement 标题:用三角分解协议规范端到端语音翻译 链接:https://arxiv.org/abs/2112.10991

作者:Yichao Du,Zhirui Zhang,Weizhi Wang,Boxing Chen,Jun Xie,Tong Xu 机构:‡University of Science and Technology of China, Hefei, China, ♮Machine Intelligence Technology Lab, Alibaba DAMO Academy, §Rutgers University, New Brunswick, USA 备注:AAAI 2022 摘要:端到端语音到文本翻译(E2E-ST)由于其潜在的错误传播更少、延迟更低和参数更少而变得越来越流行。给定三元组训练语料库$langle speech,translation,translationrangle$,传统的高质量E2E-ST系统利用$langle speech,translationrangle$对预训练模型,然后利用$langle speech,translationrangle$对进一步优化模型。然而,这个过程在每个阶段只涉及两个元组数据,这种松散耦合无法充分利用三元组数据之间的关联。在本文中,我们试图建立基于语音输入的转录和翻译联合概率模型,以直接利用这类三元组数据。在此基础上,我们提出了一种新的正则化模型训练方法,以改善三重数据中双路径分解的一致性,这在理论上应该是相等的。为了实现这一目标,我们在模型训练目标中引入了两个Kullback-Leibler散度正则化项,以减少双路径输出概率之间的失配。然后,经过良好训练的模型可以通过预定义的早期停止标记自然转换为E2E-ST模型。在MuST-C基准测试上的实验表明,我们提出的方法在所有8种语言对上都显著优于最先进的E2E-ST基线,同时在自动语音识别任务中实现了更好的性能。我们的代码是开源的https://github.com/duyichao/E2E-ST-TDA. 摘要:End-to-end speech-to-text translation~(E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus $langle speech, transcription, translationrangle$, the conventional high-quality E2E-ST system leverages the $langle speech, transcriptionrangle$ pair to pre-train the model and then utilizes the $langle speech, translationrangle$ pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce two Kullback-Leibler divergence regularization terms into the model training objective to reduce the mismatch between output probabilities of dual-path. Then the well-trained model can be naturally transformed as the E2E-ST models by the pre-defined early stop tag. Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving better performance in the automatic speech recognition task. Our code is open-sourced at https://github.com/duyichao/E2E-ST-TDA.

机器翻译,仅供参考

0 人点赞