金融/语音/音频处理学术速递[12.10]

2021-12-10 16:59:22 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计5篇

cs.SD语音,共计11篇

eess.AS音频处理,共计11篇

1.q-fin金融:

【1】 Cross-ownership as a structural explanation for rising correlations in crisis times 标题:交叉持股作为危机时期关联度上升的结构性解释 链接:https://arxiv.org/abs/2112.04824

作者:Nils Bertschinger,Axel A. Araneda 机构:Frankfurt Institute for Advanced Studies; D-, Frankfurt am Main, Germany., Institute of Financial Complex Systems, Masaryk University; , Brno, Czech Republic. 摘要:在本文中,我们通过允许股权和债务交叉持有的金融网络来考察企业之间的相互联系。我们将股票之间的相关性与资产的无条件相关性、其业务资产的价值以及网络的敏感性(尤其是$Delta$-Greek)进行数学关联。我们还注意到,这种关系与股票水平无关。此外,对于两家公司的情况,我们分析表明,股权相关性总是高于资产相关性;通过数字插图显示此问题。最后,我们研究了股权相关性与资产价格之间的关系,模型得出,由于资产的下降,股权相关性增加。 摘要:In this paper, we examine the interlinkages among firms through a financial network where cross-holdings on both equity and debt are allowed. We relate mathematically the correlation among equities with the unconditional correlation of the assets, the values of their business assets and the sensitivity of the network, particularly the $Delta$-Greek. We noticed also this relation is independent of the Equities level. Besides, for the two-firms case, we analytically demonstrate that the equities correlation is always higher than the correlation of the assets; showing this issue by numerical illustrations. Finally, we study the relation between equity correlations and asset prices, where the model arrives to an increase in the former due to a fall in the assets.

【2】 High-Dimensional Stock Portfolio Trading with Deep Reinforcement Learning 标题:基于深度强化学习的高维股票组合交易 链接:https://arxiv.org/abs/2112.04755

作者:Uta Pigorsch,Sebastian Schäfer 机构:Schumpeter School of Business and Economics, University of Wuppertal, Wuppertal, Germany 备注:14 pages, 5 figures, 2 tables 摘要:提出了一种基于深度Q学习的金融组合交易深度强化学习算法。该算法能够从任何规模的横截面数据集中交易高维投资组合,其中可能包括资产中的数据缺口和非唯一历史长度。我们通过为每个环境采样一项资产来顺序设置环境,同时以所得资产的回报回报回报投资,并以资产组的平均回报回报回报现金储备。这迫使代理人战略性地将资本分配给其预期业绩高于平均水平的资产。我们在样本外分析中应用了我们的方法,对48个美国股票投资组合进行了分析,股票数量从10只到500只不等,选择标准和交易成本水平也各不相同。平均而言,该算法在所有投资组合中仅使用一个超参数设置,大大优于所有考虑的被动和主动基准投资策略。 摘要:This paper proposes a Deep Reinforcement Learning algorithm for financial portfolio trading based on Deep Q-learning. The algorithm is capable of trading high-dimensional portfolios from cross-sectional datasets of any size which may include data gaps and non-unique history lengths in the assets. We sequentially set up environments by sampling one asset for each environment while rewarding investments with the resulting asset's return and cash reservation with the average return of the set of assets. This enforces the agent to strategically assign capital to assets that it predicts to perform above-average. We apply our methodology in an out-of-sample analysis to 48 US stock portfolio setups, varying in the number of stocks from ten up to 500 stocks, in the selection criteria and in the level of transaction costs. The algorithm on average outperforms all considered passive and active benchmark investment strategies by a large margin using only one hyperparameter setup for all portfolios.

【3】 Adaptive calibration of Heston Model using PCRLB based switching Filter 标题:基于PCRLB型开关过滤的赫斯顿模型自适应校准 链接:https://arxiv.org/abs/2112.04576

作者:Kumar Yashaswi 备注:7 Pages, 5 Figures, 1 Table, Keywords- Stochastic volatility; Heston Model; Normal MLE; Bayesian Filtering; Posterior Cramer-Rao Lower Bound (PCRLB). arXiv admin note: text overlap with arXiv:2112.03193 摘要:自1987年股市崩盘以来,期权定价理论中就存在随机波动率模型,该模型违反了布莱克-斯科尔斯模型关于波动率不变的假设。赫斯顿模型是一种广泛应用于波动率估计和期权定价的随机波动率模型。本文利用贝叶斯滤波理论和后验Cramer-Rao下界(PCRLB),结合文献[1]提出的正态最大似然估计(NMLE),设计了一种新的状态空间表示下Heston模型参数估计方法。几种贝叶斯滤波器如扩展卡尔曼滤波器(EKF)、无迹卡尔曼滤波器(UKF)、粒子滤波器(PF)用于潜在状态和参数估计。我们采用文献[2]中提出的切换策略,对类似Heston模型的非线性离散时间状态空间模型(SSM)进行自适应状态估计。我们使用基于PCRLB[3]的粒子滤波器近似性能度量来判断每个时间步的最佳滤波器。我们在标准普尔500指数和NSE指数的定价数据上测试了我们提出的框架,从该指数估计了潜在的波动性和参数。我们提出的方法与这两个指数的VIX测度和历史波动率进行了比较。结果表明,一个有效的框架,估计波动性自适应变化的市场动态。 摘要:Stochastic volatility models have existed in Option pricing theory ever since the crash of 1987 which violated the Black-Scholes model assumption of constant volatility. Heston model is one such stochastic volatility model that is widely used for volatility estimation and option pricing. In this paper, we design a novel method to estimate parameters of Heston model under state-space representation using Bayesian filtering theory and Posterior Cramer-Rao Lower Bound (PCRLB), integrating it with Normal Maximum Likelihood Estimation (NMLE) proposed in [1]. Several Bayesian filters like Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF), Particle Filter (PF) are used for latent state and parameter estimation. We employ a switching strategy proposed in [2] for adaptive state estimation of the non-linear, discrete-time state-space model (SSM) like Heston model. We use a particle filter approximated PCRLB [3] based performance measure to judge the best filter at each time step. We test our proposed framework on pricing data from S&P 500 and NSE Index, estimating the underlying volatility and parameters from the index. Our proposed method is compared with the VIX measure and historical volatility for both the indexes. The results indicate an effective framework for estimating volatility adaptively with changing market dynamics.

【4】 Recent Advances in Reinforcement Learning in Finance 标题:强化学习在金融学中的最新进展 链接:https://arxiv.org/abs/2112.04553

作者:Ben Hambly,Renyuan Xu,Huining Yang 备注:60 pages, 1 figure 摘要:由于数据量的不断增加,金融行业发生了快速变化,这给数据处理和数据分析技术带来了革命性的变化,并带来了新的理论和计算挑战。与经典随机控制理论和其他解决财务决策问题的分析方法相比,强化学习(RL)有了新的发展能够以较少的模型假设充分利用大量财务数据,并在复杂的财务环境中改进决策。本调查报告旨在回顾RL方法在金融领域的最新发展和使用情况。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后介绍了各种算法,重点介绍了不需要任何模型假设的基于价值和策略的方法。与神经网络进行连接,以扩展框架,使其包含深度RL算法。我们的调查最后讨论了这些RL算法在金融领域各种决策问题中的应用,包括最优执行、投资组合优化、期权定价和套期保值、做市、智能订单路由和机器人咨询。 摘要:The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.

【5】 Theoretical Economics and the Second-Order Economic Theory. What is it? 标题:理论经济学和二阶经济理论。那是什么? 链接:https://arxiv.org/abs/2112.04566

作者:Victor Olkhov 机构:Moscow, Russia, ORCID: ,-,-,- 备注:14 pages 摘要:我们考虑经济代理人,代理人的变量,代理人的交易和交易与其他代理和代理人的期望为基础的理论描述的经济和金融过程。宏观经济和金融变量由代理人变量构成。反过来,代理人的交易价值或交易量的总和决定了代理人变量的演变。总之,代理人的期望支配着代理人的交易决策。我们认为三位一体的变量、交易和期望是经济学理论描述的简单砖块。我们注意到,描述由特定时间间隔{Delta}内的市场交易总和确定的变量的模型是一阶经济理论。目前大多数经济模型都属于一阶经济理论。然而,我们表明,这些模型不足以进行充分的经济描述。贸易决策在很大程度上取决于市场价格预测。我们表明,市场价格波动的合理预测等于{Delta}期间交易价值和交易量平方和的描述。我们称由市场交易平方和组成的建模变量为二阶经济理论。如果价格概率的预测使用三维价格统计矩和价格偏态,那么它等于对市场交易的三维幂和的描述——三阶经济理论。市场价格概率的精确预测等于对所有n个市场交易的n次方和的描述,这限制了价格概率预测的准确性,限制了经济理论预测的有效性。 摘要:We consider economic agents, agent's variables, agent's trades and deals with other agents and agent's expectations as ground for theoretical description of economic and financial processes. Macroeconomic and financial variables are composed by agent's variables. In turn, sums of agent's trade values or volumes determine evolution of agent's variables. In conclusion, agent's expectations govern agent's trade decisions. We consider that trinity - agent's variables, trades and expectations as simple bricks for theoretical description of economics. We note models that describe variables determined by sums of market trades during certain time interval {Delta} as the first-order economic theories. Most current economic models belong to the first-order economic theories. However, we show that these models are insufficient for adequate economic description. Trade decisions substantially depend on market price forecasting. We show that reasonable predictions of market price volatility equal descriptions of sums of squares of trade values and volumes during {Delta}. We call modeling variables composed by sums of squares of market trades as the second-order economic theories. If forecast of price probability uses 3-d price statistical moment and price skewness then it equals description of sums of 3-d power of market trades - the third-order economic theory. Exact prediction of market price probability equals description of sums of n-th power of market trades for all n. That limits accuracy of price probability forecasting and confines forecast validity of economic theories.

2.cs.SD语音:

【1】 Domain Adaptation and Autoencoder Based Unsupervised Speech Enhancement 标题:基于域自适应和自动编码器的无监督语音增强 链接:https://arxiv.org/abs/2112.05036

作者:Yi Li,Yang Sun,Kirill Horoshenkov,Syed Mohsen Naqvi 机构: Sun is working with the Big Data Institute, University of Oxford 备注:None 摘要:作为迁移学习的一个范畴,领域适应在推广一项任务中训练的模型并将其应用于其他类似任务或环境中起着重要作用。在语音增强中,可以利用经过良好训练的声学模型来获取其他语言、说话人和环境中的语音信号。最近的领域自适应研究更有效地利用了各种神经网络和高级抽象特征。然而,相关研究更可能将训练有素的模型从丰富多样的领域转移到有限相似的领域。因此,在本研究中,针对转移到更大、更丰富的域的相反情况,提出了一种用于无监督语音增强的域自适应方法。一方面,利用重要性加权(IW)方法和方差约束自动编码器来减少源域和目标域之间共享权重的偏移。另一方面,为了训练具有最坏情况权重的分类器并使风险最小化,提出了minimax方法。从语音库和IEEE数据集到TIMIT数据集,对建议的IW和minimax方法进行评估。实验结果表明,所提出的方法优于现有的方法。 摘要:As a category of transfer learning, domain adaptation plays an important role in generalizing the model trained in one task and applying it to other similar tasks or settings. In speech enhancement, a well-trained acoustic model can be exploited to obtain the speech signal in the context of other languages, speakers, and environments. Recent domain adaptation research was developed more effectively with various neural networks and high-level abstract features. However, the related studies are more likely to transfer the well-trained model from a rich and more diverse domain to a limited and similar domain. Therefore, in this study, the domain adaptation method is proposed in unsupervised speech enhancement for the opposite circumstance that transferring to a larger and richer domain. On the one hand, the importance-weighting (IW) approach is exploited with a variance constrained autoencoder to reduce the shift of shared weights between the source and target domains. On the other hand, in order to train the classifier with the worst-case weights and minimize the risk, the minimax method is proposed. Both the proposed IW and minimax methods are evaluated from the VOICE BANK and IEEE datasets to the TIMIT dataset. The experiment results show that the proposed methods outperform the state-of-the-art approaches.

【2】 Personalized musically induced emotions of not-so-popular Colombian music 标题:不太流行的哥伦比亚音乐的个性化音乐诱导情感 链接:https://arxiv.org/abs/2112.04975

作者:Juan Sebastián Gómez-Cañón,Perfecto Herrera,Estefanía Cano,Emilia Gómez 机构:MIRLab - Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, Songquito UG, Erlangen, Germany, Joint Research Centre, European Commission, Seville, Spain 备注:None 摘要:这项工作提供了一个初步的概念证明,音乐情感识别(MER)系统是如何在政治背景下有意偏向于音乐诱发情感的注释的。具体而言,我们分析了哥伦比亚传统音乐,其中包含两种类型的带有政治色彩的歌词:(1)来自“左翼”游击队哥伦比亚革命武装力量(FARC)的vallenatos和社交歌曲;(2)来自“右翼”准军事组织哥伦比亚联合自卫军(AUC)的corridos。我们训练个性化的机器学习模型来预测三个具有不同政治观点的用户的诱发情绪——我们的目标是识别可能诱发特定用户负面情绪的歌曲,例如愤怒和恐惧。在这种程度上,用户的情绪判断可以被解释为问题化数据——主观情绪判断反过来可以在以人为中心的机器学习环境中影响用户。简言之,高度期望的“情绪调节”应用可能会偏离“情绪操纵”——最近对情绪识别技术的质疑可能超越多样性和包容性的伦理问题。 摘要:This work presents an initial proof of concept of how Music Emotion Recognition (MER) systems could be intentionally biased with respect to annotations of musically induced emotions in a political context. In specific, we analyze traditional Colombian music containing politically charged lyrics of two types: (1) vallenatos and social songs from the "left-wing" guerrilla Fuerzas Armadas Revolucionarias de Colombia (FARC) and (2) corridos from the "right-wing" paramilitaries Autodefensas Unidas de Colombia (AUC). We train personalized machine learning models to predict induced emotions for three users with diverse political views - we aim at identifying the songs that may induce negative emotions for a particular user, such as anger and fear. To this extent, a user's emotion judgements could be interpreted as problematizing data - subjective emotional judgments could in turn be used to influence the user in a human-centered machine learning environment. In short, highly desired "emotion regulation" applications could potentially deviate to "emotion manipulation" - the recent discredit of emotion recognition technologies might transcend ethical issues of diversity and inclusion.

【3】 LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading 标题:LipSound 2:唇语转换重建和唇读的自我监督预训练 链接:https://arxiv.org/abs/2112.04748

作者:Leyuan Qu,Cornelius Weber,Stefan Wermter 机构: Department of Informatics, University of Hamburg 备注:SUBMITTED TO IEEE Transaction on Neural Networks and Learning Systems 摘要:这项工作的目的是通过利用视频中音频和视频流的自然共存,研究跨模态自我监督预训练对语音重建(视频到音频)的影响。我们提出了LipSound2,它由编码器-解码器体系结构和位置感知注意机制组成,可以直接将人脸图像序列映射到mel尺度的光谱图,而无需任何人类注释。建议的LipSound2模型首先在$sim$2400h多语种(如英语和德语)视听数据(VoxCeleb2)上进行预训练。为了验证该方法的通用性,我们在特定领域的数据集(GRID,TCD-TIMIT)上对预先训练的模型进行了微调,以进行英语语音重建,并与之前的方法相比,在说话人相关和独立的环境下,在语音质量和可懂度方面取得了显著的改进。除了英语之外,我们还在CMLR数据集上进行汉语语音重建,以验证对可迁移性的影响。最后,我们通过在预先训练的语音识别系统上微调生成的音频来训练级联唇读(视频到文本)系统,并在英语和汉语基准数据集上实现最先进的性能。 摘要:The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on $sim$2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.

【4】 Noise-robust blind reverberation time estimation using noise-aware time-frequency masking 标题:基于噪声感知时频掩蔽的噪声鲁棒混响时间盲估计 链接:https://arxiv.org/abs/2112.04726

作者:Kaitong Zheng,Chengshi Zheng,Jinqiu Sang,Yulong Zhang,Xiaodong Li 机构:Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy, University of Chinese Academy of Sciences, Beijing, China

【5】 CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet 标题:CWS-PResUNet:基于信道子带相位感知的音乐源分离 链接:https://arxiv.org/abs/2112.04685

作者:Haohe Liu,Qiuqiang Kong,Jiafeng Liu 机构: Sound, Audio, and Music Intelligence (SAMI) Group, ByteDance , The Ohio State University, Authors of papers retain, copyright and release the work, under a Creative Commons, Attribution ,., International, License (CC BY ,.,)., In partnership with 备注:Published at MDX Workshop @ ISMIR 2021

【6】 NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying Beamformers 标题:NICE波束:时变波束形成器的神经集成协方差估计器 链接:https://arxiv.org/abs/2112.04613

作者:Jonah Casebeer,Jacob Donley,Daniel Wong,Buye Xu,Anurag Kumar 机构:Reality Labs Research, USA

【7】 X-Vector based voice activity detection for multi-genre broadcast speech-to-text 标题:基于X矢量的多体裁广播语音转文本语音活跃度检测 链接:https://arxiv.org/abs/2112.05016

作者:Misa Ogura,Matt Haynes 机构:BBC Research & Development, UK 备注:7 pages, 3 figures, 4 tables 摘要:语音活动检测(VAD)是自动语音识别中的一个基本预处理步骤。这在广播行业尤其如此,因为广播行业中遇到了各种各样的音频材料和录音条件。基于先前的研究表明,x向量嵌入可以应用于多种音频分类任务,我们研究了x向量在语音和噪声识别中的适用性。我们发现,所提出的基于x矢量的VAD系统在AVA语音上检测干净语音时达到了最佳报告分数,同时在存在噪声和音乐的情况下保持了稳健的VAD性能。此外,我们将基于x-vector的VAD系统集成到现有的STT管道中,并将其在多个广播数据集上的性能与使用WebRTC VAD的基线系统进行比较。关键的是,我们提出的基于x向量的VAD提高了真实广播音频上STT转录的准确性 摘要:Voice Activity Detection (VAD) is a fundamental preprocessing step in automatic speech recognition. This is especially true within the broadcast industry where a wide variety of audio materials and recording conditions are encountered. Based on previous studies which indicate that xvector embeddings can be applied to a diverse set of audio classification tasks, we investigate the suitability of x-vectors in discriminating speech from noise. We find that the proposed x-vector based VAD system achieves the best reported score in detecting clean speech on AVA-Speech, whilst retaining robust VAD performance in the presence of noise and music. Furthermore, we integrate the x-vector based VAD system into an existing STT pipeline and compare its performance on multiple broadcast datasets against a baseline system with WebRTC VAD. Crucially, our proposed x-vector based VAD improves the accuracy of STT transcription on real-world broadcast audio

【8】 Harmonic and non-Harmonic Based Noisy Reverberant Speech Enhancement in Time Domain 标题:基于谐波和非谐波的含噪混响语音时域增强 链接:https://arxiv.org/abs/2112.04949

作者:G. Zucatelli,R. Coelho 机构: Theauthors are with the Laboratory of Acoustic Signal Processing, MilitaryInstitute of Engineering (IME) 备注:9 pages 摘要:本文介绍了一种单步时域方法HnH NRSE,它是为在混响噪声条件下同时提高语音清晰度和质量而设计的。在这个解决方案中,通过应用过零和能量准则来分离语音的谐波和非谐波元素。对其非平稳度的客观评估进一步用于自适应增益来处理掩蔽分量。这种技术不需要事先了解语音统计或房间信息。此外,本文还提出了两种改进混响语音信号的组合方法:IRMO和IRMN。考虑到两个可懂度和三个质量度量,对提议的方法和基线方法进行评估,用于客观预测。结果表明,在大多数情况下,与竞争方法相比,该方案具有更高的可懂度和质量改进。此外,还进行了感知可懂度听力测试,这与这些结果相符。此外,与组合IRMO和IRMN技术相比,建议的HnH NRSE解决方案实现了SRMR质量测量,具有类似的结果。 摘要:This paper introduces the single step time domain method named HnH-NRSE, whihc is designed for simultaneous speech intelligibility and quality improvement under noisy-reverberant conditions. In this solution, harmonic and non-harmonic elements of speech are separated by applying zero-crossing and energy criteria. An objective evaluation of the its non-stationarity degree is further used for an adaptive gain to treat masking components. No prior knowledge of speech statistics or room information is required for this technique. Additionally, two combined solutions, IRMO and IRMN, are proposed as composite methods for improvement on noisy-reverberant speech signals. The proposed and baseline methods are evaluated considering two intelligibility and three quality measures, applied for the objective prediction. The results show that the proposed scheme leads to a higher intelligibility and quality improvement when compared to competing methods in most scenarios. Additionally, a perceptual intelligibility listening test is performed, which corroborates with these results. Furthermore, the proposed HnH-NRSE solution attains SRMR quality measure with similar results when compared to the composed IRMO and IRMN techniques.

【9】 A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks 标题:一种基于深度神经网络的立体感知语音增强训练框架 链接:https://arxiv.org/abs/2112.04939

作者:Bahareh Tolooshams,Kazuhito Koishida 机构:⋆School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, † Microsoft Corporation, One Microsoft Way, Redmond, WA 备注:Submitted to ICASSP 2022 摘要:近年来,基于深度学习的语音增强显示出前所未有的性能。最流行的单声道语音增强框架是端到端网络,将噪声混合映射为干净语音的估计。随着计算能力的提高和多通道麦克风录音的可用性,以前的工作旨在结合空间统计和光谱信息来提高性能。尽管单声道输出的增强性能有所提高,但空间图像的保存和主观评价在文献中并没有得到太多的关注。本文提出了一种新的基于立体感知的语音增强框架,即基于深度学习的语音增强的训练损失,在增强立体混合的同时保留空间图像。该框架与模型无关,因此可以应用于任何基于深度学习的体系结构。我们通过听力测试对经过训练的模型进行广泛的客观和主观评估。我们表明,通过对图像保留损失进行正则化,整体性能得到了改善,并且语音的立体声方面得到了更好的保留。 摘要:Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.

【10】 End-to-end Alexa Device Arbitration 标题:端到端Alexa设备仲裁 链接:https://arxiv.org/abs/2112.04914

作者:Jarred Barber,Yifeng Fan,Tao Zhang 机构:Amazon Alexa AI, USA, University of Illinois, Urbana-Champaign, USA 备注:Submitted to ICASSP 2022 摘要:我们介绍了一种不同的说话人定位问题,我们称之为设备仲裁。在设备仲裁问题中,用户发出一个由多个分布式麦克风阵列(智能家居设备)检测到的关键字,我们希望确定哪个设备离用户最近。我们提出了一种端到端的机器学习系统,而不是解决完全的本地化问题。该系统学习在每个设备上独立计算的特征嵌入。然后,将来自每个设备的嵌入聚合在一起,以产生最终仲裁决定。我们使用一个大型房间模拟来生成训练和评估数据,并将我们的系统与信号处理基线进行比较。 摘要:We introduce a variant of the speaker localization problem, which we call device arbitration. In the device arbitration problem, a user utters a keyword that is detected by multiple distributed microphone arrays (smart home devices), and we want to determine which device was closest to the user. Rather than solving the full localization problem, we propose an end-to-end machine learning system. This system learns a feature embedding that is computed independently on each device. The embeddings from each device are then aggregated together to produce the final arbitration decision. We use a large-scale room simulation to generate training and evaluation data, and compare our system against a signal processing baseline.

【11】 On The Effect Of Coding Artifacts On Acoustic Scene Classification 标题:编码伪影对声场分类的影响 链接:https://arxiv.org/abs/2112.04841

作者:Nagashree K. S. Rao,Nils Peters 机构: University of Erlangen-Nuremberg, Erlangen, Germany, International Audio Laboratories, Erlangen, Germany 备注:paper presented at the 2021 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 摘要:以前的DCASE挑战有助于提高声学场景分类系统的性能。最先进的分类器需要强大的处理能力和内存,这对于资源受限的移动或物联网边缘设备来说是一个挑战。因此,它更有可能将这些模型部署在更强大的硬件上,并对以前从低功耗边缘设备上传(或流式传输)的音频记录进行分类。在这种情况下,边缘设备可以应用感知音频编码来降低传输数据速率。本文使用DCASE 2020挑战贡献[1]探讨了感知音频编码对分类性能的影响。我们发现,与对原始(未压缩)音频进行分类相比,分类准确率可以降低57%。我们进一步演示了模型训练期间的有损音频压缩技术如何提高压缩音频信号的分类精度,即使对于未包括在训练过程中的音频编解码器和编解码器比特率也是如此。 摘要:Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for resource-constrained mobile or IoT edge devices. Thus, it is more likely to deploy these models on more powerful hardware and classify audio recordings previously uploaded (or streamed) from low-power edge devices. In such scenario, the edge device may apply perceptual audio coding to reduce the transmission data rate. This paper explores the effect of perceptual audio coding on the classification performance using a DCASE 2020 challenge contribution [1]. We found that classification accuracy can degrade by up to 57% compared to classifying original (uncompressed) audio. We further demonstrate how lossy audio compression techniques during model training can improve classification accuracy of compressed audio signals even for audio codecs and codec bitrates not included in the training process.

3.eess.AS音频处理:

【1】 X-Vector based voice activity detection for multi-genre broadcast speech-to-text 标题:基于X矢量的多体裁广播语音转文本语音活跃度检测 链接:https://arxiv.org/abs/2112.05016

作者:Misa Ogura,Matt Haynes 机构:BBC Research & Development, UK 备注:7 pages, 3 figures, 4 tables 摘要:语音活动检测(VAD)是自动语音识别中的一个基本预处理步骤。这在广播行业尤其如此,因为广播行业中遇到了各种各样的音频材料和录音条件。基于先前的研究表明,x向量嵌入可以应用于多种音频分类任务,我们研究了x向量在语音和噪声识别中的适用性。我们发现,所提出的基于x矢量的VAD系统在AVA语音上检测干净语音时达到了最佳报告分数,同时在存在噪声和音乐的情况下保持了稳健的VAD性能。此外,我们将基于x-vector的VAD系统集成到现有的STT管道中,并将其在多个广播数据集上的性能与使用WebRTC VAD的基线系统进行比较。关键的是,我们提出的基于x向量的VAD提高了真实广播音频上STT转录的准确性 摘要:Voice Activity Detection (VAD) is a fundamental preprocessing step in automatic speech recognition. This is especially true within the broadcast industry where a wide variety of audio materials and recording conditions are encountered. Based on previous studies which indicate that xvector embeddings can be applied to a diverse set of audio classification tasks, we investigate the suitability of x-vectors in discriminating speech from noise. We find that the proposed x-vector based VAD system achieves the best reported score in detecting clean speech on AVA-Speech, whilst retaining robust VAD performance in the presence of noise and music. Furthermore, we integrate the x-vector based VAD system into an existing STT pipeline and compare its performance on multiple broadcast datasets against a baseline system with WebRTC VAD. Crucially, our proposed x-vector based VAD improves the accuracy of STT transcription on real-world broadcast audio

【2】 Harmonic and non-Harmonic Based Noisy Reverberant Speech Enhancement in Time Domain 标题:基于谐波和非谐波的含噪混响语音时域增强 链接:https://arxiv.org/abs/2112.04949

作者:G. Zucatelli,R. Coelho 机构: Theauthors are with the Laboratory of Acoustic Signal Processing, MilitaryInstitute of Engineering (IME) 备注:9 pages 摘要:本文介绍了一种单步时域方法HnH NRSE,它是为在混响噪声条件下同时提高语音清晰度和质量而设计的。在这个解决方案中,通过应用过零和能量准则来分离语音的谐波和非谐波元素。对其非平稳度的客观评估进一步用于自适应增益来处理掩蔽分量。这种技术不需要事先了解语音统计或房间信息。此外,本文还提出了两种改进混响语音信号的组合方法:IRMO和IRMN。考虑到两个可懂度和三个质量度量,对提议的方法和基线方法进行评估,用于客观预测。结果表明,在大多数情况下,与竞争方法相比,该方案具有更高的可懂度和质量改进。此外,还进行了感知可懂度听力测试,这与这些结果相符。此外,与组合IRMO和IRMN技术相比,建议的HnH NRSE解决方案实现了SRMR质量测量,具有类似的结果。 摘要:This paper introduces the single step time domain method named HnH-NRSE, whihc is designed for simultaneous speech intelligibility and quality improvement under noisy-reverberant conditions. In this solution, harmonic and non-harmonic elements of speech are separated by applying zero-crossing and energy criteria. An objective evaluation of the its non-stationarity degree is further used for an adaptive gain to treat masking components. No prior knowledge of speech statistics or room information is required for this technique. Additionally, two combined solutions, IRMO and IRMN, are proposed as composite methods for improvement on noisy-reverberant speech signals. The proposed and baseline methods are evaluated considering two intelligibility and three quality measures, applied for the objective prediction. The results show that the proposed scheme leads to a higher intelligibility and quality improvement when compared to competing methods in most scenarios. Additionally, a perceptual intelligibility listening test is performed, which corroborates with these results. Furthermore, the proposed HnH-NRSE solution attains SRMR quality measure with similar results when compared to the composed IRMO and IRMN techniques.

【3】 A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks 标题:一种基于深度神经网络的立体感知语音增强训练框架 链接:https://arxiv.org/abs/2112.04939

作者:Bahareh Tolooshams,Kazuhito Koishida 机构:⋆School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, † Microsoft Corporation, One Microsoft Way, Redmond, WA 备注:Submitted to ICASSP 2022 摘要:近年来,基于深度学习的语音增强显示出前所未有的性能。最流行的单声道语音增强框架是端到端网络,将噪声混合映射为干净语音的估计。随着计算能力的提高和多通道麦克风录音的可用性,以前的工作旨在结合空间统计和光谱信息来提高性能。尽管单声道输出的增强性能有所提高,但空间图像的保存和主观评价在文献中并没有得到太多的关注。本文提出了一种新的基于立体感知的语音增强框架,即基于深度学习的语音增强的训练损失,在增强立体混合的同时保留空间图像。该框架与模型无关,因此可以应用于任何基于深度学习的体系结构。我们通过听力测试对经过训练的模型进行广泛的客观和主观评估。我们表明,通过对图像保留损失进行正则化,整体性能得到了改善,并且语音的立体声方面得到了更好的保留。 摘要:Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.

【4】 End-to-end Alexa Device Arbitration 标题:端到端Alexa设备仲裁 链接:https://arxiv.org/abs/2112.04914

作者:Jarred Barber,Yifeng Fan,Tao Zhang 机构:Amazon Alexa AI, USA, University of Illinois, Urbana-Champaign, USA 备注:Submitted to ICASSP 2022 摘要:我们介绍了一种不同的说话人定位问题,我们称之为设备仲裁。在设备仲裁问题中,用户发出一个由多个分布式麦克风阵列(智能家居设备)检测到的关键字,我们希望确定哪个设备离用户最近。我们提出了一种端到端的机器学习系统,而不是解决完全的本地化问题。该系统学习在每个设备上独立计算的特征嵌入。然后,将来自每个设备的嵌入聚合在一起,以产生最终仲裁决定。我们使用一个大型房间模拟来生成训练和评估数据,并将我们的系统与信号处理基线进行比较。 摘要:We introduce a variant of the speaker localization problem, which we call device arbitration. In the device arbitration problem, a user utters a keyword that is detected by multiple distributed microphone arrays (smart home devices), and we want to determine which device was closest to the user. Rather than solving the full localization problem, we propose an end-to-end machine learning system. This system learns a feature embedding that is computed independently on each device. The embeddings from each device are then aggregated together to produce the final arbitration decision. We use a large-scale room simulation to generate training and evaluation data, and compare our system against a signal processing baseline.

【5】 On The Effect Of Coding Artifacts On Acoustic Scene Classification 标题:编码伪影对声场分类的影响 链接:https://arxiv.org/abs/2112.04841

作者:Nagashree K. S. Rao,Nils Peters 机构: University of Erlangen-Nuremberg, Erlangen, Germany, International Audio Laboratories, Erlangen, Germany 备注:paper presented at the 2021 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 摘要:以前的DCASE挑战有助于提高声学场景分类系统的性能。最先进的分类器需要强大的处理能力和内存,这对于资源受限的移动或物联网边缘设备来说是一个挑战。因此,它更有可能将这些模型部署在更强大的硬件上,并对以前从低功耗边缘设备上传(或流式传输)的音频记录进行分类。在这种情况下,边缘设备可以应用感知音频编码来降低传输数据速率。本文使用DCASE 2020挑战贡献[1]探讨了感知音频编码对分类性能的影响。我们发现,与对原始(未压缩)音频进行分类相比,分类准确率可以降低57%。我们进一步演示了模型训练期间的有损音频压缩技术如何提高压缩音频信号的分类精度,即使对于未包括在训练过程中的音频编解码器和编解码器比特率也是如此。 摘要:Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for resource-constrained mobile or IoT edge devices. Thus, it is more likely to deploy these models on more powerful hardware and classify audio recordings previously uploaded (or streamed) from low-power edge devices. In such scenario, the edge device may apply perceptual audio coding to reduce the transmission data rate. This paper explores the effect of perceptual audio coding on the classification performance using a DCASE 2020 challenge contribution [1]. We found that classification accuracy can degrade by up to 57% compared to classifying original (uncompressed) audio. We further demonstrate how lossy audio compression techniques during model training can improve classification accuracy of compressed audio signals even for audio codecs and codec bitrates not included in the training process.

【6】 Domain Adaptation and Autoencoder Based Unsupervised Speech Enhancement 标题:基于域自适应和自动编码器的无监督语音增强 链接:https://arxiv.org/abs/2112.05036

作者:Yi Li,Yang Sun,Kirill Horoshenkov,Syed Mohsen Naqvi 机构: Sun is working with the Big Data Institute, University of Oxford 备注:None 摘要:作为迁移学习的一个范畴,领域适应在推广一项任务中训练的模型并将其应用于其他类似任务或环境中起着重要作用。在语音增强中,可以利用经过良好训练的声学模型来获取其他语言、说话人和环境中的语音信号。最近的领域自适应研究更有效地利用了各种神经网络和高级抽象特征。然而,相关研究更可能将训练有素的模型从丰富多样的领域转移到有限相似的领域。因此,在本研究中,针对转移到更大、更丰富的域的相反情况,提出了一种用于无监督语音增强的域自适应方法。一方面,利用重要性加权(IW)方法和方差约束自动编码器来减少源域和目标域之间共享权重的偏移。另一方面,为了训练具有最坏情况权重的分类器并使风险最小化,提出了minimax方法。从语音库和IEEE数据集到TIMIT数据集,对建议的IW和minimax方法进行评估。实验结果表明,所提出的方法优于现有的方法。 摘要:As a category of transfer learning, domain adaptation plays an important role in generalizing the model trained in one task and applying it to other similar tasks or settings. In speech enhancement, a well-trained acoustic model can be exploited to obtain the speech signal in the context of other languages, speakers, and environments. Recent domain adaptation research was developed more effectively with various neural networks and high-level abstract features. However, the related studies are more likely to transfer the well-trained model from a rich and more diverse domain to a limited and similar domain. Therefore, in this study, the domain adaptation method is proposed in unsupervised speech enhancement for the opposite circumstance that transferring to a larger and richer domain. On the one hand, the importance-weighting (IW) approach is exploited with a variance constrained autoencoder to reduce the shift of shared weights between the source and target domains. On the other hand, in order to train the classifier with the worst-case weights and minimize the risk, the minimax method is proposed. Both the proposed IW and minimax methods are evaluated from the VOICE BANK and IEEE datasets to the TIMIT dataset. The experiment results show that the proposed methods outperform the state-of-the-art approaches.

【7】 Personalized musically induced emotions of not-so-popular Colombian music 标题:不太流行的哥伦比亚音乐的个性化音乐诱导情感 链接:https://arxiv.org/abs/2112.04975

作者:Juan Sebastián Gómez-Cañón,Perfecto Herrera,Estefanía Cano,Emilia Gómez 机构:MIRLab - Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, Songquito UG, Erlangen, Germany, Joint Research Centre, European Commission, Seville, Spain 备注:None 摘要:这项工作提供了一个初步的概念证明,音乐情感识别(MER)系统是如何在政治背景下有意偏向于音乐诱发情感的注释的。具体而言,我们分析了哥伦比亚传统音乐,其中包含两种类型的带有政治色彩的歌词:(1)来自“左翼”游击队哥伦比亚革命武装力量(FARC)的vallenatos和社交歌曲;(2)来自“右翼”准军事组织哥伦比亚联合自卫军(AUC)的corridos。我们训练个性化的机器学习模型来预测三个具有不同政治观点的用户的诱发情绪——我们的目标是识别可能诱发特定用户负面情绪的歌曲,例如愤怒和恐惧。在这种程度上,用户的情绪判断可以被解释为问题化数据——主观情绪判断反过来可以在以人为中心的机器学习环境中影响用户。简言之,高度期望的“情绪调节”应用可能会偏离“情绪操纵”——最近对情绪识别技术的质疑可能超越多样性和包容性的伦理问题。 摘要:This work presents an initial proof of concept of how Music Emotion Recognition (MER) systems could be intentionally biased with respect to annotations of musically induced emotions in a political context. In specific, we analyze traditional Colombian music containing politically charged lyrics of two types: (1) vallenatos and social songs from the "left-wing" guerrilla Fuerzas Armadas Revolucionarias de Colombia (FARC) and (2) corridos from the "right-wing" paramilitaries Autodefensas Unidas de Colombia (AUC). We train personalized machine learning models to predict induced emotions for three users with diverse political views - we aim at identifying the songs that may induce negative emotions for a particular user, such as anger and fear. To this extent, a user's emotion judgements could be interpreted as problematizing data - subjective emotional judgments could in turn be used to influence the user in a human-centered machine learning environment. In short, highly desired "emotion regulation" applications could potentially deviate to "emotion manipulation" - the recent discredit of emotion recognition technologies might transcend ethical issues of diversity and inclusion.

【8】 LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading 标题:LipSound 2:唇语转换重建和唇读的自我监督预训练 链接:https://arxiv.org/abs/2112.04748

作者:Leyuan Qu,Cornelius Weber,Stefan Wermter 机构: Department of Informatics, University of Hamburg 备注:SUBMITTED TO IEEE Transaction on Neural Networks and Learning Systems 摘要:这项工作的目的是通过利用视频中音频和视频流的自然共存,研究跨模态自我监督预训练对语音重建(视频到音频)的影响。我们提出了LipSound2,它由编码器-解码器体系结构和位置感知注意机制组成,可以直接将人脸图像序列映射到mel尺度的光谱图,而无需任何人类注释。建议的LipSound2模型首先在$sim$2400h多语种(如英语和德语)视听数据(VoxCeleb2)上进行预训练。为了验证该方法的通用性,我们在特定领域的数据集(GRID,TCD-TIMIT)上对预先训练的模型进行了微调,以进行英语语音重建,并与之前的方法相比,在说话人相关和独立的环境下,在语音质量和可懂度方面取得了显著的改进。除了英语之外,我们还在CMLR数据集上进行汉语语音重建,以验证对可迁移性的影响。最后,我们通过在预先训练的语音识别系统上微调生成的音频来训练级联唇读(视频到文本)系统,并在英语和汉语基准数据集上实现最先进的性能。 摘要:The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on $sim$2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.

【9】 Noise-robust blind reverberation time estimation using noise-aware time-frequency masking 标题:基于噪声感知时频掩蔽的噪声鲁棒混响时间盲估计 链接:https://arxiv.org/abs/2112.04726

作者:Kaitong Zheng,Chengshi Zheng,Jinqiu Sang,Yulong Zhang,Xiaodong Li 机构:Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy, University of Chinese Academy of Sciences, Beijing, China 摘要:混响时间是表征密闭室声学特性的最重要参数之一。在真实场景中,与使用专业测量仪器的传统声学测量技术相比,从记录的语音中盲估计混响时间要方便得多。然而,记录的语音经常被噪声污染,这对混响时间的估计精度有不利影响。针对这一问题,本文提出了一种基于噪声感知时频掩蔽的两级混响时间盲估计方法。该方法能够很好地区分混响尾和噪声,从而提高了噪声环境下混响时间的估计精度。仿真和真实声学实验结果表明,该方法在具有挑战性的场景中明显优于其他方法。 摘要:The reverberation time is one of the most important parameters used to characterize the acoustic property of an enclosure. In real-world scenarios, it is much more convenient to estimate the reverberation time blindly from recorded speech compared to the traditional acoustic measurement techniques using professional measurement instruments. However, the recorded speech is often corrupted by noise, which has a detrimental effect on the estimation accuracy of the reverberation time. To address this issue, this paper proposes a two-stage blind reverberation time estimation method based on noise-aware time-frequency masking. This proposed method has a good ability to distinguish the reverberation tails from the noise, thus improving the estimation accuracy of reverberation time in noisy scenarios. The simulated and real-world acoustic experimental results show that the proposed method significantly outperforms other methods in challenging scenarios.

【10】 CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet 标题:CWS-PResUNet:基于信道子带相位感知的音乐源分离 链接:https://arxiv.org/abs/2112.04685

作者:Haohe Liu,Qiuqiang Kong,Jiafeng Liu 机构: Sound, Audio, and Music Intelligence (SAMI) Group, ByteDance , The Ohio State University, Authors of papers retain, copyright and release the work, under a Creative Commons, Attribution ,., International, License (CC BY ,.,)., In partnership with 备注:Published at MDX Workshop @ ISMIR 2021

【11】 NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying Beamformers 标题:NICE波束:时变波束形成器的神经集成协方差估计器 链接:https://arxiv.org/abs/2112.04613

作者:Jonah Casebeer,Jacob Donley,Daniel Wong,Buye Xu,Anurag Kumar 机构:Reality Labs Research, USA 机器翻译,仅供参考

0 人点赞