金融/语音/音频处理学术速递[9.13]

2021-09-16 10:59:03 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计4篇

cs.SD语音,共计4篇

eess.AS音频处理,共计4篇

1.q-fin金融:

【1】 An Alternative Approach to Evaluate American Options Price Using HJM Approach 标题:用HJM方法评估美式期权价格的另一种方法 链接:https://arxiv.org/abs/2109.04920

作者:Kushantha Fernando,Vajira Manathunga 机构:com†Middle Tennessee State University 摘要:金融业和学术研究的发展催生了创新的金融产品。本文提出了一种美式期权定价的替代方法。我们的方法利用著名的{Heath1992 Bond}(“HJM”)技术计算资产上的美式期权。最初,HJM正向建模方法是作为固定收益市场债券定价的替代方法引入的。此后,{schweizer2008term}和{carmona2008infinite}通过捕捉波动的动态性质,将HJM正向建模方法扩展到股票市场。他们对波动性的期限结构进行了建模,这通常在市场中观察到,与布莱克-斯科尔斯框架下的恒定波动性假设相反。利用这种方法,我们提出了一个替代值函数、停止准则和停止时间。我们给出了一个例子,说明如何使用所提出的方法对美式看跌期权进行定价。 摘要:Developments in finance industry and academic research has led to innovative financial products. This paper presents an alternative approach to price American options. Our approach utilizes famous cite{heath1992bond} ("HJM") technique to calculate American option written on an asset. Originally, HJM forward modeling approach was introduced as an alternative approach to bond pricing in fixed income market. Since then, cite{schweizer2008term} and cite{carmona2008infinite} extended HJM forward modeling approach to equity market by capturing dynamic nature of volatility. They modeled the term structure of volatility, which is commonly observed in the market place as opposed to constant volatility assumption under Black - Scholes framework. Using this approach, we propose an alternative value function, a stopping criteria and a stopping time. We give an example of how to price American put option using proposed methodology.

【2】 Adjoint Differentiation for generic matrix functions 标题:一般矩阵函数的伴随微分 链接:https://arxiv.org/abs/2109.04913

作者:Andrei Goloubentsev,Dmitri Goloubentsev,Evgeny Lakshtanov 机构: Department of Mathematics, University of Aveiro 摘要:我们推导了形式为$C=f(a)$的方阵运算的伴随$overline{a}$的公式,其中$f$在每个特征值的邻域内是全纯的。然后,我们将该公式应用于特定情况下的闭式表达式,例如,当我们有谱分解$a=UDU^{-1}$、谱截止$C=a_u $和最近的相关矩阵例程时。最后,我们解释了如何简化正则线性回归系数的伴随计算。 摘要:We derive a formula for the adjoint $overline{A}$ of a square-matrix operation of the form $C=f(A)$, where $f$ is holomorphic in the neighborhood of each eigenvalue. We then apply the formula to derive closed-form expressions in particular cases of interest such as the case when we have a spectral decomposition $A=UDU^{-1}$, the spectrum cut-off $C=A_ $ and the Nearest Correlation Matrix routine. Finally, we explain how to simplify the computation of adjoints for regularized linear regression coefficients.

【3】 When Two Worlds Collide: Using Particle Physics Tools to Visualize the Limit Order Book 标题:当两个世界发生碰撞时:使用粒子物理工具可视化极限订单手册 链接:https://arxiv.org/abs/2109.04812

作者:Marjolein E. Verhulst,Philippe Debie,Stephan Hageboeck,Joost M. E. Pennings,Cornelis Gardebroek,Axel Naumann,Paul van Leeuwen,Andres A. Trujillo-Barrera,Lorenzo Moneta 机构: Wageningen University & Research, Wageningen, the Netherlands, Wageningen Economic Research, Den Haag, the Netherlands, European Organization for Nuclear Research (CERN), Meyrin, Switzerland, Maastricht University, Maastricht, the Netherlands 备注:51 pages, 9 figures 摘要:我们介绍了一种方法来可视化的极限订单书(LOB)使用粒子物理镜头。CERN开发的开源数据分析工具ROOT用于重建和可视化期货市场。使用基于消息的数据,而不是快照,因为它提供了许多可视化优势。可视化方法可以同时包含多个变量和市场,并且不一定与时间相关。利益相关者可以使用它可视化高速数据,以更好地了解市场或有效地监控市场。此外,该方法很容易根据用户规范进行调整,以检查各种LOB研究主题,从而补充现有方法。 摘要:We introduce a methodology to visualize the limit order book (LOB) using a particle physics lens. Open-source data-analysis tool ROOT, developed by CERN, is used to reconstruct and visualize futures markets. Message-based data is used, rather than snapshots, as it offers numerous visualization advantages. The visualization method can include multiple variables and markets simultaneously and is not necessarily time dependent. Stakeholders can use it to visualize high-velocity data to gain a better understanding of markets or effectively monitor markets. In addition, the method is easily adjustable to user specifications to examine various LOB research topics, thereby complementing existing methods.

【4】 Risk-Adjusted Valuation for Real Option Decisions 标题:实物期权决策的风险调整估值 链接:https://arxiv.org/abs/2109.04793

作者:Carol Alexander,Xi Chen,Charles Ward 摘要:我们使用不同的投资回报率对投资者异质性进行建模,并评估对投资估值的影响。通过假设对现金流没有异议,我们强调了风险偏好(尤其是资本成本)如何影响对现在投资或保留未来投资选择权的主观评估。我们提出了一个风险调整估值模型,以便于投资者根据投资机会的市场估值作出主观决策。投资者的主观评估源于他们认为市场对投资的错误估价,因此预测现金流使用代表投资者和市场观点的两种不同利率进行贴现。这将我们的模型从完美或不完美的套期保值假设中解放出来,相反,我们能够说明当对风险溢价的看法发生分歧时,套期保值对实物期权价值的影响。在危机期间,随着未来现金流的特殊风险增加,推迟投资变得更有价值,但当风险水平异常高时,决策者可能会仓促投资。我们的模型验证了经典实物期权估值模型建立的特征,并提供了许多关于决策者风险溢价建模差异重要性的新见解,特别是在危机期间。它还具有许多实际优势,因为它不需要比基本贴现现金流方法(如市场化资产免责声明方法)更多的参数输入,但输出更丰富。它们允许成本和收入不确定性之间的复杂相互作用,以及易于探索可对冲和不可对冲风险对实物期权价值的影响。此外,我们还提供了完全可调整的Python代码,其中所有参数值都可以由用户选择。 摘要:We model investor heterogeneity using different required returns on an investment and evaluate the impact on the valuation of an investment. By assuming no disagreement on the cash flows, we emphasize how risk preferences in particular, but also the costs of capital, influence a subjective evaluation of the decision to invest now or retain the option to invest in future. We propose a risk-adjusted valuation model to facilitate investors' subjective decision making, in response to the market valuation of an investment opportunity. The investor's subjective assessment arises from their perceived misvaluation of the investment by the market, so projected cash flows are discounted using two different rates representing the investor's and the market's view. This liberates our model from perfect or imperfect hedging assumptions and instead, we are able to illustrate the hedging effect on the real option value when perceptions of risk premia diverge. During crises periods, delaying an investment becomes more valuable as the idiosyncratic risk of future cash flows increases, but the decision-maker may rush to invest too quickly when the risk level is exceptionally high. Our model verifies features established by classical real-option valuation models and provides many new insights about the importance of modelling divergences in decision-makers risk premia, especially during crisis periods. It also has many practical advantages because it requires no more parameter inputs than basic discounted cash flow approaches, such as the marketed asset disclaimer method, but the outputs are much richer. They allow for complex interactions between cost and revenue uncertainties as well as an easy exploration of the effects of hedgeable and un-hedgeable risks on the real option value. Furthermore, we provide fully-adjustable Python code in which all parameter values can be chosen by the user.

2.cs.SD语音:

【1】 Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition 标题:端到端多通道远场语音识别的自注意通道组合器前端 链接:https://arxiv.org/abs/2109.04783

作者:Rong Gong,Carl Quillen,Dushyant Sharma,Andrew Goderre,José Laínez,Ljubomir Milanović 机构:Nuance Communications GmbH, Vienna, Austria, Nuance Communications Inc., Burlington, USA, Nuance Communications S.A., Madrid, Spain 备注:In Proceedings of Interspeech 2021 摘要:当提供足够大的远场训练数据时,联合优化多通道前端和端到端(E2E)自动语音识别(ASR)后端将显示有希望的结果。最近的文献表明,传统的波束形成器设计,如MVDR(最小方差无失真响应)或固定波束形成器,可以成功地作为前端集成到具有可学习参数的E2E ASR系统中。在这项工作中,我们提出了自注意通道组合器(SACC)ASR前端,它利用自注意机制在幅度谱域中组合多通道音频信号。在多通道播放测试数据上进行的实验表明,与基于最先进的固定波束形成器的前端相比,SACC实现了9.3%的WERR,两者都与基于ContextNet的ASR后端进行了联合优化。我们还演示了SACC和传统波束形成器之间的连接,并分析了SACC的中间输出。 摘要:When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.

【2】 Speech Enhancement by Noise Self-Supervised Rank-Constrained Spatial Covariance Matrix Estimation via Independent Deeply Learned Matrix Analysis 标题:基于独立深度学习矩阵分析的噪声自监督秩约束空间协方差矩阵估计语音增强 链接:https://arxiv.org/abs/2109.04658

作者:Sota Misawa,Norihiro Takamune,Tomohiko Nakamura,Daichi Kitamura,Hiroshi Saruwatari,Masakazu Une,Shoji Makino 机构:∗ The University of Tokyo, Tokyo, Japan, † National Institute of Technology, Kagawa College, Kagawa, Japan, ‡ University of Tsukuba, Ibaraki, Japan, § Waseda University, Fukuoka, Japan 备注:accepted for APSIPA2021 摘要:秩约束空间协方差矩阵估计(RCSCME)是一种针对定向目标语音和扩散噪声混合情况的方法。在传统的RCSCME中,采用独立低秩矩阵分析(ILRMA)作为预处理方法。我们提出了使用独立深入学习矩阵分析(IDLMA)的RCSCME,这是ILRMA的监督扩展。在这种方法中,IDLMA需要深度神经网络(DNN)来分离目标语音和噪声。我们在IDLMA中使用单通道语音增强DNN去噪器来估计目标语音和噪声。我们还提出了噪声自监督RCSCME,其中我们使用去噪器的输出估计仅噪声的时间间隔,并设计了RCSCME的噪声空间协方差矩阵的先验分布。我们证实,在几种噪声条件下,所提出的方法优于传统方法。 摘要:Rank-constrained spatial covariance matrix estimation (RCSCME) is a method for the situation that the directional target speech and the diffuse noise are mixed. In conventional RCSCME, independent low-rank matrix analysis (ILRMA) is used as the preprocessing method. We propose RCSCME using independent deeply learned matrix analysis (IDLMA), which is a supervised extension of ILRMA. In this method, IDLMA requires deep neural networks (DNNs) to separate the target speech and the noise. We use Denoiser, which is a single-channel speech enhancement DNN, in IDLMA to estimate not only the target speech but also the noise. We also propose noise self-supervised RCSCME, in which we estimate the noise-only time intervals using the output of Denoiser and design the prior distribution of the noise spatial covariance matrix for RCSCME. We confirm that the proposed methods outperform the conventional methods under several noise conditions.

【3】 Large-vocabulary Audio-visual Speech Recognition in Noisy Environments 标题:噪声环境下的大词汇量视听语音识别 链接:https://arxiv.org/abs/2109.04894

作者:Wentao Yu,Steffen Zeiler,Dorothea Kolossa 机构:Institute of Communication Acoustics Ruhr University Bochum Germany 备注:None 摘要:视听语音识别(AVSR)可以有效地提高小词汇量系统的识别率。然而,对于大词汇量系统,仍然存在许多困难,例如视频识别精度不令人满意,这使得很难在仅音频基线上进行改进。在本文中,我们特别考虑这样的场景,重点放在大型词汇表任务的LRS2数据库,其中音频的性能远远优于视频唯一的准确性,使这是一个有趣的和富有挑战性的设置多模态集成。为了解决固有的困难,我们提出了一种新的融合策略:在一组基于模型和基于信号的流可靠性度量的指导下,训练一个循环积分网络来融合多个单模态模型的状态后验。在解码过程中,该网络用于混合识别器内的流集成,从而可以处理其多个特征输入的时变可靠性和信息内容。我们将结果与端到端AVSR系统以及具有竞争力的混合基线模型进行比较,发现新的融合策略显示出优异的结果,平均而言甚至优于oracle动态流加权,而oracle动态流加权到目前为止已标记出标准流加权的上限(实际上无法实现)。尽管纯唇读的性能很低,但在干净、嘈杂和混响的所有条件下,视听集成都是有帮助的。平均而言,与纯音频模式相比,新系统实现了42.18%的相对字错误率降低,表明所提出的集成方法具有很高的有效性。 摘要:Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding, this network is used for stream integration within a hybrid recognizer, where it can thus cope with the time-variant reliability and information content of its multiple feature inputs. We compare the results with end-to-end AVSR systems as well as with competitive hybrid baseline models, finding that the new fusion strategy shows superior results, on average even outperforming oracle dynamic stream weighting, which has so far marked the -- realistically unachievable -- upper bound for standard stream weighting. Even though the pure lipreading performance is low, audio-visual integration is helpful under all -- clean, noisy, and reverberant -- conditions. On average, the new system achieves a relative word error rate reduction of 42.18% compared to the audio-only model, pointing at a high effectiveness of the proposed integration approach.

【4】 Directional MCLP Analysis and Reconstruction for Spatial Speech Communication 标题:空间语音通信中的方向MCLP分析与重构 链接:https://arxiv.org/abs/2109.04544

作者:Srikanth Raj Chetupalli,Thippur V. Sreenivas 机构:Dept. of Electrical Communication Engineering, Indian Institute of Science, Bengaluru,. 备注:The manuscript is submitted as a full paper to IEEE/ACM Transactions on Audio, Speech and Language Processing 摘要:本文考虑了空间语音通信,即语音信号的重建以及说话人在空间中的相对位置(混响信息)。在发射机处估计方向、漫射分量和源位置信息,在接收机处考虑感知有效再现。我们考虑空间分布麦克风阵列的信号采集,以及节点特定的信号估计,以及其到达方向(DOA)估计。采用短时傅里叶变换(STFT)域多通道线性预测(MCLP)方法对漫反射分量进行建模,并采用相对声学传递函数对直接信号分量进行建模。在每个节点上,采用无失真阵列响应约束和时变复高斯源模型联合估计源波达方向和组成信号分量。每个节点的DoA方向之间的交点用于计算源位置。在距离估计源位置最近的节点处计算的信号分量被视为用于传输的信号。在接收机处,使用四通道扬声器(LS)设置进行空间再现,其中相对于发射机外壳中选定的虚拟侦听器位置再现源空间图像。矢量基振幅平移(VBAP)方法用于使用LS设置的直接分量再现,并在去相关后从所有扬声器中均匀再现漫反射分量。该空间语音通信方案通过扬声器收听或基于头部相关传递函数(HRTF)的双耳头戴式耳机收听,对于免提通信是有效且更自然的。 摘要:Spatial speech communication, i.e., the reconstruction of spoken signal along with the relative speaker position in the enclosure (reverberation information) is considered in this paper. Directional, diffuse components and the source position information are estimated at the transmitter, and perceptually effective reproduction is considered at the receiver. We consider spatially distributed microphone arrays for signal acquisition, and node specific signal estimation, along with its direction of arrival (DoA) estimation. Short-time Fourier transform (STFT) domain multi-channel linear prediction (MCLP) approach is used to model the diffuse component and relative acoustic transfer function is used to model the direct signal component. Distortion-less array response constraint and the time-varying complex Gaussian source model are used in the joint estimation of source DoA and the constituent signal components, separately at each node. The intersection between DoA directions at each node is used to compute the source position. Signal components computed at the node nearest to the estimated source position are taken as the signals for transmission. At the receiver, a four channel loud speaker (LS) setup is used for spatial reproduction, in which the source spatial image is reproduced relative to a chosen virtual listener position in the transmitter enclosure. Vector base amplitude panning (VBAP) method is used for direct component reproduction using the LS setup and the diffuse component is reproduced equally from all the loud speakers after decorrelation. This scheme of spatial speech communication is shown to be effective and more natural for hands-free telecommunication, through either loudspeaker listening or binaural headphone listening with head related transfer function (HRTF) based presentation.

3.eess.AS音频处理:

【1】 Large-vocabulary Audio-visual Speech Recognition in Noisy Environments 标题:噪声环境下的大词汇量视听语音识别 链接:https://arxiv.org/abs/2109.04894

作者:Wentao Yu,Steffen Zeiler,Dorothea Kolossa 机构:Institute of Communication Acoustics Ruhr University Bochum Germany 备注:None 摘要:视听语音识别(AVSR)可以有效地提高小词汇量系统的识别率。然而,对于大词汇量系统,仍然存在许多困难,例如视频识别精度不令人满意,这使得很难在仅音频基线上进行改进。在本文中,我们特别考虑这样的场景,重点放在大型词汇表任务的LRS2数据库,其中音频的性能远远优于视频唯一的准确性,使这是一个有趣的和富有挑战性的设置多模态集成。为了解决固有的困难,我们提出了一种新的融合策略:在一组基于模型和基于信号的流可靠性度量的指导下,训练一个循环积分网络来融合多个单模态模型的状态后验。在解码过程中,该网络用于混合识别器内的流集成,从而可以处理其多个特征输入的时变可靠性和信息内容。我们将结果与端到端AVSR系统以及具有竞争力的混合基线模型进行比较,发现新的融合策略显示出优异的结果,平均而言甚至优于oracle动态流加权,而oracle动态流加权到目前为止已标记出标准流加权的上限(实际上无法实现)。尽管纯唇读的性能很低,但在干净、嘈杂和混响的所有条件下,视听集成都是有帮助的。平均而言,与纯音频模式相比,新系统实现了42.18%的相对字错误率降低,表明所提出的集成方法具有很高的有效性。 摘要:Audio-visual speech recognition (AVSR) can effectively and significantly improve the recognition rates of small-vocabulary systems, compared to their audio-only counterparts. For large-vocabulary systems, however, there are still many difficulties, such as unsatisfactory video recognition accuracies, that make it hard to improve over audio-only baselines. In this paper, we specifically consider such scenarios, focusing on the large-vocabulary task of the LRS2 database, where audio-only performance is far superior to video-only accuracies, making this an interesting and challenging setup for multi-modal integration. To address the inherent difficulties, we propose a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures. During decoding, this network is used for stream integration within a hybrid recognizer, where it can thus cope with the time-variant reliability and information content of its multiple feature inputs. We compare the results with end-to-end AVSR systems as well as with competitive hybrid baseline models, finding that the new fusion strategy shows superior results, on average even outperforming oracle dynamic stream weighting, which has so far marked the -- realistically unachievable -- upper bound for standard stream weighting. Even though the pure lipreading performance is low, audio-visual integration is helpful under all -- clean, noisy, and reverberant -- conditions. On average, the new system achieves a relative word error rate reduction of 42.18% compared to the audio-only model, pointing at a high effectiveness of the proposed integration approach.

【2】 Directional MCLP Analysis and Reconstruction for Spatial Speech Communication 标题:空间语音通信中的方向MCLP分析与重构 链接:https://arxiv.org/abs/2109.04544

作者:Srikanth Raj Chetupalli,Thippur V. Sreenivas 机构:Dept. of Electrical Communication Engineering, Indian Institute of Science, Bengaluru,. 备注:The manuscript is submitted as a full paper to IEEE/ACM Transactions on Audio, Speech and Language Processing 摘要:本文考虑了空间语音通信,即语音信号的重建以及说话人在空间中的相对位置(混响信息)。在发射机处估计方向、漫射分量和源位置信息,在接收机处考虑感知有效再现。我们考虑空间分布麦克风阵列的信号采集,以及节点特定的信号估计,以及其到达方向(DOA)估计。采用短时傅里叶变换(STFT)域多通道线性预测(MCLP)方法对漫反射分量进行建模,并采用相对声学传递函数对直接信号分量进行建模。在每个节点上,采用无失真阵列响应约束和时变复高斯源模型联合估计源波达方向和组成信号分量。每个节点的DoA方向之间的交点用于计算源位置。在距离估计源位置最近的节点处计算的信号分量被视为用于传输的信号。在接收机处,使用四通道扬声器(LS)设置进行空间再现,其中相对于发射机外壳中选定的虚拟侦听器位置再现源空间图像。矢量基振幅平移(VBAP)方法用于使用LS设置的直接分量再现,并在去相关后从所有扬声器中均匀再现漫反射分量。该空间语音通信方案通过扬声器收听或基于头部相关传递函数(HRTF)的双耳头戴式耳机收听,对于免提通信是有效且更自然的。 摘要:Spatial speech communication, i.e., the reconstruction of spoken signal along with the relative speaker position in the enclosure (reverberation information) is considered in this paper. Directional, diffuse components and the source position information are estimated at the transmitter, and perceptually effective reproduction is considered at the receiver. We consider spatially distributed microphone arrays for signal acquisition, and node specific signal estimation, along with its direction of arrival (DoA) estimation. Short-time Fourier transform (STFT) domain multi-channel linear prediction (MCLP) approach is used to model the diffuse component and relative acoustic transfer function is used to model the direct signal component. Distortion-less array response constraint and the time-varying complex Gaussian source model are used in the joint estimation of source DoA and the constituent signal components, separately at each node. The intersection between DoA directions at each node is used to compute the source position. Signal components computed at the node nearest to the estimated source position are taken as the signals for transmission. At the receiver, a four channel loud speaker (LS) setup is used for spatial reproduction, in which the source spatial image is reproduced relative to a chosen virtual listener position in the transmitter enclosure. Vector base amplitude panning (VBAP) method is used for direct component reproduction using the LS setup and the diffuse component is reproduced equally from all the loud speakers after decorrelation. This scheme of spatial speech communication is shown to be effective and more natural for hands-free telecommunication, through either loudspeaker listening or binaural headphone listening with head related transfer function (HRTF) based presentation.

【3】 Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition 标题:端到端多通道远场语音识别的自注意通道组合器前端 链接:https://arxiv.org/abs/2109.04783

作者:Rong Gong,Carl Quillen,Dushyant Sharma,Andrew Goderre,José Laínez,Ljubomir Milanović 机构:Nuance Communications GmbH, Vienna, Austria, Nuance Communications Inc., Burlington, USA, Nuance Communications S.A., Madrid, Spain 备注:In Proceedings of Interspeech 2021 摘要:当提供足够大的远场训练数据时,联合优化多通道前端和端到端(E2E)自动语音识别(ASR)后端将显示有希望的结果。最近的文献表明,传统的波束形成器设计,如MVDR(最小方差无失真响应)或固定波束形成器,可以成功地作为前端集成到具有可学习参数的E2E ASR系统中。在这项工作中,我们提出了自注意通道组合器(SACC)ASR前端,它利用自注意机制在幅度谱域中组合多通道音频信号。在多通道播放测试数据上进行的实验表明,与基于最先进的固定波束形成器的前端相比,SACC实现了9.3%的WERR,两者都与基于ContextNet的ASR后端进行了联合优化。我们还演示了SACC和传统波束形成器之间的连接,并分析了SACC的中间输出。 摘要:When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.

【4】 Speech Enhancement by Noise Self-Supervised Rank-Constrained Spatial Covariance Matrix Estimation via Independent Deeply Learned Matrix Analysis 标题:基于独立深度学习矩阵分析的噪声自监督秩约束空间协方差矩阵估计语音增强 链接:https://arxiv.org/abs/2109.04658

作者:Sota Misawa,Norihiro Takamune,Tomohiko Nakamura,Daichi Kitamura,Hiroshi Saruwatari,Masakazu Une,Shoji Makino 机构:∗ The University of Tokyo, Tokyo, Japan, † National Institute of Technology, Kagawa College, Kagawa, Japan, ‡ University of Tsukuba, Ibaraki, Japan, § Waseda University, Fukuoka, Japan 备注:accepted for APSIPA2021 摘要:秩约束空间协方差矩阵估计(RCSCME)是一种针对定向目标语音和扩散噪声混合情况的方法。在传统的RCSCME中,采用独立低秩矩阵分析(ILRMA)作为预处理方法。我们提出了使用独立深入学习矩阵分析(IDLMA)的RCSCME,这是ILRMA的监督扩展。在这种方法中,IDLMA需要深度神经网络(DNN)来分离目标语音和噪声。我们在IDLMA中使用单通道语音增强DNN去噪器来估计目标语音和噪声。我们还提出了噪声自监督RCSCME,其中我们使用去噪器的输出估计仅噪声的时间间隔,并设计了RCSCME的噪声空间协方差矩阵的先验分布。我们证实,在几种噪声条件下,所提出的方法优于传统方法。 摘要:Rank-constrained spatial covariance matrix estimation (RCSCME) is a method for the situation that the directional target speech and the diffuse noise are mixed. In conventional RCSCME, independent low-rank matrix analysis (ILRMA) is used as the preprocessing method. We propose RCSCME using independent deeply learned matrix analysis (IDLMA), which is a supervised extension of ILRMA. In this method, IDLMA requires deep neural networks (DNNs) to separate the target speech and the noise. We use Denoiser, which is a single-channel speech enhancement DNN, in IDLMA to estimate not only the target speech but also the noise. We also propose noise self-supervised RCSCME, in which we estimate the noise-only time intervals using the output of Denoiser and design the prior distribution of the noise spatial covariance matrix for RCSCME. We confirm that the proposed methods outperform the conventional methods under several noise conditions.

0 人点赞