q-fin金融,共计17篇
cs.SD语音,共计12篇
eess.AS音频处理,共计15篇
1.q-fin金融:
【1】 Multi-Asset Spot and Option Market Simulation 标题:多资产现货和期权市场仿真 链接:https://arxiv.org/abs/2112.06823
作者:Magnus Wiese,Ben Wood,Alexandre Pachoud,Ralf Korn,Hans Buehler,Phillip Murray,Lianjun Bai 机构:J.P. Morgan:, Bank Street, London, United Kingdom, University of Kaiserlautern, Gottlieb-Daimler Straße , Kaiserslautern, Germany 摘要:我们在正常化流动的基础上为单一标的构建了现实的现货和股票期权市场模拟器。我们通过一个无套利的自动编码器来处理市场观察到的看涨期权价格的高维性,该编码器近似于价格的有效低维表示,同时在重构曲面中保持无静态套利。给定一个多资产宇宙,我们利用规范化流的条件可逆性,并引入一种可扩展的方法来校准一组独立模拟器的联合分布,同时保持每个模拟器的动态。实证结果突出了校准模拟器的优点及其保真度。 摘要:We construct realistic spot and equity option market simulators for a single underlying on the basis of normalizing flows. We address the high-dimensionality of market observed call prices through an arbitrage-free autoencoder that approximates efficient low-dimensional representations of the prices while maintaining no static arbitrage in the reconstructed surface. Given a multi-asset universe, we leverage the conditional invertibility property of normalizing flows and introduce a scalable method to calibrate the joint distribution of a set of independent simulators while preserving the dynamics of each simulator. Empirical results highlight the goodness of the calibrated simulators and their fidelity.
【2】 Hedging Cryptocurrency Options 标题:对冲加密货币期权 链接:https://arxiv.org/abs/2112.06807
作者:Jovanka Lili Matic,Natalie Packham,Wolfgang Karl Härdle 摘要:加密货币(CC)市场具有波动性、非平稳性和非连续性。这对CC期权的定价和套期保值提出了独特的挑战。我们研究了各种模型的套期保值行为和有效性。首先,我们将市场数据校准为SVI隐含波动率面,再校准为价格期权。为了涵盖广泛的市场动态,我们使用两种类型的蒙特卡罗模拟生成价格路径。在第一种方法中,价格路径遵循SVCJ模型(具有相关跳跃的随机波动率)。第二种方法通过GARCH滤波核密度估计来模拟路径。在这两个市场中,期权通过仿射跳跃扩散类和无限活动Lāevy过程的模型进行套期保值。包括广泛的市场模型,可以理解完整但过于节俭的模型与更复杂但不完整的模型之间的套期保值绩效权衡。采用动态Delta、Delta-Gamma、Delta-Vega和最小方差对冲策略。校准结果显示了随机波动性、低跳跃强度和无限活动的强烈迹象。除了短期期权之外,在随机波动率模型中,Delta-Vega套期保值取得了持续良好的业绩。根据校准和套期保值结果判断,该研究提供了随机波动性是CC市场驱动力的证据。 摘要:The cryptocurrency (CC) market is volatile, non-stationary and non-continuous. This poses unique challenges for pricing and hedging CC options. We study the hedge behaviour and effectiveness for a wide range of models. First, we calibrate market data to SVI-implied volatility surfaces to price options. To cover a wide range of market dynamics, we generate price paths using two types of Monte Carlo simulations. In the first approach, price paths follow an SVCJ model (stochastic volatility with correlated jumps). The second approach simulates paths from a GARCH-filtered kernel density estimation. In these two markets, options are hedged with models from the class of affine jump diffusions and infinite activity L'evy processes. Including a wide range of market models allows to understand the trade-off in the hedge performance between complete, but overly parsimonious models, and more complex, but incomplete models. Dynamic Delta, Delta-Gamma, Delta-Vega and minimum variance hedge strategies are applied. The calibration results reveal a strong indication for stochastic volatility, low jump intensity and evidence of infinite activity. With the exception of short-dated options, a consistently good performance is achieved with Delta-Vega hedging in stochastic volatility models. Judging on the calibration and hedging results, the study provides evidence that stochastic volatility is the driving force in CC markets.
【3】 FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance 标题:FinRL-Meta:量化金融中数据驱动深度强化学习的近真实市场环境 链接:https://arxiv.org/abs/2112.06753
作者:Xiao-Yang Liu,Jingyang Rui,Jiechao Gao,Liuqing Yang,Hongyang Yang,Zhaoran Wang,Christina Dan Wang,Jian Guo 机构:Columbia University; ,The University of Hong Kong; ,University of Virginia;, Northwestern University; ,New York University (Shanghai); ,IDEA Research. 摘要:深度强化学习(DRL)在构建金融市场模拟器方面显示出巨大的潜力。然而,由于现实世界市场的高度复杂性和动态性,原始历史金融数据往往涉及大量噪音,可能无法反映市场的未来,从而降低了基于DRL的市场模拟器的逼真度。此外,基于DRL的市场模拟器的准确性在很大程度上依赖于众多不同的DRL代理,这增加了对市场环境的需求,并对模拟速度提出了挑战。在本文中,我们提出了一个FinRL元框架,它为数据驱动的金融强化学习构建了一系列市场环境。首先,FinRL Meta将金融数据处理与基于DRL的策略的设计流程分离,并为金融大数据提供开源数据工程工具。其次,FinRL Meta为各种交易任务提供了数百个市场环境。第三,FinRL Meta通过利用数千个GPU核实现多处理模拟和训练。我们的代码可在线访问https://github.com/AI4Finance-Foundation/FinRL-Meta. 摘要:Deep reinforcement learning (DRL) has shown huge potentials in building financial market simulators recently. However, due to the highly complex and dynamic nature of real-world markets, raw historical financial data often involve large noise and may not reflect the future of markets, degrading the fidelity of DRL-based market simulators. Moreover, the accuracy of DRL-based market simulators heavily relies on numerous and diverse DRL agents, which increases demand for a universe of market environments and imposes a challenge on simulation speed. In this paper, we present a FinRL-Meta framework that builds a universe of market environments for data-driven financial reinforcement learning. First, FinRL-Meta separates financial data processing from the design pipeline of DRL-based strategy and provides open-source data engineering tools for financial big data. Second, FinRL-Meta provides hundreds of market environments for various trading tasks. Third, FinRL-Meta enables multiprocessing simulation and training by exploiting thousands of GPU cores. Our codes are available online at https://github.com/AI4Finance-Foundation/FinRL-Meta.
【4】 Proper solutions for Epstein-Zin Stochastic Differential Utility 标题:Epstein-Zin随机微分效用的正解 链接:https://arxiv.org/abs/2112.06708
作者:Martin Herdegen,David Hobson,Joseph Jerome 摘要:在本文中,我们考虑了由爱泼斯坦- Zin随机微分效用(EZ-SDU)控制的偏好代理的最优投资消费问题,该投资商在无限地平线上投资一个常数参数黑斯科尔斯-默顿市场。本文所考虑的参数组合是风险规避参数$R$和跨时间互补性的弹性$ S$满足$θ=分形{1-R}{1-S}>1美元。从这个意义上说,本文是对Herdegen、Hobson和Jerome[arXiv:2107.06593]的补充。$theta>1$(与(0,1)$中的$theta相反)情况的主要新颖之处在于,存在与每个非零消费流相关联的无限系列效用过程。为了解决这个问题,我们引入了基于经济动机的适当效用过程的概念,其中,粗略地说,当未来消费为非零时,如果效用过程为非零,则效用过程是适当的。然后,我们继续说明,对于一个非常广泛的消费流$C$,存在一个与$C$关联的适当的实用程序流程$V$。此外,对于广泛的消费流$C$,适当的实用程序流程$V$是唯一的。最后,我们解决了一个常数参数金融市场中的最优投资消费问题,在该问题中,我们优化了正确的连续可实现的消费流,这些消费流具有与之相关的唯一正确的效用过程。 摘要:In this article, we consider the optimal investment-consumption problem for an agent with preferences governed by Epstein--Zin stochastic differential utility (EZ-SDU) who invests in a constant-parameter Black-Scholes-Merton market over the infinite horizon. The parameter combinations that we consider in this paper are such that the risk aversion parameter $R$ and the elasticity of intertemporal complementarity $S$ satisfy $theta=frac{1-R}{1-S}>1$. In this sense, this paper is complementary to Herdegen, Hobson and Jerome [arXiv:2107.06593]. The main novelty of the case $theta>1$ (as opposed to $thetain(0,1)$) is that there is an infinite family of utility processes associated to every nonzero consumption stream. To deal with this issue, we introduce the economically motivated notion of a proper utility process, where, roughly speaking, a utility process is proper if it is nonzero whenever future consumption is nonzero. We then proceed to show that for a very wide class of consumption streams $C$, there exists a proper utility process $V$ associated to $C$. Furthermore, for a wide class of consumption streams $C$, the proper utility process $V$ is unique. Finally, we solve the optimal investment-consumption problem in a constant parameter financial market, where we optimise over the right-continuous attainable consumption streams that have a unique proper utility process associated to them.
【5】 Optimal Expansion of Business Opportunity 标题:商机的最佳拓展 链接:https://arxiv.org/abs/2112.06706
作者:Ling Wang,Kexin Chen,Mei Choi Chiu,Hoi Ying Wong 摘要:任何一家公司,如果其业务战略具有限制其潜在收益的风险敞口约束,自然会考虑扩张,因为这会增加其风险敞口。我们将业务扩展建模为业务政策机会集的扩大。然而,扩张是不可逆转的,并伴随着机会成本。我们使用效用的期望优化将其表述为一个新的随机控制问题,并结合最优停止时间,我们推导了指数效用的显式解。我们将该框架应用于投资和再保险场景。在投资问题中,分析了增加交易风险的成本和激励,而在再保险问题中,研究了保险公司开展再保险业务的最佳时机。我们的模型预测,通过业务扩张获得的额外收入是扩张决策的关键激励因素。有趣的是,公司可能有这种动机,但可能会在扩张前等待一段时间,尽管零机会成本或对模型参数的特定限制条件是等待的例外情况。在等待期间,业务策略保持在扩展前的opportunity set的边界上。等待期的长短与扩展业务的机会成本、回报和风险有关。 摘要:Any firm whose business strategy has an exposure constraint that limits its potential gain naturally considers expansion, as this can increase its exposure. We model business expansion as an enlargement of the opportunity set for business policies. However, expansion is irreversible and has an opportunity cost attached. We use the expected optimization of utility to formulate this as a novel stochastic control problem combined with an optimal stopping time, and we derive an explicit solution for exponential utility. We apply the framework to an investment and a reinsurance scenario. In the investment problem, the cost and incentives to increase the trading exposure are analyzed, while the optimal timing for an insurer to launch its reinsurance business is investigated in the reinsurance problem. Our model predicts that the additional income gained through business expansion is the key incentive for a decision to expand. Interestingly, companies may have this incentive but are likely to wait for a period of time before expanding, although situations of zero opportunity cost or specific restrictive conditions on the model parameters are exceptions to waiting. The business policy remains on the boundary of the opportunity set before expansion during the waiting period. The length of the waiting period is related to the opportunity cost, return, and risk of the expanded business.
【6】 Burst Market -- the next leap of humanity 标题:爆裂市场--人类的下一次飞跃 链接:https://arxiv.org/abs/2112.06646
作者:Vincent Zha 备注:11 pages, no figure 摘要:人类面临着一个重大挑战:如何在个人之间有效地共享知识或技能。尤其是,人们经常需要快速的信息帮助,例如关于电脑/汽车/管道问题的建议或购物/旅行/烹饪技巧。这种需求无处不在,大多数需求可以在几分钟或几秒钟内通过快速聊天得到解答。然而,现实情况是,询问者通常无法快速找到帮助者,最终不得不花费更多的时间和/或金钱进行搜索,有时根本得不到答案。这意味着一个大问题:当前的就业市场,本文称之为“传统市场”(CM),大量地使这种突发性的工作失败。本文认为,失败的原因是由于技术约束导致的高交易成本,导致巨大的工作损失和人类智能利用不足。为了解决这个问题,本文建议建立一个虚拟突发市场(BM),允许人们在极短的时间内(如几分钟或几秒钟)以自己的速率通过音频或视频对话轻松销售任何服务。由于技术进步,该解决方案可行且利润丰厚。凭借前所未有的便利性和吸引力,BM将创造大量兼职或全职工作,并使人们的生活更具生产力和便利性。特别是,BM将有助于消除贫困,因为人们将能够更轻松地赚取收入,减少人工智能导致的失业,因为BM的工作将高度智能化,使人的价值最大化,因为人们将能够更轻松地获得和出售技能,由于合作的突破,人类的繁荣和成就与日俱增。BM还将重塑产业,并可能导致全球权力重组,甚至国家的最终毁灭。简而言之,本文创建了BM的概念,并相信它将引发人类的下一次飞跃。 摘要:Humans have a significant challenge: how to effectively share knowledge or skills among individuals. In particular, people often need quick help on information such as advice on computer/car/plumbing problems or shopping/travel/cooking tips. Such needs happen ubiquitously and most can be answered through quick chats in a few minutes or seconds. However, the reality is that the askers usually cannot quickly find helpers, end up having to spend much more time and/or money searching and sometimes get no answer at all. This means a big problem: the current job market, called by this paper Conventional Market (CM), massively fails such Burst Jobs. This paper believes that the reason for the failure is the high transaction cost due to technical constraints, leading to huge job losses and human intelligence underutilization. To solve the problem, this paper suggests to establish a virtual Burst Market (BM) allowing people to easily sell any services through audio or video conversations at their own rate at an extremely short period of time like a few minutes or seconds. The solution is feasible and lucrative thanks to the technology progress. With unprecedented accessibility and attractiveness, the BM will create massive part-time or full-time jobs and make people's lives more productive and convenient. In particular, the BM will help with poverty lifting as people will be able to earn income more easily, reducing AI-led unemployment as the BM jobs will be highly intelligent, maximizing human values as people will be able to acquire and sell their skills more easily, and increasing prosperity and human achievements thanks to the collaboration breakthrough. The BM will also reshape industries and may cause a global power reshuffle or even the ultimate destruction of nations. In short, this paper creates the notion of the BM and believes that it will invoke the next leap of humanity.
【7】 Will enterprise digital transformation affect diversification strategy? 标题:企业数字化转型是否会影响多元化战略? 链接:https://arxiv.org/abs/2112.06605
作者:Ge-zhi Wu,Da-ming You 机构: Business School, Central South University,Changsha, China, Collaborative Innovation Center of Resource-conserving and Environmentally friendly, Society and Ecological Civilization, Central South University, Changsha, China, Correspondence 摘要:本文实证检验了企业数字化转型对企业多元化水平的影响。研究发现,企业数字化转型显著提高了企业多元化水平,并通过了一系列稳健性检验和内生检验。我们发现,企业数字化转型对多元化的促进作用主要是通过降低企业组织成本来实现的;数字化转型对企业多元化的影响是异质性的,我们发现数字化转型在组织成本较高、交易成本较低和服务业中的作用更为显著。 摘要:This paper empirically examines the impact of enterprise digital transformation on the level of enterprise diversification. It is found that the digital transformation of enterprises has significantly improved the level of enterprise diversification, and the conclusion has passed a series of robustness tests and endogenous tests. We find that the promotion effect of enterprise digital transformation on diversification is mainly realized by reducing enterprise organization costs; The impact of digital transformation on enterprise diversification is heterogeneous, We find that it plays a more significant role in higher organization costs situation, lower transaction costs situation and service industry.
【8】 Time-consistent mean-variance reinsurance-investment problem with long-range dependent mortality rate 标题:具有长期相依死亡率的时间相容均值-方差再保险投资问题 链接:https://arxiv.org/abs/2112.06602
作者:Ling Wang,Mei Choi Chiu,Hoi Ying Wong 摘要:本文研究了寿险公司面临的时间一致性均值-方差再保险投资(RI)问题。受死亡率表现出长程依赖(LRD)的最新发现的启发,我们研究了LRD对RI策略的影响。我们采用Wang等人(2021年)提出的Volterra死亡率模型,将LRD纳入死亡率过程,并使用强度由随机死亡率表示的复合泊松过程来描述保险索赔。在开环均衡均值-方差准则下,我们导出了显式均衡RI控制,并研究了在常数和状态相关风险规避情况下这些控制的唯一性。我们同时解决了无限非马尔可夫参数和保险人财富过程突然增加所带来的困难。我们还通过数值研究揭示了LRD对均衡策略的影响。 摘要:This paper investigates the time-consistent mean-variance reinsurance-investment (RI) problem faced by life insurers. Inspired by recent findings that mortality rates exhibit long-range dependence (LRD), we examine the effect of LRD on RI strategies. We adopt the Volterra mortality model proposed in Wang et al.(2021) to incorporate LRD into the mortality rate process and describe insurance claims using a compound Poisson process with the intensity represented by stochastic mortality rate. Under the open-loop equilibrium mean-variance criterion, we derive explicit equilibrium RI controls and study the uniqueness of these controls in cases of constant and state-dependent risk aversion. We simultaneously resolve difficulties arising from unbounded non-Markovian parameters and sudden increases in the insurer's wealth process. We also use a numerical study to reveal the influence of LRD on equilibrium strategies.
【9】 Cryptocurrency Market Consolidation in 2020--2021 标题:2020--2021年加密货币市场整合 链接:https://arxiv.org/abs/2112.06552
作者:Jarosław Kwapień,Marcin Wątorek,Stanisław Drożdż 机构: Institute of Nuclear Physics, Cracow University of Technology 备注:None 摘要:研究了Binance上列出的80种最具流动性的加密货币的价格收益时间序列是否存在去趋势互相关。对于移动窗口的不同位置,应用去趋势相关矩阵的谱分析和基于该矩阵计算的最小生成树的拓扑分析。加密货币之间的相互关联性比以前更强。在特定的时间尺度上,平均互相关随着时间的推移而增加,其方式类似于从过去到现在的Epps效应放大。最小生成树也改变了它们的拓扑结构,对于短时间尺度,它们随着最大节点度的增加而变得更加集中,而对于长时间尺度,它们变得更加分散,但同时也更加相关。除了市场间的依赖性之外,还分析了加密货币市场与一些传统市场(如股票市场、商品市场和外汇市场)之间的去趋势交叉相关性。在同一动荡时期,加密货币市场与其他市场表现出更高水平的交叉相关性,而在这一时期,加密货币市场本身具有较强的交叉相关性。 摘要:Time series of price returns for 80 of the most liquid cryptocurrencies listed on Binance are investigated for the presence of detrended cross-correlations. A spectral analysis of the detrended correlation matrix and a topological analysis of the minimal spanning trees calculated based on this matrix are applied for different positions of a moving window. The cryptocurrencies become more strongly cross-correlated among themselves than they used to be before. The average cross-correlations increase with time on a specific time scale in a way that resembles the Epps effect amplification when going from past to present. The minimal spanning trees also change their topology and, for the short time scales, they become more centralized with increasing maximum node degrees, while for the long time scales they become more distributed, but also more correlated at the same time. Apart from the inter-market dependencies, the detrended cross-correlations between the cryptocurrency market and some traditional markets, like the stock markets, commodity markets, and Forex, are also analyzed. The cryptocurrency market shows higher levels of cross-correlations with the other markets during the same turbulent periods, in which it is strongly cross-correlated itself.
【10】 Mesoscopic Structure of the Stock Market and Portfolio Optimization 标题:股票市场的中观结构与投资组合优化 链接:https://arxiv.org/abs/2112.06544
作者:Sebastiano Michele Zema,Giorgio Fagiolo,Tiziano Squartini,Diego Garlaschelli 机构: IMT Institute for Advanced Studies, IT 3Lorentz Institute for Theoretical Physics, University of Leiden 备注:21 pages, 8 figures 摘要:市场结构的特质(微观)和系统(宏观)成分已被证明是导致最优均值-方差分配偏离启发式“等权”投资组合的原因。在本文中,我们利用随机矩阵理论(RMT)衍生的聚类技术来研究第三个中间(介观)市场结构,该结构随着时间的推移变得最稳定,并从投资组合管理的角度提供了重要的实用见解。首先,我们通过从相关矩阵中过滤出随机和系统性协动来说明构建投资组合在预测和实现风险方面的好处。其次,我们根据过滤后出现的股票集群重新定义投资组合优化问题。最后,我们提出了一个新的财富分配方案,该方案同样重视属于同一社区的股票,并表明它进一步提高了所构建投资组合的可靠性。结果在不同的时间跨度、横截面尺寸和定义优化问题的一组约束条件下是稳健的 摘要:The idiosyncratic (microscopic) and systemic (macroscopic) components of market structure have been shown to be responsible for the departure of the optimal mean-variance allocation from the heuristic `equally-weighted' portfolio. In this paper, we exploit clustering techniques derived from Random Matrix Theory (RMT) to study a third, intermediate (mesoscopic) market structure that turns out to be the most stable over time and provides important practical insights from a portfolio management perspective. First, we illustrate the benefits, in terms of predicted and realized risk profiles, of constructing portfolios by filtering out both random and systemic co-movements from the correlation matrix. Second, we redefine the portfolio optimization problem in terms of stock clusters that emerge after filtering. Finally, we propose a new wealth allocation scheme that attaches equal importance to stocks belonging to the same community and show that it further increases the reliability of the constructed portfolios. Results are robust across different time spans, cross-sectional dimensions and set of constraints defining the optimization problem
【11】 Unification of different systemic risk measures and Aumann-Shapley allocations 标题:不同系统风险测度的统一与Aumann-Shapley配置 链接:https://arxiv.org/abs/2112.06534
作者:Ludger Overbeck,Florian Schindler 机构:Institute of Mathematics, Justus-Liebig-Universität Gießen, Gießen, Germany; 摘要:我们研究了系统性风险度量理论的两种不同贡献。事实证明,这两种类型都具有关键属性,在大多数相关情况下,这两种属性都可以包含在公理化方法中。此外,引入了符合Aumann-Shapley精神的资本分配规则(CAR),使我们有机会计算系统性资本分配,而不考虑风险度量方法。此外,该CAR还提供了另一种方法,可以在公理化方法中找到相应的对应项。 摘要:We study two different contributions to the theory of systemic risk measures. It turns out, that crucial properties are shared by both types and that in most relevant cases both can be included in the axiomatic approach. Moreover, a capital allocation rule (CAR) in the spirit of Aumann-Shapley is introduced which gives us the opportunity to compute systemic capital allocations regardless of the risk measurement approach. Additionally, this CAR yields an alternative approach to find the corresponding counterpart in the axiomatic approach.
【12】 A highly granular model of China's coal production, transport and consumption system shows how its decarbonization and energy security plans will affect coal imports 标题:中国煤炭生产、运输和消费系统的高度细粒度模型显示了其脱碳和能源安全计划将如何影响煤炭进口 链接:https://arxiv.org/abs/2112.06357
作者:Jorrit Gosens,Alex Turnbull,Frank Jotzo 机构:Crawford School of Public Policy, Australian National University, Acton, Australian Capital Territory, Australia, Keshik Capital, Singapore 摘要:中国的目标是到2060年实现净零碳排放,到2030年达到排放高峰。这将减少发电和炼钢的煤炭消耗。与此同时,中国的目标是改善能源安全,主要是扩大国内煤炭生产和运输基础设施。在这里,我们分析了这两种压力对海运煤炭进口的影响,并建立了一个中国煤炭生产、运输和消费系统的专门模型,其中包含安装级别的地理空间和技术细节。与早期模型相比,这意味着粒度增加了1000倍,允许表示以前被模糊的方面。我们发现,中国煤炭消费量的减少对海运进口的影响远大于国内供应。最近铁路和港口运力的扩张,降低了将国内煤炭运至南部沿海省份的成本,将进一步减少对海运热煤的需求,并扩大脱碳对煤炭进口的影响。海运焦煤进口也可能下降,因为邻国蒙古廉价优质焦煤的供应量增加。 摘要:China aims for net-zero carbon emissions by 2060, and an emissions peak before 2030. This will reduce its consumption of coal for power generation and steel making. Simultaneously, China aims for improved energy security, primarily with expanded domestic coal production and transport infrastructure. Here, we analyze effects of both these pressures on seaborne coal imports, with a purpose-built model of China's coal production, transport, and consumption system with installation-level geospatial and technical detail. This represents a 1000-fold increase in granularity versus earlier models, allowing representation of aspects that have previously been obscured. We find that reduced Chinese coal consumption affects seaborne imports much more strongly than domestic supply. Recent expansions of rail and port capacity, which reduce costs of getting domestic coal to Southern coastal provinces, will further reduce demand for seaborne thermal coal and amplify the effect of decarbonisation on coal imports. Seaborne coking coal imports are also likely to fall, because of expanded supply of cheap and high quality coking coal from neighbouring Mongolia.
【13】 U.S. Long-Term Earnings Outcomes by Sex, Race, Ethnicity, and Place of Birth 标题:按性别、种族、民族和出生地划分的美国长期收入结果 链接:https://arxiv.org/abs/2112.05822
作者:Kevin L. McKinney,John M. Abowd,Hubert P. Janicki 备注:77 pages, 42 figures 摘要:本文是全球收入动态项目收入不平等、波动性和流动性跨国比较的一部分。利用美国人口普查局的纵向雇主家庭动态(LEHD)基础设施文件中的数据,我们为1998年至2019年的美国提供了一套统一的收入统计数据,我们发现美国收入不平等性有所增加,波动性有所降低。不平等性加剧和波动性降低的组合表明,不同人口群体的收入增长存在显著差异。我们通过估算25-54岁合格工人的单一队列12年平均收入来进一步探讨这一点。研究发现,劳动力供应(带薪小时数和工作季度数)的差异可以解释近90%的工人收入变化,尽管即使在控制劳动力供应后,人口群体之间的收入差异仍然无法解释。使用分位数回归方法,我们估计每个人口组的反事实收入分布。我们发现,在收入分布的底部,工资时数、地域划分、行业和教育等特征的差异解释了几乎所有的收入差距,但在中位数以上,特征回报差异的贡献成为主要组成部分。 摘要:This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. The combination of increased inequality and reduced volatility suggest earnings growth differs substantially across different demographic groups. We explore this further by estimating 12-year average earnings for a single cohort of age 25-54 eligible workers. Differences in labor supply (hours paid and quarters worked) are found to explain almost 90% of the variation in worker earnings, although even after controlling for labor supply substantial earnings differences across demographic groups remain unexplained. Using a quantile regression approach, we estimate counterfactual earnings distributions for each demographic group. We find that at the bottom of the earnings distribution differences in characteristics such as hours paid, geographic division, industry, and education explain almost all the earnings gap, however above the median the contribution of the differences in the returns to characteristics becomes the dominant component.
【14】 COVID-19 Forecasts via Stock Market Indicators 标题:利用股市指标进行冠状病毒预测 链接:https://arxiv.org/abs/2112.06393
作者:Yi Liang,James Unwin 机构:⋆ PRIMES, Massachusetts Institute of Technology, Cambridge, MA , USA, † Department of Physics, University of Illinois at Chicago, Chicago, IL , USA 备注:11 pages, 9 figures 摘要:可靠的短期预测可以为后勤规划,特别是医院员工和设备等资源的优化配置提供潜在的救生见解。通过重新解释COVID-19的每日烛台情况,我们能够应用一些最流行的股票市场技术指标,以在传染病的过程中获得预测能力。通过对MACD 2019冠状病毒疾病的分析,我们对它们进行了预测,并对股票市场数据和WHO COVID-19数据进行了预测。特别是,我们通过考虑大流行后续波的起始点的识别,展示了这种新方法的实用性。最后,我们的2019冠状病毒疾病的新的方法是用来评估当前的卫生政策是否正在影响生长。 摘要:Reliable short term forecasting can provide potentially lifesaving insights into logistical planning, and in particular, into the optimal allocation of resources such as hospital staff and equipment. By reinterpreting COVID-19 daily cases in terms of candlesticks, we are able to apply some of the most popular stock market technical indicators to obtain predictive power over the course of the pandemics. By providing a quantitative assessment of MACD, RSI, and candlestick analyses, we show their statistical significance in making predictions for both stock market data and WHO COVID-19 data. In particular, we show the utility of this novel approach by considering the identification of the beginnings of subsequent waves of the pandemic. Finally, our new methods are used to assess whether current health policies are impacting the growth in new COVID-19 cases.
【15】 A q-spin Potts model of markets: Gain-loss asymmetry in stock indices as an emergent phenomenon 标题:市场的q-自旋Potts模型:股指盈亏不对称是一种新出现的现象 链接:https://arxiv.org/abs/2112.06290
作者:Stefan Bornholdt 机构:Institut f¨ur Theoretische Physik, Universit¨at Bremen, Germany 备注:None 摘要:受磁性物理模型启发的市场自旋模型,如伊辛模型,允许研究市场中相互作用主体的集体动力学。可能的州的数量主要限于两个(买入或卖出)或三个选项。然而,在最简单的模型中,竞争性股票的羊群效应和整个市场的集体动态可能无法达到。在这里,我研究了一个简单伊辛市场模型的q-spin-Potts模型版本,以在spin模型中表示股票市场指数的动态。因此,在该市场中,由股票组成的指数变量的时间序列中观察到自组织损益不对称。 摘要:Spin models of markets inspired by physics models of magnetism, as the Ising model, allow for the study of the collective dynamics of interacting agents in a market. The number of possible states has been mostly limited to two (buy or sell) or three options. However, herding effects of competing stocks and the collective dynamics of a whole market may escape our reach in the simplest models. Here I study a q-spin Potts model version of a simple Ising market model to represent the dynamics of a stock market index in a spin model. As a result, a self-organized gain-loss asymmetry in the time series of an index variable composed of stocks in this market is observed.
【16】 On the Stability, Economic Efficiency and Incentive Compatibility of Electricity Market Dynamics 标题:论电力市场动态的稳定性、经济性和激励相容性 链接:https://arxiv.org/abs/2112.05811
作者:Pengcheng You,Yan Jiang,Enoch Yeung,Dennice F. Gayme,Enrique Mallada 机构: Mallada are with the Whiting School ofEngineering, Johns Hopkins University 摘要:本文主要研究电力市场的运作,该市场的参与者以亚分钟的时间标度投标。为此,我们将市场清算过程建模为一个动态系统,称为市场动力学,它在时间上与电网频率动力学耦合,因此需要在满足系统运行约束的同时保证系统范围的稳定性。我们将参与者描述为价格接受者,他们理性地更新他们的出价,以最大限度地提高他们的效用,以响应实时的价格和调度计划。对于两种常见的基于数量和价格的竞价机制,我们确定了参与者行为和规划者目标之间的一致性概念,从而形成了基于马鞍的市场设计,确保收敛到满足所有运营约束的点。我们进一步探讨了这种对齐特性不成立的情况,并观察到未对齐参与者的出价可能会破坏闭环系统的稳定性。因此,我们设计了市场动态的规范化版本,以恢复所有理想的稳定性和稳态性能保证。在IEEE 39节点系统上的数值试验验证了我们的结果。 摘要:This paper focuses on the operation of an electricity market that accounts for participants that bid at a sub-minute timescale. To that end, we model the market-clearing process as a dynamical system, called market dynamics, which is temporally coupled with the grid frequency dynamics and is thus required to guarantee system-wide stability while meeting the system operational constraints. We characterize participants as price-takers who rationally update their bids to maximize their utility in response to real-time schedules of prices and dispatch. For two common bidding mechanisms, based on quantity and price, we identify a notion of alignment between participants' behavior and planners' goals that leads to a saddle-based design of the market that guarantees convergence to a point meeting all operational constraints. We further explore cases where this alignment property does not hold and observe that misaligned participants' bidding can destabilize the closed-loop system. We thus design a regularized version of the market dynamics that recovers all the desirable stability and steady-state performance guarantees. Numerical tests validate our results on the IEEE 39-bus system.
【17】 Price Stability of Cryptocurrencies as a Medium of Exchange 标题:加密货币作为交易媒介的价格稳定性 链接:https://arxiv.org/abs/2111.08390
作者:Tatsuru Kikuchi,Toranosuke Onishi,Kenichi Ueda 机构:The University of Tokyo, Tokio Marine & Nichido Fire Insurance Co., Ltd., ) 备注:127 pages 摘要:我们提出了加密货币作为交换媒介价格稳定的积极证据。对于2016年至2020年的样本年,发现主要加密货币的价格相对于主要金融资产是稳定的。具体而言,在过滤掉不到一个月的周期后,我们调查了主要加密货币(即比特币、以太坊和Ripple)及其比较货币(即主要法定投标书、欧元和日元以及主要股票指数、标准普尔500指数和MSCI世界指数)的每日美元回报率。我们使用三种不同的方法检验过滤后的每日收益率的稳定性。首先,在我们的样本中,皮尔逊相关性在随后几年增加。其次,基于动态时间扭曲方法(允许关系中的滞后和领先),加密货币的每日收益与其比较货币的相似性甚至在2016年就已经存在。第三,我们检查预测加密货币价格的累计误差总和(假设与比较国的每日收益率存在稳定关系)是否不超过Black-Scholes模型所暗示的界限。换句话说,这一检验并不否定有效市场假说。 摘要:We present positive evidence of price stability of cryptocurrencies as a medium of exchange. For the sample years from 2016 to 2020, the prices of major cryptocurrencies are found to be stable, relative to major financial assets. Specifically, after filtering out the less-than-one-month cycles, we investigate the daily returns in US dollars of the major cryptocurrencies (i.e., Bitcoin, Ethereum, and Ripple) as well as their comparators (i.e., major legal tenders, the Euro and Japanese yen, and the major stock indexes, S&P 500 and MSCI World Index). We examine the stability of the filtered daily returns using three different measures. First, the Pearson correlations increased in later years in our sample. Second, based on the dynamic time-warping method that allows lags and leads in relations, the similarities in the daily returns of cryptocurrencies with their comparators have been present even since 2016. Third, we check whether the cumulative sum of errors to predict cryptocurrency prices, assuming stable relations with comparators' daily returns, does not exceeds the bounds implied by the Black-Scholes model. This test, in other words, does not reject the efficient market hypothesis.
2.cs.SD语音:
【1】 Mean-square-error-based secondary source placement in sound field synthesis with prior information on desired field 标题:具有期望声场先验信息的声场综合中基于均方误差的二次声源布置 链接:https://arxiv.org/abs/2112.06774
作者:Keisuke Kimura,Shoichi Koyama,Natsuki Ueno,Hiroshi Saruwatari 机构:The University of Tokyo,-,-, Hongo, Bunkyo-ku, Tokyo ,-, Japan 备注:Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021 摘要:提出了一种优化声场合成中二次源位置的方法。当允许的放置区域和扬声器的可用数量有限时,这种优化方法将非常有用。对于一般的基于线性最小二乘法的声场合成方法,包括压力匹配和(加权)模式匹配,我们制定了一个基于均方误差的成本函数,结合了可能期望声场的统计特性,而当前的大多数方法仅适用于压力匹配方法。文中还导出了一种有效的贪心算法,用于最小化所提出的代价函数。数值实验表明,与经验上使用的规则布局相比,该方法优化的布局可以获得较高的复制精度。 摘要:A method of optimizing secondary source placement in sound field synthesis is proposed. Such an optimization method will be useful when the allowable placement region and available number of loudspeakers are limited. We formulate a mean-square-error-based cost function, incorporating the statistical properties of possible desired sound fields, for general linear-least-squares-based sound field synthesis methods, including pressure matching and (weighted) mode matching, whereas most of the current methods are applicable only to the pressure-matching method. An efficient greedy algorithm for minimizing the proposed cost function is also derived. Numerical experiments indicated that a high reproduction accuracy can be achieved by the placement optimized by the proposed method compared with the empirically used regular placement.
【2】 Computational bioacoustics with deep learning: a review and roadmap 标题:基于深度学习的计算生物声学:综述和路线图 链接:https://arxiv.org/abs/2112.06725
作者:Dan Stowell 机构:Department of Cognitive Science and Artificial Intelligence, Tilburg University, Tilburg, Naturalis Biodiversity Center, Leiden, The Netherlands, Corresponding author: 摘要:动物发声和自然声景观是令人着迷的研究对象,包含了关于动物行为、种群和生态系统的宝贵证据。他们在生物声学和生态声学中进行研究,信号处理和分析是一个重要组成部分。近几十年来,由于价格合理的数字录音设备的增长,以及大数据、信号处理和机器学习等信息学的巨大进步,计算生物声学的发展速度加快。方法继承自更广泛的深度学习领域,包括语音和图像处理。然而,任务、需求和数据特征通常不同于语音或音乐分析中处理的任务、需求和数据特征。许多声学信号中肯定存在尚未解决的问题和任务,但尚未实现。在本文中,我回顾了计算生物声学深度学习的最新进展,旨在澄清关键概念,识别和分析知识差距。基于此,我为计算生物声学和深度学习提供了一个主观但有原则的路线图:社区应致力于解决的主题,以便充分利用AI和信息学的未来发展,并使用音频数据回答动物学和生态学问题。 摘要:Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, and to huge progress in informatics such as big data, signal processing and machine learning. Methods are inherited from the wider field of deep learning, including speech and image processing. However, the tasks, demands and data characteristics are often different from those addressed in speech or music analysis. There remain unsolved problems, and tasks for which evidence is surely present in many acoustic signals, but not yet realised. In this paper I perform a review of the state of the art in deep learning for computational bioacoustics, aiming to clarify key concepts and identify and analyse knowledge gaps. Based on this, I offer a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should aim to address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.
【3】 PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition 标题:PM-MMUT:用于稳健维吾尔语E2E语音识别的多模型单元训练增强电话掩码数据 链接:https://arxiv.org/abs/2112.06721
作者:Guodong Ma,Pengfei Hu,Nurmemet Yolwas,Shen Huang,Hao Huang 机构:School of Information Science and Engineering, Xinjiang University, Urumqi, China, Tencent Minority-Mandarin Translation, Beijing, China, Xinjiang Provincial Key Laboratory of Multi-lingual Information Technology, Urumqi, China 备注:Subbmitted to ICASSP 2022 摘要:维吾尔语语音中经常会出现辅音和元音的减少,这可能会导致维吾尔语自动语音识别(ASR)的性能下降。我们最近提出的基于掩蔽的学习策略,电话掩蔽训练(PMT),缓解了这种现象对维吾尔族ASR的影响。虽然PMT取得了显著的改进,但是由于PMT(音素)的掩蔽单元和建模单元(词块)之间的粒度不匹配,仍然存在进一步的改进空间。为了提高PMT的性能,我们提出了多建模单元训练(MMUT)体系结构与PMT(PM-MMUT)的融合。MMUT框架的思想是将编码器分成两部分,包括声学特征序列到音素级表示(AF到PLR)和音素级表示到词段级表示(PLR到WPLR)。它允许AF-to-PLR通过基于中间音素的CTC损失进行优化,以学习PMT带来的丰富音素级上下文信息。对维吾尔族ASR的实验结果表明,所提出的方法有显著的改善,优于纯PMT(阅读测试从24.0下降到23.7,口语测试从38.4下降到36.8)。我们还使用ESPnet1在960小时的Librispeech基准上进行了实验,与最新的官方ESPnet1预训练模型相比,在没有LM fusion的情况下,所有测试集的相对功耗降低了约10%。 摘要:Consonant and vowel reduction are often encountered in Uyghur speech, which might cause performance degradation in Uyghur automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between masking unit of PMT (phoneme) and modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experi-mental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT (reduction WER from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test respectively). We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test sets without LM fusion comparing with the latest official ESPnet1 pre-trained model.
【4】 Detecting Emotion Carriers by Combining Acoustic and Lexical Representations 标题:声学和词汇相结合的情感载体检测方法 链接:https://arxiv.org/abs/2112.06603
作者:Sebastian P. Bayerl,Aniruddha Tammewar,Korbinian Riedhammer,Giuseppe Riccardi 机构: Technische Hochschule N¨urnberg Georg Simon Ohm, Germany, Signals and Interactive Systems Lab, University of Trento 备注:Accepted at ASRU 2021 this https URL 摘要:个人叙述(PN)——口头或书面——是对事实、人物、事件和个人经历的回忆。情绪识别和情绪分析任务通常在话语或文档级别定义。然而,在这项工作中,我们将重点放在情感载体(EC)上,它被定义为最能解释叙述者情感状态的片段(言语或文本)(“失去父亲”,“让我选择”)。一旦提取出来,这种EC可以提供更丰富的用户状态表示,以改进自然语言理解和对话建模。在以前的工作中,已经证明了使用词汇特征可以识别EC。然而,口语叙事应该提供更丰富的上下文描述和用户的情绪状态。在本文中,我们利用基于单词的声学和文本嵌入以及早期和晚期融合技术来检测口语叙事中的ECs。对于声学词级表示,我们使用残差神经网络(ResNet)对单独的语音情感语料库进行预训练,并对其进行微调以检测EC。使用不同融合和系统组合策略的实验表明,后期融合可以显著改善该任务。 摘要:Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explain the emotional state of the narrator ("loss of father", "made me choose"). Once extracted, such EC can provide a richer representation of the user state to improve natural language understanding and dialogue modeling. In previous work, it has been shown that EC can be identified using lexical features. However, spoken narratives should provide a richer description of the context and the users' emotional state. In this paper, we leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives. For the acoustic word-level representations, we use Residual Neural Networks (ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect EC. Experiments with different fusion and system combination strategies show that late fusion leads to significant improvements for this task.
【5】 Detecting Audio Adversarial Examples with Logit Noising 标题:利用Logit噪声检测音频对抗性实例 链接:https://arxiv.org/abs/2112.06443
作者:Namgyu Park,Sangwoo Ji,Jong Kim 机构:POSTECH, Pohang, South Korea 备注:10 pages, 12 figures, In Proceedings of the 37th Annual Computer Security Applications Conference (ACSAC) 2021 摘要:自动语音识别(ASR)系统容易受到音频对抗性示例的攻击,这些示例试图通过向良性语音信号添加干扰来欺骗ASR系统。尽管人类无法区分对抗性示例和原始良性波形,但ASR系统将前者转录为恶意目标语句。已经提出了几种方法来生成音频对抗示例,并将其直接输入ASR系统(在线)。此外,许多研究人员已经证明了健壮的物理音频对抗示例(空中)的可行性。为了抵御这些攻击,已经提出了几项研究。然而,由于精度下降或时间开销,在实际情况下部署它们很困难。在本文中,我们提出了一种新的方法来检测音频对抗性示例,方法是在输入ASR解码器之前向logits中添加噪声。我们表明,仔细选择的噪声可以显著影响音频对抗示例的转录结果,而对良性音频波的转录结果影响最小。基于这一特点,我们通过比较logit噪声改变的转录与原始转录来检测音频对抗性示例。所提出的方法可以很容易地应用于ASR系统,而无需任何结构变化或额外的训练。实验结果表明,与目前最先进的检测方法相比,该方法对线上音频对抗实例和空中音频对抗实例具有较强的鲁棒性。 摘要:Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples that attempt to deceive ASR systems by adding perturbations to benign speech signals. Although an adversarial example and the original benign wave are indistinguishable to humans, the former is transcribed as a malicious target sentence by ASR systems. Several methods have been proposed to generate audio adversarial examples and feed them directly into the ASR system (over-line). Furthermore, many researchers have demonstrated the feasibility of robust physical audio adversarial examples(over-air). To defend against the attacks, several studies have been proposed. However, deploying them in a real-world situation is difficult because of accuracy drop or time overhead. In this paper, we propose a novel method to detect audio adversarial examples by adding noise to the logits before feeding them into the decoder of the ASR. We show that carefully selected noise can significantly impact the transcription results of the audio adversarial examples, whereas it has minimal impact on the transcription results of benign audio waves. Based on this characteristic, we detect audio adversarial examples by comparing the transcription altered by logit noising with its original transcription. The proposed method can be easily applied to ASR systems without any structural changes or additional training. The experimental results show that the proposed method is robust to over-line audio adversarial examples as well as over-air audio adversarial examples compared with state-of-the-art detection methods.
【6】 Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN 标题:基于多鉴别器CycleGAN的语音增强改善含噪语音识别 链接:https://arxiv.org/abs/2112.06309
作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute of Natural Language Processing, University of Stuttgart, Germany 备注:6 pages, 9 figures, ASRU 2021 摘要:本文介绍了我们通过语音增强改进含噪语音自动识别的最新研究。我们提出了一种新的方法称为多鉴别器CycleGAN来降低输入语音的噪声,从而提高自动语音识别的性能。我们提出的方法利用CycleGAN框架进行语音增强,无需任何并行数据,并通过引入检查不同频率区域的多个鉴别器对其进行改进。此外,我们还证明了在训练数据的齐次子集上训练多个生成器比在所有训练数据上训练一个生成器要好。我们在CHiME-3数据集上评估了我们的方法,并观察到在开发集上高达10.03%,在评估集上高达14.09%。 摘要:This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement. We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech and therefore improve the automatic speech recognition performance. Our proposed method leverages the CycleGAN framework for speech enhancement without any parallel data and improve it by introducing multiple discriminators that check different frequency areas. Furthermore, we show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data. We evaluate our method on CHiME-3 data set and observe up to 10.03% relatively WER improvement on the development set and up to 14.09% on the evaluation set.
【7】 Visualising and Explaining Deep Learning Models for Speech Quality Prediction 标题:语音质量预测深度学习模型的可视化与解释 链接:https://arxiv.org/abs/2112.06219
作者:H. Tilkorn,G. Mittag,S. Möller 备注:4 pages, 6 figures, In Proceedings of the DAGA 2021 (the annual conference of the German Acoustical Society, DEGA) 摘要:估计传输语音的质量是一项非常重要的任务。传统上,测试参与者被要求对样本的质量进行评分;如今,自动化的方法是可用的。这些方法可分为:1)侵入模型,既使用原始信号又使用降级信号;2)非侵入模型,只需要降级信号。最近,基于神经网络的非侵入式模型表现出优于基于信号处理的模型。然而,基于深度学习的模型的优势伴随着解释更具挑战性的成本而来。为了更深入地了解预测模型,本文分析了非侵入式语音质量预测模型NISQA。NISQA由卷积神经网络(CNN)和递归神经网络(RNN)组成。CNN的任务是在帧级别上计算语音质量预测的相关特征,而RNN建模各个语音帧之间的时间依赖关系。不同的解释算法用于理解CNN的自动学习特征。通过这种方式,可以确定几个可解释的特征,例如对噪声或强干扰的敏感性。另一方面,发现多个特征携带冗余信息。 摘要:Estimating quality of transmitted speech is known to be a non-trivial task. While traditionally, test participants are asked to rate the quality of samples; nowadays, automated methods are available. These methods can be divided into: 1) intrusive models, which use both, the original and the degraded signals, and 2) non-intrusive models, which only require the degraded signal. Recently, non-intrusive models based on neural networks showed to outperform signal processing based models. However, the advantages of deep learning based models come with the cost of being more challenging to interpret. To get more insight into the prediction models the non-intrusive speech quality prediction model NISQA is analyzed in this paper. NISQA is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN). The task of the CNN is to compute relevant features for the speech quality prediction on a frame level, while the RNN models time-dependencies between the individual speech frames. Different explanation algorithms are used to understand the automatically learned features of the CNN. In this way, several interpretable features could be identified, such as the sensitivity to noise or strong interruptions. On the other hand, it was found that multiple features carry redundant information.
【8】 Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus 标题:基于SautiDB-Naija语料库的尼日利亚口音嵌入学习初步结果 链接:https://arxiv.org/abs/2112.06199
作者:Tejumade Afonja,Oladimeji Mudele,Iroro Orife,Kenechi Dukor,Lawrence Francis,Duru Goodness,Oluwafemi Azeez,Ademola Malomo,Clinton Mbataku 机构: Niger-Volta Language Technologies Institute 摘要:本文描述了SautiDB Naija的基础性工作,这是一个新的非母语(L2)尼日利亚英语语音语料库。我们描述了语料库是如何创建和管理的,以及口音分类和学习尼日利亚口音嵌入的初步实验。语料库的初始版本包括来自尼日利亚语言的第二语言英语使用者的900多条录音,如约鲁巴语、伊博语、江户语、埃菲克·伊比比奥语和伊加拉语。我们进一步演示了如何在预先训练好的模型(如wav2vec)上进行微调,以产生适合相关语音任务(如口音分类)的表示。SautiDB Naija已发布给Zenodo,以供灵活的知识共享许可证下的一般使用。 摘要:This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.
【9】 Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR 标题:基于单通道增强和鲁棒ASR的感知损失识别模型 链接:https://arxiv.org/abs/2112.06068
作者:Peter Plantinga,Deblin Bagchi,Eric Fosler-Lussier 机构: Fosler-Lussier were all with the Ohio StateUniversity in Columbus Ohio for this work 摘要:在存在噪声的情况下,单通道语音增强方法并不总能提高自动识别率,因为它们会引入对识别没有帮助的失真。随着顺序神经网络模型端到端训练的趋势,几个研究小组已经通过前端增强模块与后端识别模块的联合训练来解决这个问题。虽然这种方法可以确保增强输出有助于识别,但增强模型可能会过度拟合训练数据,从而在存在未知噪声的情况下削弱识别模型。为了解决这个问题,我们使用一个预先训练好的声学模型来产生感知损失,这使得语音增强更加了解信号的语音特性。这种方法保留了联合训练的一些好处,同时缓解了过度拟合问题。在Voicebank DEMAND增强数据集上的实验表明,该方法在某些客观增强分数方面达到了新的水平。与失真无关训练相结合,我们的方法在测试集上获得了2.80%的WER,这比联合训练的识别性能高出20%以上,比失真无关掩码训练的识别性能高出14%。 摘要:Single-channel speech enhancement approaches do not always improve automatic recognition rates in the presence of noise, because they can introduce distortions unhelpful for recognition. Following a trend towards end-to-end training of sequential neural network models, several research groups have addressed this problem with joint training of front-end enhancement module with back-end recognition module. While this approach ensures enhancement outputs are helpful for recognition, the enhancement model can overfit to the training data, weakening the recognition model in the presence of unseen noise. To address this, we used a pre-trained acoustic model to generate a perceptual loss that makes speech enhancement more aware of the phonetic properties of the signal. This approach keeps some benefits of joint training, while alleviating the overfitting problem. Experiments on Voicebank DEMAND dataset for enhancement show that this approach achieves a new state of the art for some objective enhancement scores. In combination with distortion-independent training, our approach gets a WER of 2.80% on the test set, which is more than 20% relative better recognition performance than joint training, and 14% relative better than distortion-independent mask training.
【10】 U-shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement 标题:用于语音增强的频带感知U型Transformer 链接:https://arxiv.org/abs/2112.06052
作者:Yi Li,Yang Sun,Syed Mohsen Naqvi 机构: Newcastle University , University of Oxford 摘要:最先进的语音增强技术在语音估计精度方面的性能有限。最近,在深度学习中,Transformer显示出通过自我注意利用言语中的长距离依赖的潜力。因此,它被引入到语音增强中,以提高混合噪声的语音估计精度。然而,为了解决具有自我注意的Transformer的计算成本问题,轴向注意是一种选择,即将二维注意分为两个一维注意。受轴向注意的启发,该方法沿时间轴和频率轴计算注意图,生成时间和频率子注意图。此外,与轴向注意不同,该方法为时间轴和频率轴提供了两个平行的多头注意。此外,文献证明,在噪声混合中,语音中的低频段通常比高频段包含更多的期望信息。因此,提出了频带感知注意,即高频带注意(HFA)和低频带注意(LFA)。该方法还首次引入了U形Transformer,进一步提高了语音估计精度。通过对四个公共数据集的广泛评估,证实了该方法的有效性。 摘要:The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the long-range dependency in speech by self-attention. Therefore, it is introduced in speech enhancement to improve the speech estimation accuracy from a noise mixture. However, to address the computational cost issue in Transformer with self-attention, the axial attention is the option i.e., to split a 2D attention into two 1D attentions. Inspired by the axial attention, in the proposed method we calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps. Moreover, different from the axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis. Furthermore, it is proven in the literature that the lower frequency-band in speech, generally, contains more desired information than the higher frequency-band, in a noise mixture. Therefore, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA). The U-shaped Transformer is also first time introduced in the proposed method to further improve the speech estimation accuracy. The extensive evaluations over four public datasets, confirm the efficacy of the proposed method.
【11】 Hybrid Neural Networks for On-device Directional Hearing 标题:混合神经网络在设备定向听力中的应用 链接:https://arxiv.org/abs/2112.05893
作者:Anran Wang,Maruchi Kim,Hao Zhang,Shyamnath Gollakota 机构:University of Washington,ETH Z¨urich 备注:None 摘要:设备上的定向听力要求音频源与给定方向分离,同时实现严格的人类不易察觉的延迟要求。虽然神经网络可以获得比传统波束形成器更好的性能,但所有现有模型都无法支持计算受限可穿戴设备上的低延迟因果推断。我们介绍了DeepBeam,一种将传统波束形成器与定制的轻量级神经网络相结合的混合模型。前者减少了后者的计算负担,也提高了其通用性,而后者旨在进一步减少内存和计算开销,以实现实时和低延迟操作。我们的评估显示,在合成数据上,我们的性能与最先进的因果推理模型相当,同时实现了模型大小减少5倍、每秒计算量减少4倍、处理时间减少5倍以及更好地推广到实际硬件数据。此外,我们的实时混合模型在为低功耗可穿戴设备设计的移动CPU上以8毫秒的速度运行,并实现17.5毫秒的端到端延迟。 摘要:On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present DeepBeam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.
【12】 Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech 标题:用于长格式会话语音自动识别的有向语音分离 链接:https://arxiv.org/abs/2112.05863
作者:Rohit Paturi,Sundararajan Srinivasan,Katrin Kirchhoff 机构:Amazon AWS AI 摘要:语音分离的许多最新进展主要是针对具有高度重叠的短音频话语的合成混合。这些数据集与真实会话数据有显著差异,因此,在这些数据集上训练和评估的模型不能推广到真实会话场景。对长格式语音使用这些模型的另一个问题是,由于时频掩码的无监督聚类或排列不变训练(PIT)损失,分离语音段的顺序不确定。这导致难以为自动语音识别(ASR)等下游任务准确拼接同质说话人片段。在本文中,我们提出了一种基于直接从混合信号中提取的说话人嵌入的说话人条件分离器。我们使用定向损失来训练该模型,该定向损失调节分离段的顺序。使用该模型,我们在不需要额外的重新拼接步骤的情况下,显著提高了真实会话数据的字错误率(WER)。 摘要:Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
3.eess.AS音频处理:
【1】 Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech 标题:用于长格式会话语音自动识别的有向语音分离 链接:https://arxiv.org/abs/2112.05863
作者:Rohit Paturi,Sundararajan Srinivasan,Katrin Kirchhoff 机构:Amazon AWS AI 摘要:语音分离的许多最新进展主要是针对具有高度重叠的短音频话语的合成混合。这些数据集与真实会话数据有显著差异,因此,在这些数据集上训练和评估的模型不能推广到真实会话场景。对长格式语音使用这些模型的另一个问题是,由于时频掩码的无监督聚类或排列不变训练(PIT)损失,分离语音段的顺序不确定。这导致难以为自动语音识别(ASR)等下游任务准确拼接同质说话人片段。在本文中,我们提出了一种基于直接从混合信号中提取的说话人嵌入的说话人条件分离器。我们使用定向损失来训练该模型,该定向损失调节分离段的顺序。使用该模型,我们在不需要额外的重新拼接步骤的情况下,显著提高了真实会话数据的字错误率(WER)。 摘要:Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
【2】 Mean-square-error-based secondary source placement in sound field synthesis with prior information on desired field 标题:具有期望声场先验信息的声场综合中基于均方误差的二次声源布置 链接:https://arxiv.org/abs/2112.06774
作者:Keisuke Kimura,Shoichi Koyama,Natsuki Ueno,Hiroshi Saruwatari 机构:The University of Tokyo,-,-, Hongo, Bunkyo-ku, Tokyo ,-, Japan 备注:Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021 摘要:提出了一种优化声场合成中二次源位置的方法。当允许的放置区域和扬声器的可用数量有限时,这种优化方法将非常有用。对于一般的基于线性最小二乘法的声场合成方法,包括压力匹配和(加权)模式匹配,我们制定了一个基于均方误差的成本函数,结合了可能期望声场的统计特性,而当前的大多数方法仅适用于压力匹配方法。文中还导出了一种有效的贪心算法,用于最小化所提出的代价函数。数值实验表明,与经验上使用的规则布局相比,该方法优化的布局可以获得较高的复制精度。 摘要:A method of optimizing secondary source placement in sound field synthesis is proposed. Such an optimization method will be useful when the allowable placement region and available number of loudspeakers are limited. We formulate a mean-square-error-based cost function, incorporating the statistical properties of possible desired sound fields, for general linear-least-squares-based sound field synthesis methods, including pressure matching and (weighted) mode matching, whereas most of the current methods are applicable only to the pressure-matching method. An efficient greedy algorithm for minimizing the proposed cost function is also derived. Numerical experiments indicated that a high reproduction accuracy can be achieved by the placement optimized by the proposed method compared with the empirically used regular placement.
【3】 Computational bioacoustics with deep learning: a review and roadmap 标题:基于深度学习的计算生物声学:综述和路线图 链接:https://arxiv.org/abs/2112.06725
作者:Dan Stowell 机构:Department of Cognitive Science and Artificial Intelligence, Tilburg University, Tilburg, Naturalis Biodiversity Center, Leiden, The Netherlands, Corresponding author: 摘要:动物发声和自然声景观是令人着迷的研究对象,包含了关于动物行为、种群和生态系统的宝贵证据。他们在生物声学和生态声学中进行研究,信号处理和分析是一个重要组成部分。近几十年来,由于价格合理的数字录音设备的增长,以及大数据、信号处理和机器学习等信息学的巨大进步,计算生物声学的发展速度加快。方法继承自更广泛的深度学习领域,包括语音和图像处理。然而,任务、需求和数据特征通常不同于语音或音乐分析中处理的任务、需求和数据特征。许多声学信号中肯定存在尚未解决的问题和任务,但尚未实现。在本文中,我回顾了计算生物声学深度学习的最新进展,旨在澄清关键概念,识别和分析知识差距。基于此,我为计算生物声学和深度学习提供了一个主观但有原则的路线图:社区应致力于解决的主题,以便充分利用AI和信息学的未来发展,并使用音频数据回答动物学和生态学问题。 摘要:Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, and to huge progress in informatics such as big data, signal processing and machine learning. Methods are inherited from the wider field of deep learning, including speech and image processing. However, the tasks, demands and data characteristics are often different from those addressed in speech or music analysis. There remain unsolved problems, and tasks for which evidence is surely present in many acoustic signals, but not yet realised. In this paper I perform a review of the state of the art in deep learning for computational bioacoustics, aiming to clarify key concepts and identify and analyse knowledge gaps. Based on this, I offer a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should aim to address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.
【4】 PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition 标题:PM-MMUT:用于稳健维吾尔语E2E语音识别的多模型单元训练增强电话掩码数据 链接:https://arxiv.org/abs/2112.06721
作者:Guodong Ma,Pengfei Hu,Nurmemet Yolwas,Shen Huang,Hao Huang 机构:School of Information Science and Engineering, Xinjiang University, Urumqi, China, Tencent Minority-Mandarin Translation, Beijing, China, Xinjiang Provincial Key Laboratory of Multi-lingual Information Technology, Urumqi, China 备注:Subbmitted to ICASSP 2022 摘要:维吾尔语语音中经常会出现辅音和元音的减少,这可能会导致维吾尔语自动语音识别(ASR)的性能下降。我们最近提出的基于掩蔽的学习策略,电话掩蔽训练(PMT),缓解了这种现象对维吾尔族ASR的影响。虽然PMT取得了显著的改进,但是由于PMT(音素)的掩蔽单元和建模单元(词块)之间的粒度不匹配,仍然存在进一步的改进空间。为了提高PMT的性能,我们提出了多建模单元训练(MMUT)体系结构与PMT(PM-MMUT)的融合。MMUT框架的思想是将编码器分成两部分,包括声学特征序列到音素级表示(AF到PLR)和音素级表示到词段级表示(PLR到WPLR)。它允许AF-to-PLR通过基于中间音素的CTC损失进行优化,以学习PMT带来的丰富音素级上下文信息。对维吾尔族ASR的实验结果表明,所提出的方法有显著的改善,优于纯PMT(阅读测试从24.0下降到23.7,口语测试从38.4下降到36.8)。我们还使用ESPnet1在960小时的Librispeech基准上进行了实验,与最新的官方ESPnet1预训练模型相比,在没有LM fusion的情况下,所有测试集的相对功耗降低了约10%。 摘要:Consonant and vowel reduction are often encountered in Uyghur speech, which might cause performance degradation in Uyghur automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between masking unit of PMT (phoneme) and modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experi-mental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT (reduction WER from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test respectively). We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test sets without LM fusion comparing with the latest official ESPnet1 pre-trained model.
【5】 Detecting Emotion Carriers by Combining Acoustic and Lexical Representations 标题:声学和词汇相结合的情感载体检测方法 链接:https://arxiv.org/abs/2112.06603
作者:Sebastian P. Bayerl,Aniruddha Tammewar,Korbinian Riedhammer,Giuseppe Riccardi 机构: Technische Hochschule N¨urnberg Georg Simon Ohm, Germany, Signals and Interactive Systems Lab, University of Trento 备注:Accepted at ASRU 2021 this https URL 摘要:个人叙述(PN)——口头或书面——是对事实、人物、事件和个人经历的回忆。情绪识别和情绪分析任务通常在话语或文档级别定义。然而,在这项工作中,我们将重点放在情感载体(EC)上,它被定义为最能解释叙述者情感状态的片段(言语或文本)(“失去父亲”,“让我选择”)。一旦提取出来,这种EC可以提供更丰富的用户状态表示,以改进自然语言理解和对话建模。在以前的工作中,已经证明了使用词汇特征可以识别EC。然而,口语叙事应该提供更丰富的上下文描述和用户的情绪状态。在本文中,我们利用基于单词的声学和文本嵌入以及早期和晚期融合技术来检测口语叙事中的ECs。对于声学词级表示,我们使用残差神经网络(ResNet)对单独的语音情感语料库进行预训练,并对其进行微调以检测EC。使用不同融合和系统组合策略的实验表明,后期融合可以显著改善该任务。 摘要:Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explain the emotional state of the narrator ("loss of father", "made me choose"). Once extracted, such EC can provide a richer representation of the user state to improve natural language understanding and dialogue modeling. In previous work, it has been shown that EC can be identified using lexical features. However, spoken narratives should provide a richer description of the context and the users' emotional state. In this paper, we leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives. For the acoustic word-level representations, we use Residual Neural Networks (ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect EC. Experiments with different fusion and system combination strategies show that late fusion leads to significant improvements for this task.
【6】 Detecting Audio Adversarial Examples with Logit Noising 标题:利用Logit噪声检测音频对抗性实例 链接:https://arxiv.org/abs/2112.06443
作者:Namgyu Park,Sangwoo Ji,Jong Kim 机构:POSTECH, Pohang, South Korea 备注:10 pages, 12 figures, In Proceedings of the 37th Annual Computer Security Applications Conference (ACSAC) 2021 摘要:自动语音识别(ASR)系统容易受到音频对抗性示例的攻击,这些示例试图通过向良性语音信号添加干扰来欺骗ASR系统。尽管人类无法区分对抗性示例和原始良性波形,但ASR系统将前者转录为恶意目标语句。已经提出了几种方法来生成音频对抗示例,并将其直接输入ASR系统(在线)。此外,许多研究人员已经证明了健壮的物理音频对抗示例(空中)的可行性。为了抵御这些攻击,已经提出了几项研究。然而,由于精度下降或时间开销,在实际情况下部署它们很困难。在本文中,我们提出了一种新的方法来检测音频对抗性示例,方法是在输入ASR解码器之前向logits中添加噪声。我们表明,仔细选择的噪声可以显著影响音频对抗示例的转录结果,而对良性音频波的转录结果影响最小。基于这一特点,我们通过比较logit噪声改变的转录与原始转录来检测音频对抗性示例。所提出的方法可以很容易地应用于ASR系统,而无需任何结构变化或额外的训练。实验结果表明,与目前最先进的检测方法相比,该方法对线上音频对抗实例和空中音频对抗实例具有较强的鲁棒性。 摘要:Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples that attempt to deceive ASR systems by adding perturbations to benign speech signals. Although an adversarial example and the original benign wave are indistinguishable to humans, the former is transcribed as a malicious target sentence by ASR systems. Several methods have been proposed to generate audio adversarial examples and feed them directly into the ASR system (over-line). Furthermore, many researchers have demonstrated the feasibility of robust physical audio adversarial examples(over-air). To defend against the attacks, several studies have been proposed. However, deploying them in a real-world situation is difficult because of accuracy drop or time overhead. In this paper, we propose a novel method to detect audio adversarial examples by adding noise to the logits before feeding them into the decoder of the ASR. We show that carefully selected noise can significantly impact the transcription results of the audio adversarial examples, whereas it has minimal impact on the transcription results of benign audio waves. Based on this characteristic, we detect audio adversarial examples by comparing the transcription altered by logit noising with its original transcription. The proposed method can be easily applied to ASR systems without any structural changes or additional training. The experimental results show that the proposed method is robust to over-line audio adversarial examples as well as over-air audio adversarial examples compared with state-of-the-art detection methods.
【7】 Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN 标题:基于多鉴别器CycleGAN的语音增强改善含噪语音识别 链接:https://arxiv.org/abs/2112.06309
作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute of Natural Language Processing, University of Stuttgart, Germany 备注:6 pages, 9 figures, ASRU 2021 摘要:本文介绍了我们通过语音增强改进含噪语音自动识别的最新研究。我们提出了一种新的方法称为多鉴别器CycleGAN来降低输入语音的噪声,从而提高自动语音识别的性能。我们提出的方法利用CycleGAN框架进行语音增强,无需任何并行数据,并通过引入检查不同频率区域的多个鉴别器对其进行改进。此外,我们还证明了在训练数据的齐次子集上训练多个生成器比在所有训练数据上训练一个生成器要好。我们在CHiME-3数据集上评估了我们的方法,并观察到在开发集上高达10.03%,在评估集上高达14.09%。 摘要:This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement. We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech and therefore improve the automatic speech recognition performance. Our proposed method leverages the CycleGAN framework for speech enhancement without any parallel data and improve it by introducing multiple discriminators that check different frequency areas. Furthermore, we show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data. We evaluate our method on CHiME-3 data set and observe up to 10.03% relatively WER improvement on the development set and up to 14.09% on the evaluation set.
【8】 Visualising and Explaining Deep Learning Models for Speech Quality Prediction 标题:语音质量预测深度学习模型的可视化与解释 链接:https://arxiv.org/abs/2112.06219
作者:H. Tilkorn,G. Mittag,S. Möller 备注:4 pages, 6 figures, In Proceedings of the DAGA 2021 (the annual conference of the German Acoustical Society, DEGA) 摘要:估计传输语音的质量是一项非常重要的任务。传统上,测试参与者被要求对样本的质量进行评分;如今,自动化的方法是可用的。这些方法可分为:1)侵入模型,既使用原始信号又使用降级信号;2)非侵入模型,只需要降级信号。最近,基于神经网络的非侵入式模型表现出优于基于信号处理的模型。然而,基于深度学习的模型的优势伴随着解释更具挑战性的成本而来。为了更深入地了解预测模型,本文分析了非侵入式语音质量预测模型NISQA。NISQA由卷积神经网络(CNN)和递归神经网络(RNN)组成。CNN的任务是在帧级别上计算语音质量预测的相关特征,而RNN建模各个语音帧之间的时间依赖关系。不同的解释算法用于理解CNN的自动学习特征。通过这种方式,可以确定几个可解释的特征,例如对噪声或强干扰的敏感性。另一方面,发现多个特征携带冗余信息。 摘要:Estimating quality of transmitted speech is known to be a non-trivial task. While traditionally, test participants are asked to rate the quality of samples; nowadays, automated methods are available. These methods can be divided into: 1) intrusive models, which use both, the original and the degraded signals, and 2) non-intrusive models, which only require the degraded signal. Recently, non-intrusive models based on neural networks showed to outperform signal processing based models. However, the advantages of deep learning based models come with the cost of being more challenging to interpret. To get more insight into the prediction models the non-intrusive speech quality prediction model NISQA is analyzed in this paper. NISQA is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN). The task of the CNN is to compute relevant features for the speech quality prediction on a frame level, while the RNN models time-dependencies between the individual speech frames. Different explanation algorithms are used to understand the automatically learned features of the CNN. In this way, several interpretable features could be identified, such as the sensitivity to noise or strong interruptions. On the other hand, it was found that multiple features carry redundant information.
【9】 Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus 标题:基于SautiDB-Naija语料库的尼日利亚口音嵌入学习初步结果 链接:https://arxiv.org/abs/2112.06199
作者:Tejumade Afonja,Oladimeji Mudele,Iroro Orife,Kenechi Dukor,Lawrence Francis,Duru Goodness,Oluwafemi Azeez,Ademola Malomo,Clinton Mbataku 机构: Niger-Volta Language Technologies Institute 摘要:本文描述了SautiDB Naija的基础性工作,这是一个新的非母语(L2)尼日利亚英语语音语料库。我们描述了语料库是如何创建和管理的,以及口音分类和学习尼日利亚口音嵌入的初步实验。语料库的初始版本包括来自尼日利亚语言的第二语言英语使用者的900多条录音,如约鲁巴语、伊博语、江户语、埃菲克·伊比比奥语和伊加拉语。我们进一步演示了如何在预先训练好的模型(如wav2vec)上进行微调,以产生适合相关语音任务(如口音分类)的表示。SautiDB Naija已发布给Zenodo,以供灵活的知识共享许可证下的一般使用。 摘要:This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.
【10】 Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR 标题:基于单通道增强和鲁棒ASR的感知损失识别模型 链接:https://arxiv.org/abs/2112.06068
作者:Peter Plantinga,Deblin Bagchi,Eric Fosler-Lussier 机构: Fosler-Lussier were all with the Ohio StateUniversity in Columbus Ohio for this work 摘要:在存在噪声的情况下,单通道语音增强方法并不总能提高自动识别率,因为它们会引入对识别没有帮助的失真。随着顺序神经网络模型端到端训练的趋势,几个研究小组已经通过前端增强模块与后端识别模块的联合训练来解决这个问题。虽然这种方法可以确保增强输出有助于识别,但增强模型可能会过度拟合训练数据,从而在存在未知噪声的情况下削弱识别模型。为了解决这个问题,我们使用一个预先训练好的声学模型来产生感知损失,这使得语音增强更加了解信号的语音特性。这种方法保留了联合训练的一些好处,同时缓解了过度拟合问题。在Voicebank DEMAND增强数据集上的实验表明,该方法在某些客观增强分数方面达到了新的水平。与失真无关训练相结合,我们的方法在测试集上获得了2.80%的WER,这比联合训练的识别性能高出20%以上,比失真无关掩码训练的识别性能高出14%。 摘要:Single-channel speech enhancement approaches do not always improve automatic recognition rates in the presence of noise, because they can introduce distortions unhelpful for recognition. Following a trend towards end-to-end training of sequential neural network models, several research groups have addressed this problem with joint training of front-end enhancement module with back-end recognition module. While this approach ensures enhancement outputs are helpful for recognition, the enhancement model can overfit to the training data, weakening the recognition model in the presence of unseen noise. To address this, we used a pre-trained acoustic model to generate a perceptual loss that makes speech enhancement more aware of the phonetic properties of the signal. This approach keeps some benefits of joint training, while alleviating the overfitting problem. Experiments on Voicebank DEMAND dataset for enhancement show that this approach achieves a new state of the art for some objective enhancement scores. In combination with distortion-independent training, our approach gets a WER of 2.80% on the test set, which is more than 20% relative better recognition performance than joint training, and 14% relative better than distortion-independent mask training.
【11】 U-shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement 标题:用于语音增强的频带感知U型Transformer 链接:https://arxiv.org/abs/2112.06052
作者:Yi Li,Yang Sun,Syed Mohsen Naqvi 机构: Newcastle University , University of Oxford 摘要:最先进的语音增强技术在语音估计精度方面的性能有限。最近,在深度学习中,Transformer显示出通过自我注意利用言语中的长距离依赖的潜力。因此,它被引入到语音增强中,以提高混合噪声的语音估计精度。然而,为了解决具有自我注意的Transformer的计算成本问题,轴向注意是一种选择,即将二维注意分为两个一维注意。受轴向注意的启发,该方法沿时间轴和频率轴计算注意图,生成时间和频率子注意图。此外,与轴向注意不同,该方法为时间轴和频率轴提供了两个平行的多头注意。此外,文献证明,在噪声混合中,语音中的低频段通常比高频段包含更多的期望信息。因此,提出了频带感知注意,即高频带注意(HFA)和低频带注意(LFA)。该方法还首次引入了U形Transformer,进一步提高了语音估计精度。通过对四个公共数据集的广泛评估,证实了该方法的有效性。 摘要:The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the long-range dependency in speech by self-attention. Therefore, it is introduced in speech enhancement to improve the speech estimation accuracy from a noise mixture. However, to address the computational cost issue in Transformer with self-attention, the axial attention is the option i.e., to split a 2D attention into two 1D attentions. Inspired by the axial attention, in the proposed method we calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps. Moreover, different from the axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis. Furthermore, it is proven in the literature that the lower frequency-band in speech, generally, contains more desired information than the higher frequency-band, in a noise mixture. Therefore, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA). The U-shaped Transformer is also first time introduced in the proposed method to further improve the speech estimation accuracy. The extensive evaluations over four public datasets, confirm the efficacy of the proposed method.
【12】 Hybrid Neural Networks for On-device Directional Hearing 标题:混合神经网络在设备定向听力中的应用 链接:https://arxiv.org/abs/2112.05893
作者:Anran Wang,Maruchi Kim,Hao Zhang,Shyamnath Gollakota 机构:University of Washington,ETH Z¨urich 备注:None 摘要:设备上的定向听力要求音频源与给定方向分离,同时实现严格的人类不易察觉的延迟要求。虽然神经网络可以获得比传统波束形成器更好的性能,但所有现有模型都无法支持计算受限可穿戴设备上的低延迟因果推断。我们介绍了DeepBeam,一种将传统波束形成器与定制的轻量级神经网络相结合的混合模型。前者减少了后者的计算负担,也提高了其通用性,而后者旨在进一步减少内存和计算开销,以实现实时和低延迟操作。我们的评估显示,在合成数据上,我们的性能与最先进的因果推理模型相当,同时实现了模型大小减少5倍、每秒计算量减少4倍、处理时间减少5倍以及更好地推广到实际硬件数据。此外,我们的实时混合模型在为低功耗可穿戴设备设计的移动CPU上以8毫秒的速度运行,并实现17.5毫秒的端到端延迟。 摘要:On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present DeepBeam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.
【13】 Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems 标题:重新审视对话系统时代ASR与NLU的界限 链接:https://arxiv.org/abs/2112.05842
作者:Manaal Faruqui,Dilek Hakkani-Tür 机构:Google Assistant, Amazon Alexa AI 备注:Accepted to be published at Computational Linguistics Journal 2022 摘要:随着世界各地越来越多的用户在日常生活中与对话代理进行交互,需要更好的语音理解,这就需要重新关注自动语音识别(ASR)和自然语言理解(NLU)研究之间的动态关系。我们简要回顾了这些研究领域,并阐述了它们之间的当前关系。根据我们在本文中所做的观察,我们认为:(1)NLU应该认识到对话系统管道上游使用的ASR模型的存在,(2)ASR应该能够从NLU中发现的错误中学习,(3)需要提供语音输入语义注释的端到端数据集,(4)ASR和NLU研究社区之间应加强合作。 摘要:As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this paper, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system's pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end datasets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.
【14】 Sequence-level self-learning with multiple hypotheses 标题:具有多个假设的序列级自学习 链接:https://arxiv.org/abs/2112.05826
作者:Kenichi Kumatani,Dimitrios Dimitriadis,Yashesh Gaur,Robert Gmyr,Sefik Emre Eskimez,Jinyu Li,Michael Zeng 机构:Microsoft, WA, USA 备注:Published in Interspeech 2020: this https URL 摘要:在这项工作中,我们开发了一种新的基于注意的自动语音识别(ASR)序列对序列(seq2seq)模型的自学习技术。对于未翻译的语音数据,来自ASR系统的假设必须用作标签。然而,不完善的ASR结果使得无监督学习难以持续提高识别性能,特别是在多个强大的教师模型不可用的情况下。与传统的无监督学习方法相比,我们采用了多任务学习(MTL)框架,其中第n个最佳ASR假设被用作每个任务的标签。seq2seq网络通过MTL框架进行更新,以便找到能够覆盖多个假设的通用表示。通过这样做,可以减轻emph{hard decision}错误的影响。我们首先通过在美国和英国英语语音之间的口音适应任务中的ASR实验来证明我们的自学习方法的有效性。我们的实验结果表明,与仅使用美国英语数据训练的基线模型相比,我们的方法可以将英国语音数据的WER从14.55%降低到10.36%。此外,我们还研究了我们提出的方法在联邦学习场景中的效果。 摘要:In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the emph{multi-task learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario.
【15】 Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition 标题:构建稀疏混合专家语音识别多语种教师队伍 链接:https://arxiv.org/abs/2112.05820
作者:Kenichi Kumatani,Robert Gmyr,Felipe Cruz Salinas,Linquan Liu,Wei Zuo,Devang Patel,Eric Sun,Yu Shi 机构:Microsoft 摘要:稀疏门控混合专家(MoE)可以在计算复杂度较低的情况下放大网络容量。在这项工作中,我们研究了如何通过简单的路由算法来扩展多语言自动语音识别(ASR)网络,以获得更好的准确度。更具体地说,我们将稀疏选通MoE技术应用于两种类型的网络:顺序-顺序Transformer(S2S-T)和Transformer-传感器(T-T)。我们通过一组多语言数据的ASR实验证明,使用S2S-T和T-T,MoE网络可以分别将相对单词错误率降低16.5%和4.7%。此外,我们还深入研究了MoE在不同条件下对T-T体系结构的影响:流模式、非流模式、使用语言ID以及使用MoE的标签解码器。 摘要:The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.5% and 4.7% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.