金融/语音/音频处理学术速递[7.23]

2021-07-27 11:14:03 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计11篇

cs.SD语音,共计13篇

eess.AS音频处理,共计13篇

1.q-fin金融:

【1】 Of Access and Inclusivity Digital Divide in Online Education 标题:论网络教育中的可访问性和包容性数字鸿沟

作者:Bheemeshwar Reddy A,Sunny Jose,Vaidehi R 备注:None 链接:https://arxiv.org/abs/2107.10723 摘要:网络教育能让所有学生平等地参与和受益吗?大规模的在线教育如果不解决数字基础设施的巨大准入差距和差距,不仅会使绝大多数学生失去学习机会,而且还会加剧教育机会方面现有的社会经济差距。 摘要:Can online education enable all students to participate in and benefit from it equally? Massive online education without addressing the huge access gap and disparities in digital infrastructure would not only exclude a vast majority of students from learning opportunities but also exacerbate the existing socio-economic disparities in educational opportunities.

【2】 Capital Requirements and Claims Recovery: A New Perspective on Solvency Regulation 标题:资本要求与债权回收:偿付能力监管的新视角

作者:Cosimo Munari,Lutz Wilhelmy,Stefan Weber 链接:https://arxiv.org/abs/2107.10635 摘要:保护债权人是金融监管的一个关键目标。在保护需求很高的情况下,即在银行业和保险业,偿付能力监管要求是防止债权人因其债权遭受损失的一种手段。现行以风险价值和平均风险价值为基础的监管要求限制了金融机构违约的概率,但未能控制违约情况下债权人债权的追偿规模。我们通过开发一种新的风险度量方法,即风险恢复值来解决这一问题。我们的概念方法可以灵活扩展,并允许为各种风险管理目的构建一般的恢复风险措施。通过设计,这些风险措施控制了债权人债权的收回,并将债权人的保护需要纳入管理层的激励结构中。我们提供了详细的案例研究和应用:我们分析了回收风险措施对企业资产负债表上资产和负债联合分配的反应,并将相应的资本要求与基于风险价值和平均风险价值的现行监管基准进行了比较。我们将讨论如何根据历史监管标准校准恢复风险度量。最后,我们证明了恢复风险度量可以应用于公司业务部门的基于绩效的管理,并且它们允许在投资管理的背景下对风险和回报之间的最佳权衡进行易于处理的描述。 摘要:Protection of creditors is a key objective of financial regulation. Where the protection needs are high, i.e., in banking and insurance, regulatory solvency requirements are an instrument to prevent that creditors incur losses on their claims. The current regulatory requirements based on Value at Risk and Average Value at Risk limit the probability of default of financial institutions, but they fail to control the size of recovery on creditors' claims in the case of default. We resolve this failure by developing a novel risk measure, Recovery Value at Risk. Our conceptual approach can flexibly be extended and allows the construction of general recovery risk measures for various risk management purposes. By design, these risk measures control recovery on creditors' claims and integrate the protection needs of creditors into the incentive structure of the management. We provide detailed case studies and applications: We analyze how recovery risk measures react to the joint distributions of assets and liabilities on firms' balance sheets and compare the corresponding capital requirements with the current regulatory benchmarks based on Value at Risk and Average Value at Risk. We discuss how to calibrate recovery risk measures to historic regulatory standards. Finally, we show that recovery risk measures can be applied to performance-based management of business divisions of firms and that they allow for a tractable characterization of optimal tradeoffs between risk and return in the context of investment management.

【3】 cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope 标题:cCorGAN:用于学习椭圆经验条件分布的条件相关GAN

作者:Gautier Marti,Victor Goubet,Frank Nielsen 机构: Independent researcher, ESILV - ´Ecole Sup´erieure d’Ing´enieurs L´eonard de Vinci, Paris, France, Sony Computer Science Laboratories Inc, Tokyo, Japan 备注:None 链接:https://arxiv.org/abs/2107.10606 摘要:本文提出了一种基于条件生成对抗网络的相关矩阵椭圆型中条件分布的近似方法。我们用定量金融的一个应用来说明这种方法:对相关收益的蒙特卡罗模拟来比较基于风险的投资组合构建方法。最后,我们讨论了目前的局限性,并主张进一步探索椭圆几何以改进结果。 摘要:We propose a methodology to approximate conditional distributions in the elliptope of correlation matrices based on conditional generative adversarial networks. We illustrate the methodology with an application from quantitative finance: Monte Carlo simulations of correlated returns to compare risk-based portfolio construction methods. Finally, we discuss about current limitations and advocate for further exploration of the elliptope geometry to improve results.

【4】 A Stochastic Control Approach to Public Debt Management 标题:公债管理的一种随机控制方法

作者:Matteo Brachetta,Claudia Ceci 机构:Department of Mathematics, †Department of Economics, University of Chieti-Pescara 备注:26 pages 链接:https://arxiv.org/abs/2107.10491 摘要:讨论了一类随机环境下的债务管理问题。我们提出了一个债务与GDP(GDP)比率的模型,其中政府通过财政政策的干预同时影响公共债务和GDP增长率。我们考虑了随机利率和可能与GDP增长率的相关性,通过这两个过程(利率和GDP增长率)对一个随机因素的依赖,这个随机因素可能代表任何相关的宏观经济变量,如经济状况。我们要解决的问题是一个政府的目标是确定财政政策,以尽量减少一般功能成本。证明了该值函数是Hamilton-Jacobi-Bellman方程的粘性解,并在经典解的基础上给出了一个验证定理。我们研究了在许多感兴趣的案例中,候选最优财政政策的形式,提供了有趣的政策见解。最后,我们讨论了债务削减问题和债务平滑问题的两个应用,给出了在某些特殊情况下价值函数和最优策略的显式表达式。 摘要:We discuss a class of debt management problems in a stochastic environment model. We propose a model for the debt-to-GDP (Gross Domestic Product) ratio where the government interventions via fiscal policies affect the public debt and the GDP growth rate at the same time. We allow for stochastic interest rate and possible correlation with the GDP growth rate through the dependence of both the processes (interest rate and GDP growth rate) on a stochastic factor which may represent any relevant macroeconomic variable, such as the state of economy. We tackle the problem of a government whose goal is to determine the fiscal policy in order to minimize a general functional cost. We prove that the value function is a viscosity solution to the Hamilton-Jacobi-Bellman equation and provide a Verification Theorem based on classical solutions. We investigate the form of the candidate optimal fiscal policy in many cases of interest, providing interesting policy insights. Finally, we discuss two applications to the debt reduction problem and debt smoothing, providing explicit expressions of the value function and the optimal policy in some special cases.

【5】 Time Varying Risk in U.S. Housing Sector and Real Estate Investment Trusts Equity Return 标题:美国房地产业的时变风险与房地产投资信托基金的股权回报

作者:Masud Alam 备注:53 pages 链接:https://arxiv.org/abs/2107.10455 摘要:本研究旨在探讨美国房地产业的波动性如何影响房地产投资信托(REIT)的权益回报。我认为,住房变量的意外变化可能是总住房风险的一个来源,从美国住房变量的波动性中提取的第一个主成分可以预测预期的房地产投资信托基金股票收益。我提出并构建了一个基于因素的住房风险指数,作为资产价格模型中的一个附加因素,该模型利用了美国住房行业住房变量的时变条件波动性。研究结果表明,所提出的住房风险指数与Merton(1973)的条件跨期资本资产定价模型(ICAPM)的风险收益关系在经济上和理论上是一致的,该模型预测REIT股权收益的风险溢价平均最高可达5.6%。在子样本分析中,正相关关系不受样本期选择的影响,但显示2009-18样本期的住房风险贝塔值较高。在控制了波动率、法玛-法兰西三个因素以及一系列宏观经济和金融变量后,这种关系仍然显著。此外,拟议中的房地产贝塔也准确预测了美国宏观经济和金融状况。 摘要:This study examines how housing sector volatilities affect real estate investment trust (REIT) equity return in the United States. I argue that unexpected changes in housing variables can be a source of aggregate housing risk, and the first principal component extracted from the volatilities of U.S. housing variables can predict the expected REIT equity returns. I propose and construct a factor-based housing risk index as an additional factor in asset price models that uses the time-varying conditional volatility of housing variables within the U.S. housing sector. The findings show that the proposed housing risk index is economically and theoretically consistent with the risk-return relationship of the conditional Intertemporal Capital Asset Pricing Model (ICAPM) of Merton (1973), which predicts an average maximum of 5.6 percent of risk premium in REIT equity return. In subsample analyses, the positive relationship is not affected by sample periods' choice but shows higher housing risk beta values for the 2009-18 sample period. The relationship remains significant after controlling for VIX, Fama-French three factors, and a broad set of macroeconomic and financial variables. Moreover, the proposed housing beta also accurately forecasts U.S. macroeconomic and financial conditions.

【6】 Everything You Always Wanted to Know About XVA Model Risk but Were Afraid to Ask 标题:所有您一直想知道但不敢问的XVA模型风险

作者:Lorenzo Silotto,Marco Scaringi,Marco Bianchetti 备注:59 pages, 15 figures, 16 tables, 43 references 链接:https://arxiv.org/abs/2107.10377 摘要:估值调整,统称XVA,在现代衍生品定价中发挥着重要作用。XVA是一种奇特的定价组件,因为它们需要对多个风险因素进行正向模拟,以便计算包括抵押品在内的投资组合风险敞口,从而导致重大的模型风险和计算工作,即使在普通交易的情况下也是如此。这项工作分析了最关键的模型风险因素,即XVA最敏感的风险因素,在准确性和性能之间找到了一个可接受的折衷方案。这项任务是在一个完整的背景下进行的,包括根据实际市场数据校准的市场标准多曲线G2 模型、变动保证金和ISDA-SIMM动态初始保证金、不同的抵押方案以及最常见的线性和非线性利率衍生品。此外,我们考虑了在无抵押掉期的情况下XVA的另一种分析方法。我们表明,一个关键的因素是建立一个节俭的时间网格,能够捕捉风险保证金期间担保敞口中出现的所有周期性峰值。为此,我们提出了一种有效捕获所有峰值的解决方法。此外,我们还证明了存在一个参数化,它允许在合理的时间内获得准确的结果,这对于实际应用是非常重要的。为了解决与存在一系列不同参数化有关的估值不确定性,我们根据欧盟审慎估值条例的规定,计算了XVA的模型风险AVA(额外估值调整)。最后,这项工作可以作为一本手册,包含一个完整的,现实的和稳健的担保风险敞口和XVA建模框架的实施步骤说明。 摘要:Valuation adjustments, collectively named XVA, play an important role in modern derivatives pricing. XVA are an exotic pricing component since they require the forward simulation of multiple risk factors in order to compute the portfolio exposure including collateral, leading to a significant model risk and computational effort, even in case of plain vanilla trades. This work analyses the most critical model risk factors, meant as those to which XVA are most sensitive, finding an acceptable compromise between accuracy and performance. This task has been conducted in a complete context including a market standard multi-curve G2 model calibrated on real market data, both Variation Margin and ISDA-SIMM dynamic Initial Margin, different collateralization schemes, and the most common linear and non-linear interest rates derivatives. Moreover, we considered an alternative analytical approach for XVA in case of uncollateralized Swaps. We show that a crucial element is the construction of a parsimonious time grid capable of capturing all periodical spikes arising in collateralized exposure during the Margin Period of Risk. To this end, we propose a workaround to efficiently capture all spikes. Moreover, we show that there exists a parameterization which allows to obtain accurate results in a reasonable time, which is a very important feature for practical applications. In order to address the valuation uncertainty linked to the existence of a range of different parameterizations, we calculate the Model Risk AVA (Additional Valuation Adjustment) for XVA according to the provisions of the EU Prudent Valuation regulation. Finally, this work can serve as an handbook containing step-by-step instructions for the implementation of a complete, realistic and robust modelling framework of collateralized exposure and XVA.

【7】 A Sparsity Algorithm with Applications to Corporate Credit Rating 标题:一种稀疏性算法及其在企业信用评级中的应用

作者:Dan Wang,Zhi Chen,Ionut Florescu 备注:16 pages, 11 tables, 3 figures 链接:https://arxiv.org/abs/2107.10306 摘要:在人工智能中,解释通常被称为黑盒的机器学习技术的结果是一项困难的任务。对特定“黑匣子”的反事实解释试图找出对输入值的最小变化,从而修改对特定输出的预测,而不是原始输出。在这项工作中,我们制定的问题,寻找一个反事实的解释作为一个优化问题。我们提出了一种新的“稀疏算法”来解决优化问题,同时也最大限度地提高了反事实解释的稀疏性。为了提高上市公司的信用评级,我们应用稀疏性算法为上市公司提供了一个简单的建议。我们用一个综合生成的数据集验证了稀疏性算法,并将其进一步应用于美国金融、医疗和it行业公司的季度财务报表。我们提供的证据表明,当评级提高时,反事实解释可以捕捉到当前季度和下一季度之间发生变化的真实报表特征的性质。实证结果表明,一家公司的评级越高,进一步提高信用评级所需的“努力”就越大。 摘要:In Artificial Intelligence, interpreting the results of a Machine Learning technique often termed as a black box is a difficult task. A counterfactual explanation of a particular "black box" attempts to find the smallest change to the input values that modifies the prediction to a particular output, other than the original one. In this work we formulate the problem of finding a counterfactual explanation as an optimization problem. We propose a new "sparsity algorithm" which solves the optimization problem, while also maximizing the sparsity of the counterfactual explanation. We apply the sparsity algorithm to provide a simple suggestion to publicly traded companies in order to improve their credit ratings. We validate the sparsity algorithm with a synthetically generated dataset and we further apply it to quarterly financial statements from companies in financial, healthcare and IT sectors of the US market. We provide evidence that the counterfactual explanation can capture the nature of the real statement features that changed between the current quarter and the following quarter when ratings improved. The empirical results show that the higher the rating of a company the greater the "effort" required to further improve credit rating.

【8】 Factors determining maximum energy consumption of Bitcoin miners 标题:决定比特币矿工最大能耗的因素

作者:Jesus M. Gonzalez-Barahona 备注:24 pages, request for comments 链接:https://arxiv.org/abs/2107.10634 摘要:背景:在过去的几年里,人们对比特币矿工的能源消耗进行了大量的讨论和估算。然而,大多数研究都集中在估计能源消耗上,而不是探讨决定能源消耗的因素。目的:探讨决定比特币矿工最大能耗的因素。特别是,分析能源消耗的限制,以及在多大程度上变化的因素可以产生其减少。方法:估计某段时间内所有比特币开采商的整体利润,以及他们在这段时间内因开采活动而面临的成本(包括能源)。其基本假设是,只有当矿商对利润有预期,同时在相互尊重方面具有竞争力时,他们才会消耗能源开采比特币。因此,他们将作为一个群体在利润平衡支出的点上运作。结果:我们给出了一个确定能源消耗的基本方程,该方程基于一些特定的因素:铸币、交易费用、汇率、能源价格和摊销成本。我们还定义了可根据采矿设备的成本和能耗计算的摊销系数,有助于了解设备成本如何影响总能耗。结论:找出了驱动比特币能耗的因素,并从中探讨了降低比特币能耗的途径。其中一些方法并没有减少比特币最重要的特性,比如控制聚合hashpower的机会,或者工作证明机制的基本原理。一般来说,所提出的方法有助于根据可用数据计算或假设的因素预测不同情景下的能源消耗。 摘要:Background: During the last years, there has been a lot of discussion and estimations on the energy consumption of Bitcoin miners. However, most of the studies are focused on estimating energy consumption, not in exploring the factors that determine it. Goal: To explore the factors that determine maximum energy consumption of Bitcoin miners. In particular, analyze the limits of energy consumption, and to which extent variations of the factors could produce its reduction. Method: Estimate the overall profit of all Bitcoin miners during a certain period of time, and the costs (including energy) that they face during that time, because of the mining activity. The underlying assumptions is that miners will only consume energy to mine Bitcoin if they have the expectation of profit, and at the same time they are competitive with respect of each other. Therefore, they will operate as a group in the point where profits balance expenditures. Results: We show a basic equation that determines energy consumption based on some specific factors: minting, transaction fees, exchange rate, energy price, and amortization cost. We also define the Amortization Factor, which can be computed for mining devices based on their cost and energy consumption, helps to understand how the cost of equipment influences total energy consumption. Conclusions: The factors driving energy consumption are identified, and from them, some ways in which Bitcoin energy consumption could be reduced are discussed. Some of these ways do not reduce the most important properties of Bitcoin, such as the chances of control of the aggregated hashpower, or the fundamentals of the proof of work mechanism. In general, the methods presented can help to predict energy consumption in different scenarios, based on factors that can be calculated from available data, or assumed in scenarios.

【9】 Hodge theoretic reward allocation for generalized cooperative games on graphs 标题:图上广义合作对策的Hodge理论报酬分配

作者:Tongseok Lim 机构: Krannert School of ManagementPurdue University 链接:https://arxiv.org/abs/2107.10510 摘要:定义了一般图上的合作对策,推广了lloyds。Shapley著名的分配公式,这些游戏的条款随机路径积分驱动的相关马尔可夫链在每个图形。然后,根据一般加权图上的组合霍奇分解,我们证明了由随机路径积分定义的每个参与者一个值分配算子与参与者的分量对策(即最小二乘(或泊松)方程的解)是一致的。文中还介绍了一些激励性的例子和应用。 摘要:We define cooperative games on general graphs and generalize Lloyd S. Shapley's celebrated allocation formula for those games in terms of stochastic path integral driven by the associated Markov chain on each graph. We then show that the value allocation operator, one for each player defined by the stochastic path integral, coincides with the player's component game which is the solution to the least squares (or Poisson's) equation, in light of the combinatorial Hodge decomposition on general weighted graphs. Several motivational examples and applications are also presented.

【10】 Financial Network Games 标题:财经网络游戏

作者:Panagiotis Kanellopoulos,Maria Kyropoulou,Hao Zhou 机构:School of Computer Science and Electronic Engineering, University of Essex, UK 链接:https://arxiv.org/abs/2107.06623 摘要:我们从博弈论的角度研究金融系统。一个金融系统由一个网络来表示,其中节点对应于企业,而有向标记边对应于企业之间的债务契约。网络中周期的存在表明,企业向其贷方之一付款可能会产生一些收入。因此,如果一家公司不能完全偿还债务,那么它向每个债权人支付的确切(部分)款项可能会影响现金流入。我们自然而然地假设,公司对他们的财务状况(效用)感兴趣,这与他们从网络收到的收入金额一致。这定义了企业之间的博弈,可以看作是效用最大化的代理人,他们可以制定支付策略。我们是第一个研究金融网络游戏下出现的一套自然的支付策略称为优先比例支付。我们研究了均衡策略的存在性和效率,在不同的假设条件下,企业效用如何定义,企业之间允许的债务契约类型,以及实践中常见的其他财务特征。令人惊讶的是,即使所有公司的策略都是固定的,也不能保证存在唯一的支付模式。因此,我们也研究了固定支付策略有效支付模式的存在性和计算。 摘要:We study financial systems from a game-theoretic standpoint. A financial system is represented by a network, where nodes correspond to firms, and directed labeled edges correspond to debt contracts between them. The existence of cycles in the network indicates that a payment of a firm to one of its lenders might result to some incoming payment. So, if a firm cannot fully repay its debt, then the exact (partial) payments it makes to each of its creditors can affect the cash inflow back to itself. We naturally assume that the firms are interested in their financial well-being (utility) which is aligned with the amount of incoming payments they receive from the network. This defines a game among the firms, that can be seen as utility-maximizing agents who can strategize over their payments. We are the first to study financial network games that arise under a natural set of payment strategies called priority-proportional payments. We investigate the existence and (in)efficiency of equilibrium strategies, under different assumptions on how the firms' utility is defined, on the types of debt contracts allowed between the firms, and on the presence of other financial features that commonly arise in practice. Surprisingly, even if all firms' strategies are fixed, the existence of a unique payment profile is not guaranteed. So, we also investigate the existence and computation of valid payment profiles for fixed payment strategies.

【11】 A Time-Varying Network for Cryptocurrencies 标题:一种用于加密货币的时变网络

作者:Li Guo,Wolfgang Karl Härdle,Yubo Tao 机构:Fudan University, Shanghai Institute of International Finance and Economics, Wolfgang Karl H¨ardle†, Humboldt-Universit¨at zu Berlin, Singapore Management University, Xiamen University, Charles University 备注:43 pages, 6 figures, 7 tables 链接:https://arxiv.org/abs/1802.03708 摘要:加密货币返回交叉可预测性和技术相似性,产生关于风险传播和市场细分的信息。为了研究这些影响,我们建立了一个基于收益交叉可预测性和技术相似性演化的加密货币时变网络。我们发展了一种动态协变量辅助的谱聚类方法来一致地估计加密货币网络的潜在社区结构,该网络同时考虑了这两组信息。我们证明投资者可以通过投资不同社区的加密货币来实现更好的风险分散。实施跨加密动量交易策略的横截面投资组合可获得1.08%的日回报。通过剖析行为因素的投资组合收益,我们证实了我们的结果不是由行为机制驱动的。 摘要:Cryptocurrencies return cross-predictability and technological similarity yield information on risk propagation and market segmentation. To investigate these effects, we build a time-varying network for cryptocurrencies, based on the evolution of return cross-predictability and technological similarities. We develop a dynamic covariate-assisted spectral clustering method to consistently estimate the latent community structure of cryptocurrencies network that accounts for both sets of information. We demonstrate that investors can achieve better risk diversification by investing in cryptocurrencies from different communities. A cross-sectional portfolio that implements an inter-crypto momentum trading strategy earns a 1.08% daily return. By dissecting the portfolio returns on behavioral factors, we confirm that our results are not driven by behavioral mechanisms.

2.cs.SD语音:

【1】 StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion 标题:StarGANv2-VC:一种多样化、无监督、非并行的自然发音转换框架

作者:Yinghao Aaron Li,Ali Zare,Nima Mesgarani 机构:Department of Electrical Engineering, Columbia University, USA 备注:To be published in INTERSPEECH 2021 Proceedings 链接:https://arxiv.org/abs/2107.10394 摘要:我们提出了一种无监督的非并行多对多语音转换(VC)方法,该方法使用一种称为StarGAN v2的生成性对抗网络(GAN)。结合使用敌方源分类器损失和感知损失,我们的模型明显优于以前的VC模型。虽然我们的模型只训练了20个说英语的人,但它可以推广到各种语音转换任务,如任意对多、跨语言和歌唱转换。通过使用一个风格编码器,我们的框架还可以将普通阅读语音转换成风格语音,如情感语音和假声语音。在一个非平行多对多语音转换任务上进行的主客观评价实验表明,我们的模型能够产生自然发音的语音,接近最先进的基于文本到语音(TTS)的语音转换方法的音质,而不需要文本标签。此外,我们的模型是完全卷积的,并且具有比实时声码器更快的速度,例如并行WaveGAN可以执行实时语音转换。 摘要:We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

【2】 JS Fake Chorales: a Synthetic Dataset of Polyphonic Music with Human Annotation 标题:JS假合唱:一个人工注释的复调音乐合成数据集

作者:Omar Peracha 链接:https://arxiv.org/abs/2107.10388 摘要:与其他领域(如语言建模或图像分类)相比,基于学习的复调符号音乐建模的高质量数据集在规模上仍然不易访问。特别是,数据集包含的信息揭示了人类对给定音乐样本的反应,这是罕见的。规模问题仍然是该领域取得突破的一个普遍障碍,而缺乏听众评价尤其与生成性建模问题空间有关,在这个问题空间中,与质量成功密切相关的明确客观指标仍然难以捉摸。我们提出了JS-Fake-Chorales,一个由新的基于学习的算法生成的500个片段的数据集,以MIDI形式提供。为了验证进一步按需扩展该数据集的可能性,我们从算法中获取连续的输出,并避免樱桃选择。我们进行了一项在线实验,旨在尽可能公平地对听众进行评价,结果发现,受访者在区分巴赫创作的真假合唱时,平均只比随机猜测好7%。此外,我们还将从实验中收集的匿名数据与MIDI样本一起提供,例如受访者的音乐体验以及他们为每个样本提交回复所花的时间。最后,我们进行了烧蚀研究,以证明在复调音乐建模研究中使用合成片段的有效性,并发现我们可以通过简单地使用JS假合唱来增加训练集,使用已知的算法来改进经典JSB合唱数据集的最新验证集丢失。 摘要:High quality datasets for learning-based modelling of polyphonic symbolic music remain less readily-accessible at scale than in other domains, such as language modelling or image classification. In particular, datasets which contain information revealing insights about human responses to the given music samples are rare. The issue of scale persists as a general hindrance towards breakthroughs in the field, while the lack of listener evaluation is especially relevant to the generative modelling problem-space, where clear objective metrics correlating strongly with qualitative success remain elusive. We propose the JS Fake Chorales, a dataset of 500 pieces generated by a new learning-based algorithm, provided in MIDI form. We take consecutive outputs from the algorithm and avoid cherry-picking in order to validate the potential to further scale this dataset on-demand. We conduct an online experiment for human evaluation, designed to be as fair to the listener as possible, and find that respondents were on average only 7% better than random guessing at distinguishing JS Fake Chorales from real chorales composed by JS Bach. Furthermore, we make anonymised data collected from experiments available along with the MIDI samples, such as the respondents' musical experience and how long they took to submit their response for each sample. Finally, we conduct ablation studies to demonstrate the effectiveness of using the synthetic pieces for research in polyphonic music modelling, and find that we can improve on state-of-the-art validation set loss for the canonical JSB Chorales dataset, using a known algorithm, by simply augmenting the training set with the JS Fake Chorales.

【3】 HARP-Net: Hyper-Autoencoded Reconstruction Propagation\for Scalable Neural Audio Coding 标题:HARP-NET:适用于可伸缩神经音频编码的超自动编码重建传播

作者:Darius Petermann,Seungkwon Beack,Minje Kim 机构: Indiana University, Department of Intelligent Systems Engineering, Bloomington, IN , USA, Electronics and Telecommunications Research Institute, Daejeon , South Korea 备注:Accepted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021, Mohonk Mountain House, New Paltz, NY 链接:https://arxiv.org/abs/2107.10843 摘要:基于自动编码器的编解码器采用量化将其瓶颈层激活转换为位串,这一过程阻碍了编码器和解码器部分之间的信息流。为了避免这个问题,我们在相应的一对编码器-解码器层之间使用额外的跳过连接。假设在镜像自动编码器拓扑中,解码器层重构其相应编码器层的中间特征表示。因此,从相应的编码器层直接传播的任何附加信息都有助于重建。我们以附加自动编码器的形式实现这种跳过连接,每个自动编码器是一个小型编解码器,用于压缩成对编解码器层之间的大量数据传输。我们的经验验证,提出的超自动编码架构改善感知音频质量相比,一个普通的自动编码器基线。 摘要:An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the intermediate feature representation of its corresponding encoder layer. Hence, any additional information directly propagated from the corresponding encoder layer helps the reconstruction. We implement this kind of skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoder-decoder layers. We empirically verify that the proposed hyper-autoencoded architecture improves perceptual audio quality compared to an ordinary autoencoder baseline.

【4】 Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition 标题:用于音视频情感识别的多模态残差感知器网络

作者:Xin Chang,Władysław Skarbek 机构:����������, Citation: Chang, X.; Skarbek, W., Multi-modal Residual Perceptron, Network for Audio-Video Emotion, classification. Preprints ,. 链接:https://arxiv.org/abs/2107.10742 摘要:情感识别是人机交互的一个重要研究领域。音视频情感识别(AVER)目前受到深度神经网络(DNN)建模工具的攻击。在已发表的论文中,作者通常只展示了多种模式优于纯音频或纯视频模式的案例。然而,在某些情况下,单一模态的优越性是可以找到的。在我们的研究中,我们假设对于情感事件的模糊类别,一个模态的高噪声可以放大在建模神经网络参数中间接表示的第二模态的低噪声。为了避免这种跨模态信息的干扰,我们定义了一个多模态剩余感知器网络(MRPN),它从多模态网络分支中学习产生深度特征表示,同时降低噪声。对于提出的MRPN模型和新的流媒体数字电影时间增强算法,Ryerson情感语音和歌曲视听数据库(RAVDESS)和众包情感多模态演员数据集(Crema-d)的平均识别率分别提高到91.4%和83.15%。此外,MRPN概念显示了其在处理光和声信号源的多模态分类器中的潜力。 摘要:Emotion recognition is an important research field for Human-Computer Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the authors show only cases of the superiority of multi modalities over audio-only or video-only modalities. However, there are cases superiority in single modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the higher noise of one modality can amplify the lower noise of the second modality represented indirectly in the parameters of the modeling neural network. To avoid such cross-modal information interference we define a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise. For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for multi-modal classifiers dealing with signal sources not only of optical and acoustical type.

【5】 Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough 标题:Achoo项目:从呼吸、声音和咳嗽记录中检测冠状病毒的实用模型和应用

作者:Alexander Ponomarchuk,Ilya Burenko,Elian Malkin,Ivan Nazarov,Vladimir Kokh,Manvel Avetisian,Leonid Zhukov 机构:∗ Sber AI Lab 链接:https://arxiv.org/abs/2107.10716 摘要:COVID-19大流行引起了人们对感染检测和监测解决方案的极大兴趣和需求。在本文中,我们提出了一种机器学习方法来快速分类COVID-19使用录音消费设备。该方法将信号处理方法与微调的深度学习网络相结合,提供了信号去噪、咳嗽检测和分类的方法。我们还开发并部署了一个移动应用程序,它使用症状检查器以及语音、呼吸和咳嗽信号来检测COVID-19感染。该应用程序在开源数据集和最终用户在beta测试期间收集的噪声数据上都表现出了强大的性能。 摘要:The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification. We have also developed and deployed a mobile application that uses symptoms checker together with voice, breath and cough signals to detect COVID-19 infection. The application showed robust performance on both open sourced datasets and on the noisy data collected during beta testing by the end users.

【6】 CarneliNet: Neural Mixture Model for Automatic Speech Recognition 标题:CarneliNet:用于自动语音识别的神经混合模型

作者:Aleksei Kalinov,Somshubra Majumdar,Jagadeesh Balam,Boris Ginsburg 机构:NVIDIA, USA, Skolkovo Institute of Science and Technology, Russia 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.10708 摘要:端到端的自动语音识别系统通过使用越来越深的模型已经达到了很高的识别精度。然而,深度的增加带来了更大的感受野,这可能会对流场景中的模型性能产生负面影响。我们提出了另一种方法,我们称之为神经混合模型。其基本思想是引入浅层网络的并行混合,而不是非常深的网络。为了验证这个想法,我们设计了Carneline——一个由三个巨型模块组成的基于CTC的神经网络。每个巨型块由基于一维纵深可分卷积的多个并行浅子网组成。我们在LibriSpeech、MLS和AISHELL-2数据集上对模型进行了评估,并获得了接近CTC模型的最新结果。最后,我们证明了可以动态地重新配置并行子网的数量以适应计算需求,而无需重新训练。 摘要:End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network. To validate this idea we design CarneliNet -- a CTC-based neural network composed of three mega-blocks. Each mega-block consists of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions. We evaluate the model on LibriSpeech, MLS and AISHELL-2 datasets and achieved close to state-of-the-art results for CTC-based models. Finally, we demonstrate that one can dynamically reconfigure the number of parallel sub-networks to accommodate the computational requirements without retraining.

【7】 Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech 标题:基于多任务的无线通信语音鲁棒ASR联合学习方法

作者:Duo Ma,Nana Hou,Van Tung Pham,Haihua Xu,Eng Siong Chng 机构: Human Language Technology (HLT) Laboratory, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore 备注:7pages,3figures,Submitted to APSIPA2021 链接:https://arxiv.org/abs/2107.10701 摘要:为了实现无线通信条件下鲁棒的端到端自动语音识别(E2E-ASR),本文提出了一种基于多任务的方法,将语音增强(SE)模块作为前端,E2E-ASR模型作为后端。该方法的优点之一是可以从头开始训练整个系统。与以前的工作不同,这里的任何一个组件都不需要分别执行预训练和微调过程。通过分析,我们发现该方法的成功之处在于以下几个方面。首先,多任务学习是必不可少的,即SE网络不仅是学习产生更智能的语音,而且是为了产生有利于识别的语音。其次,我们还发现从噪声语音中保留语音相位对于提高ASR性能至关重要。第三,提出了一种双通道的数据增强训练方法,并对其进行了进一步的改进,即将干净语音和增强语音结合起来对整个系统进行训练。我们在大鼠英语数据集上对所提出的方法进行了评估,联合训练方法的相对WER降低了4.6%,而数据增强方法的相对WER降低了11.2%。 摘要:To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component here doesn't need to perform pre-training and fine-tuning processes separately. Through analysis, we found that the success of the proposed method lies in the following aspects. Firstly, multitask learning is essential, that is the SE network is not only learning to produce more Intelligent speech, it is also aimed to generate speech that is beneficial to recognition. Secondly, we also found speech phase preserved from noisy speech is critical for improving ASR performance. Thirdly, we propose a dual channel data augmentation training method to obtain further improvement.Specifically, we combine the clean and enhanced speech to train the whole system. We evaluate the proposed method on the RATS English data set, achieving a relative WER reduction of 4.6% with the joint training method, and up to a relative WER reduction of 11.2% with the proposed data augmentation method.

【8】 CNN Classifier for Just-in-Time Woodpeckers Detection and Deterrent 标题:用于实时检测和威慑啄木鸟的CNN分类器

作者:Alexander Greysukh 备注:6 pages, 4 figures 链接:https://arxiv.org/abs/2107.10676 摘要:啄木鸟会对房屋造成严重破坏,尤其是在郊区。有许多防止和击退的方法,包括被动诱饵,虽然这些可能只提供暂时的救济。随后,可能更有效地实施啄木鸟威慑,如运动,光,声,或超声波,将触发啄木鸟的标志鼓检测。为了检测典型的25hz鼓声频率,需要10毫秒以下的采样周期和频繁的fft,并且需要相当大的计算成本。在硬件频谱分析仪可以避免这些成本的权衡频率的时间分辨率。训练后的模型转换成TF-Lite-Micro,移植到MCU,识别出各种预先记录的啄木鸟鸣声。该计划是将原型机与威慑装置结合起来,使之成为一个完全自主的解决方案。 摘要:Woodpeckers can cause significant damage to homes, especially in suburban areas. There are a number of preventing and repelling methods including passive decoys, though these may only provide temporary relief. Subsequently, it may be more efficient to implement a woodpecker deterrent, such as motion, light, sound, or ultrasound that would be triggered by detection of woodpecker signature drumming. To detect the typical 25 Hz drumming frequency, sampling periods under 10 milliseconds with frequent FFTs are required with considerable computational costs. An in-hardware spectrum analyzer may avoid these costs by trading off frequency for time resolutions. The trained model converted to TF Lite Micro, ported to an MCU, and identifies a variety of the prerecorded woodpecker drumming. The plan is to integrate the prototype with a deterrent device making it a completely autonomous solution.

【9】 Digital Einstein Experience: Fast Text-to-Speech for Conversational AI 标题:数字爱因斯坦体验:对话式人工智能的快速文本到语音转换

作者:Joanna Rownicka,Kilian Sprenkamp,Antonio Tripiana,Volodymyr Gromoglasov,Timo P Kunz 机构:Aflorithmic Labs Ltd. 备注:accepted at Interspeech 2021 链接:https://arxiv.org/abs/2107.10658 摘要:我们描述了为会话人工智能用例创建和提供自定义语音的方法。更具体地说,我们为数字爱因斯坦角色提供了一个声音,以实现数字对话体验中的人机交互。为了生成符合上下文的语音,我们首先设计一个语音字符,然后生成与所需语音属性相对应的录音。然后我们模拟声音。我们的解决方案利用FastSpeech2从音素中预测对数标度的mel谱图,并利用并行WaveGAN生成波形。该系统支持字符输入,并在输出端提供语音波形。我们为选定的单词使用自定义词典,以确保它们的正确发音。我们提出的云架构能够实现快速的语音传输,使我们能够与数字版本的Albert Einstein进行实时对话。 摘要:We describe our approach to create and deliver a custom voice for a conversational AI use-case. More specifically, we provide a voice for a Digital Einstein character, to enable human-computer interaction within the digital conversation experience. To create the voice which fits the context well, we first design a voice character and we produce the recordings which correspond to the desired speech attributes. We then model the voice. Our solution utilizes Fastspeech 2 for log-scaled mel-spectrogram prediction from phonemes and Parallel WaveGAN to generate the waveforms. The system supports a character input and gives a speech waveform at the output. We use a custom dictionary for selected words to ensure their proper pronunciation. Our proposed cloud architecture enables for fast voice delivery, making it possible to talk to the digital version of Albert Einstein in real-time.

【10】 A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework 标题:基于Coqui STT框架的哈萨克语低成本语音识别基线模型

作者:Ilnar Salimzianov 机构:Taruen 备注:4 pages, 2 tables 链接:https://arxiv.org/abs/2107.10637 摘要:移动设备正在改变人们与计算机的交互方式,应用程序的语音接口也变得越来越重要。最近出版的自动语音识别系统非常精确,但通常需要强大的机器(专门的图形处理单元)进行推理,这使得它们无法在商品设备上运行,尤其是在流模式下。在不使用GPU的情况下,我们对哈萨克ASR基线模型(Khassanov等人,2021年)的准确性印象深刻,但对其推理时间不满意,因此我们训练了一个新的基线声学模型(与上述论文在同一数据集上)和三个语言模型,以用于Coqui STT框架。结果看起来很有希望,但是需要进一步的训练和参数扫描,或者限制ASR系统必须支持的词汇量,才能达到生产级的精度。 摘要:Mobile devices are transforming the way people interact with computers, and speech interfaces to applications are ever more important. Automatic Speech Recognition systems recently published are very accurate, but often require powerful machinery (specialised Graphical Processing Units) for inference, which makes them impractical to run on commodity devices, especially in streaming mode. Impressed by the accuracy of, but dissatisfied with the inference times of the baseline Kazakh ASR model of (Khassanov et al.,2021) when not using a GPU, we trained a new baseline acoustic model (on the same dataset as the aforementioned paper) and three language models for use with the Coqui STT framework. Results look promising, but further epochs of training and parameter sweeping or, alternatively, limiting the vocabulary that the ASR system must support, is needed to reach a production-level accuracy.

【11】 Controlling the Perceived Sound Quality for Dialogue Enhancement with Deep Learning 标题:基于深度学习的对话增强感知音质控制

作者:Christian Uhle,Matteo Torcoli,Jouni Paulus 机构:⋆ Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel , Erlangen, Germany, † International Audio Laboratories Erlangen∗, Am Wolfsmantel , Erlangen, Germany 备注:None 链接:https://arxiv.org/abs/2107.10562 摘要:语音增强可衰减语音信号中的干扰声音,但可能会引入可感知地恶化输出信号的伪影。我们提出了一种控制干扰背景信号衰减和音质损失之间权衡的方法。深度神经网络估计分离的背景信号的衰减,使得使用伪影相关感知分数量化的声音质量满足可调整的目标。主观评价表明,在不同的输入信号中获得了一致的音质。实验结果表明,该方法能够有效地控制折衷,其精度足以满足实际的对话增强应用。 摘要:Speech enhancement attenuates interfering sounds in speech signals but may introduce artifacts that perceivably deteriorate the output signal. We propose a method for controlling the trade-off between the attenuation of the interfering background signal and the loss of sound quality. A deep neural network estimates the attenuation of the separated background signal such that the sound quality, quantified using the Artifact-related Perceptual Score, meets an adjustable target. Subjective evaluations indicate that consistent sound quality is obtained across various input signals. Our experiments show that the proposed method is able to control the trade-off with an accuracy that is adequate for real-world dialogue enhancement applications.

【12】 Improving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer Learning 标题:利用Sørensen-Dice系数损失和迁移学习改进多通道录音的复音事件检测

作者:Karn N. Watcharasupat,Thi Ngoc Tho Nguyen,Ngoc Khanh Nguyen,Zhen Jian Lee,Douglas L. Jones,Woon Seng Gan 机构:School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore., Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA. 备注:Under review for the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021 链接:https://arxiv.org/abs/2107.10471 摘要:最近,S{o}rensen--Dice系数作为一种损失函数(也称为Dice loss)越来越受欢迎,因为它在负样本数显著超过正样本数的任务中具有鲁棒性,如语义分割、自然语言处理和声音事件检测。传统的二元交叉熵损失的复调声音事件检测系统的训练往往会导致次优的检测性能,因为训练常常被来自负样本的更新所淹没。在本文中,我们研究了骰子丢失、模态内和模态间转移学习、数据增强和记录格式对多声道输入的合成音事件检测系统性能的影响。我们的分析表明,在不同的训练设置和录制格式下,以骰子损失训练的复调声音事件检测系统在F1分数和错误率方面始终优于以交叉熵损失训练的复调声音事件检测系统。通过使用迁移学习和适当组合不同的数据增强技术,我们进一步提高了性能。 摘要:The S{o}rensen--Dice Coefficient has recently seen rising popularity as a loss function (also known as Dice loss) due to its robustness in tasks where the number of negative samples significantly exceeds that of positive samples, such as semantic segmentation, natural language processing, and sound event detection. Conventional training of polyphonic sound event detection systems with binary cross-entropy loss often results in suboptimal detection performance as the training is often overwhelmed by updates from negative samples. In this paper, we investigated the effect of the Dice loss, intra- and inter-modal transfer learning, data augmentation, and recording formats, on the performance of polyphonic sound event detection systems with multichannel inputs. Our analysis showed that polyphonic sound event detection systems trained with Dice loss consistently outperformed those trained with cross-entropy loss across different training settings and recording formats in terms of F1 score and error rate. We achieved further performance gains via the use of transfer learning and an appropriate combination of different data augmentation techniques.

【13】 What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis 标题:是什么让声音事件定位和检测变得困难?从错误分析中得到的启示

作者:Thi Ngoc Tho Nguyen,Karn N. Watcharasupat,Zhen Jian Lee,Ngoc Khanh Nguyen,Douglas L. Jones,Woon Seng Gan 机构: School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore., Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA. 备注:Under review for the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021 链接:https://arxiv.org/abs/2107.10469 摘要:声事件定位与检测(SELD)是一个新兴的研究课题,旨在将声事件检测与波达方向估计的任务统一起来。因此,SELD继承了噪声、混响、干扰、复调和声源的非平稳性等两个任务的挑战。此外,SELD常常面临一个额外的挑战,即为多个重叠的声音事件分配检测到的声音类别和到达方向之间的正确对应。以往的研究表明,混响环境中的未知干扰往往会导致SELD系统性能的严重下降。为了进一步了解SELD任务的挑战,我们对两个SELD系统进行了详细的错误分析,这两个系统在DCASE SELD挑战的团队类别中都排名第二,一个在2020年,一个在2021年。实验结果表明,复调是SELD的主要挑战,因为很难检测出所有感兴趣的声音事件。此外,SELD系统对于训练集中占主导地位的复调场景往往会产生较少的错误。 摘要:Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation. As a result, SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources. Furthermore, SELD often faces an additional challenge of assigning correct correspondences between the detected sound classes and directions of arrival to multiple overlapping sound events. Previous studies have shown that unknown interferences in reverberant environments often cause major degradation in the performance of SELD systems. To further understand the challenges of the SELD task, we performed a detailed error analysis on two of our SELD systems, which both ranked second in the team category of DCASE SELD Challenge, one in 2020 and one in 2021. Experimental results indicate polyphony as the main challenge in SELD, due to the difficulty in detecting all sound events of interest. In addition, the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set.

3.eess.AS音频处理:

【1】 HARP-Net: Hyper-Autoencoded Reconstruction Propagation\for Scalable Neural Audio Coding 标题:HARP-NET:适用于可伸缩神经音频编码的超自动编码重建传播

作者:Darius Petermann,Seungkwon Beack,Minje Kim 机构: Indiana University, Department of Intelligent Systems Engineering, Bloomington, IN , USA, Electronics and Telecommunications Research Institute, Daejeon , South Korea 备注:Accepted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021, Mohonk Mountain House, New Paltz, NY 链接:https://arxiv.org/abs/2107.10843 摘要:基于自动编码器的编解码器采用量化将其瓶颈层激活转换为位串,这一过程阻碍了编码器和解码器部分之间的信息流。为了避免这个问题,我们在相应的一对编码器-解码器层之间使用额外的跳过连接。假设在镜像自动编码器拓扑中,解码器层重构其相应编码器层的中间特征表示。因此,从相应的编码器层直接传播的任何附加信息都有助于重建。我们以附加自动编码器的形式实现这种跳过连接,每个自动编码器是一个小型编解码器,用于压缩成对编解码器层之间的大量数据传输。我们的经验验证,提出的超自动编码架构改善感知音频质量相比,一个普通的自动编码器基线。 摘要:An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the intermediate feature representation of its corresponding encoder layer. Hence, any additional information directly propagated from the corresponding encoder layer helps the reconstruction. We implement this kind of skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoder-decoder layers. We empirically verify that the proposed hyper-autoencoded architecture improves perceptual audio quality compared to an ordinary autoencoder baseline.

【2】 Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition 标题:用于音视频情感识别的多模态残差感知器网络

作者:Xin Chang,Władysław Skarbek 机构:����������, Citation: Chang, X.; Skarbek, W., Multi-modal Residual Perceptron, Network for Audio-Video Emotion, classification. Preprints ,. 链接:https://arxiv.org/abs/2107.10742 摘要:情感识别是人机交互的一个重要研究领域。音视频情感识别(AVER)目前受到深度神经网络(DNN)建模工具的攻击。在已发表的论文中,作者通常只展示了多种模式优于纯音频或纯视频模式的案例。然而,在某些情况下,单一模态的优越性是可以找到的。在我们的研究中,我们假设对于情感事件的模糊类别,一个模态的高噪声可以放大在建模神经网络参数中间接表示的第二模态的低噪声。为了避免这种跨模态信息的干扰,我们定义了一个多模态剩余感知器网络(MRPN),它从多模态网络分支中学习产生深度特征表示,同时降低噪声。对于提出的MRPN模型和新的流媒体数字电影时间增强算法,Ryerson情感语音和歌曲视听数据库(RAVDESS)和众包情感多模态演员数据集(Crema-d)的平均识别率分别提高到91.4%和83.15%。此外,MRPN概念显示了其在处理光和声信号源的多模态分类器中的潜力。 摘要:Emotion recognition is an important research field for Human-Computer Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the authors show only cases of the superiority of multi modalities over audio-only or video-only modalities. However, there are cases superiority in single modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the higher noise of one modality can amplify the lower noise of the second modality represented indirectly in the parameters of the modeling neural network. To avoid such cross-modal information interference we define a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise. For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for multi-modal classifiers dealing with signal sources not only of optical and acoustical type.

【3】 Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough 标题:Achoo项目:从呼吸、声音和咳嗽记录中检测冠状病毒的实用模型和应用

作者:Alexander Ponomarchuk,Ilya Burenko,Elian Malkin,Ivan Nazarov,Vladimir Kokh,Manvel Avetisian,Leonid Zhukov 机构:∗ Sber AI Lab 链接:https://arxiv.org/abs/2107.10716 摘要:COVID-19大流行引起了人们对感染检测和监测解决方案的极大兴趣和需求。在本文中,我们提出了一种机器学习方法来快速分类COVID-19使用录音消费设备。该方法将信号处理方法与微调的深度学习网络相结合,提供了信号去噪、咳嗽检测和分类的方法。我们还开发并部署了一个移动应用程序,它使用症状检查器以及语音、呼吸和咳嗽信号来检测COVID-19感染。该应用程序在开源数据集和最终用户在beta测试期间收集的噪声数据上都表现出了强大的性能。 摘要:The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification. We have also developed and deployed a mobile application that uses symptoms checker together with voice, breath and cough signals to detect COVID-19 infection. The application showed robust performance on both open sourced datasets and on the noisy data collected during beta testing by the end users.

【4】 CarneliNet: Neural Mixture Model for Automatic Speech Recognition 标题:CarneliNet:用于自动语音识别的神经混合模型

作者:Aleksei Kalinov,Somshubra Majumdar,Jagadeesh Balam,Boris Ginsburg 机构:NVIDIA, USA, Skolkovo Institute of Science and Technology, Russia 备注:Submitted to ASRU 2021 链接:https://arxiv.org/abs/2107.10708 摘要:端到端的自动语音识别系统通过使用越来越深的模型已经达到了很高的识别精度。然而,深度的增加带来了更大的感受野,这可能会对流场景中的模型性能产生负面影响。我们提出了另一种方法,我们称之为神经混合模型。其基本思想是引入浅层网络的并行混合,而不是非常深的网络。为了验证这个想法,我们设计了Carneline——一个由三个巨型模块组成的基于CTC的神经网络。每个巨型块由基于一维纵深可分卷积的多个并行浅子网组成。我们在LibriSpeech、MLS和AISHELL-2数据集上对模型进行了评估,并获得了接近CTC模型的最新结果。最后,我们证明了可以动态地重新配置并行子网的数量以适应计算需求,而无需重新训练。 摘要:End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network. To validate this idea we design CarneliNet -- a CTC-based neural network composed of three mega-blocks. Each mega-block consists of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions. We evaluate the model on LibriSpeech, MLS and AISHELL-2 datasets and achieved close to state-of-the-art results for CTC-based models. Finally, we demonstrate that one can dynamically reconfigure the number of parallel sub-networks to accommodate the computational requirements without retraining.

【5】 Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech 标题:基于多任务的无线通信语音鲁棒ASR联合学习方法

作者:Duo Ma,Nana Hou,Van Tung Pham,Haihua Xu,Eng Siong Chng 机构: Human Language Technology (HLT) Laboratory, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore 备注:7pages,3figures,Submitted to APSIPA2021 链接:https://arxiv.org/abs/2107.10701 摘要:为了实现无线通信条件下鲁棒的端到端自动语音识别(E2E-ASR),本文提出了一种基于多任务的方法,将语音增强(SE)模块作为前端,E2E-ASR模型作为后端。该方法的优点之一是可以从头开始训练整个系统。与以前的工作不同,这里的任何一个组件都不需要分别执行预训练和微调过程。通过分析,我们发现该方法的成功之处在于以下几个方面。首先,多任务学习是必不可少的,即SE网络不仅是学习产生更智能的语音,而且是为了产生有利于识别的语音。其次,我们还发现从噪声语音中保留语音相位对于提高ASR性能至关重要。第三,提出了一种双通道的数据增强训练方法,并对其进行了进一步的改进,即将干净语音和增强语音结合起来对整个系统进行训练。我们在大鼠英语数据集上对所提出的方法进行了评估,联合训练方法的相对WER降低了4.6%,而数据增强方法的相对WER降低了11.2%。 摘要:To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component here doesn't need to perform pre-training and fine-tuning processes separately. Through analysis, we found that the success of the proposed method lies in the following aspects. Firstly, multitask learning is essential, that is the SE network is not only learning to produce more Intelligent speech, it is also aimed to generate speech that is beneficial to recognition. Secondly, we also found speech phase preserved from noisy speech is critical for improving ASR performance. Thirdly, we propose a dual channel data augmentation training method to obtain further improvement.Specifically, we combine the clean and enhanced speech to train the whole system. We evaluate the proposed method on the RATS English data set, achieving a relative WER reduction of 4.6% with the joint training method, and up to a relative WER reduction of 11.2% with the proposed data augmentation method.

【6】 CNN Classifier for Just-in-Time Woodpeckers Detection and Deterrent 标题:用于实时检测和威慑啄木鸟的CNN分类器

作者:Alexander Greysukh 备注:6 pages, 4 figures 链接:https://arxiv.org/abs/2107.10676 摘要:啄木鸟会对房屋造成严重破坏,尤其是在郊区。有许多防止和击退的方法,包括被动诱饵,虽然这些可能只提供暂时的救济。随后,可能更有效地实施啄木鸟威慑,如运动,光,声,或超声波,将触发啄木鸟的标志鼓检测。为了检测典型的25hz鼓声频率,需要10毫秒以下的采样周期和频繁的fft,并且需要相当大的计算成本。在硬件频谱分析仪可以避免这些成本的权衡频率的时间分辨率。训练后的模型转换成TF-Lite-Micro,移植到MCU,识别出各种预先记录的啄木鸟鸣声。该计划是将原型机与威慑装置结合起来,使之成为一个完全自主的解决方案。 摘要:Woodpeckers can cause significant damage to homes, especially in suburban areas. There are a number of preventing and repelling methods including passive decoys, though these may only provide temporary relief. Subsequently, it may be more efficient to implement a woodpecker deterrent, such as motion, light, sound, or ultrasound that would be triggered by detection of woodpecker signature drumming. To detect the typical 25 Hz drumming frequency, sampling periods under 10 milliseconds with frequent FFTs are required with considerable computational costs. An in-hardware spectrum analyzer may avoid these costs by trading off frequency for time resolutions. The trained model converted to TF Lite Micro, ported to an MCU, and identifies a variety of the prerecorded woodpecker drumming. The plan is to integrate the prototype with a deterrent device making it a completely autonomous solution.

【7】 Digital Einstein Experience: Fast Text-to-Speech for Conversational AI 标题:数字爱因斯坦体验:对话式人工智能的快速文本到语音转换

作者:Joanna Rownicka,Kilian Sprenkamp,Antonio Tripiana,Volodymyr Gromoglasov,Timo P Kunz 机构:Aflorithmic Labs Ltd. 备注:accepted at Interspeech 2021 链接:https://arxiv.org/abs/2107.10658 摘要:我们描述了为会话人工智能用例创建和提供自定义语音的方法。更具体地说,我们为数字爱因斯坦角色提供了一个声音,以实现数字对话体验中的人机交互。为了生成符合上下文的语音,我们首先设计一个语音字符,然后生成与所需语音属性相对应的录音。然后我们模拟声音。我们的解决方案利用FastSpeech2从音素中预测对数标度的mel谱图,并利用并行WaveGAN生成波形。该系统支持字符输入,并在输出端提供语音波形。我们为选定的单词使用自定义词典,以确保它们的正确发音。我们提出的云架构能够实现快速的语音传输,使我们能够与数字版本的Albert Einstein进行实时对话。 摘要:We describe our approach to create and deliver a custom voice for a conversational AI use-case. More specifically, we provide a voice for a Digital Einstein character, to enable human-computer interaction within the digital conversation experience. To create the voice which fits the context well, we first design a voice character and we produce the recordings which correspond to the desired speech attributes. We then model the voice. Our solution utilizes Fastspeech 2 for log-scaled mel-spectrogram prediction from phonemes and Parallel WaveGAN to generate the waveforms. The system supports a character input and gives a speech waveform at the output. We use a custom dictionary for selected words to ensure their proper pronunciation. Our proposed cloud architecture enables for fast voice delivery, making it possible to talk to the digital version of Albert Einstein in real-time.

【8】 A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework 标题:基于Coqui STT框架的哈萨克语低成本语音识别基线模型

作者:Ilnar Salimzianov 机构:Taruen 备注:4 pages, 2 tables 链接:https://arxiv.org/abs/2107.10637 摘要:移动设备正在改变人们与计算机的交互方式,应用程序的语音接口也变得越来越重要。最近出版的自动语音识别系统非常精确,但通常需要强大的机器(专门的图形处理单元)进行推理,这使得它们无法在商品设备上运行,尤其是在流模式下。在不使用GPU的情况下,我们对哈萨克ASR基线模型(Khassanov等人,2021年)的准确性印象深刻,但对其推理时间不满意,因此我们训练了一个新的基线声学模型(与上述论文在同一数据集上)和三个语言模型,以用于Coqui STT框架。结果看起来很有希望,但是需要进一步的训练和参数扫描,或者限制ASR系统必须支持的词汇量,才能达到生产级的精度。 摘要:Mobile devices are transforming the way people interact with computers, and speech interfaces to applications are ever more important. Automatic Speech Recognition systems recently published are very accurate, but often require powerful machinery (specialised Graphical Processing Units) for inference, which makes them impractical to run on commodity devices, especially in streaming mode. Impressed by the accuracy of, but dissatisfied with the inference times of the baseline Kazakh ASR model of (Khassanov et al.,2021) when not using a GPU, we trained a new baseline acoustic model (on the same dataset as the aforementioned paper) and three language models for use with the Coqui STT framework. Results look promising, but further epochs of training and parameter sweeping or, alternatively, limiting the vocabulary that the ASR system must support, is needed to reach a production-level accuracy.

【9】 Controlling the Perceived Sound Quality for Dialogue Enhancement with Deep Learning 标题:基于深度学习的对话增强感知音质控制

作者:Christian Uhle,Matteo Torcoli,Jouni Paulus 机构:⋆ Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel , Erlangen, Germany, † International Audio Laboratories Erlangen∗, Am Wolfsmantel , Erlangen, Germany 备注:None 链接:https://arxiv.org/abs/2107.10562 摘要:语音增强可衰减语音信号中的干扰声音,但可能会引入可感知地恶化输出信号的伪影。我们提出了一种控制干扰背景信号衰减和音质损失之间权衡的方法。深度神经网络估计分离的背景信号的衰减,使得使用伪影相关感知分数量化的声音质量满足可调整的目标。主观评价表明,在不同的输入信号中获得了一致的音质。实验结果表明,该方法能够有效地控制折衷,其精度足以满足实际的对话增强应用。 摘要:Speech enhancement attenuates interfering sounds in speech signals but may introduce artifacts that perceivably deteriorate the output signal. We propose a method for controlling the trade-off between the attenuation of the interfering background signal and the loss of sound quality. A deep neural network estimates the attenuation of the separated background signal such that the sound quality, quantified using the Artifact-related Perceptual Score, meets an adjustable target. Subjective evaluations indicate that consistent sound quality is obtained across various input signals. Our experiments show that the proposed method is able to control the trade-off with an accuracy that is adequate for real-world dialogue enhancement applications.

【10】 Improving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer Learning 标题:利用Sørensen-Dice系数损失和迁移学习改进多通道录音的复音事件检测

作者:Karn N. Watcharasupat,Thi Ngoc Tho Nguyen,Ngoc Khanh Nguyen,Zhen Jian Lee,Douglas L. Jones,Woon Seng Gan 机构:School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore., Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA. 备注:Under review for the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021 链接:https://arxiv.org/abs/2107.10471 摘要:最近,S{o}rensen--Dice系数作为一种损失函数(也称为Dice loss)越来越受欢迎,因为它在负样本数显著超过正样本数的任务中具有鲁棒性,如语义分割、自然语言处理和声音事件检测。传统的二元交叉熵损失的复调声音事件检测系统的训练往往会导致次优的检测性能,因为训练常常被来自负样本的更新所淹没。在本文中,我们研究了骰子丢失、模态内和模态间转移学习、数据增强和记录格式对多声道输入的合成音事件检测系统性能的影响。我们的分析表明,在不同的训练设置和录制格式下,以骰子损失训练的复调声音事件检测系统在F1分数和错误率方面始终优于以交叉熵损失训练的复调声音事件检测系统。通过使用迁移学习和适当组合不同的数据增强技术,我们进一步提高了性能。 摘要:The S{o}rensen--Dice Coefficient has recently seen rising popularity as a loss function (also known as Dice loss) due to its robustness in tasks where the number of negative samples significantly exceeds that of positive samples, such as semantic segmentation, natural language processing, and sound event detection. Conventional training of polyphonic sound event detection systems with binary cross-entropy loss often results in suboptimal detection performance as the training is often overwhelmed by updates from negative samples. In this paper, we investigated the effect of the Dice loss, intra- and inter-modal transfer learning, data augmentation, and recording formats, on the performance of polyphonic sound event detection systems with multichannel inputs. Our analysis showed that polyphonic sound event detection systems trained with Dice loss consistently outperformed those trained with cross-entropy loss across different training settings and recording formats in terms of F1 score and error rate. We achieved further performance gains via the use of transfer learning and an appropriate combination of different data augmentation techniques.

【11】 What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis 标题:是什么让声音事件定位和检测变得困难?从错误分析中得到的启示

作者:Thi Ngoc Tho Nguyen,Karn N. Watcharasupat,Zhen Jian Lee,Ngoc Khanh Nguyen,Douglas L. Jones,Woon Seng Gan 机构: School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore., Dept. of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA. 备注:Under review for the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021 链接:https://arxiv.org/abs/2107.10469 摘要:声事件定位与检测(SELD)是一个新兴的研究课题,旨在将声事件检测与波达方向估计的任务统一起来。因此,SELD继承了噪声、混响、干扰、复调和声源的非平稳性等两个任务的挑战。此外,SELD常常面临一个额外的挑战,即为多个重叠的声音事件分配检测到的声音类别和到达方向之间的正确对应。以往的研究表明,混响环境中的未知干扰往往会导致SELD系统性能的严重下降。为了进一步了解SELD任务的挑战,我们对两个SELD系统进行了详细的错误分析,这两个系统在DCASE SELD挑战的团队类别中都排名第二,一个在2020年,一个在2021年。实验结果表明,复调是SELD的主要挑战,因为很难检测出所有感兴趣的声音事件。此外,SELD系统对于训练集中占主导地位的复调场景往往会产生较少的错误。 摘要:Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation. As a result, SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources. Furthermore, SELD often faces an additional challenge of assigning correct correspondences between the detected sound classes and directions of arrival to multiple overlapping sound events. Previous studies have shown that unknown interferences in reverberant environments often cause major degradation in the performance of SELD systems. To further understand the challenges of the SELD task, we performed a detailed error analysis on two of our SELD systems, which both ranked second in the team category of DCASE SELD Challenge, one in 2020 and one in 2021. Experimental results indicate polyphony as the main challenge in SELD, due to the difficulty in detecting all sound events of interest. In addition, the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set.

【12】 StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion 标题:StarGANv2-VC:一种多样化、无监督、非并行的自然发音转换框架

作者:Yinghao Aaron Li,Ali Zare,Nima Mesgarani 机构:Department of Electrical Engineering, Columbia University, USA 备注:To be published in INTERSPEECH 2021 Proceedings 链接:https://arxiv.org/abs/2107.10394 摘要:我们提出了一种无监督的非并行多对多语音转换(VC)方法,该方法使用一种称为StarGAN v2的生成性对抗网络(GAN)。结合使用敌方源分类器损失和感知损失,我们的模型明显优于以前的VC模型。虽然我们的模型只训练了20个说英语的人,但它可以推广到各种语音转换任务,如任意对多、跨语言和歌唱转换。通过使用一个风格编码器,我们的框架还可以将普通阅读语音转换成风格语音,如情感语音和假声语音。在一个非平行多对多语音转换任务上进行的主客观评价实验表明,我们的模型能够产生自然发音的语音,接近最先进的基于文本到语音(TTS)的语音转换方法的音质,而不需要文本标签。此外,我们的模型是完全卷积的,并且具有比实时声码器更快的速度,例如并行WaveGAN可以执行实时语音转换。 摘要:We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

【13】 JS Fake Chorales: a Synthetic Dataset of Polyphonic Music with Human Annotation 标题:JS假合唱:一个人工注释的复调音乐合成数据集

作者:Omar Peracha 链接:https://arxiv.org/abs/2107.10388 摘要:与其他领域(如语言建模或图像分类)相比,基于学习的复调符号音乐建模的高质量数据集在规模上仍然不易访问。特别是,数据集包含的信息揭示了人类对给定音乐样本的反应,这是罕见的。规模问题仍然是该领域取得突破的一个普遍障碍,而缺乏听众评价尤其与生成性建模问题空间有关,在这个问题空间中,与质量成功密切相关的明确客观指标仍然难以捉摸。我们提出了JS-Fake-Chorales,一个由新的基于学习的算法生成的500个片段的数据集,以MIDI形式提供。为了验证进一步按需扩展该数据集的可能性,我们从算法中获取连续的输出,并避免樱桃选择。我们进行了一项在线实验,旨在尽可能公平地对听众进行评价,结果发现,受访者在区分巴赫创作的真假合唱时,平均只比随机猜测好7%。此外,我们还将从实验中收集的匿名数据与MIDI样本一起提供,例如受访者的音乐体验以及他们为每个样本提交回复所花的时间。最后,我们进行了烧蚀研究,以证明在复调音乐建模研究中使用合成片段的有效性,并发现我们可以通过简单地使用JS假合唱来增加训练集,使用已知的算法来改进经典JSB合唱数据集的最新验证集丢失。 摘要:High quality datasets for learning-based modelling of polyphonic symbolic music remain less readily-accessible at scale than in other domains, such as language modelling or image classification. In particular, datasets which contain information revealing insights about human responses to the given music samples are rare. The issue of scale persists as a general hindrance towards breakthroughs in the field, while the lack of listener evaluation is especially relevant to the generative modelling problem-space, where clear objective metrics correlating strongly with qualitative success remain elusive. We propose the JS Fake Chorales, a dataset of 500 pieces generated by a new learning-based algorithm, provided in MIDI form. We take consecutive outputs from the algorithm and avoid cherry-picking in order to validate the potential to further scale this dataset on-demand. We conduct an online experiment for human evaluation, designed to be as fair to the listener as possible, and find that respondents were on average only 7% better than random guessing at distinguishing JS Fake Chorales from real chorales composed by JS Bach. Furthermore, we make anonymised data collected from experiments available along with the MIDI samples, such as the respondents' musical experience and how long they took to submit their response for each sample. Finally, we conduct ablation studies to demonstrate the effectiveness of using the synthetic pieces for research in polyphonic music modelling, and find that we can improve on state-of-the-art validation set loss for the canonical JSB Chorales dataset, using a known algorithm, by simply augmenting the training set with the JS Fake Chorales.

0 人点赞