金融/语音/音频处理学术速递[7.14]

2021-07-27 10:53:51 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计8篇

cs.SD语音,共计12篇

eess.AS音频处理,共计12篇

1.q-fin金融:

【1】 Sissy That Walk: Transportation to Work by Sexual Orientation 标题:Sissy That Walk:性取向带来的上班交通工具

作者:Sonia Oreffice,Dario Sansone 链接:https://arxiv.org/abs/2107.06210 摘要:我们利用2008-2019年美国社区调查,分析了不同性取向下上班交通方式的差异。同性伴侣中的个人开车上班的可能性明显低于不同性伴侣中的男性和女性。这一差距在男性中尤为明显:平均而言,同性伴侣开车上班的可能性降低了近12个百分点(或13%)。同性伴侣中的个人也更有可能使用公共交通工具、步行或骑自行车上班:平均而言,男性和女性乘坐公共交通工具上班的可能性分别比异性伴侣高7个百分点和3个百分点。在控制了人口特征、伴侣特征、地点、生育率和婚姻状况后,这些差异仍然存在。《2008-2018年普通社会调查》(General Social Survey 2008-2018)的其他证据表明,这些性取向差异可能是由于女同性恋、男同性恋和双性恋者比异性恋者更关心环境。 摘要:We analyze differences in mode of transportation to work by sexual orientation, using the American Community Survey 2008-2019. Individuals in same-sex couples are significantly less likely to drive to work than men and women in different-sex couples. This gap is particularly stark among men: on average, almost 12 percentage point (or 13%) lower likelihood of driving to work for men in same-sex couples. Individuals in same-sex couples are also more likely to use public transport, walk, or bike to work: on average, men and women are 7 and 3 percentage points more likely, respectively, to take public transportation to work than those in different-sex couples. These differences persist after controlling for demographic characteristics, partner's characteristics, location, fertility, and marital status. Additional evidence from the General Social Survey 2008-2018 suggests that these disparities by sexual orientation may be due to lesbian, gay, and bisexual individuals caring more for the environment than straight individuals.

【2】 Geometric insights into robust portfolio construction with gearing 标题:杠杆作用下稳健投资组合构造的几何洞察力

作者:Lara Dalmeyer,Tim Gebbie 机构:Department of Statistical Sciences, University of Cape Town, Rondebosch , South Africa 备注:17 pages, 5 Figures, 3 Tables 链接:https://arxiv.org/abs/2107.06194 摘要:我们研究并扩展了Golts和Jones(2009)的结果,即无约束二次投资组合优化产生的alpha权重角具有依赖于协方差矩阵条件数的上界。这意味着更好的条件协方差矩阵从无约束均值-方差优化中产生权重,这些权重与每个资产的预期收益率更为一致。我们进一步阐明了将$alpha$权重角和条件数之间的不平等联系起来的数学见解,并将结果扩展到包括具有杠杆约束的投资组合优化。我们提供了一个强大的优化,其中包括杠杆约束的扩展家庭,并讨论其解释。 摘要:We investigate and extend the results of Golts and Jones (2009) that an alpha-weight angle resulting from unconstrained quadratic portfolio optimisations has an upper bound dependent on the condition number of the covariance matrix. This implies that better conditioned covariance matrices produce weights from unconstrained mean-variance optimisations that are better aligned with each assets expected return. We provide further clarity on the mathematical insights that relate the inequality between the $alpha$-weight angle and the condition number and extend the result to include portfolio optimisations with gearing constraints. We provide an extended family of robust optimisations that include the gearing constraints, and discuss their interpretation.

【3】 The climate in climate economics 标题:气候经济学中的气候

作者:Doris Folini,Felix Kübler,Aleksandra Malova,Simon Scheidegger 链接:https://arxiv.org/abs/2107.06162 摘要:我们为经济学中使用的气候模型制定了一个通用的校准策略。其关键思想是选择自由模型参数以匹配大规模地球系统模型的输出,这些模型在预定义的未来排放情景下运行,并在耦合模型比对项目(CMIP5)中收集。我们建议使用四个在气候科学文献中被认为是关键的测试案例。其中两个试验是高度理想化的,以允许单独检查碳循环和温度响应。另外两个测试包括二氧化碳排放量的逐渐变化、外部强迫和温度响应。我们为三种不同的CMIP5模型响应重新校准了开创性的DICE-2016模型气候部分的自由参数:多模型平均值以及另外两种表现出极端平衡气候敏感性的CMIP5模型。作为一个额外的新颖之处,我们对DICE-2016的校准明确允许模型中的任意时间步。我们表明,i)DICE-2016中的温度方程和碳循环都被错误校准,ii)通过重新校准其系数,我们可以匹配所有三个CMIP5目标。我们将DICE-2016的经济模型与新校准的气候模型相结合,计算碳的社会成本和最佳变暖。我们发现,在我们更新的模型中,碳的社会成本与DICE-2016非常相似,但最佳长期温度几乎比DICE-2016获得的温度低一度。这种气候行为的差异反映在碳的社会成本对贴现率的过度敏感上。在最佳缓解情景下,DICE-2016的温度预测(与我们提议的校准相反)不属于CMIP5情景,这表明人们可能希望对DICE-2016得出的政策预测持怀疑态度。 摘要:We develop a generic calibration strategy for climate models used in economics. The key idea is to choose the free model parameters to match the output of large-scale Earth System Models, which are run on pre-defined future emissions scenarios and collected in the Coupled Model Intercomparison Project (CMIP5). We propose to use four test cases that are considered pivotal in the climate science literature. Two of these tests are highly idealized to allow for the separate examination of the carbon cycle and the temperature response. Another two tests incorporate gradual changes in CO2 emissions, exogenous forcing, and the temperature response. We re-calibrate the free parameters of the climate part of the seminal DICE-2016 model for three different CMIP5 model responses: the multi-model mean as well as two other CMIP5 models that exhibit extreme equilibrium climate sensitivities. As an additional novelty, our calibrations of DICE-2016 allow for an arbitrary time step in the model explicitly. We show that i) both the temperature equations and the carbon cycle in DICE-2016 are miscalibrated and that ii) by re-calibrating its coefficients, we can match all three CMIP5 targets. We apply the economic model from DICE-2016 in combination with the newly calibrated climate model to compute the social cost of carbon and the optimal warming. We find that in our updated model, the social cost of carbon is very similar to DICE-2016, however, the optimal long-run temperature lies almost one degree below that obtained by DICE-2016. This difference in climate behavior is reflected in the over-sensitivity of the social cost of carbon to the discount rate. Under the optimal mitigation scenario, the temperature predictions of DICE-2016 (in contrast to our proposed calibration) fall outside of the CMIP5 scenarios, suggesting that one might want to be skeptical about policy predictions derived from DICE-2016.

【4】 Economic development and the structure of cross-technology interactions 标题:经济发展与跨技术互动结构

作者:Anton Bondarev,Frank C. Krysiak 备注:None 链接:https://arxiv.org/abs/2107.06137 摘要:对经济增长的大多数解释都是基于知识溢出,其中一些技术的发展促进了其他技术的提高。实证研究表明,这些溢出具有异质性和相当复杂的结构。但是,到目前为止,人们很少关注这种跨技术互动的不同结构的后果:同质或异质互动、单向或双向溢出是否更容易促进经济发展?通过对一个简单增长模型中包含跨技术相互作用的r&d部门的详细描述,我们分析了溢出结构如何影响增长前景和增长模式。我们证明了某些类型的相互作用(例如,单向相互作用)不能诱导指数增长,而其他结构可以。此外,根据交互的结构,从长远来看,所有或只有一些技术将有助于增长。最后,一些溢出结构可能导致复杂的增长模式,例如技术转移,随着时间的推移,不同的技术集群是增长的主要引擎。 摘要:Most explanations of economic growth are based on knowledge spillovers, where the development of some technologies facilitates the enhancement of others. Empirical studies show that these spillovers can have a heterogeneous and rather complex structure. But, so far, little attention has been paid to the consequences of different structures of such cross-technology interactions: Is economic development more easily fostered by homogenous or heterogeneous interactions, by uni- or bidirectional spillovers? Using a detailed description of an r&d sector with cross-technology interactions embedded in a simple growth model, we analyse how the structure of spillovers influences growth prospects and growth patterns. We show that some type of interactions (e.g., one-way interactions) cannot induce exponential growth, whereas other structures can. Furthermore, depending on the structure of interactions, all or only some technologies will contribute to growth in the long run. Finally, some spillover structures can lead to complex growth patterns, such as technology transitions, where, over time, different technology clusters are the main engine of growth.

【5】 Micro-level dynamics in hidden action situations with limited information 标题:有限信息隐蔽行动情境中的微观动力学

作者:Stephan Leitner,Friederike Wall 备注:57 pages, 6 figures, 4 tables 链接:https://arxiv.org/abs/2107.06002 摘要:对于委托人将任务分配给努力执行分配给他的代理的情况,隐藏动作模型提供了一个最优的共享规则。然而,委托人只能观察任务的结果,而不能观察代理人的实际行动。隐藏行为模型建立在对委托人和代理人的信息访问能力的一些理想化假设的基础上。我们提出了一个基于代理的模型,放松了其中的一些假设。我们的分析重点放在微观层面的动态触发有限的信息访问。对于委托人的领域,我们确定了所谓的西西弗斯效应,这解释了为什么在信息有限的情况下,最优共享规则通常无法实现,并且我们确定了调节这种效应的因素。此外,我们还分析了代理领域的行为动力学。我们证明了在无限信息下,代理可能付出比最优更大的努力,我们称之为过度努力。有趣的是,委托人可以通过激励机制控制过度努力的概率。然而,代理人的过度努力最终有多少是委托人无法直接控制的。 摘要:The hidden-action model provides an optimal sharing rule for situations in which a principal assigns a task to an agent who makes an effort to carry out the task assigned to him. However, the principal can only observe the task outcome but not the agent's actual action. The hidden-action model builds on somewhat idealized assumptions about the principal's and the agent's capabilities related to information access. We propose an agent-based model that relaxes some of these assumptions. Our analysis lays particular focus on the micro-level dynamics triggered by limited information access. For the principal's sphere, we identify the so-called Sisyphus effect that explains why the optimal sharing rule can generally not be achieved if the information is limited, and we identify factors that moderate this effect. In addition, we analyze the behavioral dynamics in the agent's sphere. We show that the agent might make even more of an effort than optimal under unlimited information, which we refer to as excess effort. Interestingly, the principal can control the probability of making an excess effort via the incentive mechanism. However, how much excess effort the agent finally makes is out of the principal's direct control.

【6】 Multiplicative Error Models: 20 years on 标题:乘性误差模型:20年

作者:Fabrizio Cipollini,Giampiero M. Gallo 备注:29 pages 链接:https://arxiv.org/abs/2107.05923 摘要:代表市场活动的现象有几种:交易量、交易次数、交易或报价之间的持续时间、波动性——无论如何衡量——都具有以正价值时间序列表示的特征。当建立模型时,他们行为的持久性和对新信息的反应建议采用自回归型框架。乘法误差模型(MEM)是对GARCH方法的扩展,用于对资产收益率的条件波动性进行建模和预测。它是通过将过程的条件期望(确定性地依赖于前一时间段的信息集)与表示不可预测消息的随机干扰相乘地结合而获得的:MEMs已被证明可以节省地实现其产生良好性能预测的任务。本文讨论了单变量和多变量情况下模型描述和推理的各个方面。这些应用举例说明了慢移动低频分量的存在如何改善估计模型的特性。 摘要:Several phenomena are available representing market activity: volumes, number of trades, durations between trades or quotes, volatility - however measured - all share the feature to be represented as positive valued time series. When modeled, persistence in their behavior and reaction to new information suggested to adopt an autoregressive-type framework. The Multiplicative Error Model (MEM) is borne of an extension of the popular GARCH approach for modeling and forecasting conditional volatility of asset returns. It is obtained by multiplicatively combining the conditional expectation of a process (deterministically dependent upon an information set at a previous time period) with a random disturbance representing unpredictable news: MEMs have proved to parsimoniously achieve their task of producing good performing forecasts. In this paper we discuss various aspects of model specification and inference both for the univariate and the multivariate case. The applications are illustrative examples of how the presence of a slow moving low-frequency component can improve the properties of the estimated models.

【7】 Dynamics of the market states in the space of correlation matrices with applications to financial markets 标题:市场状态在相关矩阵空间中的动力学及其在金融市场中的应用

作者:Hirdesh K. Pharasi,Suchetana Sadhukhan,Parisa Majari,Anirban Chakraborti,Thomas H. Seligman 机构:Fakult¨at f¨ur Physik, Universit¨at Duisburg-Essen, Duisburg-, Germany, Department of Physics, School of Advanced Sciences, VIT Bhopal, Kothri Kalan, Sehore-, India 备注:16 pages, 9 figures, to appear in the Proceedings Volume, Eds. Anirban Chakraborti, Emmanuel Haven, Sudip Patra and Naresh Singh, "Quantum Decision Theory and Complexity Modelling in Economics and Public Policy" (Springer-Cham, New Economic Windows) 链接:https://arxiv.org/abs/2107.05663 摘要:基于相关性的金融市场状态概念在过去10年中得到了越来越多的关注。我们建议追溯到2018年之前的一些重要步骤,然后对最近的发展进行更详细的介绍,试图使这一方法更实用。最后,我们尝试对未来进行展望,提出直接或用符号动力学分析相关矩阵空间中的轨迹,以及在随机矩阵环境中分析组成状态的簇。 摘要:The concept of states of financial markets based on correlations has gained increasing attention during the last 10 years. We propose to retrace some important steps up to 2018, and then give a more detailed view of recent developments that attempt to make the use of this more practical. Finally, we try to give a glimpse to the future proposing the analysis of trajectories in correlation matrix space directly or in terms of symbolic dynamics as well as attempts to analyze the clusters that make up the states in a random matrix context.

【8】 A Lucas Critique Compliant SVAR model with Observation-driven Time-varying Parameters 标题:具有观测驱动时变参数的Lucas Critique兼容sVaR模型

作者:Giacomo Bormetti,Fulvio Corsi 备注:48 pages, 10 figures, 1 table 链接:https://arxiv.org/abs/2107.05263 摘要:我们提出了一个观察驱动的时变SVAR模型,其中,与Lucas批判一致,结构冲击驱动了宏观变量的演化和VAR参数的动态变化。与现有方法中参数遵循随机和外部冲击的随机过程相反,我们的观察驱动规范允许参数的演变由已实现的过去结构冲击驱动,从而为衡量观察到的冲击和假设性政策干预对经济体系未来演变的影响提供了可能。 摘要:We propose an observation-driven time-varying SVAR model where, in agreement with the Lucas Critique, structural shocks drive both the evolution of the macro variables and the dynamics of the VAR parameters. Contrary to existing approaches where parameters follow a stochastic process with random and exogenous shocks, our observation-driven specification allows the evolution of the parameters to be driven by realized past structural shocks, thus opening the possibility to gauge the impact of observed shocks and hypothetical policy interventions on the future evolution of the economic system.

2.cs.SD语音:

【1】 Dance2Music: Automatic Dance-driven Music Generation 标题:Dance2Music:舞蹈驱动的自动音乐生成

作者:Gunjan Aggarwal,Devi Parikh 机构:Adobe, Facebook AI Research & Georgia Tech 链接:https://arxiv.org/abs/2107.06252

【2】 Timbre Classification of Musical Instruments with a Deep Learning Multi-Head Attention-Based Model 标题:基于深度学习多头注意力模型的乐器音色分类

作者:Carlos Hernandez-Olivan,Jose R. Beltran 机构:Universidad de Zaragoza 链接:https://arxiv.org/abs/2107.06231 摘要:这项工作的目的是定义一个基于深度学习的模型,该模型能够用尽可能少的参数识别不同的乐器音色。为了达到这个目的,我们研究了古典管弦乐器演奏不同的动态,这是一些乐器家族的一部分,在相同的音高范围内演奏音符。即使乐器以同样的强度演奏同一个音符,也可以通过音色来评估乐器的分类能力。所采用的网络采用多头部注意机制,输出端有8个头部和一个密集的网络,并以声音样本的对数mel幅度谱图作为输入。这个网络允许识别20个古典管弦乐队的乐器类别,使F$1$的总价值达到0.62。对注意层的权重进行了分析,给出了模型的混淆矩阵,使我们能够评估所提出的体系结构区分音色的能力,并确定未来工作应关注的方面。 摘要:The aim of this work is to define a model based on deep learning that is able to identify different instrument timbres with as few parameters as possible. For this purpose, we have worked with classical orchestral instruments played with different dynamics, which are part of a few instrument families and which play notes in the same pitch range. It has been possible to assess the ability to classify instruments by timbre even if the instruments are playing the same note with the same intensity. The network employed uses a multi-head attention mechanism, with 8 heads and a dense network at the output taking as input the log-mel magnitude spectrograms of the sound samples. This network allows the identification of 20 instrument classes of the classical orchestra, achieving an overall F$_1$ value of 0.62. An analysis of the weights of the attention layer has been performed and the confusion matrix of the model is presented, allowing us to assess the ability of the proposed architecture to distinguish timbre and to establish the aspects on which future work should focus.

【3】 The IWSLT 2021 BUT Speech Translation Systems 标题:IWSLT 2021 BUT语音翻译系统

作者:Hari Krishna Vydana,Martin Karafi'at,Luk'as Burget,"Honza" Cernock'y 机构: “Honza” ˇCernockýBrno University of Technology 链接:https://arxiv.org/abs/2107.06155 摘要:本文介绍了BUT公司为IWSLT2021开发的英语到德语离线语音翻译系统。它们基于联合训练的自动语音识别机器翻译模型。在musc通用测试集上对其性能进行了评价。在这项工作中,我们从大量独立的ASR训练数据和机器翻译训练数据,以及少量的语音翻译训练数据的角度来研究它们的效率。利用大量的ASR和MT训练数据对ASR和MT模型进行预训练。语音翻译数据通过定义从语音到翻译的端到端可差分路径来联合优化ASR-MT模型。为此,我们使用ASR解码器的内部连续表示作为MT模块的输入。我们证明,通过使用大量纯文本机器翻译训练数据,将ASR解码器与机器翻译模块联合训练,可以进一步改善语音翻译。我们还通过训练一个能够生成标点文本的ASR模块,而不是将标点任务交给机器翻译模块来实现显著的改进。 摘要:The paper describes BUT's English to German offline speech translation(ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.

【4】 DiCOVA-Net: Diagnosing COVID-19 using Acoustics based on Deep Residual Network for the DiCOVA Challenge 2021 标题:DiCOVA-Net:基于深度残差网络的声学诊断2021年DiCOVA挑战赛冠状病毒

作者:Jiangeng Chang,Shaoze Cui,Mengling Feng 备注:5 figures 链接:https://arxiv.org/abs/2107.06126 摘要:在本文中,我们提出了一种基于深度残差网络的方法,即DiCOVA网络,根据咳嗽声记录识别COVID-19感染患者。由于健康人群远远多于感染者,这一分类问题面临着数据不平衡的挑战。为了提高模型对少数群体(感染者)的识别能力,我们在模型中引入了数据扩充和成本敏感的方法。此外,考虑到本课题的特殊性,我们采用了一些微调技术来调整预训练ResNet50。此外,为了提高模型的可推广性,我们使用集成学习来整合使用不同随机种子产生的多个基分类器的预测结果。为了评估提出的DiCOVA网络的性能,我们使用DiCOVA挑战数据集进行了实验。结果表明,该方法的AUC达到85.43%,在所有参赛队伍中名列前茅。 摘要:In this paper, we propose a deep residual network-based method, namely the DiCOVA-Net, to identify COVID-19 infected patients based on the acoustic recording of their coughs. Since there are far more healthy people than infected patients, this classification problem faces the challenge of imbalanced data. To improve the model's ability to recognize minority class (the infected patients), we introduce data augmentation and cost-sensitive methods into our model. Besides, considering the particularity of this task, we deploy some fine-tuning techniques to adjust the pre-training ResNet50. Furthermore, to improve the model's generalizability, we use ensemble learning to integrate prediction results from multiple base classifiers generated using different random seeds. To evaluate the proposed DiCOVA-Net's performance, we conducted experiments with the DiCOVA challenge dataset. The results show that our method has achieved 85.43% in AUC, among the top of all competing teams.

【5】 The Piano Inpainting Application 标题:浅谈钢琴修复技术的应用

作者:Gaëtan Hadjeres,Léopold Crestel 机构:Sony Computer Science Laboratories, Paris, France 链接:https://arxiv.org/abs/2107.05944 摘要:自回归模型现在能够产生高品质的分钟长的表现力MIDI钢琴表演。尽管这一进展提出了新的工具来辅助音乐创作,但我们观察到,由于生成算法提供的控制有限、推理时间过长或音乐家的工作流程中缺乏集成,因此生成算法仍然没有被艺术家广泛使用。在这项工作中,我们提出了钢琴修复应用程序(PIA),一个专注于修复钢琴演奏的生成模型,因为我们相信这种基本操作(修复钢琴演奏中缺失的部分)鼓励了人机交互,并为音乐创作开辟了新的途径。我们的方法依赖于一个编码器-解码器线性Transformer结构,该结构基于一种称为结构化MIDI编码的MIDI钢琴演奏的新表示法。通过揭示线性Transformer和我们的修复任务之间有趣的协同作用,我们能够有效地修复钢琴演奏的相邻区域,这使得我们的模型适合交互式和响应性人工智能辅助创作。最后,我们介绍我们免费提供的Ableton Live PIA插件,它允许音乐家在广泛使用的专业数字音频工作站中使用PIA平滑地生成或修改任何MIDI剪辑。 摘要:Autoregressive models are now capable of generating high-quality minute-long expressive MIDI piano performances. Even though this progress suggests new tools to assist music composition, we observe that generative algorithms are still not widely used by artists due to the limited control they offer, prohibitive inference times or the lack of integration within musicians' workflows. In this work, we present the Piano Inpainting Application (PIA), a generative model focused on inpainting piano performances, as we believe that this elementary operation (restoring missing parts of a piano performance) encourages human-machine interaction and opens up new ways to approach music composition. Our approach relies on an encoder-decoder Linear Transformer architecture trained on a novel representation for MIDI piano performances termed Structured MIDI Encoding. By uncovering an interesting synergy between Linear Transformers and our inpainting task, we are able to efficiently inpaint contiguous regions of a piano performance, which makes our model suitable for interactive and responsive A.I.-assisted composition. Finally, we introduce our freely-available Ableton Live PIA plugin, which allows musicians to smoothly generate or modify any MIDI clip using PIA within a widely-used professional Digital Audio Workstation.

【6】 Towards Automatic Instrumentation by Learning to Separate Parts in Symbolic Multitrack Music 标题:在符号多声道音乐中学习分离声部走向自动化乐器

作者:Hao-Wen Dong,Chris Donahue,Taylor Berg-Kirkpatrick,Julian McAuley 机构: University of California San Diego, Stanford University 备注:Accepted to ISMIR 2021 链接:https://arxiv.org/abs/2107.05916 摘要:现代键盘允许音乐家同时演奏多种乐器,方法是给不同的乐器指定区域——键盘的固定音高范围。在本文中,我们旨在进一步扩展这一思想,并探讨自动乐器的可行性,即在独奏音乐演奏过程中为音符动态分配乐器。除了为执行用例提供在线、实时的设置外,自动插装还可以在离线设置中的辅助创作工具中找到应用程序。由于缺乏原始独奏音乐的配对数据及其完整排列,我们通过学习将部分(如声音、乐器和音轨)从符号多轨音乐的混合中分离出来,假设混合是在键盘上播放的,从而接近自动乐器。我们将零件分离问题定义为一个序列多类分类问题,并采用机器学习将注释序列映射为零件标签序列。为了检验我们提出的模型的有效性,我们对巴赫合唱、弦乐四重奏、游戏音乐和流行音乐四个不同流派和合奏的数据集进行了全面的实证评估。我们的实验表明,所提出的模型优于各种基线。我们还展示了我们提出的模型通过将混合物分成若干部分,为现有安排产生替代的令人信服的工具的潜力。所有源代码和音频样本可以在https://salu133445.github.io/arranger/ . 摘要:Modern keyboards allow a musician to play multiple instruments at the same time by assigning zones -- fixed pitch ranges of the keyboard -- to different instruments. In this paper, we aim to further extend this idea and examine the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance. In addition to the online, real-time-capable setting for performative use cases, automatic instrumentation can also find applications in assistive composing tools in an offline setting. Due to the lack of paired data of original solo music and their full arrangements, we approach automatic instrumentation by learning to separate parts (e.g., voices, instruments and tracks) from their mixture in symbolic multitrack music, assuming that the mixture is to be played on a keyboard. We frame the task of part separation as a sequential multi-class classification problem and adopt machine learning to map sequences of notes into sequences of part labels. To examine the effectiveness of our proposed models, we conduct a comprehensive empirical evaluation over four diverse datasets of different genres and ensembles -- Bach chorales, string quartets, game music and pop music. Our experiments show that the proposed models outperform various baselines. We also demonstrate the potential for our proposed models to produce alternative convincing instrumentations for an existing arrangement by separating its mixture into parts. All source code and audio samples can be found at https://salu133445.github.io/arranger/ .

【7】 Conformer-based End-to-end Speech Recognition With Rotary Position Embedding 标题:旋转位置嵌入的基于Conform的端到端语音识别

作者:Shengqiang Li,Menglong Xu,Xiao-Lei Zhang 机构:CIAIC, School of Marine Science and Technology, Northwestern Polytechnical University, China 备注:5 pages, 3 figures 链接:https://arxiv.org/abs/2107.05907 摘要:基于Transformer的端到端语音识别模型由于其训练速度快、能够模拟长距离的全局上下文,近年来受到了广泛的关注。transformer架构中的位置嵌入是必不可少的,因为它为输入序列中不同位置的元素之间的依赖关系建模提供了监督。为了利用输入序列的时间顺序,许多工作在输入序列中注入了有关元素相对或绝对位置的信息。本文研究了卷积增强变换器(conformer)中的各种位置嵌入方法,并采用了一种新的实现方法&旋转位置嵌入(RoPE)。RoPE通过旋转矩阵将绝对位置信息编码到输入序列中,然后自然地将显式相对位置信息合并到自我注意模块中。为了评价RoPE方法的有效性,我们在AISHELL-1和LibriSpeech语料库上进行了实验。结果表明,用RoPE增强的一致性在语音识别任务中取得了较好的效果。具体来说,我们的模型在测试clean和测试LibriSpeech语料库的其他集合时分别比conformer降低了8.70%和7.27%。 摘要:Transformer-based end-to-end speech recognition models have received considerable attention in recent years due to their high training speed and ability to model a long-range global context. Position embedding in the transformer architecture is indispensable because it provides supervision for dependency modeling between elements at different positions in the input sequence. To make use of the time order of the input sequence, many works inject some information about the relative or absolute position of the element into the input sequence. In this work, we investigate various position embedding methods in the convolution-augmented transformer (conformer) and adopt a novel implementation named rotary position embedding (RoPE). RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. To evaluate the effectiveness of the RoPE method, we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show that the conformer enhanced with RoPE achieves superior performance in the speech recognition task. Specifically, our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.

【8】 Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 标题:面向2021年ZeroSpeech挑战赛的整形CPC与深度聚类相结合的语音表征学习

作者:Takashi Maekaku,Xuankai Chang,Yuya Fujita,Li-Wei Chen,Shinji Watanabe,Alexander Rudnicky 机构:Yahoo Japan Corporation, Tokyo, JAPAN, Carnegie Mellon University, PA, USA 链接:https://arxiv.org/abs/2107.05899 摘要:我们提出了一个零资源语音挑战2021系统,它结合了对比预测编码(CPC)和深度聚类。在深度聚类中,我们首先用k-均值对CPC网络的输出进行聚类,得到伪标记。然后,我们训练一个额外的自回归模型,以监督的方式对先前获得的伪标签进行分类。音素判别表示是通过对自回归模型最后一层的输出进行第二轮聚类来实现的。我们证明了用一个构象层替换一个变换层将导致词汇度量的进一步增益。实验结果表明,与基于librispeech460h数据训练的CPC-small基线方法相比,语音度量、词汇度量和句法度量分别提高了35%、1.5%和2.3%。在这个挑战中,我们通过语法度量获得了最好的结果。 摘要:We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top results in this challenge with the syntactic metric.

【9】 Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task 标题:在理解和学习辅助文本翻译任务中提高言语翻译水平

作者:Yun Tang,Juan Pino,Xian Li,Changhan Wang,Dmitriy Genzel 机构:Facebook AI 备注:Accepted by ACL 2021 链接:https://arxiv.org/abs/2107.05782 摘要:预训练和多任务学习被广泛应用于提高语篇翻译性能。在这项研究中,我们感兴趣的是训练一个语音到文本的翻译模型以及一个辅助的文本到文本的翻译任务。在多任务学习框架下,我们进行了详细的分析,以了解辅助任务对主要任务的影响。我们的分析证实,多任务学习倾向于从不同的模式中产生相似的解码器表示,并保留来自预训练文本翻译模块的更多信息。我们观察到两个任务之间的负迁移效应最小,共享更多的参数有助于将知识从文本任务转移到言语任务。分析还表明,译码器顶层的模态表示差异仍然是不可忽略的,这些层对翻译质量至关重要。受这些发现的启发,我们提出了三种提高翻译质量的方法。首先,提出了一种参数共享和初始化策略,以增强任务间的信息共享。其次,针对编码器提出了一种新的基于注意的正则化方法,使得不同模式的表示更加接近。第三,提出了一种在线知识提炼方法,以增强文本到语音任务的知识传递。我们的实验表明,该方法在一个很强的基线上提高了2个BLEU以上的翻译性能,并在textsc{MuST-C}英语-德语、英语-法语和英语-西班牙语语言对上取得了最新的结果。 摘要:Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-the-art results on the textsc{MuST-C} English-German, English-French and English-Spanish language pairs.

【10】 Codified audio language modeling learns useful representations for music information retrieval 标题:编码音频语言建模学习用于音乐信息检索的有用表示

作者:Rodrigo Castellon,Chris Donahue,Percy Liang 机构:Stanford University 备注:To appear in the proceedings of ISMIR 2021 链接:https://arxiv.org/abs/2107.05677 摘要:我们证明,语言模型预先训练编码(离散编码)的音乐音频学习表示,是有用的下游和平号任务。具体来说,我们探讨了Jukebox(Dhariwal et al.2020)的表现形式:一个音乐生成系统,包含一个语言模型,该语言模型基于来自100万首歌曲的编码音频进行训练。为了确定Jukebox的表示是否包含对MIR有用的信息,我们使用它们作为输入特征来训练几个MIR任务的浅层模型。相对于传统的MIR模型在标记方面的预训练,我们发现使用Jukebox中的表示作为输入特征,在标记、类型分类、情感识别和关键点检测四个MIR任务中平均产生30%的性能提高。对于密钥检测,我们观察到自动存储塔中的表示比预先训练的模型中的表示强得多,这表明通过编码音频语言模型进行预训练可以解决传统方法中的盲点。我们将Jukebox表示的强度解释为音频建模而不是标记为和平号提供了更丰富的表示的证据。 摘要:We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

【11】 A Configurable Multilingual Model is All You Need to Recognize All Languages 标题:可配置的多语言模型是识别所有语言所需的全部

作者:Long Zhou,Jinyu Li,Eric Sun,Shujie Liu 机构:†Microsoft Research Asia, ‡Microsoft Speech and Language Group 链接:https://arxiv.org/abs/2107.05876 摘要:多语种自动语音识别(ASR)模型由于其简化的模型训练和部署过程,近年来显示出巨大的应用前景。传统的方法要么训练一个通用的多语言模型而不需要任何语言信息,要么用一个1-hot语言ID(LID)向量来指导目标语言的识别。实际上,系统会提示用户预先选择几种他/她会说的语言。没有LID的多语言模型不能很好地利用用户设置的语言信息,而具有LID的多语言模型只能处理一种预先选定的语言。本文提出了一种新的可配置多语言模型(CMM),该模型只需训练一次,通过从训练后的CMM中提取特定于语言的模块和通用模型,就可以根据用户的选择配置为不同的模型。特别是,单个CMM可以部署到任何用户场景中,用户可以预先选择任何语言组合。使用75K小时的转录匿名微软多语种数据进行训练,并使用10个语言测试集进行评估,当用户选择1、2或3种语言时,该模型比通用多语种模型分别提高了26.0%、16.9%和10.4%。CMM在代码转换测试集上的表现也明显更好。 摘要:Multilingual automatic speech recognition (ASR) models have shown great promise in recent years because of the simplified model training and deployment process. Conventional methods either train a universal multilingual model without taking any language information or with a 1-hot language ID (LID) vector to guide the recognition of the target language. In practice, the user can be prompted to pre-select several languages he/she can speak. The multilingual model without LID cannot well utilize the language information set by the user while the multilingual model with LID can only handle one pre-selected language. In this paper, we propose a novel configurable multilingual model (CMM) which is trained only once but can be configured as different models based on users' choices by extracting language-specific modules together with a universal model from the trained CMM. Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages. Trained with 75K hours of transcribed anonymized Microsoft multilingual data and evaluated with 10-language test sets, the proposed CMM improves from the universal multilingual model by 26.0%, 16.9%, and 10.4% relative word error reduction when the user selects 1, 2, or 3 languages, respectively. CMM also performs significantly better on code-switching test sets.

【12】 AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data 标题:有限训练数据下鲁棒小规模关键词检测的AUC优化

作者:Menglong Xu,Shengqiang Li,Chengdong Liang,Xiao-Lei Zhang 机构:CIAIC, School of Marine Science and Technology, Northwestern Polytechnical University, China 备注:submitted to ASRU2021 链接:https://arxiv.org/abs/2107.05859 摘要:深度神经网络为小足迹关键词识别提供了有效的解决方案。然而,如果训练数据是有限的,那么在训练数据中经常遇到看不见的声音的真实场景中,实现健壮且高度准确的KWS仍然是一个挑战。大多数传统的方法都是在训练集上最大限度地提高分类精度,而不考虑不可见的声音。为了提高基于深度神经网络的KWS的鲁棒性,本文引入了一种新的损失函数,即接收机工作特性曲线下面积最大化(AUC)。该方法不仅在封闭训练集上最大限度地提高了关键词的分类精度,而且在优化非关键词片段检测性能的同时,也最大限度地提高了AUC分数。在googlespeechcommands数据集v1和v2上的实验结果表明,我们的方法在大多数评价指标方面都达到了新的水平。 摘要:Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are out of the training data are frequently encountered. Most conventional methods aim to maximize the classification accuracy on the training set, without taking the unseen sounds into account. To enhance the robustness of the deep neural networks based KWS, in this paper, we introduce a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC). The proposed method not only maximizes the classification accuracy of keywords on the closed training set, but also maximizes the AUC score for optimizing the performance of non-keyword segments detection. Experimental results on the Google Speech Commands dataset v1 and v2 show that our method achieves new state-of-the-art performance in terms of most evaluation metrics.

3.eess.AS音频处理:

【1】 A Configurable Multilingual Model is All You Need to Recognize All Languages 标题:可配置的多语言模型是识别所有语言所需的全部

作者:Long Zhou,Jinyu Li,Eric Sun,Shujie Liu 机构:†Microsoft Research Asia, ‡Microsoft Speech and Language Group 链接:https://arxiv.org/abs/2107.05876 摘要:多语种自动语音识别(ASR)模型由于其简化的模型训练和部署过程,近年来显示出巨大的应用前景。传统的方法要么训练一个通用的多语言模型而不需要任何语言信息,要么用一个1-hot语言ID(LID)向量来指导目标语言的识别。实际上,系统会提示用户预先选择几种他/她会说的语言。没有LID的多语言模型不能很好地利用用户设置的语言信息,而具有LID的多语言模型只能处理一种预先选定的语言。本文提出了一种新的可配置多语言模型(CMM),该模型只需训练一次,通过从训练后的CMM中提取特定于语言的模块和通用模型,就可以根据用户的选择配置为不同的模型。特别是,单个CMM可以部署到任何用户场景中,用户可以预先选择任何语言组合。使用75K小时的转录匿名微软多语种数据进行训练,并使用10个语言测试集进行评估,当用户选择1、2或3种语言时,该模型比通用多语种模型分别提高了26.0%、16.9%和10.4%。CMM在代码转换测试集上的表现也明显更好。 摘要:Multilingual automatic speech recognition (ASR) models have shown great promise in recent years because of the simplified model training and deployment process. Conventional methods either train a universal multilingual model without taking any language information or with a 1-hot language ID (LID) vector to guide the recognition of the target language. In practice, the user can be prompted to pre-select several languages he/she can speak. The multilingual model without LID cannot well utilize the language information set by the user while the multilingual model with LID can only handle one pre-selected language. In this paper, we propose a novel configurable multilingual model (CMM) which is trained only once but can be configured as different models based on users' choices by extracting language-specific modules together with a universal model from the trained CMM. Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages. Trained with 75K hours of transcribed anonymized Microsoft multilingual data and evaluated with 10-language test sets, the proposed CMM improves from the universal multilingual model by 26.0%, 16.9%, and 10.4% relative word error reduction when the user selects 1, 2, or 3 languages, respectively. CMM also performs significantly better on code-switching test sets.

【2】 AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data 标题:有限训练数据下鲁棒小规模关键词检测的AUC优化

作者:Menglong Xu,Shengqiang Li,Chengdong Liang,Xiao-Lei Zhang 机构:CIAIC, School of Marine Science and Technology, Northwestern Polytechnical University, China 备注:submitted to ASRU2021 链接:https://arxiv.org/abs/2107.05859 摘要:深度神经网络为小足迹关键词识别提供了有效的解决方案。然而,如果训练数据是有限的,那么在训练数据中经常遇到看不见的声音的真实场景中,实现健壮且高度准确的KWS仍然是一个挑战。大多数传统的方法都是在训练集上最大限度地提高分类精度,而不考虑不可见的声音。为了提高基于深度神经网络的KWS的鲁棒性,本文引入了一种新的损失函数,即接收机工作特性曲线下面积最大化(AUC)。该方法不仅在封闭训练集上最大限度地提高了关键词的分类精度,而且在优化非关键词片段检测性能的同时,也最大限度地提高了AUC分数。在googlespeechcommands数据集v1和v2上的实验结果表明,我们的方法在大多数评价指标方面都达到了新的水平。 摘要:Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are out of the training data are frequently encountered. Most conventional methods aim to maximize the classification accuracy on the training set, without taking the unseen sounds into account. To enhance the robustness of the deep neural networks based KWS, in this paper, we introduce a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC). The proposed method not only maximizes the classification accuracy of keywords on the closed training set, but also maximizes the AUC score for optimizing the performance of non-keyword segments detection. Experimental results on the Google Speech Commands dataset v1 and v2 show that our method achieves new state-of-the-art performance in terms of most evaluation metrics.

【3】 Dance2Music: Automatic Dance-driven Music Generation 标题:Dance2Music:舞蹈驱动的自动音乐生成

作者:Gunjan Aggarwal,Devi Parikh 机构:Adobe, Facebook AI Research & Georgia Tech 链接:https://arxiv.org/abs/2107.06252 摘要:舞蹈和音乐通常是并行不悖的。舞蹈、音乐和它们的同步性的复杂性使得从计算创造力的角度来研究它们非常有趣。虽然有几部作品研究了为给定的音乐生成舞蹈,但为给定的舞蹈自动生成音乐的研究仍处于探索阶段。这种能力可以有几个创造性的表达和娱乐应用。我们介绍了这方面的一些早期探索。我们提出了一种基于搜索的离线方法,在处理整个舞蹈视频后生成音乐,以及一种在线方法,该方法使用深度神经网络在视频进行时动态生成音乐。我们通过人体研究将这些方法与一个强大的启发式基线进行比较,并给出我们的发现。我们已经将我们的在线方法集成到了一个现场演示中!演示视频可在此处找到:https://sites.google.com/view/dance2music/live-demo. 摘要:Dance and music typically go hand in hand. The complexities in dance, music, and their synchronisation make them fascinating to study from a computational creativity perspective. While several works have looked at generating dance for a given music, automatically generating music for a given dance remains under-explored. This capability could have several creative expression and entertainment applications. We present some early explorations in this direction. We present a search-based offline approach that generates music after processing the entire dance video and an online approach that uses a deep neural network to generate music on-the-fly as the video proceeds. We compare these approaches to a strong heuristic baseline via human studies and present our findings. We have integrated our online approach in a live demo! A video of the demo can be found here: https://sites.google.com/view/dance2music/live-demo.

【4】 Timbre Classification of Musical Instruments with a Deep Learning Multi-Head Attention-Based Model 标题:基于深度学习多头注意力模型的乐器音色分类

作者:Carlos Hernandez-Olivan,Jose R. Beltran 机构:Universidad de Zaragoza 链接:https://arxiv.org/abs/2107.06231 摘要:这项工作的目的是定义一个基于深度学习的模型,该模型能够用尽可能少的参数识别不同的乐器音色。为了达到这个目的,我们研究了古典管弦乐器演奏不同的动态,这是一些乐器家族的一部分,在相同的音高范围内演奏音符。即使乐器以同样的强度演奏同一个音符,也可以通过音色来评估乐器的分类能力。所采用的网络采用多头部注意机制,输出端有8个头部和一个密集的网络,并以声音样本的对数mel幅度谱图作为输入。这个网络允许识别20个古典管弦乐队的乐器类别,使F$1$的总价值达到0.62。对注意层的权重进行了分析,给出了模型的混淆矩阵,使我们能够评估所提出的体系结构区分音色的能力,并确定未来工作应关注的方面。 摘要:The aim of this work is to define a model based on deep learning that is able to identify different instrument timbres with as few parameters as possible. For this purpose, we have worked with classical orchestral instruments played with different dynamics, which are part of a few instrument families and which play notes in the same pitch range. It has been possible to assess the ability to classify instruments by timbre even if the instruments are playing the same note with the same intensity. The network employed uses a multi-head attention mechanism, with 8 heads and a dense network at the output taking as input the log-mel magnitude spectrograms of the sound samples. This network allows the identification of 20 instrument classes of the classical orchestra, achieving an overall F$_1$ value of 0.62. An analysis of the weights of the attention layer has been performed and the confusion matrix of the model is presented, allowing us to assess the ability of the proposed architecture to distinguish timbre and to establish the aspects on which future work should focus.

【5】 The IWSLT 2021 BUT Speech Translation Systems 标题:IWSLT 2021 BUT语音翻译系统

作者:Hari Krishna Vydana,Martin Karafi'at,Luk'as Burget,"Honza" Cernock'y 机构: “Honza” ˇCernockýBrno University of Technology 链接:https://arxiv.org/abs/2107.06155 摘要:本文介绍了BUT公司为IWSLT2021开发的英语到德语离线语音翻译系统。它们基于联合训练的自动语音识别机器翻译模型。在musc通用测试集上对其性能进行了评价。在这项工作中,我们从大量独立的ASR训练数据和机器翻译训练数据,以及少量的语音翻译训练数据的角度来研究它们的效率。利用大量的ASR和MT训练数据对ASR和MT模型进行预训练。语音翻译数据通过定义从语音到翻译的端到端可差分路径来联合优化ASR-MT模型。为此,我们使用ASR解码器的内部连续表示作为MT模块的输入。我们证明,通过使用大量纯文本机器翻译训练数据,将ASR解码器与机器翻译模块联合训练,可以进一步改善语音翻译。我们还通过训练一个能够生成标点文本的ASR模块,而不是将标点任务交给机器翻译模块来实现显著的改进。 摘要:The paper describes BUT's English to German offline speech translation(ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.

【6】 DiCOVA-Net: Diagnosing COVID-19 using Acoustics based on Deep Residual Network for the DiCOVA Challenge 2021 标题:DiCOVA-Net:基于深度残差网络的声学诊断2021年DiCOVA挑战赛冠状病毒

作者:Jiangeng Chang,Shaoze Cui,Mengling Feng 备注:5 figures 链接:https://arxiv.org/abs/2107.06126 摘要:在本文中,我们提出了一种基于深度残差网络的方法,即DiCOVA网络,根据咳嗽声记录识别COVID-19感染患者。由于健康人群远远多于感染者,这一分类问题面临着数据不平衡的挑战。为了提高模型对少数群体(感染者)的识别能力,我们在模型中引入了数据扩充和成本敏感的方法。此外,考虑到本课题的特殊性,我们采用了一些微调技术来调整预训练ResNet50。此外,为了提高模型的可推广性,我们使用集成学习来整合使用不同随机种子产生的多个基分类器的预测结果。为了评估提出的DiCOVA网络的性能,我们使用DiCOVA挑战数据集进行了实验。结果表明,该方法的AUC达到85.43%,在所有参赛队伍中名列前茅。 摘要:In this paper, we propose a deep residual network-based method, namely the DiCOVA-Net, to identify COVID-19 infected patients based on the acoustic recording of their coughs. Since there are far more healthy people than infected patients, this classification problem faces the challenge of imbalanced data. To improve the model's ability to recognize minority class (the infected patients), we introduce data augmentation and cost-sensitive methods into our model. Besides, considering the particularity of this task, we deploy some fine-tuning techniques to adjust the pre-training ResNet50. Furthermore, to improve the model's generalizability, we use ensemble learning to integrate prediction results from multiple base classifiers generated using different random seeds. To evaluate the proposed DiCOVA-Net's performance, we conducted experiments with the DiCOVA challenge dataset. The results show that our method has achieved 85.43% in AUC, among the top of all competing teams.

【7】 The Piano Inpainting Application 标题:浅谈钢琴修复技术的应用

作者:Gaëtan Hadjeres,Léopold Crestel 机构:Sony Computer Science Laboratories, Paris, France 链接:https://arxiv.org/abs/2107.05944 摘要:自回归模型现在能够产生高品质的分钟长的表现力MIDI钢琴表演。尽管这一进展提出了新的工具来辅助音乐创作,但我们观察到,由于生成算法提供的控制有限、推理时间过长或音乐家的工作流程中缺乏集成,因此生成算法仍然没有被艺术家广泛使用。在这项工作中,我们提出了钢琴修复应用程序(PIA),一个专注于修复钢琴演奏的生成模型,因为我们相信这种基本操作(修复钢琴演奏中缺失的部分)鼓励了人机交互,并为音乐创作开辟了新的途径。我们的方法依赖于一个编码器-解码器线性Transformer结构,该结构基于一种称为结构化MIDI编码的MIDI钢琴演奏的新表示法。通过揭示线性Transformer和我们的修复任务之间有趣的协同作用,我们能够有效地修复钢琴演奏的相邻区域,这使得我们的模型适合交互式和响应性人工智能辅助创作。最后,我们介绍我们免费提供的Ableton Live PIA插件,它允许音乐家在广泛使用的专业数字音频工作站中使用PIA平滑地生成或修改任何MIDI剪辑。 摘要:Autoregressive models are now capable of generating high-quality minute-long expressive MIDI piano performances. Even though this progress suggests new tools to assist music composition, we observe that generative algorithms are still not widely used by artists due to the limited control they offer, prohibitive inference times or the lack of integration within musicians' workflows. In this work, we present the Piano Inpainting Application (PIA), a generative model focused on inpainting piano performances, as we believe that this elementary operation (restoring missing parts of a piano performance) encourages human-machine interaction and opens up new ways to approach music composition. Our approach relies on an encoder-decoder Linear Transformer architecture trained on a novel representation for MIDI piano performances termed Structured MIDI Encoding. By uncovering an interesting synergy between Linear Transformers and our inpainting task, we are able to efficiently inpaint contiguous regions of a piano performance, which makes our model suitable for interactive and responsive A.I.-assisted composition. Finally, we introduce our freely-available Ableton Live PIA plugin, which allows musicians to smoothly generate or modify any MIDI clip using PIA within a widely-used professional Digital Audio Workstation.

【8】 Towards Automatic Instrumentation by Learning to Separate Parts in Symbolic Multitrack Music 标题:在符号多声道音乐中学习分离声部走向自动化乐器

作者:Hao-Wen Dong,Chris Donahue,Taylor Berg-Kirkpatrick,Julian McAuley 机构: University of California San Diego, Stanford University 备注:Accepted to ISMIR 2021 链接:https://arxiv.org/abs/2107.05916 摘要:现代键盘允许音乐家同时演奏多种乐器,方法是给不同的乐器指定区域——键盘的固定音高范围。在本文中,我们旨在进一步扩展这一思想,并探讨自动乐器的可行性,即在独奏音乐演奏过程中为音符动态分配乐器。除了为执行用例提供在线、实时的设置外,自动插装还可以在离线设置中的辅助创作工具中找到应用程序。由于缺乏原始独奏音乐的配对数据及其完整排列,我们通过学习将部分(如声音、乐器和音轨)从符号多轨音乐的混合中分离出来,假设混合是在键盘上播放的,从而接近自动乐器。我们将零件分离问题定义为一个序列多类分类问题,并采用机器学习将注释序列映射为零件标签序列。为了检验我们提出的模型的有效性,我们对巴赫合唱、弦乐四重奏、游戏音乐和流行音乐四个不同流派和合奏的数据集进行了全面的实证评估。我们的实验表明,所提出的模型优于各种基线。我们还展示了我们提出的模型通过将混合物分成若干部分,为现有安排产生替代的令人信服的工具的潜力。所有源代码和音频样本可以在https://salu133445.github.io/arranger/ . 摘要:Modern keyboards allow a musician to play multiple instruments at the same time by assigning zones -- fixed pitch ranges of the keyboard -- to different instruments. In this paper, we aim to further extend this idea and examine the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance. In addition to the online, real-time-capable setting for performative use cases, automatic instrumentation can also find applications in assistive composing tools in an offline setting. Due to the lack of paired data of original solo music and their full arrangements, we approach automatic instrumentation by learning to separate parts (e.g., voices, instruments and tracks) from their mixture in symbolic multitrack music, assuming that the mixture is to be played on a keyboard. We frame the task of part separation as a sequential multi-class classification problem and adopt machine learning to map sequences of notes into sequences of part labels. To examine the effectiveness of our proposed models, we conduct a comprehensive empirical evaluation over four diverse datasets of different genres and ensembles -- Bach chorales, string quartets, game music and pop music. Our experiments show that the proposed models outperform various baselines. We also demonstrate the potential for our proposed models to produce alternative convincing instrumentations for an existing arrangement by separating its mixture into parts. All source code and audio samples can be found at https://salu133445.github.io/arranger/ .

【9】 Conformer-based End-to-end Speech Recognition With Rotary Position Embedding 标题:旋转位置嵌入的基于Conform的端到端语音识别

作者:Shengqiang Li,Menglong Xu,Xiao-Lei Zhang 机构:CIAIC, School of Marine Science and Technology, Northwestern Polytechnical University, China 备注:5 pages, 3 figures 链接:https://arxiv.org/abs/2107.05907 摘要:基于Transformer的端到端语音识别模型由于其训练速度快、能够模拟长距离的全局上下文,近年来受到了广泛的关注。transformer架构中的位置嵌入是必不可少的,因为它为输入序列中不同位置的元素之间的依赖关系建模提供了监督。为了利用输入序列的时间顺序,许多工作在输入序列中注入了有关元素相对或绝对位置的信息。本文研究了卷积增强变换器(conformer)中的各种位置嵌入方法,并采用了一种新的实现方法&旋转位置嵌入(RoPE)。RoPE通过旋转矩阵将绝对位置信息编码到输入序列中,然后自然地将显式相对位置信息合并到自我注意模块中。为了评价RoPE方法的有效性,我们在AISHELL-1和LibriSpeech语料库上进行了实验。结果表明,用RoPE增强的一致性在语音识别任务中取得了较好的效果。具体来说,我们的模型在测试clean和测试LibriSpeech语料库的其他集合时分别比conformer降低了8.70%和7.27%。 摘要:Transformer-based end-to-end speech recognition models have received considerable attention in recent years due to their high training speed and ability to model a long-range global context. Position embedding in the transformer architecture is indispensable because it provides supervision for dependency modeling between elements at different positions in the input sequence. To make use of the time order of the input sequence, many works inject some information about the relative or absolute position of the element into the input sequence. In this work, we investigate various position embedding methods in the convolution-augmented transformer (conformer) and adopt a novel implementation named rotary position embedding (RoPE). RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. To evaluate the effectiveness of the RoPE method, we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show that the conformer enhanced with RoPE achieves superior performance in the speech recognition task. Specifically, our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.

【10】 Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 标题:面向2021年ZeroSpeech挑战赛的整形CPC与深度聚类相结合的语音表征学习

作者:Takashi Maekaku,Xuankai Chang,Yuya Fujita,Li-Wei Chen,Shinji Watanabe,Alexander Rudnicky 机构:Yahoo Japan Corporation, Tokyo, JAPAN, Carnegie Mellon University, PA, USA 链接:https://arxiv.org/abs/2107.05899 摘要:我们提出了一个零资源语音挑战2021系统,它结合了对比预测编码(CPC)和深度聚类。在深度聚类中,我们首先用k-均值对CPC网络的输出进行聚类,得到伪标记。然后,我们训练一个额外的自回归模型,以监督的方式对先前获得的伪标签进行分类。音素判别表示是通过对自回归模型最后一层的输出进行第二轮聚类来实现的。我们证明了用一个构象层替换一个变换层将导致词汇度量的进一步增益。实验结果表明,与基于librispeech460h数据训练的CPC-small基线方法相比,语音度量、词汇度量和句法度量分别提高了35%、1.5%和2.3%。在这个挑战中,我们通过语法度量获得了最好的结果。 摘要:We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top results in this challenge with the syntactic metric.

【11】 Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task 标题:在理解和学习辅助文本翻译任务中提高言语翻译水平

作者:Yun Tang,Juan Pino,Xian Li,Changhan Wang,Dmitriy Genzel 机构:Facebook AI 备注:Accepted by ACL 2021 链接:https://arxiv.org/abs/2107.05782 摘要:预训练和多任务学习被广泛应用于提高语篇翻译性能。在这项研究中,我们感兴趣的是训练一个语音到文本的翻译模型以及一个辅助的文本到文本的翻译任务。在多任务学习框架下,我们进行了详细的分析,以了解辅助任务对主要任务的影响。我们的分析证实,多任务学习倾向于从不同的模式中产生相似的解码器表示,并保留来自预训练文本翻译模块的更多信息。我们观察到两个任务之间的负迁移效应最小,共享更多的参数有助于将知识从文本任务转移到言语任务。分析还表明,译码器顶层的模态表示差异仍然是不可忽略的,这些层对翻译质量至关重要。受这些发现的启发,我们提出了三种提高翻译质量的方法。首先,提出了一种参数共享和初始化策略,以增强任务间的信息共享。其次,针对编码器提出了一种新的基于注意的正则化方法,使得不同模式的表示更加接近。第三,提出了一种在线知识提炼方法,以增强文本到语音任务的知识传递。我们的实验表明,该方法在一个很强的基线上提高了2个BLEU以上的翻译性能,并在textsc{MuST-C}英语-德语、英语-法语和英语-西班牙语语言对上取得了最新的结果。 摘要:Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-the-art results on the textsc{MuST-C} English-German, English-French and English-Spanish language pairs.

【12】 Codified audio language modeling learns useful representations for music information retrieval 标题:编码音频语言建模学习用于音乐信息检索的有用表示

作者:Rodrigo Castellon,Chris Donahue,Percy Liang 机构:Stanford University 备注:To appear in the proceedings of ISMIR 2021 链接:https://arxiv.org/abs/2107.05677 摘要:我们证明,语言模型预先训练编码(离散编码)的音乐音频学习表示,是有用的下游和平号任务。具体来说,我们探讨了Jukebox(Dhariwal et al.2020)的表现形式:一个音乐生成系统,包含一个语言模型,该语言模型基于来自100万首歌曲的编码音频进行训练。为了确定Jukebox的表示是否包含对MIR有用的信息,我们使用它们作为输入特征来训练几个MIR任务的浅层模型。相对于传统的MIR模型在标记方面的预训练,我们发现使用Jukebox中的表示作为输入特征,在标记、类型分类、情感识别和关键点检测四个MIR任务中平均产生30%的性能提高。对于密钥检测,我们观察到自动存储塔中的表示比预先训练的模型中的表示强得多,这表明通过编码音频语言模型进行预训练可以解决传统方法中的盲点。我们将Jukebox表示的强度解释为音频建模而不是标记为和平号提供了更丰富的表示的证据。 摘要:We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

0 人点赞