金融/语音/音频处理学术速递[7.20]

2021-07-27 11:07:36 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计11篇

cs.SD语音,共计10篇

eess.AS音频处理,共计10篇

1.q-fin金融:

【1】 A New Attempt to Identify Long-term Precursors for Financial Crisis in the Market Correlation Structures 标题:在市场关联结构中识别金融危机长期前兆的新尝试

作者:Anton J. Heckens,Thomas Guhr 机构:Fakult¨at f¨ur Physik, Universit¨at Duisburg–Essen, Duisburg, Germany 备注:22 pages, 9 figures 链接:https://arxiv.org/abs/2107.09048 摘要:对金融市场事件的预测是每个投资者的梦想,通常也是一厢情愿。从更一般的经济和社会角度来看,确定大型事件的指标对于评估系统性风险是非常可取的。不幸的是,金融市场的本质,特别是非马尔可夫性和非平稳性,使这一挑战成为一项艰巨的挑战,对成熟的答案几乎没有希望。尽管如此,人们还是需要收集各种可观测数据中的证据片段,像拼图中的片段一样进行组合,最终可能有助于瞥见大型事件的长期指标或前兆——如果从统计学意义上讲是这样的话。在这里,我们提出了一个新的拼图块。我们利用金融市场相关结构时间演化中存在的准平稳市场状态。最近,我们确定了这样的市场状态相对于整体市场的集体运动。我们研究了它们在美国股市16年来的前兆性质,包括两次危机:互联网泡沫破裂和雷曼兄弟(Lehman Brothers)破产前阶段。我们确定了某些有趣的特征,并批判性地讨论了它们作为指标的适用性。 摘要:Prediction of events in financial markets is every investor's dream and, usually, wishful thinking. From a more general, economic and societal viewpoint, the identification of indicators for large events is highly desirable to assess systemic risks. Unfortunately, the very nature of financial markets, particularly the predominantly non-Markovian character as well as non-stationarity, make this challenge a formidable one, leaving little hope for fully fledged answers. Nevertheless, it is called for to collect pieces of evidence in a variety of observables to be assembled like the pieces of a puzzle that eventually might help to catch a glimpse of long-term indicators or precursors for large events - if at all in a statistical sense. Here, we present a new piece for this puzzle. We use the quasi-stationary market states which exist in the time evolution of the correlation structure in financial markets. Recently, we identified such market states relative to the collective motion of the market as a whole. We study their precursor properties in the US stock markets over 16 years, including two crises, the dot-com bubble burst and the pre-phase of the Lehman Brothers crash. We identify certain interesting features and critically discuss their suitability as indicators.

【2】 Optimal sports betting strategies in practice: an experimental review 标题:实践中的最优体育博彩策略:实验综述

作者:Matej Uhrín,Gustav Šourek,Ondřej Hubáček,Filip Železný 机构: Filip ˇZelezn´yDepartment of Computer ScienceCzech Technical University in PragueAbstractWe investigate the most popular approaches to the problem of sports betting investment based onmodern portfolio theory and the Kelly criterion 备注:None 链接:https://arxiv.org/abs/2107.08827 摘要:基于现代投资组合理论和Kelly准则,我们研究了解决体育博彩投资问题最流行的方法。我们定义了问题设置、正式的投资策略,并回顾了它们在实践中的常见修改。审查修改的根本目的是减轻因正式战略的不切实际的数学假设而产生的额外风险。我们测试结果的方法,使用一个统一的评估协议的三个运动:赛马,篮球和足球。结果表明了附加风险控制方法的实际必要性,并展示了它们各自的优点。特别地,我们证明了流行的“分数Kelly”方法的自适应变体是一个非常适合于各种设置的选择。 摘要:We investigate the most popular approaches to the problem of sports betting investment based on modern portfolio theory and the Kelly criterion. We define the problem setting, the formal investment strategies, and review their common modifications used in practice. The underlying purpose of the reviewed modifications is to mitigate the additional risk stemming from the unrealistic mathematical assumptions of the formal strategies. We test the resulting methods using a unified evaluation protocol for three sports: horse racing, basketball and soccer. The results show the practical necessity of the additional risk-control methods and demonstrate their individual benefits. Particularly, we show that an adaptive variant of the popular ``fractional Kelly'' method is a very suitable choice across a wide range of settings.

【3】 A Data-driven Explainable Case-based Reasoning Approach for Financial Risk Detection 标题:一种数据驱动的可解释案例推理金融风险检测方法

作者:Wei Li,Florentina Paraschiv,Georgios Sermpinis 机构:NTNU Business School, Norwegian University of Science and Technology, Trondheim, Norway, Institute for Operations Research and Computational Finance, University of St. Gallen, Bodanstrasse , CH-, St. Gallen, Switzerland 链接:https://arxiv.org/abs/2107.08808 摘要:近年来人工智能方法的迅速发展,使其在预测各种金融风险方面得到了广泛的应用。本研究提出一种新的可解释案例推理(CBR)方法,不需要丰富的金融风险专业知识。与其他黑盒算法相比,可解释CBR系统可以对结果进行自然的经济解释。事实上,实证结果强调了CBR系统在预测金融风险方面的可解释性,这对金融公司及其客户都是至关重要的。此外,我们的结果表明,与其他人工智能方法相比,所提出的自动设计CBR系统具有良好的预测性能,克服了标准CBR系统高度依赖相应领域先验知识的主要缺点。 摘要:The rapid development of artificial intelligence methods contributes to their wide applications for forecasting various financial risks in recent years. This study introduces a novel explainable case-based reasoning (CBR) approach without a requirement of rich expertise in financial risk. Compared with other black-box algorithms, the explainable CBR system allows a natural economic interpretation of results. Indeed, the empirical results emphasize the interpretability of the CBR system in predicting financial risk, which is essential for both financial companies and their customers. In addition, our results show that the proposed automatic design CBR system has a good prediction performance compared to other artificial intelligence methods, overcoming the main drawback of a standard CBR system of highly depending on prior domain knowledge about the corresponding field.

【4】 Stock Movement Prediction with Financial News using Contextualized Embedding from BERT 标题:基于BERT上下文嵌入的金融新闻股票走势预测

作者:Qinkai Chen 机构:Ecole Polytechnique, Route de Saclay, Palaiseau, France, Exoduspoint Capital Management France, Boulevard Haussmann, Paris, France 备注:22 pages, 6 figures, 7 tables 链接:https://arxiv.org/abs/2107.08721 摘要:新闻事件可以极大地影响股票市场。在本文中,我们只利用新闻标题来预测金融新闻事件后股票价格的短期变动。为了实现这一目标,我们引入了一种新的文本挖掘方法,称为微调上下文嵌入递归神经网络(FT-CE-RNN)。与以往使用静态向量表示新闻(静态嵌入)的方法相比,我们的模型使用了由Transformer(BERT)的双向编码器表示生成的标题的上下文化向量表示(上下文化嵌入)。我们的模型得到了这个股票运动预测任务的最新结果。与其他基准模型相比,该模型在精度和交易模拟方面都有显著提高。通过对彭博新闻数百万条头条新闻的各种交易模拟,我们证明了该模型在真实场景中的能力。 摘要:News events can greatly influence equity markets. In this paper, we are interested in predicting the short-term movement of stock prices after financial news events using only the headlines of the news. To achieve this goal, we introduce a new text mining method called Fine-Tuned Contextualized-Embedding Recurrent Neural Network (FT-CE-RNN). Compared with previous approaches which use static vector representations of the news (static embedding), our model uses contextualized vector representations of the headlines (contextualized embeddings) generated from Bidirectional Encoder Representations from Transformers (BERT). Our model obtains the state-of-the-art result on this stock movement prediction task. It shows significant improvement compared with other baseline models, in both accuracy and trading simulations. Through various trading simulations based on millions of headlines from Bloomberg News, we demonstrate the ability of this model in real scenarios.

【5】 Empirical evidence on the Euler equation for investment in the US 标题:美国投资的欧拉方程的经验证据

作者:Guido Ascari,Qazi Haque,Leandro M. Magnusson,Sophocles Mavroeidis 链接:https://arxiv.org/abs/2107.08713 摘要:欧拉方程模型的投资调整成本和可变的资本利用率估计使用美国战后的总数据与计量经济方法,是稳健的弱工具,并利用信息在可能的结构变化。考虑了各种不同的识别假设,包括外部工具和从动态随机一般均衡模型中获得的工具。结果表明,资本利用弹性和投资调整成本参数识别能力较弱。这是因为投资似乎对资本利用率和实际利率的变化没有反应。 摘要:The Euler equation model for investment with adjustment costs and variable capital utilization is estimated using aggregate US post-war data with econometric methods that are robust to weak instruments and exploit information in possible structural changes. Various alternative identification assumptions are considered, including external instruments, and instruments obtained from Dynamic Stochastic General Equilibrium models. Results show that the elasticity of capital utilization and investment adjustment cost parameters are very weakly identified. This is because investment appears to be unresponsive to changes in capital utilization and the real interest rate.

【6】 A characterisation of cross-impact kernels 标题:交叉碰撞核的一种刻画

作者:Mathieu Rosenbaum,Mehdi Tomas 链接:https://arxiv.org/abs/2107.08684 摘要:交易一项金融资产会推动其价格以及其他资产的价格,这种现象被称为交叉影响。我们考虑一般类的基于核的交叉影响模型,并研究合适的参数化交易目的。我们重点研究了保证价格是鞅并预测未来序流的核(鞅容许核)和保证不存在可能的价格操纵的核(无统计套利容许核)。我们确定了这两个类之间的重叠,并提供了校正数据交叉影响核的公式。我们用SP500期货数据来说明我们的结果。 摘要:Trading a financial asset pushes its price as well as the prices of other assets, a phenomenon known as cross-impact. We consider a general class of kernel-based cross-impact models and investigate suitable parameterisations for trading purposes. We focus on kernels that guarantee that prices are martingales and anticipate future order flow (martingale-admissible kernels) and those that ensure there is no possible price manipulation (no-statistical-arbitrage-admissible kernels). We determine the overlap between these two classes and provide formulas for calibration of cross-impact kernels on data. We illustrate our results using SP500 futures data.

【7】 Monitoring multidimensional phenomena with a multicriteria composite performance interval approach 标题:用多准则复合性能区间方法监测多维现象

作者:Ana Garcia-Bernabeu,Adolfo Hilario-Caballero 机构:Department of Economy and Social Science, Universitat Politècncia de València, Campus d’Alcoi, Spain, Institute of Control Systems and Industrial Computing (ai,), Universitat Politècnica de Valencia 备注:13 pages, 1 figure 链接:https://arxiv.org/abs/2107.08393 摘要:在过去的二十年里,衡量和比较广泛领域的多维现象的综合指标的构建已经大大增加。本文提出了一种基于不同聚合规则计算多准则综合性能区间的方法。建议的方法提供了一个额外的信息层,因为从不可补偿的角度来看,性能间隔显示了一个下限,而允许完全可补偿的上限。该方案的突出特点是:(i)以基于距离的多准则技术为基线,构造多准则性能区间(ii)利用Minkowski的$Lu p$度量的特殊情况进行距离/分离度量的聚合(iii)多标准绩效区间的跨度可被视为维度或指标平衡的标志。 摘要:In the last two decades, composite indicators' construction to measure and compare multidimensional phenomena in a broad spectrum of domains has increased considerably. Different methodological approaches are used to summarize huge data sets of information in a single figure. This paper proposes a new approach that consists of computing a multicriteria composite performance interval based on different aggregation rules. The suggested approach provides an additional layer of information as the performance interval displays a lower bound from a non-compensability perspective, and an upper bound allowing for full-compensability. The outstanding features of this proposal are: (i) a distance-based multicriteria technique is taken as the baseline to construct the multicriteria performance interval (ii) the aggregation of distances/separation measures is made using particular cases of Minkowski's $L_p$ metrics; (iii) the span of the multicriteria performance interval can be considered as a sign of the dimensions or indicators balance.

【8】 A Probit Estimation of Urban Bases of Environmental Awareness: Evidence from Sylhet City, Bangladesh 标题:城市环境意识基础的概率估计--以孟加拉锡尔赫特市为例

作者:Mohammad Masud Alam 备注:None 链接:https://arxiv.org/abs/2107.08342 摘要:本文评估了影响孟加拉国锡尔赫特市区居民环境意识的重要因素。有序Probit(OPM)估计是应用于个人环境关注的10个措施的价值。OPM的估计结果揭示了高等教育、高收入和充分就业状况对环境关注和环境责任行为的主导作用。年轻人和受教育程度较高的受访者往往比年长人和受教育程度较低的受访者更有知识和关心。被调查者的家庭规模、中等收入水平收入、兼职就业状况对环境意识程度的边际影响不显著。研究结果也验证了Van Liere和Dunlap(1980)提出的“年龄假说”,并且性别效应揭示了在决定环境关注程度方面的不重要作用。城市高收入人群的环境意识与环境意识项目呈线性增长,环境意识项目可能具有重要的政策意义,如老年人和受教育程度较低的人群的环境意识项目,并可能导致对高收入群体增加税收,以缓解城市地区的污染问题。 摘要:This paper evaluates the significant factors contributing to environmental awareness among individuals living in the urban area of Sylhet, Bangladesh. Ordered Probit(OPM) estimation is applied on the value of ten measures of individual environmental concern. The estimated results of OPM reveal the dominance of higher education, higher income, and full-employment status on environmental concern and environmentally responsible behavior. Younger and more educated respondents tended to be more knowledgeable and concerned than older and less educated respondents. The marginal effect of household size, middle-income level income, and part-time employment status of the survey respondents played a less significant role in the degree of environmental awareness. Findings also validate the "age hypothesis" proposed by Van Liere and Dunlap (1980), and the gender effect reveals an insignificant role in determining the degree of environmental concern. Environmental awareness among urban individuals with higher income increased linearly with environmental awareness programs which may have significant policy importance, such as environmental awareness programs for old-aged and less-educated individuals, and may lead to increased taxation on higher income groups to mitigate city areas' pollution problems.

【9】 Automatic Fatou Property of Law-invariant Risk Measures 标题:不变风险测度的自动Fatou性质

作者:Shengzhong Chen,Niushan Gao,Denny Leung,Lei Li 链接:https://arxiv.org/abs/2107.08109 摘要:本文证明了在包括Orlicz空间在内的经典模型空间上,每一个实值、律不变、相参的风险测度在其负部分有细尾的每一点上都具有Fatou性质。 摘要:In this paper, we show that, on classical model spaces including Orlicz spaces, every real-valued, law-invariant, coherent risk measure automatically has the Fatou property at every point whose negative part has a thin tail.

【10】 Competitive equilibrium always exists for combinatorial auctions with graphical pricing schemes 标题:具有图形化定价方案的组合拍卖总是存在好胜均衡

作者:Marie-Charlotte Brandenburg,Christian Haase,Ngoc Mai Tran 备注:24 pages, 4 figures 链接:https://arxiv.org/abs/2107.08813 摘要:利用离散几何方法,我们证明了在具有匿名图形估价和定价的组合拍卖中,竞争均衡总是存在的。这是一个直观且易于构造的估值类,可以同时对互补性和替代性进行建模,而且据我们所知,它是除总替代性之外的第一类保证了竞争均衡的估值类。我们通过反例证明了我们的结果是严密的,并给出了构造性竞争定价向量的显式算法。我们还扩展了多单位组合拍卖(也称为产品组合拍卖)。结合Candogan、Ozdagar和Parillo关于图形估价和定价均衡的定理,我们的结果表明二次定价是一种非常实用的组合拍卖方法。 摘要:We show that a competitive equilibrium always exists in combinatorial auctions with anonymous graphical valuations and pricing, using discrete geometry. This is an intuitive and easy-to-construct class of valuations that can model both complementarity and substitutes, and to our knowledge, it is the first class besides gross substitutes that have guaranteed competitive equilibrium. We prove through counter-examples that our result is tight, and we give explicit algorithms for constructive competitive pricing vectors. We also give extensions to multi-unit combinatorial auctions (also known as product-mix auctions). Combined with theorems on graphical valuations and pricing equilibrium of Candogan, Ozdagar and Parillo, our results indicate that quadratic pricing is a highly practical method to run combinatorial auctions.

【11】 Global Gridded Daily CO2 Emissions 标题:全球网格日二氧化碳排放量

作者:Xinyu Dou,Yilong Wang,Philippe Ciais,Frédéric Chevallier,Steven J. Davis,Monica Crippa,Greet Janssens-Maenhout,Diego Guizzardi,Efisio Solazzo,Feifan Yan,Da Huo,Zheng Bo,Zhu Deng,Biqing Zhu,Hengqi Wang,Qiang Zhang,Pierre Gentine,Zhu Liu 机构:Affiliations, . Department of Earth System Science, Tsinghua University, Beijing, China, . Key Laboratory of Land Surface Pattern and Simulation, Institute of Geographical, Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 链接:https://arxiv.org/abs/2107.08586 摘要:准确、高分辨率的二氧化碳(CO2)排放数据对于实现全球碳中和具有重要意义。在这里,我们首次提出了近实时全球网格化每日二氧化碳排放数据集(称为GRACED),其全球空间分辨率为0.1{deg}乘0.1{deg},时间分辨率为1天。基于近实时数据集(Carbon Monitor)、点源排放数据集(Global Carbon Grid)的空间模式(GID)的每日全国CO2排放量,计算不同部门的网格化石排放量,全球大气研究排放数据库(EDGAR)和卫星二氧化氮(NO2)反演的时空模式。我们对全球二氧化碳排放量的研究响应了对高质量、细粒度近实时二氧化碳排放量估计的日益增长和迫切的需求,以支持跨各种空间尺度的全球排放监测。我们展示了电力、工业、居民消费、地面交通、国内和国际航空的排放变化的空间格局,以及2019年至2020年间的国际航运部门。这有助于我们深入了解各个部门的相对贡献,并提供快速、精细的概述,说明化石二氧化碳排放量在何时何地减少和反弹,以应对紧急情况(如COVID-19)和人类活动的其他干扰以前发布的数据集。随着世界从这一流行病中复苏并使其能源系统脱碳,定期更新这一数据集将使决策者能够更密切地监测气候和能源政策的有效性,并迅速作出调整 摘要:Precise and high-resolution carbon dioxide (CO2) emission data is of great importance of achieving the carbon neutrality around the world. Here we present for the first time the near-real-time Global Gridded Daily CO2 Emission Datasets (called GRACED) from fossil fuel and cement production with a global spatial-resolution of 0.1{deg} by 0.1{deg} and a temporal-resolution of 1-day. Gridded fossil emissions are computed for different sectors based on the daily national CO2 emissions from near real time dataset (Carbon Monitor), the spatial patterns of point source emission dataset Global Carbon Grid (GID), Emission Database for Global Atmospheric Research (EDGAR) and spatiotemporal patters of satellite nitrogen dioxide (NO2) retrievals. Our study on the global CO2 emissions responds to the growing and urgent need for high-quality, fine-grained near-real-time CO2 emissions estimates to support global emissions monitoring across various spatial scales. We show the spatial patterns of emission changes for power, industry, residential consumption, ground transportation, domestic and international aviation, and international shipping sectors between 2019 and 2020. This help us to give insights on the relative contributions of various sectors and provides a fast and fine-grained overview of where and when fossil CO2 emissions have decreased and rebounded in response to emergencies (e.g. COVID-19) and other disturbances of human activities than any previously published dataset. As the world recovers from the pandemic and decarbonizes its energy systems, regular updates of this dataset will allow policymakers to more closely monitor the effectiveness of climate and energy policies and quickly adapt

2.cs.SD语音:

【1】 Over-Parameterization and Generalization in Audio Classification 标题:音频分类中的过参数化和泛化

作者:Khaled Koutini,Hamid Eghbal-zadeh,Florian Henkel,Jan Schlüter,Gerhard Widmer 机构: 1Institute of Computational Perception 备注:Presented at the ICML 2021 Workshop on Overparameterization: Pitfalls & Opportunities 链接:https://arxiv.org/abs/2107.08933 摘要:卷积神经网络(CNNs)在机器视觉、机器听觉、自然语言处理等领域中一直占据主导地位。在机器听觉中,cnn通常表现出很好的泛化能力,但它对所使用的特定录音设备非常敏感,这已被公认为声学场景分类(DCASE)领域的一个重要问题。在这项研究中,我们探讨了过度参数化的声学场景分类模型之间的关系,以及由此产生的泛化能力。具体来说,我们测试了不同条件下CNNs在宽度和深度上的伸缩性。我们的结果表明,增加宽度可以提高对不可见设备的泛化,即使不增加参数的数量。 摘要:Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.

【2】 Measuring a Six-hole Recorder Flute's Response to Breath Pressure Variations and Fitting a Model 标题:六孔记录器长笛对呼吸压力变化响应的测量及模型拟合

作者:Daniel Chin,Gus Xia 机构:NYU Shanghai 链接:https://arxiv.org/abs/2107.08727 摘要:我们提出了暹罗长笛方法,测量呼吸压力和声音平行。我们拟合了一个6自由度模型来描述呼吸压力如何影响倍频程和微音调的音调弯曲,揭示了倍频程滞后。我们发布了模型参数和数据分析工具。 摘要:We propose the Siamese-flute method that measures the breath pressure and the acoustic sound in parallel. We fit a 6-DoF model to describe how the breath pressure affects the octave and the microtonal pitch bend, revealing the octave hysteresis. We release both our model parameters and our data analysis tools.

【3】 Translatotron 2: Robust direct speech-to-speech translation 标题:Translatotron 2:健壮的直接语音到语音翻译

作者:Ye Jia,Michelle Tadmor Ramanovich,Tal Remez,Roi Pomerantz 机构:Google Research 链接:https://arxiv.org/abs/2107.08661 摘要:我们提出了translatotron2,一个可以端到端训练的神经直接语音到语音翻译模型。translatotron2由一个语音编码器、一个音素解码器、一个mel谱图合成器和一个连接前面三个组件的注意模块组成。实验结果表明,translatotron2在翻译质量和预测语音的自然度方面比原Translatotron有很大的提高,并通过减少过度生成(如胡言乱语或长停顿)显著提高了预测语音的鲁棒性。我们还提出了一种新的方法来保留源说话人的声音。经过训练的模型被限制为保留源说话人的声音,并且与原始Translatotron不同,它不能以不同说话人的声音生成语音,从而通过减少用于创建欺骗音频伪影的潜在误用,使得模型对于产品部署更为健壮。当新方法与一个简单的基于级联的数据增强结合使用时,训练好的translatotron2模型能够保留每个说话人的声音,以便在说话人转弯时输入。 摘要:We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.

【4】 An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation 标题:一种用于情感语音转换的改进StarGAN:提高语音质量和数据增强

作者:Xiangheng He,Junjie Chen,Georgios Rizos,Björn W. Schuller 机构:GLAM – Group on Language, Audio, & Music, Imperial College London, UK, Department of Computer Science, The University of Tokyo, Japan, Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.08361 摘要:情感语音转换(EVC)的目的是将源语音信号的情感风格转换为目标风格,同时保留其内容和说话人身份信息。以往的情感转换研究并没有将情感信息与应该保留的独立于情感的信息分离开来,从而以一种整体的方式将其转换,并产生低质量的音频,伴随着语言的扭曲。为了解决这个失真问题,我们提出了一个新的StarGAN框架和一个两阶段的训练过程,通过使用带有两个编码器的自动编码器作为生成对抗网络(GAN)的生成器,将情感特征从那些独立于情感的特征中分离出来。该模型在失真度的客观评价和主观评价方面都取得了良好的效果,表明该模型能有效地降低失真度。此外,在端到端语音情感识别的数据增强实验中,与基线StarGAN模型相比,本文提出的StarGAN模型在Micro-F1和Macro-F1上分别提高了2%和5%,表明本文提出的模型具有更高的数据增强价值。 摘要:Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.

【5】 Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors 标题:探索词汇释义在缓解噪音引起的理解错误方面的潜力

作者:Anupama Chingacham,Vera Demberg,Dietrich Klakow 机构:Saarland Informatics Campus, Saarland University, Germany 备注:Accepted in Interspeech 2021 链接:https://arxiv.org/abs/2107.08337 摘要:即使对听力正常的人来说,在嘈杂的环境中听也很困难。语音信号可能被噪声所掩盖,这可能导致听者对词语的误解,从而导致理解信息的总体困难。为了减轻听者的听力困难,合作说话人利用Lombard语音等语音调制策略来产生噪声鲁棒性强的语音,并且已经为语音合成系统开发了类似的解决方案。在这项工作中,我们提出了另一种解决方案,选择噪声鲁棒性词汇释义来表示一个预期的意义。我们的研究结果表明,词汇释义在噪声中的可理解性是不同的。我们评估了上下文中同义词的可理解性,发现选择一个比同义词更容易被误读的词汇单元,在信噪比为-5db时,平均理解增益为37%,在信噪比为0db时,平均理解增益为21%。 摘要:Listening in noisy environments can be difficult even for individuals with a normal hearing thresholds. The speech signal can be masked by noise, which may lead to word misperceptions on the side of the listener, and overall difficulty to understand the message. To mitigate hearing difficulties on listeners, a co-operative speaker utilizes voice modulation strategies like Lombard speech to generate noise-robust utterances, and similar solutions have been developed for speech synthesis systems. In this work, we propose an alternate solution of choosing noise-robust lexical paraphrases to represent an intended meaning. Our results show that lexical paraphrases differ in their intelligibility in noise. We evaluate the intelligibility of synonyms in context and find that choosing a lexical unit that is less risky to be misheard than its synonym introduced an average gain in comprehension of 37% at SNR -5 dB and 21% at SNR 0 dB for babble noise.

【6】 Learning De-identified Representations of Prosody from Raw Audio 标题:从原始音频中学习韵律的去识别表示

作者:Jack Weston,Raphael Lenain,Udeepa Meepegama,Emil Fristed 备注:None 链接:https://arxiv.org/abs/2107.08248 摘要:提出了一种利用对比自监督信号从原始音频中学习去识别韵律表示的方法。鉴于以往的工作依赖于瓶颈条件模型,我们引入了一组归纳偏差,利用韵律的自然结构来最小化音色信息,并将韵律与说话人表示解耦。尽管对输入进行了积极的下采样,并且无法获得语言信息,但我们的模型在DAMMP上的表现与最先进的语音表示相当,DAMMP是我们为口语理解引入的一个新基准。我们使用最小描述长度探测来显示我们的表示已经选择性地学习了非音色韵律的子成分,并且乘积量化器自然地将它们分离而不使用瓶颈。我们推导了语音去可识别性的信息论定义,并用它来证明我们的韵律表征比其他语音表征更难识别。 摘要:We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech de-identifiability and use it to demonstrate that our prosody representations are less identifiable than other speech representations.

【7】 A Comparison of Methods for OOV-word Recognition on a New Public Dataset 标题:基于新的公共数据集的OOV词识别方法比较

作者:Rudolf A. Braun,Srikanth Madikeri,Petr Motlicek 机构:Idiap Research Institute, Martigny, Switzerland 链接:https://arxiv.org/abs/2107.08091 摘要:自动语音识别系统的一个常见问题是如何识别在训练过程中看不到的单词。目前还没有成熟的方法来评估不同的技术来解决这个问题。我们建议使用CommonVoice数据集来创建多语言的测试集,这些测试集相对于训练集具有较高的词汇表外(OOV)比率,并发布一个新的工具来计算相关的性能指标。然后,在混合ASR系统的上下文中,我们评估子词模型在识别OOV方面有多好,以及通过修改wfst将OOV词信息合并到现有系统中可以获得多大的好处。此外,我们提出了一种新的方法来修改基于子词的语言模型,以便更好地识别OOV单词。我们展示了OOV单词识别的巨大改进,并提供了数据和代码。 摘要:A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a new tool for calculating relevant performance metrics. We then evaluate, within the context of a hybrid ASR system, how much better subword models are at recognizing OOVs, and how much benefit one can get from incorporating OOV-word information into an existing system by modifying WFSTs. Additionally, we propose a new method for modifying a subword-based language model so as to better recognize OOV-words. We showcase very large improvements in OOV-word recognition and make both the data and code available.

【8】 On the Veracity of Local, Model-agnostic Explanations in Audio Classification: Targeted Investigations with Adversarial Examples 标题:音频分类中局部模型不可知性解释的准确性:对抗性实例的针对性调查

作者:Verena Praher,Katharina Prinz,Arthur Flexer,Gerhard Widmer 机构: Johannes Kepler University Linz, Austria, LIT AI Lab, Linz Institute of Technology, Austria 备注:8 pages, 4 figures, to be published in Proceedings of the International Society for Music Information Retrieval Conference 2021 (ISMIR 2021) 链接:https://arxiv.org/abs/2107.09045 摘要:局部解释方法(如LIME)在MIR中已成为流行的工具,用于生成模型分类决策的事后、模型不可知的解释。其基本思想是识别分类实例中对分类器预测影响最大的一小部分人类可理解的特征。这些都是作为一种解释。在出版物中对这些解释的评价往往是接受符合人类期望的东西,而实际上却无法验证解释所显示的东西是否真的导致了模型的预测。本文报告了有针对性的调查,我们试图得到更多的深入了解莱姆的解释在音频分类任务的实际准确性。我们特意为分类器设计了对抗性的例子,让我们知道输入的哪些部分可能对模型的(错误的)预测负责。要求莱姆解释对这些对手的预测,使我们能够研究当地的解释是否确实能发现这些感兴趣的区域。我们还研究了LIME是否能更成功地发现对人类来说更显著、更容易察觉的扰动。我们的研究结果表明,LIME并不一定能够识别出最相关的输入特征,因此还不清楚这些解释是否有用,甚至是误导。 摘要:Local explanation methods such as LIME have become popular in MIR as tools for generating post-hoc, model-agnostic explanations of a model's classification decisions. The basic idea is to identify a small set of human-understandable features of the classified example that are most influential on the classifier's prediction. These are then presented as an explanation. Evaluation of such explanations in publications often resorts to accepting what matches the expectation of a human without actually being able to verify if what the explanation shows is what really caused the model's prediction. This paper reports on targeted investigations where we try to get more insight into the actual veracity of LIME's explanations in an audio classification task. We deliberately design adversarial examples for the classifier, in a way that gives us knowledge about which parts of the input are potentially responsible for the model's (wrong) prediction. Asking LIME to explain the predictions for these adversaries permits us to study whether local explanations do indeed detect these regions of interest. We also look at whether LIME is more successful in finding perturbations that are more prominent and easily noticeable for a human. Our results suggest that LIME does not necessarily manage to identify the most relevant input features and hence it remains unclear whether explanations are useful or even misleading.

【9】 Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks 标题:信道门控Res2Net:面向稳健检测的合成语音攻击

作者:Xu Li,Xixin Wu,Hui Lu,Xunying Liu,Helen Meng 机构:Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Department of Engineering, University of Cambridge 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2107.08803 摘要:现有的自动说话人验证(ASV)中的反欺骗方法对不可见攻击仍缺乏通用性。Res2Net方法在一个块内的特征组之间设计了一个类残差连接,增加了可能的感受野,提高了系统的检测泛化能力。然而,通过在没有信道优先级的特征组之间的直接加法来执行这样的类剩余连接。我们认为,跨通道的信息可能不会对欺骗线索起到同等的作用,在添加到下一个特征组之前,不太相关的通道会被抑制,这样系统可以更好地推广到不可见的攻击。这一论据推动了目前提出一种新的通道选通Res2Net(CG-Res2Net)的工作,它对Res2Net进行了修改,在特征组之间的连接中实现了通道选通机制。该选通机制根据输入动态选择信道特征,抑制相关性较弱的信道,提高检测的泛化能力。提出了三种不同结构的选通机制,并将其集成到Res2Net中。在ASVSPOOF2019逻辑访问(LA)上的实验结果表明,所提出的CG-Res2Net在整体LA评估集和单个困难的不可见攻击上都显著优于Res2Net,也优于其他最先进的单系统,说明了该方法的有效性。 摘要:Existing approaches for anti-spoofing in automatic speaker verification (ASV) still lack generalizability to unseen attacks. The Res2Net approach designs a residual-like connection between feature groups within one block, which increases the possible receptive fields and improves the system's detection generalizability. However, such a residual-like connection is performed by a direct addition between feature groups without channel-wise priority. We argue that the information across channels may not contribute to spoofing cues equally, and the less relevant channels are expected to be suppressed before adding onto the next feature group, so that the system can generalize better to unseen attacks. This argument motivates the current work that presents a novel, channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a channel-wise gating mechanism in the connection between feature groups. This gating mechanism dynamically selects channel-wise features based on the input, to suppress the less relevant channels and enhance the detection generalizability. Three gating mechanisms with different structures are proposed and integrated into Res2Net. Experimental results conducted on ASVspoof 2019 logical access (LA) demonstrate that the proposed CG-Res2Net significantly outperforms Res2Net on both the overall LA evaluation set and individual difficult unseen attacks, which also outperforms other state-of-the-art single systems, depicting the effectiveness of our method.

【10】 Residual Attention Based Network for Automatic Classification of Phonation Modes 标题:基于剩余注意力的发音模式自动分类网络

作者:Xiaoheng Sun,Yiliang Jiang,Wei Li 机构:∗School of Computer Science and Technology, Fudan University, Shanghai, China;, †Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China. 链接:https://arxiv.org/abs/2107.08425 摘要:发声方式是歌唱风格的本质特征,也是表演的重要表现形式。它可以分为四类,称为中性,呼吸,压力和流动。以往的研究采用语音质量特征和特征工程进行分类。虽然深度学习在音乐信息检索(MIR)的其他领域取得了显著的进展,但在发声模式的分类方面却鲜有尝试。本研究提出了一种基于剩余注意的发声模式自动分类网络。该网络由执行特征处理的卷积网络和使网络聚焦于特定区域的软掩模分支组成。在对比实验中,提出的网络模型在四个数据集中的三个数据集上取得了较好的分类效果,其中最高分类准确率为94.58%,比基线高出2.29%。 摘要:Phonation mode is an essential characteristic of singing style as well as an important expression of performance. It can be classified into four categories, called neutral, breathy, pressed and flow. Previous studies used voice quality features and feature engineering for classification. While deep learning has achieved significant progress in other fields of music information retrieval (MIR), there are few attempts in the classification of phonation modes. In this study, a Residual Attention based network is proposed for automatic classification of phonation modes. The network consists of a convolutional network performing feature processing and a soft mask branch enabling the network focus on a specific area. In comparison experiments, the models with proposed network achieve better results in three of the four datasets than previous works, among which the highest classification accuracy is 94.58%, 2.29% higher than the baseline.

3.eess.AS音频处理:

【1】 On the Veracity of Local, Model-agnostic Explanations in Audio Classification: Targeted Investigations with Adversarial Examples 标题:音频分类中局部模型不可知性解释的准确性:对抗性实例的针对性调查

作者:Verena Praher,Katharina Prinz,Arthur Flexer,Gerhard Widmer 机构: Johannes Kepler University Linz, Austria, LIT AI Lab, Linz Institute of Technology, Austria 备注:8 pages, 4 figures, to be published in Proceedings of the International Society for Music Information Retrieval Conference 2021 (ISMIR 2021) 链接:https://arxiv.org/abs/2107.09045 摘要:局部解释方法(如LIME)在MIR中已成为流行的工具,用于生成模型分类决策的事后、模型不可知的解释。其基本思想是识别分类实例中对分类器预测影响最大的一小部分人类可理解的特征。这些都是作为一种解释。在出版物中对这些解释的评价往往是接受符合人类期望的东西,而实际上却无法验证解释所显示的东西是否真的导致了模型的预测。本文报告了有针对性的调查,我们试图得到更多的深入了解莱姆的解释在音频分类任务的实际准确性。我们特意为分类器设计了对抗性的例子,让我们知道输入的哪些部分可能对模型的(错误的)预测负责。要求莱姆解释对这些对手的预测,使我们能够研究当地的解释是否确实能发现这些感兴趣的区域。我们还研究了LIME是否能更成功地发现对人类来说更显著、更容易察觉的扰动。我们的研究结果表明,LIME并不一定能够识别出最相关的输入特征,因此还不清楚这些解释是否有用,甚至是误导。 摘要:Local explanation methods such as LIME have become popular in MIR as tools for generating post-hoc, model-agnostic explanations of a model's classification decisions. The basic idea is to identify a small set of human-understandable features of the classified example that are most influential on the classifier's prediction. These are then presented as an explanation. Evaluation of such explanations in publications often resorts to accepting what matches the expectation of a human without actually being able to verify if what the explanation shows is what really caused the model's prediction. This paper reports on targeted investigations where we try to get more insight into the actual veracity of LIME's explanations in an audio classification task. We deliberately design adversarial examples for the classifier, in a way that gives us knowledge about which parts of the input are potentially responsible for the model's (wrong) prediction. Asking LIME to explain the predictions for these adversaries permits us to study whether local explanations do indeed detect these regions of interest. We also look at whether LIME is more successful in finding perturbations that are more prominent and easily noticeable for a human. Our results suggest that LIME does not necessarily manage to identify the most relevant input features and hence it remains unclear whether explanations are useful or even misleading.

【2】 Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks 标题:信道门控Res2Net:面向稳健检测的合成语音攻击

作者:Xu Li,Xixin Wu,Hui Lu,Xunying Liu,Helen Meng 机构:Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Department of Engineering, University of Cambridge 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2107.08803 摘要:现有的自动说话人验证(ASV)中的反欺骗方法对不可见攻击仍缺乏通用性。Res2Net方法在一个块内的特征组之间设计了一个类残差连接,增加了可能的感受野,提高了系统的检测泛化能力。然而,通过在没有信道优先级的特征组之间的直接加法来执行这样的类剩余连接。我们认为,跨通道的信息可能不会对欺骗线索起到同等的作用,在添加到下一个特征组之前,不太相关的通道会被抑制,这样系统可以更好地推广到不可见的攻击。这一论据推动了目前提出一种新的通道选通Res2Net(CG-Res2Net)的工作,它对Res2Net进行了修改,在特征组之间的连接中实现了通道选通机制。该选通机制根据输入动态选择信道特征,抑制相关性较弱的信道,提高检测的泛化能力。提出了三种不同结构的选通机制,并将其集成到Res2Net中。在ASVSPOOF2019逻辑访问(LA)上的实验结果表明,所提出的CG-Res2Net在整体LA评估集和单个困难的不可见攻击上都显著优于Res2Net,也优于其他最先进的单系统,说明了该方法的有效性。 摘要:Existing approaches for anti-spoofing in automatic speaker verification (ASV) still lack generalizability to unseen attacks. The Res2Net approach designs a residual-like connection between feature groups within one block, which increases the possible receptive fields and improves the system's detection generalizability. However, such a residual-like connection is performed by a direct addition between feature groups without channel-wise priority. We argue that the information across channels may not contribute to spoofing cues equally, and the less relevant channels are expected to be suppressed before adding onto the next feature group, so that the system can generalize better to unseen attacks. This argument motivates the current work that presents a novel, channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a channel-wise gating mechanism in the connection between feature groups. This gating mechanism dynamically selects channel-wise features based on the input, to suppress the less relevant channels and enhance the detection generalizability. Three gating mechanisms with different structures are proposed and integrated into Res2Net. Experimental results conducted on ASVspoof 2019 logical access (LA) demonstrate that the proposed CG-Res2Net significantly outperforms Res2Net on both the overall LA evaluation set and individual difficult unseen attacks, which also outperforms other state-of-the-art single systems, depicting the effectiveness of our method.

【3】 Residual Attention Based Network for Automatic Classification of Phonation Modes 标题:基于剩余注意力的发音模式自动分类网络

作者:Xiaoheng Sun,Yiliang Jiang,Wei Li 机构:∗School of Computer Science and Technology, Fudan University, Shanghai, China;, †Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China. 链接:https://arxiv.org/abs/2107.08425 摘要:发声方式是歌唱风格的本质特征,也是表演的重要表现形式。它可以分为四类,称为中性,呼吸,压力和流动。以往的研究采用语音质量特征和特征工程进行分类。虽然深度学习在音乐信息检索(MIR)的其他领域取得了显著的进展,但在发声模式的分类方面却鲜有尝试。本研究提出了一种基于剩余注意的发声模式自动分类网络。该网络由执行特征处理的卷积网络和使网络聚焦于特定区域的软掩模分支组成。在对比实验中,提出的网络模型在四个数据集中的三个数据集上取得了较好的分类效果,其中最高分类准确率为94.58%,比基线高出2.29%。 摘要:Phonation mode is an essential characteristic of singing style as well as an important expression of performance. It can be classified into four categories, called neutral, breathy, pressed and flow. Previous studies used voice quality features and feature engineering for classification. While deep learning has achieved significant progress in other fields of music information retrieval (MIR), there are few attempts in the classification of phonation modes. In this study, a Residual Attention based network is proposed for automatic classification of phonation modes. The network consists of a convolutional network performing feature processing and a soft mask branch enabling the network focus on a specific area. In comparison experiments, the models with proposed network achieve better results in three of the four datasets than previous works, among which the highest classification accuracy is 94.58%, 2.29% higher than the baseline.

【4】 Over-Parameterization and Generalization in Audio Classification 标题:音频分类中的过参数化和泛化

作者:Khaled Koutini,Hamid Eghbal-zadeh,Florian Henkel,Jan Schlüter,Gerhard Widmer 机构: 1Institute of Computational Perception 备注:Presented at the ICML 2021 Workshop on Overparameterization: Pitfalls & Opportunities 链接:https://arxiv.org/abs/2107.08933 摘要:卷积神经网络(CNNs)在机器视觉、机器听觉、自然语言处理等领域中一直占据主导地位。在机器听觉中,cnn通常表现出很好的泛化能力,但它对所使用的特定录音设备非常敏感,这已被公认为声学场景分类(DCASE)领域的一个重要问题。在这项研究中,我们探讨了过度参数化的声学场景分类模型之间的关系,以及由此产生的泛化能力。具体来说,我们测试了不同条件下CNNs在宽度和深度上的伸缩性。我们的结果表明,增加宽度可以提高对不可见设备的泛化,即使不增加参数的数量。 摘要:Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.

【5】 Measuring a Six-hole Recorder Flute's Response to Breath Pressure Variations and Fitting a Model 标题:六孔记录器长笛对呼吸压力变化响应的测量及模型拟合

作者:Daniel Chin,Gus Xia 机构:NYU Shanghai 链接:https://arxiv.org/abs/2107.08727 摘要:我们提出了暹罗长笛方法,测量呼吸压力和声音平行。我们拟合了一个6自由度模型来描述呼吸压力如何影响倍频程和微音调的音调弯曲,揭示了倍频程滞后。我们发布了模型参数和数据分析工具。 摘要:We propose the Siamese-flute method that measures the breath pressure and the acoustic sound in parallel. We fit a 6-DoF model to describe how the breath pressure affects the octave and the microtonal pitch bend, revealing the octave hysteresis. We release both our model parameters and our data analysis tools.

【6】 Translatotron 2: Robust direct speech-to-speech translation 标题:Translatotron 2:健壮的直接语音到语音翻译

作者:Ye Jia,Michelle Tadmor Ramanovich,Tal Remez,Roi Pomerantz 机构:Google Research 链接:https://arxiv.org/abs/2107.08661 摘要:我们提出了translatotron2,一个可以端到端训练的神经直接语音到语音翻译模型。translatotron2由一个语音编码器、一个音素解码器、一个mel谱图合成器和一个连接前面三个组件的注意模块组成。实验结果表明,translatotron2在翻译质量和预测语音的自然度方面比原Translatotron有很大的提高,并通过减少过度生成(如胡言乱语或长停顿)显著提高了预测语音的鲁棒性。我们还提出了一种新的方法来保留源说话人的声音。经过训练的模型被限制为保留源说话人的声音,并且与原始Translatotron不同,它不能以不同说话人的声音生成语音,从而通过减少用于创建欺骗音频伪影的潜在误用,使得模型对于产品部署更为健壮。当新方法与一个简单的基于级联的数据增强结合使用时,训练好的translatotron2模型能够保留每个说话人的声音,以便在说话人转弯时输入。 摘要:We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.

【7】 An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation 标题:一种用于情感语音转换的改进StarGAN:提高语音质量和数据增强

作者:Xiangheng He,Junjie Chen,Georgios Rizos,Björn W. Schuller 机构:GLAM – Group on Language, Audio, & Music, Imperial College London, UK, Department of Computer Science, The University of Tokyo, Japan, Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2107.08361 摘要:情感语音转换(EVC)的目的是将源语音信号的情感风格转换为目标风格,同时保留其内容和说话人身份信息。以往的情感转换研究并没有将情感信息与应该保留的独立于情感的信息分离开来,从而以一种整体的方式将其转换,并产生低质量的音频,伴随着语言的扭曲。为了解决这个失真问题,我们提出了一个新的StarGAN框架和一个两阶段的训练过程,通过使用带有两个编码器的自动编码器作为生成对抗网络(GAN)的生成器,将情感特征从那些独立于情感的特征中分离出来。该模型在失真度的客观评价和主观评价方面都取得了良好的效果,表明该模型能有效地降低失真度。此外,在端到端语音情感识别的数据增强实验中,与基线StarGAN模型相比,本文提出的StarGAN模型在Micro-F1和Macro-F1上分别提高了2%和5%,表明本文提出的模型具有更高的数据增强价值。 摘要:Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.

【8】 Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors 标题:探索词汇释义在缓解噪音引起的理解错误方面的潜力

作者:Anupama Chingacham,Vera Demberg,Dietrich Klakow 机构:Saarland Informatics Campus, Saarland University, Germany 备注:Accepted in Interspeech 2021 链接:https://arxiv.org/abs/2107.08337 摘要:即使对听力正常的人来说,在嘈杂的环境中听也很困难。语音信号可能被噪声所掩盖,这可能导致听者对词语的误解,从而导致理解信息的总体困难。为了减轻听者的听力困难,合作说话人利用Lombard语音等语音调制策略来产生噪声鲁棒性强的语音,并且已经为语音合成系统开发了类似的解决方案。在这项工作中,我们提出了另一种解决方案,选择噪声鲁棒性词汇释义来表示一个预期的意义。我们的研究结果表明,词汇释义在噪声中的可理解性是不同的。我们评估了上下文中同义词的可理解性,发现选择一个比同义词更容易被误读的词汇单元,在信噪比为-5db时,平均理解增益为37%,在信噪比为0db时,平均理解增益为21%。 摘要:Listening in noisy environments can be difficult even for individuals with a normal hearing thresholds. The speech signal can be masked by noise, which may lead to word misperceptions on the side of the listener, and overall difficulty to understand the message. To mitigate hearing difficulties on listeners, a co-operative speaker utilizes voice modulation strategies like Lombard speech to generate noise-robust utterances, and similar solutions have been developed for speech synthesis systems. In this work, we propose an alternate solution of choosing noise-robust lexical paraphrases to represent an intended meaning. Our results show that lexical paraphrases differ in their intelligibility in noise. We evaluate the intelligibility of synonyms in context and find that choosing a lexical unit that is less risky to be misheard than its synonym introduced an average gain in comprehension of 37% at SNR -5 dB and 21% at SNR 0 dB for babble noise.

【9】 Learning De-identified Representations of Prosody from Raw Audio 标题:从原始音频中学习韵律的去识别表示

作者:Jack Weston,Raphael Lenain,Udeepa Meepegama,Emil Fristed 备注:None 链接:https://arxiv.org/abs/2107.08248 摘要:提出了一种利用对比自监督信号从原始音频中学习去识别韵律表示的方法。鉴于以往的工作依赖于瓶颈条件模型,我们引入了一组归纳偏差,利用韵律的自然结构来最小化音色信息,并将韵律与说话人表示解耦。尽管对输入进行了积极的下采样,并且无法获得语言信息,但我们的模型在DAMMP上的表现与最先进的语音表示相当,DAMMP是我们为口语理解引入的一个新基准。我们使用最小描述长度探测来显示我们的表示已经选择性地学习了非音色韵律的子成分,并且乘积量化器自然地将它们分离而不使用瓶颈。我们推导了语音去可识别性的信息论定义,并用它来证明我们的韵律表征比其他语音表征更难识别。 摘要:We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech de-identifiability and use it to demonstrate that our prosody representations are less identifiable than other speech representations.

【10】 A Comparison of Methods for OOV-word Recognition on a New Public Dataset 标题:基于新的公共数据集的OOV词识别方法比较

作者:Rudolf A. Braun,Srikanth Madikeri,Petr Motlicek 机构:Idiap Research Institute, Martigny, Switzerland 链接:https://arxiv.org/abs/2107.08091 摘要:自动语音识别系统的一个常见问题是如何识别在训练过程中看不到的单词。目前还没有成熟的方法来评估不同的技术来解决这个问题。我们建议使用CommonVoice数据集来创建多语言的测试集,这些测试集相对于训练集具有较高的词汇表外(OOV)比率,并发布一个新的工具来计算相关的性能指标。然后,在混合ASR系统的上下文中,我们评估子词模型在识别OOV方面有多好,以及通过修改wfst将OOV词信息合并到现有系统中可以获得多大的好处。此外,我们提出了一种新的方法来修改基于子词的语言模型,以便更好地识别OOV单词。我们展示了OOV单词识别的巨大改进,并提供了数据和代码。 摘要:A common problem for automatic speech recognition systems is how to recognize words that they did not see during training. Currently there is no established method of evaluating different techniques for tackling this problem. We propose using the CommonVoice dataset to create test sets for multiple languages which have a high out-of-vocabulary (OOV) ratio relative to a training set and release a new tool for calculating relevant performance metrics. We then evaluate, within the context of a hybrid ASR system, how much better subword models are at recognizing OOVs, and how much benefit one can get from incorporating OOV-word information into an existing system by modifying WFSTs. Additionally, we propose a new method for modifying a subword-based language model so as to better recognize OOV-words. We showcase very large improvements in OOV-word recognition and make both the data and code available.

0 人点赞