访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
q-fin金融,共计7篇
cs.SD语音,共计11篇
eess.AS音频处理,共计10篇
1.q-fin金融:
【1】 Robust Decisions for Heterogeneous Agents via Certainty Equivalents 标题:基于确定性等价的异构Agent鲁棒决策
作者:Anne G. Balter,Nikolaus Schweizer 链接:https://arxiv.org/abs/2106.13059 摘要:我们研究的问题,规划者谁解决风险回报权衡-如金融投资决策-代表一个集体的代理人与异质性的风险偏好。规划者的目标是一个两阶段效用函数,其中外部效用函数被应用于一个给定决策的代理确定性等价物的分布。假设代理具有对数正态风险和异质电力效用偏好,我们在这样一个环境中描述了最优行为:规划人员可以让每个代理从一个固定的可能决策菜单中选择不同的选项,从而根据风险偏好对代理进行分组。这些最优决策菜单首先是针对规划者准确地知道偏好分布的情况,然后是针对规划者面对这种分布的不确定性的情况,只能获得代理相对风险规避的上下界。最后,我们通过提供有限的选择菜单而不是完全个性化的决策来提供福利损失的严格限制。 摘要:We study the problem of a planner who resolves risk-return trade-offs - like financial investment decisions - on behalf of a collective of agents with heterogeneous risk preferences. The planner's objective is a two-stage utility functional where an outer utility function is applied to the distribution of the agents' certainty equivalents from a given decision. Assuming lognormal risks and heterogeneous power utility preferences for the agents, we characterize optimal behavior in a setting where the planner can let each agent choose between different options from a fixed menu of possible decisions, leading to a grouping of the agents by risk preferences. These optimal decision menus are derived first for the case where the planner knows the distribution of preferences exactly and then for a case where he faces uncertainty about this distribution, only having access to upper and lower bounds on agents' relative risk aversion. Finally, we provide tight bounds on the welfare loss from offering a finite menu of choices rather than fully personalized decisions.
【2】 Fund2Vec: Mutual Funds Similarity using Graph Learning 标题:Fund2Vec:基于图学习的共同基金相似度
作者:Vipul Satone,Dhruv Desai,Dhagash Mehta 机构:The Vanguard Group, Inc. 备注:2 column format, 8 pages, 8 figures, 5 tables 链接:https://arxiv.org/abs/2106.12987 摘要:识别与基础投资组合相关的类似共同基金在金融服务中有许多应用,包括基金推荐系统、竞争对手分析、投资组合分析、营销和销售等。传统方法要么是定性的,因此容易产生偏差,而且往往不可复制,或者,我们知道不能从原始数据中捕捉投资组合中的所有细微差别(非线性)。我们提出了一种全新的方法来识别类似的基金,该方法基于基金及其基础资产数据的加权二部网络表示,使用一种称为Node2Vec的复杂机器学习方法来学习网络的嵌入式低维表示。我们称之为嵌入emph{Fund2Vec}。我们首次研究了原始形式的基金资产网络的加权二部网络表示法,该方法确定了投资组合之间的结构相似性,而不仅仅是投资组合重叠。 摘要:Identifying similar mutual funds with respect to the underlying portfolios has found many applications in financial services ranging from fund recommender systems, competitors analysis, portfolio analytics, marketing and sales, etc. The traditional methods are either qualitative, and hence prone to biases and often not reproducible, or, are known not to capture all the nuances (non-linearities) among the portfolios from the raw data. We propose a radically new approach to identify similar funds based on the weighted bipartite network representation of funds and their underlying assets data using a sophisticated machine learning method called Node2Vec which learns an embedded low-dimensional representation of the network. We call the embedding emph{Fund2Vec}. Ours is the first ever study of the weighted bipartite network representation of the funds-assets network in its original form that identifies structural similarity among portfolios as opposed to merely portfolio overlaps.
【3】 Stock Market Analysis with Text Data: A Review 标题:基于文本数据的股市分析研究综述
作者:Kamaladdin Fataliyev,Aneesh Chivukula,Mukesh Prasad,Wei Liu 机构:School of Computer Science, University of Technology Sydney, Ultimo NSW , Sydney, Australia, Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo NSW , Sydney, Australia, A R T I C L E I N F O 链接:https://arxiv.org/abs/2106.12985 摘要:股市走势受通过新闻文章、公司报告和社交媒体讨论共享的公共和私人信息的影响。分析这些庞大的数据来源可以给市场参与者带来盈利的优势。然而,文献中的大多数研究都是基于传统的方法来分析非结构化的、大量的文本数据。在本研究中,我们回顾了大量现有的基于文本的股票市场分析文献。我们介绍了输入数据类型,并涵盖了主要的文本数据源和变体。然后介绍了特征表示技术。然后,我们介绍了分析技术,并创建了主要股市预测模型的分类。重要的是,我们将讨论分类法中每个类别的代表性工作,分析它们各自的贡献。最后,本文给出了未解决的开放性问题的研究结果,并对今后的工作提出了建议。本文的研究目的是综述目前主要的股票市场分析模型、金融市场预测的文本表示技术、现有技术的不足,并提出未来研究的方向。 摘要:Stock market movements are influenced by public and private information shared through news articles, company reports, and social media discussions. Analyzing these vast sources of data can give market participants an edge to make profit. However, the majority of the studies in the literature are based on traditional approaches that come short in analyzing unstructured, vast textual data. In this study, we provide a review on the immense amount of existing literature of text-based stock market analysis. We present input data types and cover main textual data sources and variations. Feature representation techniques are then presented. Then, we cover the analysis techniques and create a taxonomy of the main stock market forecast models. Importantly, we discuss representative work in each category of the taxonomy, analyzing their respective contributions. Finally, this paper shows the findings on unaddressed open problems and gives suggestions for future work. The aim of this study is to survey the main stock market analysis models, text representation techniques for financial market prediction, shortcomings of existing techniques, and propose promising directions for future research.
【4】 The Pricing of Vanilla Options with Cash Dividends as a Classic Vanilla Basket Option Problem 标题:带现金红利的香草期权定价是一个经典的香草篮子期权问题
作者:Jherek Healy 机构: 链接:https://arxiv.org/abs/2106.12971 摘要:在标准的Black-Scholes-Merton框架下,股利被表示为一个连续的股利收益率,股票的普通期权的定价是通过著名的Black-Scholes公式来实现的。然而在现实中,股票在除息日支付离散的固定现金股利。这导致了所谓的分段对数正态分布模型,即资产在每个股息日从固定的已知金额跳跃。然而,在这个模型下,没有一个精确的封闭式公式来计算普通期权的定价。必须使用近似值。虽然文献中对这一具体问题存在许多近似,但本文探讨了在分段对数正态模型下,利用已有的一篮子期权公式对单资产现金股利欧式期权进行定价。 摘要:In the standard Black-Scholes-Merton framework, dividends are represented as a continuous dividend yield and the pricing of Vanilla options on a stock is achieved through the well-known Black-Scholes formula. In reality however, stocks pay a discrete fixed cash dividend at each dividend ex-date. This leads to the so-called piecewise lognormal model, where the asset jumps from a fixed known amount at each dividend date. There is however no exact closed-form formula for the pricing of Vanilla options under this model. Approximations must be used. While there exists many approximations taylored to this specific problem in the litterature, this paper explores the use of existing well-known basket option formulas for the pricing of European options on a single asset with cash dividends in the piecewise lognormal model.
【5】 Next-Day Bitcoin Price Forecast Based on Artificial intelligence Methods 标题:基于人工智能方法的次日比特币价格预测
作者:Liping Yang 机构:School of Economics and Management, Beijing University of Chemical Technology 链接:https://arxiv.org/abs/2106.12961 摘要:近年来,比特币价格预测引起了研究者和投资者的兴趣。然而,以往研究的准确性还不够好。机器学习和深度学习方法在这方面具有很强的预测能力。本文提出了一种结合集成经验模式分解(EEMD)和长短时记忆(LSTM)的深度学习方法来研究次日比特币价格预测问题。 摘要:In recent years, Bitcoin price prediction has attracted the interest of researchers and investors. However, the accuracy of previous studies is not well enough. Machine learning and deep learning methods have been proved to have strong prediction ability in this area. This paper proposed a method combined with Ensemble Empirical Mode Decomposition (EEMD) and a deep learning method called long short-term memory (LSTM) to research the problem of next-day Bitcoin price forecast.
【6】 The gig economy in Poland: evidence based on mobile big data 标题:波兰的零工经济:基于移动大数据的证据
作者:Beręsewicz Maciej,Nikulin Dagmara,Szymkowiak Marcin,Wilak Kamil 机构:‡Poznań University of Economics and Business 备注:44 pages, 20 figures 链接:https://arxiv.org/abs/2106.12827 摘要:在本文中,我们将讨论如何衡量平台经济的规模和特点的问题。我们提出了一种基于智能手机数据的不同的抽样调查方法,这些数据是作为在线营销的一部分通过程序系统被动收集的。特别是,在我们的研究中,我们着重于两种类型的服务:食品递送(博尔特快递、外卖、格洛弗、沃尔特和运输服务(博尔特司机、现在免费、iTaxi和Uber)。我们的结果表明,波兰的平台经济正在增长。特别是,关于通过应用程序提供的食品配送和运输服务,我们观察到在2018年1月至2020年12月期间,这一趋势呈增长趋势。考虑到应用程序用户的人口结构,我们的研究结果证实了过去的研究结果:大多数平台工作者是年轻男性,但应用程序用户的年龄结构在这两类服务中各不相同。另一个令人惊讶的发现是,在波兰,外国人并没有占到工作人员的大多数。如果将平台工人的数量与相应的工作人口进行比较,波兰9个最大城市的活跃应用程序用户的估计份额约占工作人口的0.5-2%。 摘要:In this article we address the question of how to measure the size and characteristics of the platform economy. We propose a~different, to sample surveys, approach based on smartphone data, which are passively collected through programmatic systems as part of online marketing. In particular, in our study we focus on two types of services: food delivery (Bolt Courier, Takeaway, Glover, Wolt and transport services (Bolt Driver, Free Now, iTaxi and Uber). Our results show that the platform economy in Poland is growing. In particular, with respect to food delivery and transportation services performed by means of applications, we observed a growing trend between January 2018 and December 2020. Taking into account the demographic structure of apps users, our results confirm findings from past studies: the majority of platform workers are young men but the age structure of app users is different for each of the two categories of services. Another surprising finding is that foreigners do not account for the majority of gig workers in Poland. When the number of platform workers is compared with corresponding working populations, the estimated share of active app users accounts for about 0.5-2% of working populations in 9 largest Polish cities.
【7】 Learning Multiple Stock Trading Patterns with Temporal Routing Adaptor and Optimal Transport 标题:利用时态路由适配器和最优传输学习多种股票交易模式
作者:Hengxu Lin,Dong Zhou,Weiqing Liu,Jiang Bian 机构:Sun Yat-sen University, Guangdong, China, Microsoft Research, Beijing, China 备注:Accepted by KDD 2021 (research track) 链接:https://arxiv.org/abs/2106.12950 摘要:成功的量化投资通常依赖于对未来股价走势的精确预测。近年来,基于机器学习的解决方案已显示出其更精确的股票预测能力,并成为现代定量投资系统中不可或缺的组成部分。然而,现有方法背后的i.i.d.假设与股市中存在的多种交易模式不一致,这必然限制了它们获得更好的股票预测性能的能力。在本文中,我们提出了一个新的架构,时间路由适配器(TRA),使现有的股票预测模型能够模拟多种股票交易模式。本质上,TRA是一个轻量级模块,它由一组独立的预测器组成,用于学习多个模式,以及一个路由器,用于向不同的预测器发送样本。然而,由于缺乏明确的模式标识符,因此训练一个有效的基于TRA的模型是相当困难的。为了解决这一问题,我们进一步设计了一种基于最优传输(OT)的学习算法来获得最优的样本到预测器分配,并通过一个辅助损失项来有效地优化具有这种分配的路由器。在真实股票排序任务上的实验表明,与最新的基线相比,该方法可以将信息系数(IC)分别从0.053提高到0.059和0.051提高到0.056。我们在这项工作中使用的数据集和代码是公开的:https://github.com/microsoft/qlib. 摘要:Successful quantitative investment usually relies on precise predictions of the future movement of the stock price. Recently, machine learning based solutions have shown their capacity to give more accurate stock prediction and become indispensable components in modern quantitative investment systems. However, the i.i.d. assumption behind existing methods is inconsistent with the existence of diverse trading patterns in the stock market, which inevitably limits their ability to achieve better stock prediction performance. In this paper, we propose a novel architecture, Temporal Routing Adaptor (TRA), to empower existing stock prediction models with the ability to model multiple stock trading patterns. Essentially, TRA is a lightweight module that consists of a set of independent predictors for learning multiple patterns as well as a router to dispatch samples to different predictors. Nevertheless, the lack of explicit pattern identifiers makes it quite challenging to train an effective TRA-based model. To tackle this challenge, we further design a learning algorithm based on Optimal Transport (OT) to obtain the optimal sample to predictor assignment and effectively optimize the router with such assignment through an auxiliary loss term. Experiments on the real-world stock ranking task show that compared to the state-of-the-art baselines, e.g., Attention LSTM and Transformer, the proposed method can improve information coefficient (IC) from 0.053 to 0.059 and 0.051 to 0.056 respectively. Our dataset and code used in this work are publicly available: https://github.com/microsoft/qlib.
2.cs.SD语音:
【1】 Where are we in semantic concept extraction for Spoken Language Understanding? 标题:口语理解的语义概念提取进展如何?
作者:Sahar Ghannay,Antoine Caubrière,Salima Mdhaffar,Gaëlle Laperrière,Bassam Jabaian,Yannick Estève 机构: Universit´e Paris-Saclay, CNRS, LISN, Orsay, France, LIA - Avignon Universit´e, France 备注:Submitted to the SPECOM 2021 conference 链接:https://arxiv.org/abs/2106.13045 摘要:近三年来,随着端到端神经方法的出现,口语理解(SLU)的研究取得了很大的进展。口语理解是指与从语音信号中提取语义相关的自然语言处理任务,如语音命名实体识别或人机对话环境下的填空任务。经典地,SLU任务是通过一个级联方法来处理的,该方法首先应用一个自动语音识别过程,然后应用一个自然语言处理模块来处理自动转录。近三年来,人们提出了基于深度神经网络的端到端神经方法,利用单个神经模型直接从语音信号中提取语义。近年来,关于无标记数据自监督训练的研究为自动语音识别和自然语言处理的性能研究开辟了新的前景。在本文中,我们简要概述了法国媒体基准数据集的最新进展,无论是否使用额外的数据。我们还展示了我们的上一个结果,该结果显著优于目前最先进的系统,概念错误率(CER)为11.2%,而不是今年最后一个最先进系统的13.6%。 摘要:Spoken language understanding (SLU) topic has seen a lot of progress these last three years, with the emergence of end-to-end neural approaches. Spoken language understanding refers to natural language processing tasks related to semantic extraction from speech signal, like named entity recognition from speech or slot filling task in a context of human-machine dialogue. Classically, SLU tasks were processed through a cascade approach that consists in applying, firstly, an automatic speech recognition process, followed by a natural language processing module applied to the automatic transcriptions. These three last years, end-to-end neural approaches, based on deep neural networks, have been proposed in order to directly extract the semantics from speech signal, by using a single neural model. More recent works on self-supervised training with unlabeled data open new perspectives in term of performance for automatic speech recognition and natural language processing. In this paper, we present a brief overview of the recent advances on the French MEDIA benchmark dataset for SLU, with or without the use of additional data. We also present our last results that significantly outperform the current state-of-the-art with a Concept Error Rate (CER) of 11.2%, instead of 13.6% for the last state-of-the-art system presented this year.
【2】 AudioCLIP: Extending CLIP to Image, Text and Audio 标题:AudioCLIP:将剪辑扩展到图像、文本和音频
作者:Andrey Guzhov,Federico Raue,Jörn Hees,Andreas Dengel 机构: DFKI GmbH, Trippstadter Str. , Kaiserslautern, Germany, TU Kaiserslautern, Kaiserslautern, Germany 备注:submitted to GCPR 2021 链接:https://arxiv.org/abs/2106.13043 摘要:在过去,快速发展的声音分类领域得益于其他领域方法的应用。今天,我们观察到将特定领域的任务和方法融合在一起的趋势,这为社区提供了新的优秀模型。在这项工作中,我们提出了一个扩展的剪辑模型,处理音频除了文本和图像。我们提出的模型使用AudioSet数据集将ESResNeXt音频模型合并到CLIP框架中。这样的组合使得所提出的模型能够执行双峰和单峰分类和查询,同时保持CLIP以Zero-Shot推理方式推广到不可见数据集的能力。AudioCLIP在环境声音分类(ESC)任务中取得了新的最新成果,在UrbanSound8K和ESC-50数据集上的准确率分别达到90.07%和97.15%,优于其他方法。此外,它在相同的数据集(分别为68.78%和69.40%)上设置了零炮ESC任务的新基线。最后,我们还评估了该模型的跨模式查询性能以及完全和部分训练对结果的影响。为了重现性,我们的代码已经发布了。 摘要:In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.
【3】 QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus 标题:QASR:QCRI Aljazeera语音资源--一个大规模的阿拉伯语注释语音语料库
作者:Hamdy Mubarak,Amir Hussein,Shammur Absar Chowdhury,Ahmed Ali 机构: Qatar Computing Research Institute, HBKU, Doha, Qatar, Kanari AI, California, USA 备注:Speech Corpus, Spoken Conversation, ASR, Dialect Identification, Punctuation Restoration, Speaker Verification, NER, Named Entity, Arabic, Speaker gender, Turn-taking Accepted in ACL 2021 链接:https://arxiv.org/abs/2106.13000 摘要:我们介绍了最大的转录阿拉伯语语音语料库,QASR,收集自广播领域。这个多方言语音数据集包含从半岛电视台新闻频道以16kHz频率采集的2000小时语音。数据集发布时带有轻度监控的转录,与音频片段对齐。与以前的数据集不同,QASR包含了基于语言动机的切分、标点、说话人信息等。QASR适用于训练和评估语音识别系统、基于声学和/或语言学的阿拉伯语方言识别、标点符号恢复、说话人识别、说话人链接,以及潜在的其他用于语音数据的NLP模块。除了QASR转录,我们还发布了1.3亿个单词的数据集,以帮助设计和训练更好的语言模型。我们发现,与先前的MGB-2语料库相比,在QASR上训练的端到端自动语音识别报告了具有竞争力的词错误率。我们报告了下游自然语言处理任务的基线结果,如使用语音记录的命名实体识别。我们还报告了阿拉伯语标点恢复的第一个基线。我们为研究团体提供了语料库。 摘要:We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
【4】 SofaMyRoom: a fast and multiplatform "shoebox" room simulator for binaural room impulse response dataset generation
作者:Roberto Barumerli,Daniele Bianchi,Michele Geronazzo,Federico Avanzini 机构:Avanzinib, Acoustic Research Institute, Austrian Academy of Sciences, Vienna, Austria, Dept. of Computer Science Dept., University of Milan, Milan, Italy, Dyson School of Design Engineering, Imperial College London, London, SW,AZ, United Kingdom 备注:18 pages,4 figures, accompanying paper for an acoustic simulator description 链接:https://arxiv.org/abs/2106.12992 摘要:本文介绍了一种鞋盒房间模拟器,它能在给定一组头部相关传递函数(HRTFs)的情况下系统地生成双耳房间脉冲响应(BRIRs)的综合数据集。机器听觉算法的评估经常需要BRIR数据集来模拟任何环境的声学。然而,目前可用的解决方案通常只考虑在假人头部上测量的HRTF,这很难描述空间声音感知的高可变性。我们的解决方案允许集成一个房间脉冲响应(RIR)模拟器和不同的HRTF集,以面向空间的声学格式表示(SOFA)。不同操作系统的源代码和编译的二进制文件允许高级和非专家用户从我们的工具箱中获益,请参阅https://github.com/spatialaudiotools/sofamyroom/ . 摘要:This paper introduces a shoebox room simulator able to systematically generate synthetic datasets of binaural room impulse responses (BRIRs) given an arbitrary set of head-related transfer functions (HRTFs). The evaluation of machine hearing algorithms frequently requires BRIR datasets in order to simulate the acoustics of any environment. However, currently available solutions typically consider only HRTFs measured on dummy heads, which poorly characterize the high variability in spatial sound perception. Our solution allows to integrate a room impulse response (RIR) simulator with different HRTF sets represented in Spatially Oriented Format for Acoustics (SOFA). The source code and the compiled binaries for different operating systems allow to both advanced and non-expert users to benefit from our toolbox, see https://github.com/spatialaudiotools/sofamyroom/ .
【5】 Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn? 标题:演讲是银的,沉默是金的:接受过ASVspoof训练的模特真正学到了什么?
作者:Nicolas M. Müller,Franziska Dieckmann,Pavel Czempin,Roman Canals,Konstantin Böttinger 机构:Department of Informatics, Technical University of Munich, Konstantin B¨ottinger, Cognitive Security Technologies, Fraunhofer AISEC, Garching near Munich, Germany 链接:https://arxiv.org/abs/2106.12914 摘要:ASVspoof数据集是最成熟的数据集之一,用于检测伪造音频和音频伪造的训练和基准测试系统。然而,我们观察到在数据集的训练和测试数据中前导沉默的不均匀分布,这暗示了目标标签:真实的实例往往比欺骗的实例有显著更长的前导沉默。这可能是有问题的,因为一个模型可能只会学习,或者至少部分地,根据前导静默的长度来决定(类似于Pascal VOC 2007数据集的问题,其中所有马的图像也包含一个特定的水印)。本文对这一现象进行了深入探讨。我们只在a)引导沉默的长度和b)有无引导沉默的情况下训练多个网络。结果表明,仅对前导沉默长度进行训练的模型表现得令人怀疑地好:它们在数据的“评估”分割上达到了85%的准确率和0.15的等错误率(EER)。相反,当在完整的音频文件上训练强模型时,我们观察到在预处理过程中移除前导沉默会显著降低性能(EER从0.05增加到0.2)。这可能表明,以前的工作可能在某种程度上只学会了基于引导沉默对目标进行分类。我们希望通过分享这些结果,ASV社区能够进一步评估这一现象。 摘要:The ASVspoof Dataset is one of the most established datasets for training and benchmarking systems designed for the detection of spoofed audio and audio deepfakes. However, we observe an uneven distribution of leading silence in dataset's training and test data, which hints at the target label: Bona-fide instances tend to have significantly longer leading silences than spoofed instances. This could be problematic, since a model may learn to only, or at least partially, base its decision on the length of the leading silence (similar to the issue with the Pascal VOC 2007 dataset, where all images of horses also contained a specific watermark). In this paper, we explore this phenomenon in depth. We train a number of networks on only a) the length of the leading silence and b) with and without leading silence. Results show that models trained on only the length of the leading silence perform suspiciously well: They achieve up to 85% percent accuracy and an equal error rate (EER) of 0.15 on the 'eval' split of the data. Conversely, when training strong models on the full audio files, we observe that removing leading silence during preprocessing dramatically worsens performance (EER increases from 0.05 to 0.2). This could indicate that previous work may, in part, have learned only to classify targets based on leading silence. We hope that by sharing these results, the ASV community can further evaluate this phenomenon.
【6】 Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech 标题:低资源高表现性语音的显式持续时间非自回归TTS建模
作者:Raahil Shah,Kamil Pokora,Abdelhamid Ezzerg,Viacheslav Klimkov,Goeric Huybrechts,Bartosz Putrycz,Daniel Korzekwa,Thomas Merritt 机构:Amazon Text-to-Speech Research 备注:6 pages, 5 figures. Accepted to Speech Synthesis Workshop (SSW) 2021 链接:https://arxiv.org/abs/2106.12896 摘要:虽然最近的神经文本到语音(TTS)方法产生高质量的语音,但它们通常需要目标说话人的大量录音。在以前的工作中,提出了一种三步方法来生成高质量的TTS,同时大大减少了训练所需的数据量。然而,我们观察到,当使用这种方法时,高表达声音的自然度水平会出现天花板效应。在本文中,我们提出了一种方法来建立高表现力的TTS语音,只需15分钟的语音数据从目标发言者。与目前最先进的方法相比,我们提出的改进方案在语音自然度和说话人相似性方面与录音的差距分别缩小了23.3%和16.3%。此外,我们仅使用15分钟的目标说话人数据来匹配基于Tacotron2的完整数据模型(约10小时)的自然度和说话人相似性,而在30分钟或更长时间内,我们的性能明显优于它。提出了以下改进:1)由基于注意的自回归TTS模型改为非自回归模型,用外部持续时间模型代替注意;2)增加一个基于条件生成对抗网络(cGAN)的微调步骤。 摘要:Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (~10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a non-autoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.
【7】 Additive Phoneme-aware Margin Softmax Loss for Language Recognition 标题:用于语言识别的附加音素感知余量软最大损失
作者:Zheng Li,Yan Liu,Lin Li,Qingyang Hong 机构:School of Electronic Science and Engineering, Xiamen University, China, School of Informatics, Xiamen University, China 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.12851 摘要:提出了一种基于语音信息的多任务学习网络的语音识别算法APM-softmax。在additive margin softmax(AM softmax)loss中,对于所有训练样本,在整个训练期间将边缘设置为常数,并且这是一种次优方法,因为训练样本中的识别难度不同。在附加角裕度softmax(AAM softmax)损耗中,附加角裕度也被设置为共价因子。本文提出了一种基于APM-Softmax-loss的phoneitc多任务学习语言识别算法,该算法根据不同的训练样本自动调整可加音素感知裕度。更具体地说,根据音素识别的结果来调整语言识别的边界。在东方语言识别(OLR)数据集上进行了实验,改进了不同语言识别测试条件下的AM-Softmax损失和AAM-Softmax损失。 摘要:This paper proposes an additive phoneme-aware margin softmax (APM-Softmax) loss to train the multi-task learning network with phonetic information for language recognition. In additive margin softmax (AM-Softmax) loss, the margin is set as a constant during the entire training for all training samples, and that is a suboptimal method since the recognition difficulty varies in training samples. In additive angular margin softmax (AAM-Softmax) loss, the additional angular margin is set as a costant as well. In this paper, we propose an APM-Softmax loss for language recognition with phoneitc multi-task learning, in which the additive phoneme-aware margin is automatically tuned for different training samples. More specifically, the margin of language recognition is adjusted according to the results of phoneme recognition. Experiments are reported on Oriental Language Recognition (OLR) datasets, and the proposed method improves AM-Softmax loss and AAM-Softmax loss in different language recognition testing conditions.
【8】 Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language 标题:当对与目标零资源语言相关的语言进行训练时,声学单词嵌入的多语言迁移得到改善
作者:Christiaan Jacobs,Herman Kamper 机构:E&E Engineering, Stellenbosch University 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.12834 摘要:声学词嵌入模型将可变时长的语音片段映射到固定维向量,从而实现高效的语音搜索和发现。以前的工作探讨了如何在目标语言中没有可用的标记数据的零资源环境中获得嵌入。目前最好的方法是使用迁移学习:使用来自多个资源丰富的语言的标记数据训练单个有监督的多语言模型,然后将其应用于目标零资源语言(无需微调)。然而,目前尚不清楚训练语言的具体选择如何影响下游绩效。具体来说,这里我们要问的是,使用与目标相关的训练语言是否有益。使用来自南部非洲11种语言的数据,我们尝试添加来自不同语系的数据,同时控制每种语言的数据量。在单词识别和示例查询搜索评估中,我们发现对来自同一个语系的语言进行训练可以获得很大的改进。通过更细粒度的分析,我们发现即使只使用一种相关的语言进行训练也能获得最大的收益。我们还发现,添加不相关语言的数据通常不会影响性能。 摘要:Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
【9】 A Simultaneous Denoising and Dereverberation Framework with Target Decoupling 标题:一种目标解耦的同时去噪和去混响框架
作者:Andong Li,Wenzhe Liu,Xiaoxue Luo,Guochen Yu,Chengshi Zheng,Xiaodong Li 机构:Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of, University of Chinese Academy of Sciences, Beijing, China, Communication University of China, Beijing, China 备注:Accepted at Interspeech 2021 链接:https://arxiv.org/abs/2106.12743 摘要:背景噪声和室内混响是影响主观语音质量的两个主要因素。在本文中,我们提出了一个完整的框架来解决复杂场景环境下的同时去噪和去冗余问题。采用链优化策略,设计了四个子阶段。在前两个阶段,我们将多任务学习的w.r.t.复谱分解为幅度和相位,只在幅度域进行噪声和混响的去除。在上述先验估计的基础上,我们在第三阶段对频谱进行了进一步的优化,在第三阶段,利用残差学习对幅度和相位信息进行了显式的修复。由于数据失配和DNN的非线性效应,DNN处理后的频谱中常存在残余噪声。为了解决这一问题,我们采用了一种轻量化的算法作为后处理模块来捕获和抑制非活动区域的残余噪声。在interspeech2021深度噪声抑制(DNS)挑战中,我们提交的系统在ITU-tp.835框架下的实时跟踪平均意见得分(MOS)排名第一 摘要:Background noise and room reverberation are regarded as two major factors to degrade the subjective speech quality. In this paper, we propose an integrated framework to address simultaneous denoising and dereverberation under complicated scenario environments. It adopts a chain optimization strategy and designs four sub-stages accordingly. In the first two stages, we decouple the multi-task learning w.r.t. complex spectrum into magnitude and phase, and only implement noise and reverberation removal in the magnitude domain. Based on the estimated priors above, we further polish the spectrum in the third stage, where both magnitude and phase information are explicitly repaired with the residual learning. Due to the data mismatch and nonlinear effect of DNNs, the residual noise often exists in the DNN-processed spectrum. To resolve the problem, we adopt a light-weight algorithm as the post-processing module to capture and suppress the residual noise in the non-active regions. In the Interspeech 2021 Deep Noise Suppression (DNS) Challenge, our submitted system ranked top-1 for the real-time track in terms of Mean Opinion Score (MOS) with ITU-T P.835 framework
【10】 Dealing with training and test segmentation mismatch: FBK@IWSLT2021 标题:处理训练和测试分段不匹配的问题:fbk@IWSLT2021
作者:Sara Papi,Marco Gaido,Matteo Negri,Marco Turchi 机构:Fondazione Bruno Kessler, Trento, Italy, University of Trento, Italy 备注:Accepted at IWSLT2021 链接:https://arxiv.org/abs/2106.12607 摘要:本文描述了FBK向iwslt2021离线语音翻译任务提交的系统。我们参与了一个直接模型,这是一个基于Transformer的体系结构,训练将英语语音音频数据翻译成德语文本。训练管道的特点是知识提炼和两步微调过程。知识提取和第一步微调都是在人工分割的真实数据和合成数据上进行的,后者是在现有语料库上训练机器翻译系统生成的。不同的是,第二个微调步骤是对MuST-cv2 En De数据集的随机分段执行的。它的主要目标是减少在自动分割的音频(即实际的、更真实的测试条件)上评估基于人工分割数据(即理想的、类似句子的分割)训练的语音翻译模型时出现的性能下降。出于同样的目的,在将测试数据传递给系统之前,对测试数据应用一个定制的混合分段过程,该过程同时考虑音频内容(暂停)和生成的分段的长度。在推理时,我们将该方法与基于语音活动检测(VAD)的基线分割方法进行了比较。我们的结果表明,提出的混合方法的有效性,显示了减少差距与手动分割从8.3到1.4 BLEU点。 摘要:This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.
【11】 SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing 标题:SRIB-飞跃深渊翻滚向远场多声道视频会议语音增强挑战
作者:R G Prithvi Raj,Rohit Kumar,M K Jayesh,Anurenjan Purushothaman,Sriram Ganapathy,M A Basha Shaik 机构:Samsung Research & Development Institute India, Bangalore, LEAP lab, Electrical Engineering, Indian Institute of Science, Bangalore, India. 链接:https://arxiv.org/abs/2106.12763 摘要:本文介绍了SRIB-LEAP向2021年演讲挑战赛提交的详细资料。该挑战赛涉及多通道语音增强任务,以提高视频会议室麦克风阵列的远场语音质量。我们提出了一种两阶段的方法,包括波束形成器和单通道增强。对于波束形成器,我们在端到端时域波束形成系统FaSNet(filter-and-sum network)中引入了自注意机制作为通道间处理层。采用卷积神经网络(CNN)长短时记忆(LSTM)结构,在对数谱域进行单通道语音增强。我们在客观质量指标上取得了改进,在噪声数据上语音质量的感知评价(PESQ)为0.5。在主观质量评价方面,该方法将平均意见得分(MOS)提高了0.9。 摘要:This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-attention mechanism as inter-channel processing layer in the filter-and-sum network (FaSNet), an end-to-end time-domain beamforming system. The single channel speech enhancement is done in log spectral domain using convolution neural network (CNN)-long short term memory (LSTM) based architecture. We achieved improvements in objective quality metrics - perceptual evaluation of speech quality (PESQ) of 0.5 on the noisy data. On subjective quality evaluation, the proposed approach improved the mean opinion score (MOS) by an absolute measure of 0.9 over the noisy audio.
3.eess.AS音频处理:
【1】 SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing 标题:SRIB-飞跃深渊翻滚向远场多声道视频会议语音增强挑战
作者:R G Prithvi Raj,Rohit Kumar,M K Jayesh,Anurenjan Purushothaman,Sriram Ganapathy,M A Basha Shaik 机构:Samsung Research & Development Institute India, Bangalore, LEAP lab, Electrical Engineering, Indian Institute of Science, Bangalore, India. 链接:https://arxiv.org/abs/2106.12763 摘要:本文介绍了SRIB-LEAP向2021年演讲挑战赛提交的详细资料。该挑战赛涉及多通道语音增强任务,以提高视频会议室麦克风阵列的远场语音质量。我们提出了一种两阶段的方法,包括波束形成器和单通道增强。对于波束形成器,我们在端到端时域波束形成系统FaSNet(filter-and-sum network)中引入了自注意机制作为通道间处理层。采用卷积神经网络(CNN)长短时记忆(LSTM)结构,在对数谱域进行单通道语音增强。我们在客观质量指标上取得了改进,在噪声数据上语音质量的感知评价(PESQ)为0.5。在主观质量评价方面,该方法将平均意见得分(MOS)提高了0.9。 摘要:This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-attention mechanism as inter-channel processing layer in the filter-and-sum network (FaSNet), an end-to-end time-domain beamforming system. The single channel speech enhancement is done in log spectral domain using convolution neural network (CNN)-long short term memory (LSTM) based architecture. We achieved improvements in objective quality metrics - perceptual evaluation of speech quality (PESQ) of 0.5 on the noisy data. On subjective quality evaluation, the proposed approach improved the mean opinion score (MOS) by an absolute measure of 0.9 over the noisy audio.
【2】 Where are we in semantic concept extraction for Spoken Language Understanding? 标题:口语理解的语义概念提取进展如何?
作者:Sahar Ghannay,Antoine Caubrière,Salima Mdhaffar,Gaëlle Laperrière,Bassam Jabaian,Yannick Estève 机构: Universit´e Paris-Saclay, CNRS, LISN, Orsay, France, LIA - Avignon Universit´e, France 备注:Submitted to the SPECOM 2021 conference 链接:https://arxiv.org/abs/2106.13045 摘要:近三年来,随着端到端神经方法的出现,口语理解(SLU)的研究取得了很大的进展。口语理解是指与从语音信号中提取语义相关的自然语言处理任务,如语音命名实体识别或人机对话环境下的填空任务。经典地,SLU任务是通过一个级联方法来处理的,该方法首先应用一个自动语音识别过程,然后应用一个自然语言处理模块来处理自动转录。近三年来,人们提出了基于深度神经网络的端到端神经方法,利用单个神经模型直接从语音信号中提取语义。近年来,关于无标记数据自监督训练的研究为自动语音识别和自然语言处理的性能研究开辟了新的前景。在本文中,我们简要概述了法国媒体基准数据集的最新进展,无论是否使用额外的数据。我们还展示了我们的上一个结果,该结果显著优于目前最先进的系统,概念错误率(CER)为11.2%,而不是今年最后一个最先进系统的13.6%。 摘要:Spoken language understanding (SLU) topic has seen a lot of progress these last three years, with the emergence of end-to-end neural approaches. Spoken language understanding refers to natural language processing tasks related to semantic extraction from speech signal, like named entity recognition from speech or slot filling task in a context of human-machine dialogue. Classically, SLU tasks were processed through a cascade approach that consists in applying, firstly, an automatic speech recognition process, followed by a natural language processing module applied to the automatic transcriptions. These three last years, end-to-end neural approaches, based on deep neural networks, have been proposed in order to directly extract the semantics from speech signal, by using a single neural model. More recent works on self-supervised training with unlabeled data open new perspectives in term of performance for automatic speech recognition and natural language processing. In this paper, we present a brief overview of the recent advances on the French MEDIA benchmark dataset for SLU, with or without the use of additional data. We also present our last results that significantly outperform the current state-of-the-art with a Concept Error Rate (CER) of 11.2%, instead of 13.6% for the last state-of-the-art system presented this year.
【3】 AudioCLIP: Extending CLIP to Image, Text and Audio 标题:AudioCLIP:将剪辑扩展到图像、文本和音频
作者:Andrey Guzhov,Federico Raue,Jörn Hees,Andreas Dengel 机构: DFKI GmbH, Trippstadter Str. , Kaiserslautern, Germany, TU Kaiserslautern, Kaiserslautern, Germany 备注:submitted to GCPR 2021 链接:https://arxiv.org/abs/2106.13043 摘要:在过去,快速发展的声音分类领域得益于其他领域方法的应用。今天,我们观察到将特定领域的任务和方法融合在一起的趋势,这为社区提供了新的优秀模型。在这项工作中,我们提出了一个扩展的剪辑模型,处理音频除了文本和图像。我们提出的模型使用AudioSet数据集将ESResNeXt音频模型合并到CLIP框架中。这样的组合使得所提出的模型能够执行双峰和单峰分类和查询,同时保持CLIP以Zero-Shot推理方式推广到不可见数据集的能力。AudioCLIP在环境声音分类(ESC)任务中取得了新的最新成果,在UrbanSound8K和ESC-50数据集上的准确率分别达到90.07%和97.15%,优于其他方法。此外,它在相同的数据集(分别为68.78%和69.40%)上设置了零炮ESC任务的新基线。最后,我们还评估了该模型的跨模式查询性能以及完全和部分训练对结果的影响。为了重现性,我们的代码已经发布了。 摘要:In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.
【4】 QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus 标题:QASR:QCRI Aljazeera语音资源--一个大规模的阿拉伯语注释语音语料库
作者:Hamdy Mubarak,Amir Hussein,Shammur Absar Chowdhury,Ahmed Ali 机构: Qatar Computing Research Institute, HBKU, Doha, Qatar, Kanari AI, California, USA 备注:Speech Corpus, Spoken Conversation, ASR, Dialect Identification, Punctuation Restoration, Speaker Verification, NER, Named Entity, Arabic, Speaker gender, Turn-taking Accepted in ACL 2021 链接:https://arxiv.org/abs/2106.13000 摘要:我们介绍了最大的转录阿拉伯语语音语料库,QASR,收集自广播领域。这个多方言语音数据集包含从半岛电视台新闻频道以16kHz频率采集的2000小时语音。数据集发布时带有轻度监控的转录,与音频片段对齐。与以前的数据集不同,QASR包含了基于语言动机的切分、标点、说话人信息等。QASR适用于训练和评估语音识别系统、基于声学和/或语言学的阿拉伯语方言识别、标点符号恢复、说话人识别、说话人链接,以及潜在的其他用于语音数据的NLP模块。除了QASR转录,我们还发布了1.3亿个单词的数据集,以帮助设计和训练更好的语言模型。我们发现,与先前的MGB-2语料库相比,在QASR上训练的端到端自动语音识别报告了具有竞争力的词错误率。我们报告了下游自然语言处理任务的基线结果,如使用语音记录的命名实体识别。我们还报告了阿拉伯语标点恢复的第一个基线。我们为研究团体提供了语料库。 摘要:We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
【5】 SofaMyRoom: a fast and multiplatform "shoebox" room simulator for binaural room impulse response dataset generation
作者:Roberto Barumerli,Daniele Bianchi,Michele Geronazzo,Federico Avanzini 机构:Avanzinib, Acoustic Research Institute, Austrian Academy of Sciences, Vienna, Austria, Dept. of Computer Science Dept., University of Milan, Milan, Italy, Dyson School of Design Engineering, Imperial College London, London, SW,AZ, United Kingdom 备注:18 pages,4 figures, accompanying paper for an acoustic simulator description 链接:https://arxiv.org/abs/2106.12992 摘要:本文介绍了一种鞋盒房间模拟器,它能在给定一组头部相关传递函数(HRTFs)的情况下系统地生成双耳房间脉冲响应(BRIRs)的综合数据集。机器听觉算法的评估经常需要BRIR数据集来模拟任何环境的声学。然而,目前可用的解决方案通常只考虑在假人头部上测量的HRTF,这很难描述空间声音感知的高可变性。我们的解决方案允许集成一个房间脉冲响应(RIR)模拟器和不同的HRTF集,以面向空间的声学格式表示(SOFA)。不同操作系统的源代码和编译的二进制文件允许高级和非专家用户从我们的工具箱中获益,请参阅https://github.com/spatialaudiotools/sofamyroom/ . 摘要:This paper introduces a shoebox room simulator able to systematically generate synthetic datasets of binaural room impulse responses (BRIRs) given an arbitrary set of head-related transfer functions (HRTFs). The evaluation of machine hearing algorithms frequently requires BRIR datasets in order to simulate the acoustics of any environment. However, currently available solutions typically consider only HRTFs measured on dummy heads, which poorly characterize the high variability in spatial sound perception. Our solution allows to integrate a room impulse response (RIR) simulator with different HRTF sets represented in Spatially Oriented Format for Acoustics (SOFA). The source code and the compiled binaries for different operating systems allow to both advanced and non-expert users to benefit from our toolbox, see https://github.com/spatialaudiotools/sofamyroom/ .
【6】 Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn? 标题:演讲是银的,沉默是金的:接受过ASVspoof训练的模特真正学到了什么?
作者:Nicolas M. Müller,Franziska Dieckmann,Pavel Czempin,Roman Canals,Konstantin Böttinger 机构:Department of Informatics, Technical University of Munich, Konstantin B¨ottinger, Cognitive Security Technologies, Fraunhofer AISEC, Garching near Munich, Germany 链接:https://arxiv.org/abs/2106.12914 摘要:ASVspoof数据集是最成熟的数据集之一,用于检测伪造音频和音频伪造的训练和基准测试系统。然而,我们观察到在数据集的训练和测试数据中前导沉默的不均匀分布,这暗示了目标标签:真实的实例往往比欺骗的实例有显著更长的前导沉默。这可能是有问题的,因为一个模型可能只会学习,或者至少部分地,根据前导静默的长度来决定(类似于Pascal VOC 2007数据集的问题,其中所有马的图像也包含一个特定的水印)。本文对这一现象进行了深入探讨。我们只在a)引导沉默的长度和b)有无引导沉默的情况下训练多个网络。结果表明,仅对前导沉默长度进行训练的模型表现得令人怀疑地好:它们在数据的“评估”分割上达到了85%的准确率和0.15的等错误率(EER)。相反,当在完整的音频文件上训练强模型时,我们观察到在预处理过程中移除前导沉默会显著降低性能(EER从0.05增加到0.2)。这可能表明,以前的工作可能在某种程度上只学会了基于引导沉默对目标进行分类。我们希望通过分享这些结果,ASV社区能够进一步评估这一现象。 摘要:The ASVspoof Dataset is one of the most established datasets for training and benchmarking systems designed for the detection of spoofed audio and audio deepfakes. However, we observe an uneven distribution of leading silence in dataset's training and test data, which hints at the target label: Bona-fide instances tend to have significantly longer leading silences than spoofed instances. This could be problematic, since a model may learn to only, or at least partially, base its decision on the length of the leading silence (similar to the issue with the Pascal VOC 2007 dataset, where all images of horses also contained a specific watermark). In this paper, we explore this phenomenon in depth. We train a number of networks on only a) the length of the leading silence and b) with and without leading silence. Results show that models trained on only the length of the leading silence perform suspiciously well: They achieve up to 85% percent accuracy and an equal error rate (EER) of 0.15 on the 'eval' split of the data. Conversely, when training strong models on the full audio files, we observe that removing leading silence during preprocessing dramatically worsens performance (EER increases from 0.05 to 0.2). This could indicate that previous work may, in part, have learned only to classify targets based on leading silence. We hope that by sharing these results, the ASV community can further evaluate this phenomenon.
【7】 Additive Phoneme-aware Margin Softmax Loss for Language Recognition 标题:用于语言识别的附加音素感知余量软最大损失
作者:Zheng Li,Yan Liu,Lin Li,Qingyang Hong 机构:School of Electronic Science and Engineering, Xiamen University, China, School of Informatics, Xiamen University, China 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.12851 摘要:提出了一种基于语音信息的多任务学习网络的语音识别算法APM-softmax。在additive margin softmax(AM softmax)loss中,对于所有训练样本,在整个训练期间将边缘设置为常数,并且这是一种次优方法,因为训练样本中的识别难度不同。在附加角裕度softmax(AAM softmax)损耗中,附加角裕度也被设置为共价因子。本文提出了一种基于APM-Softmax-loss的phoneitc多任务学习语言识别算法,该算法根据不同的训练样本自动调整可加音素感知裕度。更具体地说,根据音素识别的结果来调整语言识别的边界。在东方语言识别(OLR)数据集上进行了实验,改进了不同语言识别测试条件下的AM-Softmax损失和AAM-Softmax损失。 摘要:This paper proposes an additive phoneme-aware margin softmax (APM-Softmax) loss to train the multi-task learning network with phonetic information for language recognition. In additive margin softmax (AM-Softmax) loss, the margin is set as a constant during the entire training for all training samples, and that is a suboptimal method since the recognition difficulty varies in training samples. In additive angular margin softmax (AAM-Softmax) loss, the additional angular margin is set as a costant as well. In this paper, we propose an APM-Softmax loss for language recognition with phoneitc multi-task learning, in which the additive phoneme-aware margin is automatically tuned for different training samples. More specifically, the margin of language recognition is adjusted according to the results of phoneme recognition. Experiments are reported on Oriental Language Recognition (OLR) datasets, and the proposed method improves AM-Softmax loss and AAM-Softmax loss in different language recognition testing conditions.
【8】 Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language 标题:当对与目标零资源语言相关的语言进行训练时,声学单词嵌入的多语言迁移得到改善
作者:Christiaan Jacobs,Herman Kamper 机构:E&E Engineering, Stellenbosch University 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.12834 摘要:声学词嵌入模型将可变时长的语音片段映射到固定维向量,从而实现高效的语音搜索和发现。以前的工作探讨了如何在目标语言中没有可用的标记数据的零资源环境中获得嵌入。目前最好的方法是使用迁移学习:使用来自多个资源丰富的语言的标记数据训练单个有监督的多语言模型,然后将其应用于目标零资源语言(无需微调)。然而,目前尚不清楚训练语言的具体选择如何影响下游绩效。具体来说,这里我们要问的是,使用与目标相关的训练语言是否有益。使用来自南部非洲11种语言的数据,我们尝试添加来自不同语系的数据,同时控制每种语言的数据量。在单词识别和示例查询搜索评估中,我们发现对来自同一个语系的语言进行训练可以获得很大的改进。通过更细粒度的分析,我们发现即使只使用一种相关的语言进行训练也能获得最大的收益。我们还发现,添加不相关语言的数据通常不会影响性能。 摘要:Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
【9】 A Simultaneous Denoising and Dereverberation Framework with Target Decoupling 标题:一种目标解耦的同时去噪和去混响框架
作者:Andong Li,Wenzhe Liu,Xiaoxue Luo,Guochen Yu,Chengshi Zheng,Xiaodong Li 机构:Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of, University of Chinese Academy of Sciences, Beijing, China, Communication University of China, Beijing, China 备注:Accepted at Interspeech 2021 链接:https://arxiv.org/abs/2106.12743 摘要:背景噪声和室内混响是影响主观语音质量的两个主要因素。在本文中,我们提出了一个完整的框架来解决复杂场景环境下的同时去噪和去冗余问题。采用链优化策略,设计了四个子阶段。在前两个阶段,我们将多任务学习的w.r.t.复谱分解为幅度和相位,只在幅度域进行噪声和混响的去除。在上述先验估计的基础上,我们在第三阶段对频谱进行了进一步的优化,在第三阶段,利用残差学习对幅度和相位信息进行了显式的修复。由于数据失配和DNN的非线性效应,DNN处理后的频谱中常存在残余噪声。为了解决这一问题,我们采用了一种轻量化的算法作为后处理模块来捕获和抑制非活动区域的残余噪声。在interspeech2021深度噪声抑制(DNS)挑战中,我们提交的系统在ITU-tp.835框架下的实时跟踪平均意见得分(MOS)排名第一 摘要:Background noise and room reverberation are regarded as two major factors to degrade the subjective speech quality. In this paper, we propose an integrated framework to address simultaneous denoising and dereverberation under complicated scenario environments. It adopts a chain optimization strategy and designs four sub-stages accordingly. In the first two stages, we decouple the multi-task learning w.r.t. complex spectrum into magnitude and phase, and only implement noise and reverberation removal in the magnitude domain. Based on the estimated priors above, we further polish the spectrum in the third stage, where both magnitude and phase information are explicitly repaired with the residual learning. Due to the data mismatch and nonlinear effect of DNNs, the residual noise often exists in the DNN-processed spectrum. To resolve the problem, we adopt a light-weight algorithm as the post-processing module to capture and suppress the residual noise in the non-active regions. In the Interspeech 2021 Deep Noise Suppression (DNS) Challenge, our submitted system ranked top-1 for the real-time track in terms of Mean Opinion Score (MOS) with ITU-T P.835 framework
【10】 Dealing with training and test segmentation mismatch: FBK@IWSLT2021 标题:处理训练和测试分段不匹配的问题:fbk@IWSLT2021
作者:Sara Papi,Marco Gaido,Matteo Negri,Marco Turchi 机构:Fondazione Bruno Kessler, Trento, Italy, University of Trento, Italy 备注:Accepted at IWSLT2021 链接:https://arxiv.org/abs/2106.12607 摘要:本文描述了FBK向iwslt2021离线语音翻译任务提交的系统。我们参与了一个直接模型,这是一个基于Transformer的体系结构,训练将英语语音音频数据翻译成德语文本。训练管道的特点是知识提炼和两步微调过程。知识提取和第一步微调都是在人工分割的真实数据和合成数据上进行的,后者是在现有语料库上训练机器翻译系统生成的。不同的是,第二个微调步骤是对MuST-cv2 En De数据集的随机分段执行的。它的主要目标是减少在自动分割的音频(即实际的、更真实的测试条件)上评估基于人工分割数据(即理想的、类似句子的分割)训练的语音翻译模型时出现的性能下降。出于同样的目的,在将测试数据传递给系统之前,对测试数据应用一个定制的混合分段过程,该过程同时考虑音频内容(暂停)和生成的分段的长度。在推理时,我们将该方法与基于语音活动检测(VAD)的基线分割方法进行了比较。我们的结果表明,提出的混合方法的有效性,显示了减少差距与手动分割从8.3到1.4 BLEU点。 摘要:This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.