金融/语音/音频处理学术速递[9.9]

2021-09-16 16:49:39 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计3篇

cs.SD语音,共计10篇

eess.AS音频处理,共计9篇

1.q-fin金融:

【1】 Behavioral Bias Benefits: Beating Benchmarks By Bundling Bouncy Baskets 标题:行为偏差的好处:通过捆绑弹力篮子来击败基准 链接:https://arxiv.org/abs/2109.03740

作者:Ravi Kashyap 机构:Estonian Business School, Tallinn, Estonia Formation Fi, Hong Kong City University of, Hong Kong, Hong Kong 备注:None 摘要:我们详细地考虑了一个投资策略,名为“反弹篮子”,旨在让某人在市场上表现出乐观的观点,允许他们在证券市场上获得长线头寸,这将从市场反弹中受益最多。我们演示了如何使用定量指标和大量历史数据来实现决策目标。这一投资理念将宏观经济观点与个别证券的特点结合起来,以超越市场回报。这一主题的中心思想是从区域角度识别严重卖空的证券,但这些证券基本上是健全的,至少有一个最低买入评级,这是涵盖这些证券的股票分析师的共识。我们将讨论创建这种策略的组成部分,包括构建投资组合的机制。通过将证券借贷数据建模为几何布朗运动的模拟,我们提供了几种创建证券排名的方法,以识别严重卖空的证券。当市场因意外的极端事件而出现衰退时,这种投资策略将是理想的,并且市场预计会在此后反弹。这种情况尤其适用于2020-2021年冠状病毒大流行期间观察到的事件和相关诉讼。这种策略是克服与投资相关的潜在行为偏差的一种特殊方法,我们称之为“反弹效应”。 摘要:We consider in detail an investment strategy, titled "The Bounce Basket", designed for someone to express a bullish view on the market by allowing them to take long positions on securities that would benefit the most from a rally in the markets. We demonstrate the use of quantitative metrics and large amounts of historical data towards decision making goals. This investment concept combines macroeconomic views with characteristics of individual securities to beat the market returns. The central idea of this theme is to identity securities from a regional perspective that are heavily shorted and yet are fundamentally sound with at least a minimum buy rating from a consensus of stock analysts covering the securities. We discuss the components of creating such a strategy including the mechanics of constructing the portfolio. Using simulations, in which securities lending data is modeled as geometric brownian motions, we provide a few flavors of creating a ranking of securities to identity the ones that are heavily shorted. An investment strategy of this kind will be ideal in market scenarios when a downturn happens due to unexpected extreme events and the markets are anticipated to bounce back thereafter. This situation is especially applicable to incidents being observed, and relevant proceedings, during the Coronavirus pandemic in 2020-2021. This strategy is one particular way to overcome a potential behavioral bias related to investing, which we term the "rebound effect".

【2】 Eurasian Economic Union: Current Concept and Prospects 标题:欧亚经济联盟:现状与展望 链接:https://arxiv.org/abs/2109.03644

作者:Larisa Kargina,Mattia Masolletti 机构:Professor, Russian University of Transport, Moscow, Russia, Associate Professor, NUST University, Rome, Italy 备注:8 pages. Key words: integration, Eurasian Economic Union, economic integration. JEL codes: F-02; F-15. Affiliations: Russian University of Transport (Russia), NUST University (Italy) 摘要:本文作者分析了欧亚一体化的内容,从最初的倡议到现代欧亚经济联盟,关注导致从关税联盟和单一经济空间向更强大的一体化联盟过渡的因素。研究的主要方法是历史和法律分析。 摘要:The authors of the article analyze the content of the Eurasian integration, from the initial initiative to the modern Eurasian Economic Union, paying attention to the factors that led to the transition from the Customs Union and the Single Economic Space to a stronger integration association. The main method of research is historical and legal analysis.

【3】 Three fundamental problems in risk modeling on big data: an information theory view 标题:大数据风险建模的三个基本问题:信息论视角 链接:https://arxiv.org/abs/2109.03541

作者:Jiamin Yu 机构:Received: date Accepted: date 备注:6 pages, 7 figures 摘要:自从克劳德·香农创立信息论以来,信息论已经广泛地培育了其他科学领域,如统计学、人工智能、生物学、行为科学、神经科学、经济学和金融学。不幸的是,精算学几乎没有从信息理论中受益。到目前为止,学术搜索引擎只能搜索到一篇关于信息论的精算论文。毫无疑问,作为不确定性的信息和风险都受到熵定律的约束。今天的保险大数据时代意味着更多的数据和信息。风险管理和精算学忽视信息理论是不可接受的。因此,本文旨在利用信息理论发现保险大数据系统的性能限制,并为风险建模和精算定价系统的发展寻求指导。 摘要:Since Claude Shannon founded Information Theory, information theory has widely fostered other scientific fields, such as statistics, artificial intelligence, biology, behavioral science, neuroscience, economics, and finance. Unfortunately, actuarial science has hardly benefited from information theory. So far, only one actuarial paper on information theory can be searched by academic search engines. Undoubtedly, information and risk, both as Uncertainty, are constrained by entropy law. Today's insurance big data era means more data and more information. It is unacceptable for risk management and actuarial science to ignore information theory. Therefore, this paper aims to exploit information theory to discover the performance limits of insurance big data systems and seek guidance for risk modeling and the development of actuarial pricing systems.

2.cs.SD语音:

【1】 Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021 标题:2021年VoxCeleb说话人识别挑战赛北京ZKJ-NPU说话人确认系统 链接:https://arxiv.org/abs/2109.03568

作者:Li Zhang,Huan Zhao,Qinling Meng,Yanli Chen,Min Liu,Lei Xie 机构:Northwestern Polytechnical University (NPU), Xi’an, China, Beijing ZKJ Technology Co., Ltd. 摘要:在本报告中,我们描述了北京ZKJ-NPU团队提交给VoxCeleb说话人识别挑战2021(VoxSRC-21)的情况。我们参加了完全监督的说话人确认第一轨道和第二轨道。在挑战中,我们探索了具有不同池层和目标损失函数的各种高级神经网络结构。此外,我们还引入了ResNet-DTCF、CoAtNet和PyConv网络来提高基于CNN的说话人嵌入模型的性能。此外,我们在评估阶段应用了嵌入规范化和分数规范化。通过融合11和14个系统,我们在评估试验中的最终最佳性能(minDCF/EER)分别为轨道1和轨道2的0.1205/2.8160%和0.1175/2.8400%。通过提交,我们在这两条赛道的挑战中都获得了第二名。 摘要:In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various kinds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160% and 0.1175/2.8400% respectively for track 1 and 2. With our submission, we came to the second place in the challenge for both tracks.

【2】 Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion 标题:利用嘴唇图像进行基于帧的电喉语音转换的时间对准 链接:https://arxiv.org/abs/2109.03551

作者:Yi-Syuan Liou,Wen-Chin Huang,Ming-Chi Yen,Shu-Wei Tsai,Yu-Huai Peng,Tomoki Toda,Yu Tsao,Hsin-Min Wang 机构:∗ Academia Sinica, Taiwan, † Nagoya University, Japan, ‡ National Cheng Kung University Hospital, Taiwan 备注:Accepted to APSIPA ASC 2021 摘要:语音转换(VC)是一种有效的电致语音增强方法,旨在提高电致语音设备的人工语音质量。在基于帧的VC方法中,时间对齐需要在模型训练之前进行,而动态时间扭曲(DTW)算法被广泛用于计算每个话语对之间的最佳时间对齐。有效性基于说话人的相同音素具有相似特征的假设,并且可以通过测量源和目标语音帧之间的预定义距离来映射。然而,EL语音的特殊特性可能会打破这一假设,导致次优DTW对齐。在这项工作中,我们建议使用嘴唇图像进行时间校准,因为我们假设与健康人相比,喉切除术的嘴唇运动保持正常。我们研究了两种朴素的嘴唇表示和距离度量,实验结果表明,该方法在客观和主观评价方面明显优于纯音频对齐。 摘要:Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.

【3】 A Review of Sound Source Localization with Deep Learning Methods 标题:基于深度学习方法的声源定位研究综述 链接:https://arxiv.org/abs/2109.03465

作者:Pierre-Amaury Grumiaux,Srđan Kitić,Laurent Girin,Alexandre Guérin 备注:Submitted to IEEE Transactions on Audio, Speech, and Language Processing 摘要:本文综述了单声源定位和多声源定位的深度学习方法。我们特别感兴趣的是室内/家庭环境中的声源定位,其中存在混响和扩散噪声。在此背景下,我们提供了基于神经的定位文献的详尽图景,按照以下几个方面进行组织:神经网络结构、输入特征类型、输出策略(分类或回归)、用于模型训练和评估的数据类型以及模型训练策略。通过这种方式,感兴趣的读者可以轻松理解基于深度学习的声源定位方法的广阔前景。综述末尾提供了总结文献综述的表格,用于快速搜索具有给定目标特征集的方法。 摘要:This article is a review on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.

【4】 Signal-domain representation of symbolic music for learning embedding spaces 标题:用于学习嵌入空间的符号音乐的信号域表示 链接:https://arxiv.org/abs/2109.03454

作者:Mathieu Prang,Philippe Esling 机构:IRCAM 备注:None 摘要:机器学习模型的一个关键方面在于它们学习有效中间特征的能力。然而,输入表征在这一过程中起着至关重要的作用,而复调乐谱仍然是一种特别复杂的信息类型。在本文中,我们介绍了一种新的符号音乐数据表示方法,它将复调乐谱转换为连续信号。我们从音乐的角度评估从这种表现中学习有意义特征的能力。因此,我们介绍了一种基于合成数据原则生成的评估方法。最后,为了测试我们提出的表示,我们对最近的复调符号表示进行了广泛的基准测试。我们表明,我们的信号表示导致更好的重建和分离特征。这种改进反映在度量属性和根据音乐理论属性从我们的类信号表示学习到的空间的生成能力上。 摘要:A key aspect of machine learning models lies in their ability to learn efficient intermediate features. However, the input representation plays a crucial role in this process, and polyphonic musical scores remain a particularly complex type of information. In this paper, we introduce a novel representation of symbolic music data, which transforms a polyphonic score into a continuous signal. We evaluate the ability to learn meaningful features from this representation from a musical point of view. Hence, we introduce an evaluation method relying on principled generation of synthetic data. Finally, to test our proposed representation we conduct an extensive benchmark against recent polyphonic symbolic representations. We show that our signal-like representation leads to better reconstruction and disentangled features. This improvement is reflected in the metric properties and in the generation ability of the space learned from our signal-like representation according to music theory properties.

【5】 Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering 标题:口语答疑中的自监督对比跨模态表征学习 链接:https://arxiv.org/abs/2109.03381

作者:Chenyu You,Nuo Chen,Yuexian Zou 机构:† Department of Electrical Engineering, Yale University, New Haven, CT, USA, ‡ADSPLAB, School of ECE, Peking University, Shenzhen, China, §Peng Cheng Laboratory, Shenzhen, China 摘要:口语问答(SQA)需要对口语文档和问题进行细粒度的理解,以实现最佳答案预测。在本文中,我们提出了一种新的口语问答训练方案,包括自我监督训练阶段和对比表征学习阶段。在自监督阶段,我们提出了三个辅助的自监督任务,包括话语恢复、话语插入和问题识别,并联合训练模型在不需要任何附加数据或注释的情况下捕获语音文档之间的一致性和连贯性。然后,我们提出在对比目标中通过多种增广策略学习噪声不变的话语表征,包括广度删除和广度替换。此外,我们还设计了一种时间对齐方法,注意在学习到的公共空间中对语音文本线索进行语义对齐,从而有利于SQA任务的完成。通过这种方式,训练方案可以更有效地指导生成模型预测更合适的答案。实验结果表明,我们的模型在三个SQA基准上达到了最先进的结果。 摘要:Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks.

【6】 A Dual-Decoder Conformer for Multilingual Speech Recognition 标题:一种适用于多语种语音识别的双解码器整形器 链接:https://arxiv.org/abs/2109.03277

作者:Krishna D N 机构:Freshworks Inc. 备注:5 pages 摘要:基于转换器的模型最近在序列到序列的应用中非常流行,如机器翻译和语音识别。本文提出了一种用于印度语低资源多语言语音识别的双解码器转换器模型。我们提出的模型由一个Conformer[1]编码器、两个并行转换器解码器和一个语言分类器组成。我们使用音素解码器(PHN-DEC)来完成音素识别任务,使用字形解码器(GRP-DEC)来预测字形序列和语言信息。在多任务学习框架中,我们将音素识别和语言识别作为辅助任务。我们通过联合CTC注意[2]训练,共同优化音素识别、字形识别和语言识别任务的网络。我们的实验表明,与基线方法相比,我们可以显著降低WER。我们还表明,与单解码器方法相比,我们的双解码器方法获得了显著的改进。 摘要:Transformer-based models have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a Conformer [1] encoder, two parallel transformer decoders, and a language classifier. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information. We consider phoneme recognition and language identification as auxiliary tasks in the multi-task learning framework. We jointly optimize the network for phoneme recognition, grapheme recognition, and language identification tasks with Joint CTC-Attention [2] training. Our experiments show that we can obtain a significant reduction in WER over the baseline approaches. We also show that our dual-decoder approach obtains significant improvement over the single decoder approach.

【7】 Text-Free Prosody-Aware Generative Spoken Language Modeling 标题:无文本韵律感知的生成性口语建模 链接:https://arxiv.org/abs/2109.03264

作者:Eugene Kharitonov,Ann Lee,Adam Polyak,Yossi Adi,Jade Copet,Kushal Lakhotia,Tu-Anh Nguyen,Morgane Rivière,Abdelrahman Mohamed,Emmanuel Dupoux,Wei-Ning Hsu 机构:Facebook AI Research 摘要:语音预训练主要证明了其在分类任务上的有效性,而其生成新语音的能力,类似于GPT-2生成连贯段落的能力,几乎没有被探索过。生成性口语建模(GSLM)(Lakhotia et al.,2021)是之前唯一一项解决语音预训练生成方面的工作,它用发现的类似手机的语言建模单元替换文本,并显示出生成有意义的新句子的能力。不幸的是,尽管消除了对文本的需求,但GSLM中使用的单元丢弃了大部分韵律信息。因此,GSLM无法利用韵律来更好地理解,也无法生成富有表现力的语音。在这项工作中,我们提出了一个韵律感知的生成性口语模型(pGSLM)。它由语音的多流变换器语言模型(MS-TLM)组成,表示为发现的单元和韵律特征流,以及将MS-TLM输出转换为波形的自适应HiFi GAN模型。我们为韵律建模和生成设计了一系列度量,并将GSLM中的度量用于内容建模。实验结果表明,pGSLM可以利用韵律改进韵律和内容建模,并在语音提示下生成自然、有意义和连贯的语音。音频样本可在以下网址找到:https://speechbot.github.io/pgslm. 摘要:Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.

【8】 Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis 标题:裁判:为表达性语音合成走向低质量数据的无参照交叉语体转换 链接:https://arxiv.org/abs/2109.03439

作者:Songxiang Liu,Shan Yang,Dan Su,Dong Yu 机构:Tencent AI Lab 备注:7 pages, preprint 摘要:文语转换(TTS)合成中的跨说话人风格转换(CSST)旨在将说话人风格转换为目标说话人语音中的合成语音。大多数以前的CSST方法依赖于昂贵的高质量数据,在训练过程中携带所需的口语风格,并且需要参考话语来获得口语风格描述符,作为生成新句子的条件。这项工作介绍了Credifer,一种用于表达性TTS的健壮的无参考CSST方法,它充分利用低质量数据从文本中学习说话风格。裁判员是通过将文本到样式(T2S)模型与样式到波浪(S2W)模型级联而成的。采用语音后验概率(PPG)、音素水平音高和能量轮廓作为细粒度口语风格描述符,使用T2S模型从文本中预测。采用一种新的预训练细化方法,仅使用易于获取的低质量数据来学习稳健的T2S模型。S2W模型使用高质量的目标数据进行训练,从而有效地聚合风格描述符并在目标说话人的语音中生成高保真语音。实验结果表明,在CSST中,仲裁优于基于全局样式标记(GST)的基线方法。 摘要:Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speaker's voice. Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST.

【9】 A New Non-Negative Matrix Co-Factorisation Approach for Noisy Neonatal Chest Sound Separation 标题:一种新的噪声新生儿胸音分离的非负矩阵协分解方法 链接:https://arxiv.org/abs/2109.03275

作者:Ethan Grooby,Jinyuan He,Davood Fattahi,Lindsay Zhou,Arrabella King,Ashwin Ramanathan,Atul Malhotra,Guy A. Dumont,Faezeh Marzbanrad 机构: Monash Children’s Hospital and Department of Paediatrics, Monash University 备注:6 pages, 2 figures. To appear as conference paper at 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1st-5th November 2021 摘要:获得高质量的心音和肺音可以让临床医生准确评估新生儿的心肺健康状况,并提供及时的护理。然而,嘈杂的胸部录音是常见的,阻碍了及时和准确的评估。为了解决这一问题,提出了一种新的基于非负矩阵协因子分解的方法,将有噪声的胸部录音分离为心脏、肺和噪声分量。这种方法是通过训练20个高质量的心音和肺音,同时分离噪音记录的声音来实现的。该方法在68个包含心音和肺音的10秒噪声记录上进行了测试,并与当前最先进的非负矩阵分解方法进行了比较。结果表明,与现有方法相比,心音和肺音质量分数分别显著提高,心音和呼吸频率估计的准确度分别提高了3.6bpm和1.2bpm。 摘要:Obtaining high-quality heart and lung sounds enables clinicians to accurately assess a newborn's cardio-respiratory health and provide timely care. However, noisy chest sound recordings are common, hindering timely and accurate assessment. A new Non-negative Matrix Co-Factorisation-based approach is proposed to separate noisy chest sound recordings into heart, lung, and noise components to address this problem. This method is achieved through training with 20 high-quality heart and lung sounds, in parallel with separating the sounds of the noisy recording. The method was tested on 68 10-second noisy recordings containing both heart and lung sounds and compared to the current state of the art Non-negative Matrix Factorisation methods. Results show significant improvements in heart and lung sound quality scores respectively, and improved accuracy of 3.6bpm and 1.2bpm in heart and breathing rate estimation respectively, when compared to existing methods.

【10】 Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition 标题:高效一致性:自动语音识别中的渐进式下采样和分组注意 链接:https://arxiv.org/abs/2109.01163

作者:Maxime Burchi,Valentin Vielzeuf 机构:Orange Labs, Cesson-S´evign´e, France 备注:None 摘要:最近提出的Conformer体系结构通过结合卷积和对模型局部和全局依赖性的关注,在自动语音识别中显示了最先进的性能。在本文中,我们研究如何在有限的计算预算下降低Conformer体系结构的复杂性,从而实现一种更高效的体系结构设计,我们称之为高效Conformer。我们在构象编码器中引入渐进下采样,并提出了一种新的注意机制,称为分组注意,允许我们将注意复杂度从$O(n^{2}d)$降低到$O(n^{2}d/g)$,用于序列长度$n$、隐藏维度$d$和组大小参数$g$。我们还实验了使用跨步多头自我注意作为一种全局下采样操作。我们的实验是在LibriSpeech数据集上进行的,带有CTC和RNN传感器损耗。我们表明,在相同的计算预算下,与一致性架构相比,该架构在更快的训练和解码速度下实现了更好的性能。我们的13M参数CTC模型在不使用语言模型的情况下实现了3.6%/9.0%的有竞争力的WER,在测试清洁/测试其他集合的情况下,使用外部n-gram语言模型实现了2.7%/6.7%的有竞争力的WER,同时在推理时比我们的CTC一致性基线快29%,训练速度快36%。 摘要:The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6%/9.0% without using a language model and 2.7%/6.7% with an external n-gram language model on the test-clean/test-other sets while being 29% faster than our CTC Conformer baseline at inference and 36% faster to train.

3.eess.AS音频处理:

【1】 Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis 标题:裁判:为表达性语音合成走向低质量数据的无参照交叉语体转换 链接:https://arxiv.org/abs/2109.03439

作者:Songxiang Liu,Shan Yang,Dan Su,Dong Yu 机构:Tencent AI Lab 备注:7 pages, preprint 摘要:文语转换(TTS)合成中的跨说话人风格转换(CSST)旨在将说话人风格转换为目标说话人语音中的合成语音。大多数以前的CSST方法依赖于昂贵的高质量数据,在训练过程中携带所需的口语风格,并且需要参考话语来获得口语风格描述符,作为生成新句子的条件。这项工作介绍了Credifer,一种用于表达性TTS的健壮的无参考CSST方法,它充分利用低质量数据从文本中学习说话风格。裁判员是通过将文本到样式(T2S)模型与样式到波浪(S2W)模型级联而成的。采用语音后验概率(PPG)、音素水平音高和能量轮廓作为细粒度口语风格描述符,使用T2S模型从文本中预测。采用一种新的预训练细化方法,仅使用易于获取的低质量数据来学习稳健的T2S模型。S2W模型使用高质量的目标数据进行训练,从而有效地聚合风格描述符并在目标说话人的语音中生成高保真语音。实验结果表明,在CSST中,仲裁优于基于全局样式标记(GST)的基线方法。 摘要:Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speaker's voice. Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST.

【2】 A New Non-Negative Matrix Co-Factorisation Approach for Noisy Neonatal Chest Sound Separation 标题:一种新的噪声新生儿胸音分离的非负矩阵协分解方法 链接:https://arxiv.org/abs/2109.03275

作者:Ethan Grooby,Jinyuan He,Davood Fattahi,Lindsay Zhou,Arrabella King,Ashwin Ramanathan,Atul Malhotra,Guy A. Dumont,Faezeh Marzbanrad 机构: Monash Children’s Hospital and Department of Paediatrics, Monash University 备注:6 pages, 2 figures. To appear as conference paper at 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1st-5th November 2021 摘要:获得高质量的心音和肺音可以让临床医生准确评估新生儿的心肺健康状况,并提供及时的护理。然而,嘈杂的胸部录音是常见的,阻碍了及时和准确的评估。为了解决这一问题,提出了一种新的基于非负矩阵协因子分解的方法,将有噪声的胸部录音分离为心脏、肺和噪声分量。这种方法是通过训练20个高质量的心音和肺音,同时分离噪音记录的声音来实现的。该方法在68个包含心音和肺音的10秒噪声记录上进行了测试,并与当前最先进的非负矩阵分解方法进行了比较。结果表明,与现有方法相比,心音和肺音质量分数分别显著提高,心音和呼吸频率估计的准确度分别提高了3.6bpm和1.2bpm。 摘要:Obtaining high-quality heart and lung sounds enables clinicians to accurately assess a newborn's cardio-respiratory health and provide timely care. However, noisy chest sound recordings are common, hindering timely and accurate assessment. A new Non-negative Matrix Co-Factorisation-based approach is proposed to separate noisy chest sound recordings into heart, lung, and noise components to address this problem. This method is achieved through training with 20 high-quality heart and lung sounds, in parallel with separating the sounds of the noisy recording. The method was tested on 68 10-second noisy recordings containing both heart and lung sounds and compared to the current state of the art Non-negative Matrix Factorisation methods. Results show significant improvements in heart and lung sound quality scores respectively, and improved accuracy of 3.6bpm and 1.2bpm in heart and breathing rate estimation respectively, when compared to existing methods.

【3】 Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021 标题:2021年VoxCeleb说话人识别挑战赛北京ZKJ-NPU说话人确认系统 链接:https://arxiv.org/abs/2109.03568

作者:Li Zhang,Huan Zhao,Qinling Meng,Yanli Chen,Min Liu,Lei Xie 机构:Northwestern Polytechnical University (NPU), Xi’an, China, Beijing ZKJ Technology Co., Ltd. 摘要:在本报告中,我们描述了北京ZKJ-NPU团队提交给VoxCeleb说话人识别挑战2021(VoxSRC-21)的情况。我们参加了完全监督的说话人确认第一轨道和第二轨道。在挑战中,我们探索了具有不同池层和目标损失函数的各种高级神经网络结构。此外,我们还引入了ResNet-DTCF、CoAtNet和PyConv网络来提高基于CNN的说话人嵌入模型的性能。此外,我们在评估阶段应用了嵌入规范化和分数规范化。通过融合11和14个系统,我们在评估试验中的最终最佳性能(minDCF/EER)分别为轨道1和轨道2的0.1205/2.8160%和0.1175/2.8400%。通过提交,我们在这两条赛道的挑战中都获得了第二名。 摘要:In this report, we describe the Beijing ZKJ-NPU team submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We participated in the fully supervised speaker verification track 1 and track 2. In the challenge, we explored various kinds of advanced neural network structures with different pooling layers and objective loss functions. In addition, we introduced the ResNet-DTCF, CoAtNet and PyConv networks to advance the performance of CNN-based speaker embedding model. Moreover, we applied embedding normalization and score normalization at the evaluation stage. By fusing 11 and 14 systems, our final best performances (minDCF/EER) on the evaluation trails are 0.1205/2.8160% and 0.1175/2.8400% respectively for track 1 and 2. With our submission, we came to the second place in the challenge for both tracks.

【4】 Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion 标题:利用嘴唇图像进行基于帧的电喉语音转换的时间对准 链接:https://arxiv.org/abs/2109.03551

作者:Yi-Syuan Liou,Wen-Chin Huang,Ming-Chi Yen,Shu-Wei Tsai,Yu-Huai Peng,Tomoki Toda,Yu Tsao,Hsin-Min Wang 机构:∗ Academia Sinica, Taiwan, † Nagoya University, Japan, ‡ National Cheng Kung University Hospital, Taiwan 备注:Accepted to APSIPA ASC 2021 摘要:语音转换(VC)是一种有效的电致语音增强方法,旨在提高电致语音设备的人工语音质量。在基于帧的VC方法中,时间对齐需要在模型训练之前进行,而动态时间扭曲(DTW)算法被广泛用于计算每个话语对之间的最佳时间对齐。有效性基于说话人的相同音素具有相似特征的假设,并且可以通过测量源和目标语音帧之间的预定义距离来映射。然而,EL语音的特殊特性可能会打破这一假设,导致次优DTW对齐。在这项工作中,我们建议使用嘴唇图像进行时间校准,因为我们假设与健康人相比,喉切除术的嘴唇运动保持正常。我们研究了两种朴素的嘴唇表示和距离度量,实验结果表明,该方法在客观和主观评价方面明显优于纯音频对齐。 摘要:Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.

【5】 A Review of Sound Source Localization with Deep Learning Methods 标题:基于深度学习方法的声源定位研究综述 链接:https://arxiv.org/abs/2109.03465

作者:Pierre-Amaury Grumiaux,Srđan Kitić,Laurent Girin,Alexandre Guérin 备注:Submitted to IEEE Transactions on Audio, Speech, and Language Processing 摘要:本文综述了单声源定位和多声源定位的深度学习方法。我们特别感兴趣的是室内/家庭环境中的声源定位,其中存在混响和扩散噪声。在此背景下,我们提供了基于神经的定位文献的详尽图景,按照以下几个方面进行组织:神经网络结构、输入特征类型、输出策略(分类或回归)、用于模型训练和评估的数据类型以及模型训练策略。通过这种方式,感兴趣的读者可以轻松理解基于深度学习的声源定位方法的广阔前景。综述末尾提供了总结文献综述的表格,用于快速搜索具有给定目标特征集的方法。 摘要:This article is a review on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.

【6】 Signal-domain representation of symbolic music for learning embedding spaces 标题:用于学习嵌入空间的符号音乐的信号域表示 链接:https://arxiv.org/abs/2109.03454

作者:Mathieu Prang,Philippe Esling 机构:IRCAM 备注:None 摘要:机器学习模型的一个关键方面在于它们学习有效中间特征的能力。然而,输入表征在这一过程中起着至关重要的作用,而复调乐谱仍然是一种特别复杂的信息类型。在本文中,我们介绍了一种新的符号音乐数据表示方法,它将复调乐谱转换为连续信号。我们从音乐的角度评估从这种表现中学习有意义特征的能力。因此,我们介绍了一种基于合成数据原则生成的评估方法。最后,为了测试我们提出的表示,我们对最近的复调符号表示进行了广泛的基准测试。我们表明,我们的信号表示导致更好的重建和分离特征。这种改进反映在度量属性和根据音乐理论属性从我们的类信号表示学习到的空间的生成能力上。 摘要:A key aspect of machine learning models lies in their ability to learn efficient intermediate features. However, the input representation plays a crucial role in this process, and polyphonic musical scores remain a particularly complex type of information. In this paper, we introduce a novel representation of symbolic music data, which transforms a polyphonic score into a continuous signal. We evaluate the ability to learn meaningful features from this representation from a musical point of view. Hence, we introduce an evaluation method relying on principled generation of synthetic data. Finally, to test our proposed representation we conduct an extensive benchmark against recent polyphonic symbolic representations. We show that our signal-like representation leads to better reconstruction and disentangled features. This improvement is reflected in the metric properties and in the generation ability of the space learned from our signal-like representation according to music theory properties.

【7】 Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering 标题:口语答疑中的自监督对比跨模态表征学习 链接:https://arxiv.org/abs/2109.03381

作者:Chenyu You,Nuo Chen,Yuexian Zou 机构:† Department of Electrical Engineering, Yale University, New Haven, CT, USA, ‡ADSPLAB, School of ECE, Peking University, Shenzhen, China, §Peng Cheng Laboratory, Shenzhen, China 摘要:口语问答(SQA)需要对口语文档和问题进行细粒度的理解,以实现最佳答案预测。在本文中,我们提出了一种新的口语问答训练方案,包括自我监督训练阶段和对比表征学习阶段。在自监督阶段,我们提出了三个辅助的自监督任务,包括话语恢复、话语插入和问题识别,并联合训练模型在不需要任何附加数据或注释的情况下捕获语音文档之间的一致性和连贯性。然后,我们提出在对比目标中通过多种增广策略学习噪声不变的话语表征,包括广度删除和广度替换。此外,我们还设计了一种时间对齐方法,注意在学习到的公共空间中对语音文本线索进行语义对齐,从而有利于SQA任务的完成。通过这种方式,训练方案可以更有效地指导生成模型预测更合适的答案。实验结果表明,我们的模型在三个SQA基准上达到了最先进的结果。 摘要:Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks.

【8】 A Dual-Decoder Conformer for Multilingual Speech Recognition 标题:一种适用于多语种语音识别的双解码器整形器 链接:https://arxiv.org/abs/2109.03277

作者:Krishna D N 机构:Freshworks Inc. 备注:5 pages 摘要:基于转换器的模型最近在序列到序列的应用中非常流行,如机器翻译和语音识别。本文提出了一种用于印度语低资源多语言语音识别的双解码器转换器模型。我们提出的模型由一个Conformer[1]编码器、两个并行转换器解码器和一个语言分类器组成。我们使用音素解码器(PHN-DEC)来完成音素识别任务,使用字形解码器(GRP-DEC)来预测字形序列和语言信息。在多任务学习框架中,我们将音素识别和语言识别作为辅助任务。我们通过联合CTC注意[2]训练,共同优化音素识别、字形识别和语言识别任务的网络。我们的实验表明,与基线方法相比,我们可以显著降低WER。我们还表明,与单解码器方法相比,我们的双解码器方法获得了显著的改进。 摘要:Transformer-based models have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a Conformer [1] encoder, two parallel transformer decoders, and a language classifier. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information. We consider phoneme recognition and language identification as auxiliary tasks in the multi-task learning framework. We jointly optimize the network for phoneme recognition, grapheme recognition, and language identification tasks with Joint CTC-Attention [2] training. Our experiments show that we can obtain a significant reduction in WER over the baseline approaches. We also show that our dual-decoder approach obtains significant improvement over the single decoder approach.

【9】 Text-Free Prosody-Aware Generative Spoken Language Modeling 标题:无文本韵律感知的生成性口语建模 链接:https://arxiv.org/abs/2109.03264

作者:Eugene Kharitonov,Ann Lee,Adam Polyak,Yossi Adi,Jade Copet,Kushal Lakhotia,Tu-Anh Nguyen,Morgane Rivière,Abdelrahman Mohamed,Emmanuel Dupoux,Wei-Ning Hsu 机构:Facebook AI Research 摘要:语音预训练主要证明了其在分类任务上的有效性,而其生成新语音的能力,类似于GPT-2生成连贯段落的能力,几乎没有被探索过。生成性口语建模(GSLM)(Lakhotia et al.,2021)是之前唯一一项解决语音预训练生成方面的工作,它用发现的类似手机的语言建模单元替换文本,并显示出生成有意义的新句子的能力。不幸的是,尽管消除了对文本的需求,但GSLM中使用的单元丢弃了大部分韵律信息。因此,GSLM无法利用韵律来更好地理解,也无法生成富有表现力的语音。在这项工作中,我们提出了一个韵律感知的生成性口语模型(pGSLM)。它由语音的多流变换器语言模型(MS-TLM)组成,表示为发现的单元和韵律特征流,以及将MS-TLM输出转换为波形的自适应HiFi GAN模型。我们为韵律建模和生成设计了一系列度量,并将GSLM中的度量用于内容建模。实验结果表明,pGSLM可以利用韵律改进韵律和内容建模,并在语音提示下生成自然、有意义和连贯的语音。音频样本可在以下网址找到:https://speechbot.github.io/pgslm. 摘要:Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.

机器翻译,仅供参考

0 人点赞