金融/语音/音频处理学术速递[11.24]

2021-11-25 14:56:14 浏览数 (2)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计2篇

cs.SD语音,共计10篇

eess.AS音频处理,共计10篇

1.q-fin金融:

【1】 Pricing cryptocurrencies : Modelling the ETHBTC spot-quotient variation as a diffusion process 标题:加密货币定价:将ETHBTC现货商数变化建模为扩散过程 链接:https://arxiv.org/abs/2111.11609

作者:Sidharth Mallik 备注:6 tables, submitted to Journal of Computational Finance 摘要:本研究提出了一个ETHBTC现货与ETHBUSDT和BTCUSDT在Binance交易商之间日内变化的模型。在无套利、完全准确且无微观结构影响的条件下,变化必须等于其理论计算值0。我们对4年的数据进行了研究。我们发现,这种变化并不总是0。该变化在0的任一侧显示波动行为。此外,第一年的偏差往往大于其余年份。我们测试样本的扩散性质,发现均值回归的证据。我们使用Ornstein-Uhlenbeck过程对变化进行建模。采用最大似然估计法。从获得的参数的采样分布的准确性来看,我们得出结论,这种变化可以精确地建模为Ornstein-Uhlenbeck过程。根据获得的参数,长期平均值显示为负号,与1e-05精度下的理论值0不同。考虑到市场对公开信息定价的效率,我们注意到了结果。 摘要:This research proposes a model for the intraday variation between the ETHBTC spot and the quotient of ETHUSDT and BTCUSDT traded on Binance. Under conditions of no-arbitrage, perfect accuracy and no microstructure effects, the variation must be equal to its theoretically computed value of 0. We conduct our research on 4 years of data. We find that the variation is not constantly 0. The variation shows a fluctuating behaviour on either side of 0. Furthermore, the deviations tend to be larger in the first year than the rest of the years. We test the sample for the nature of diffusion where we find evidence of mean-reversion. We model the variation using an Ornstein-Uhlenbeck process. A maximum likelihood estimation procedure is used. From the accuracy of the sampling distribution of the parameters obtained, we conclude that the variation may be accurately modelled as an Ornstein-Uhlenbeck process. From the parameters obtained, the long-term mean is shown to have a negative sign and differs from the theoretical value of 0 at 1e-05 precision. We take note of the results in light of efficiency of the markets to price publicly known information.

【2】 Semi-nonparametric Estimation of Operational Risk Capital with Extreme Loss Events 标题:极端损失事件下操作风险资本的半参数估计 链接:https://arxiv.org/abs/2111.11459

作者:Heng Z. Chen,Stephen R. Cosslett 备注:There are 30 pages, including tables, figures, appendix and reference. The research was presented at the MATLAB Annual Computational Finance Conference, September 27-30, 2021 摘要:使用参数模型的操作风险建模可能导致对极端事件导致的99.9%的经济资本风险价值的反直觉估计。为了解决这个问题,引入了一个灵活的半非参数(SNP)模型,使用变量变化技术来丰富可用于极端事件建模的分布族。证明了SNP模型与参数核具有相同的最大吸引域(MDA),表明SNP模型与极值理论——阈值峰值法一致,但形状和尺度参数不同。通过使用由具有不同体尾阈值的混合分布生成的模拟数据集,Frêechet和Gumbel MDA中的SNP模型通过增加模型参数的数量令人满意地拟合数据集,从而得出99.9%的相似分位数估计。当应用于一家大型国际银行的实际操作风险损失数据集时,SNP模型得出的合理资本估计约为单个最大损失事件的2至2.5倍。 摘要:Operational risk modeling using the parametric models can lead to a counter-intuitive estimate of value at risk at 99.9% as economic capital due to extreme events. To address this issue, a flexible semi-nonparametric (SNP) model is introduced using the change of variables technique to enrich the family of distributions that can be used for modeling extreme events. The SNP models are proved to have the same maximum domain of attraction (MDA) as the parametric kernels, and it follows that the SNP models are consistent with the extreme value theory - peaks over threshold method but with different shape and scale parameters. By using the simulated datasets generated from a mixture of distributions with varying body-tail thresholds, the SNP models in the Fr'echet and Gumbel MDAs are shown to fit the datasets satisfactorily through increasing the number of model parameters, resulting in similar quantile estimates at 99.9%. When applied to an actual operational risk loss dataset from a major international bank, the SNP models yield a sensible capital estimate that is around 2 to 2.5 times as large as the single largest loss event.

2.cs.SD语音:

【1】 Romanian Speech Recognition Experiments from the ROBIN Project 标题:罗宾计划的罗马尼亚语音识别实验 链接:https://arxiv.org/abs/2111.12028

作者:Andrei-Marius Avram,Vasile Păiş,Dan Tufiş 机构:Research Institute for Artificial Intelligence, Romanian Academy 备注:12 pages, 3 figures, ConsILR2020 摘要:接受社会辅助机器人的基本功能之一是其与环境中其他代理的通信能力。在罗宾项目的背景下,研究了通过与机器人的语音交互进行情景对话。本文介绍了使用深度神经网络进行的不同语音识别实验,重点是在仍然可靠的情况下产生快速(网络本身的延迟小于100ms)的模型。尽管关键的期望特征之一是低延迟,但最终的深度神经网络模型实现了识别罗马尼亚语言的最新结果,与语言模型结合时,获得了9.91%的单词错误率(WER),因此,在提供改进的运行时性能的同时,改进了先前的结果。此外,我们还探讨了两个用于纠正ASR输出的模块(连字符和大写字母恢复以及未知单词纠正),目标是ROBIN项目的目标(封闭微世界中的对话)。我们设计了一个基于API的模块化架构,允许集成引擎(机器人内部或外部)根据需要将可用模块链接在一起。最后,我们将所提出的设计集成到RELATE平台中,并通过上传文件或录制新语音向web用户提供ASR服务,从而对该设计进行测试。 摘要:One of the fundamental functionalities for accepting a socially assistive robot is its communication capabilities with other agents in the environment. In the context of the ROBIN project, situational dialogue through voice interaction with a robot was investigated. This paper presents different speech recognition experiments with deep neural networks focusing on producing fast (under 100ms latency from the network itself), while still reliable models. Even though one of the key desired characteristics is low latency, the final deep neural network model achieves state of the art results for recognizing Romanian language, obtaining a 9.91% word error rate (WER), when combined with a language model, thus improving over the previous results while offering at the same time an improved runtime performance. Additionally, we explore two modules for correcting the ASR output (hyphen and capitalization restoration and unknown words correction), targeting the ROBIN project's goals (dialogue in closed micro-worlds). We design a modular architecture based on APIs allowing an integration engine (either in the robot or external) to chain together the available modules as needed. Finally, we test the proposed design by integrating it in the RELATE platform and making the ASR service available to web users by either uploading a file or recording new speech.

【2】 Longitudinal Speech Biomarkers for Automated Alzheimer's Detection 标题:用于阿尔茨海默病自动检测的纵向语音生物标志物 链接:https://arxiv.org/abs/2111.11859

作者:Jordi Laguarta Soler,Brian Subirana 机构:Specialty section:, This article was submitted to, Human-Media Interaction, a section of the journal, Frontiers in Computer Science, Citation:, Laguarta J and Subirana B (,), Longitudinal Speech Biomarkers for, Automated Alzheimer’s Detection., Front. Comput. Sci. ,:,. 备注:None 摘要:我们介绍了一种新的音频处理架构,开放语音脑模型(OVBM),提高了阿尔茨海默氏症(AD)从自发语音纵向辨别的检测准确性。我们还概述了OVBM设计方法,该方法将我们引入到这种体系结构中,这种体系结构通常可以结合多模式生物标记物,同时针对多种疾病和其他人工智能任务。我们的方法的关键是使用多个相互补充的生物标记物,当其中两个唯一地识别目标疾病中的不同受试者时,我们说它们是正交的。我们通过2019冠状病毒疾病的16种生物标志物,其中三种是正交的,说明了同时显示出明显的无关疾病如AD和COVID-19的最先进的鉴别方法。受麻省理工学院大脑和机器中心研究的启发,OVBM结合了四个智能模块的生物标记物实现:大脑OS块和重叠音频样本,并聚集来自感觉流和认知核心的生物标记物特征,为目标任务创建一个符号组合模型的多模态图神经网络。我们将其应用于AD,在原始音频上达到93.8%以上的最新准确率,同时提取受试者显著性图,使用多个生物标记物纵向跟踪相关疾病进展,16在报告的AD任务中。最终目的是通过检测发病和治疗影响来帮助医疗实践,以便对干预方案进行纵向测试。使用OBVM设计方法,我们引入了一种新的肺和呼吸道生物标记物,该标记物使用200000多个咳嗽样本创建,以预训练一个区分咳嗽文化起源的模型。该咳嗽数据集设定了一个新的基准,作为最大的音频健康数据集,在2020年4月有30000多名受试者参与,首次证明了咳嗽文化偏见。 摘要:We introduce a novel audio processing architecture, the Open Voice Brain Model (OVBM), improving detection accuracy for Alzheimer's (AD) longitudinal discrimination from spontaneous speech. We also outline the OVBM design methodology leading us to such architecture, which in general can incorporate multimodal biomarkers and target simultaneously several diseases and other AI tasks. Key in our methodology is the use of multiple biomarkers complementing each other, and when two of them uniquely identify different subjects in a target disease we say they are orthogonal. We illustrate the methodology by introducing 16 biomarkers, three of which are orthogonal, demonstrating simultaneous above state-of-the-art discrimination for apparently unrelated diseases such as AD and COVID-19. Inspired by research conducted at the MIT Center for Brain Minds and Machines, OVBM combines biomarker implementations of the four modules of intelligence: The brain OS chunks and overlaps audio samples and aggregates biomarker features from the sensory stream and cognitive core creating a multi-modal graph neural network of symbolic compositional models for the target task. We apply it to AD, achieving above state-of-the-art accuracy of 93.8% on raw audio, while extracting a subject saliency map that longitudinally tracks relative disease progression using multiple biomarkers, 16 in the reported AD task. The ultimate aim is to help medical practice by detecting onset and treatment impact so that intervention options can be longitudinally tested. Using the OBVM design methodology, we introduce a novel lung and respiratory tract biomarker created using 200,000 cough samples to pre-train a model discriminating cough cultural origin. This cough dataset sets a new benchmark as the largest audio health dataset with 30,000 subjects participating in April 2020, demonstrating for the first-time cough cultural bias.

【3】 Upsampling layers for music source separation 链接:https://arxiv.org/abs/2111.11773

作者:Jordi Pons,Joan Serrà,Santiago Pascual,Giulio Cengarle,Daniel Arteaga,Davide Scaini 机构:Dolby Laboratories 备注:Demo page: this http URL 摘要:上采样伪影是由有问题的上采样层和上采样时出现的光谱副本引起的。此外,根据所使用的上采样层,此类伪影可以是音调伪影(加性高频噪声)或滤波伪影(减法,衰减某些频带)。在这项工作中,我们通过研究不同的伪影如何相互作用并评估它们对模型性能的影响,研究在产生的音频中具有上采样伪影的实际含义。为此,我们对用于音乐源分离的大量上采样层进行了基准测试:不同的转置和亚像素卷积设置、不同的插值上采样器(包括基于拉伸和sinc插值的两个新层)和不同的基于小波的上采样器(包括一个新的可学习小波层)。我们的结果表明,与插值上采样器相关的滤波伪影在感知上更可取,即使它们倾向于获得更差的客观分数。 摘要:Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in the resulting audio, by studying how different artifacts interact and assessing their impact on the models' performance. To that end, we benchmark a large set of upsampling layers for music source separation: different transposed and subpixel convolution setups, different interpolation upsamplers (including two novel layers based on stretch and sinc interpolation), and different wavelet-based upsamplers (including a novel learnable wavelet layer). Our results show that filtering artifacts, associated with interpolation upsamplers, are perceptually preferrable, even if they tend to achieve worse objective scores.

【4】 Guided-TTS:Text-to-Speech with Untranscribed Speech 标题:Guided-TTS:带未转录语音的文本到语音转换 链接:https://arxiv.org/abs/2111.11755

作者:Heeseung Kim,Sungwon Kim,Sungroh Yoon 机构:Data Science and AI Lab., Seoul National University, Interdisciplinary Program in Artificial Intelligence, Seoul National University 摘要:大多数神经文本到语音(TTS)模型需要来自所需说话人的成对数据来进行高质量语音合成,这限制了大量未翻译数据用于训练的使用。在这项工作中,我们提出了引导TTS,一种高质量的TTS模型,可以从未翻译的语音数据中学习生成语音。引导TTS将一个无条件扩散概率模型与一个单独训练的音素分类器相结合,用于文本到语音。通过对语音的无条件分布建模,我们的模型可以利用未翻译的数据进行训练。对于文本到语音合成,我们通过音素分类指导无条件DDPM的生成过程,从给定转录本的条件分布生成mel谱图。我们表明,引导TTS与现有方法相比,在LJSpeech没有任何转录本的情况下取得了相当的性能。我们的结果进一步表明,在多峰值大规模数据上训练的单说话人相关音素分类器可以引导不同说话人无条件地进行TTS。 摘要:Most neural text-to-speech (TTS) models requirepaired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.

【5】 ADTOF: A large dataset of non-synthetic music for automatic drum transcription 标题:ADTOF:用于自动鼓改写的非合成音乐的大型数据集 链接:https://arxiv.org/abs/2111.11737

作者:Mickael Zehren,Marco Alunno,Paolo Bientinesi 机构:Universidad EAFIT Medellín, Umeå Universitet 备注:Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, Online, pp. 818-824 摘要:在有旋律乐器(DTM)存在的情况下,最先进的鼓转录方法是以监督方式训练的机器学习模型,这意味着它们依赖于标记的数据集。问题在于,可用的公共数据集在大小或现实性上都是有限的,因此对于训练目的来说是次优的。事实上,目前最好的结果是通过涉及真实和合成数据集的相当复杂的多步骤训练过程获得的。为了解决这个问题,从观察到节奏游戏玩家社区提供了大量带注释的数据开始,我们策划了一个新的众包鼓转录数据集。该数据集包含真实世界的音乐,是手动注释的,比任何其他非合成数据集大大约两个数量级,因此它是用于训练目的的主要候选数据集。然而,由于众包的原因,最初的注释包含错误。我们将讨论如何通过自动更正不同类型的错误来提高数据集的质量。当用于训练流行的DTM模型时,数据集产生的性能与最先进的DTM相匹配,从而证明了注释的质量。 摘要:The state-of-the-art methods for drum transcription in the presence of melodic instruments (DTM) are machine learning models trained in a supervised manner, which means that they rely on labeled datasets. The problem is that the available public datasets are limited either in size or in realism, and are thus suboptimal for training purposes. Indeed, the best results are currently obtained via a rather convoluted multi-step training process that involves both real and synthetic datasets. To address this issue, starting from the observation that the communities of rhythm games players provide a large amount of annotated data, we curated a new dataset of crowdsourced drum transcriptions. This dataset contains real-world music, is manually annotated, and is about two orders of magnitude larger than any other non-synthetic dataset, making it a prime candidate for training purposes. However, due to crowdsourcing, the initial annotations contain mistakes. We discuss how the quality of the dataset can be improved by automatically correcting different types of mistakes. When used to train a popular DTM model, the dataset yields a performance that matches that of the state-of-the-art for DTM, thus demonstrating the quality of the annotations.

【6】 A Contextual Latent Space Model: Subsequence Modulation in Melodic Sequence 标题:上下文潜在空间模型:旋律序列中的子序列调制 链接:https://arxiv.org/abs/2111.11703

作者:Taketo Akama 机构:Sony Computer Science Laboratories, Tokyo, Japan 备注:22nd International Society for Music Information Retrieval Conference (ISMIR), 2021; 8 pages 摘要:一些序列的生成模型,如音乐和文本,允许我们只编辑子序列,给定周围的上下文序列,这在以交互方式指导生成中起着重要作用。然而,编辑子序列主要涉及从可能的生成空间对子序列进行随机重采样。我们提出了一个上下文潜在空间模型(CLSM),以便用户能够利用生成空间中的方向感探索子序列生成,例如插值,以及探索变化——语义相似的可能子序列。基于上下文的先验知识和译码器构成了CLSM的生成模型,而基于上下文位置的编码器则是CLSM的推理模型。在实验中,我们使用了一个单声道符号音乐数据集,证明了我们的上下文潜在空间在插值方面比基线更平滑,并且生成的样本质量优于基线模型。生成示例可在线获取。 摘要:Some generative models for sequences such as music and text allow us to edit only subsequences, given surrounding context sequences, which plays an important part in steering generation interactively. However, editing subsequences mainly involves randomly resampling subsequences from a possible generation space. We propose a contextual latent space model (CLSM) in order for users to be able to explore subsequence generation with a sense of direction in the generation space, e.g., interpolation, as well as exploring variations -- semantically similar possible subsequences. A context-informed prior and decoder constitute the generative model of CLSM, and a context position-informed encoder is the inference model. In experiments, we use a monophonic symbolic music dataset, demonstrating that our contextual latent space is smoother in interpolation than baselines, and the quality of generated samples is superior to baseline models. The generation examples are available online.

【7】 Music Classification: Beyond Supervised Learning, Towards Real-world Applications 标题:音乐分类:超越监督学习,走向现实应用 链接:https://arxiv.org/abs/2111.11636

作者:Minz Won,Janne Spijkervet,Keunwoo Choi 备注:This is a web book written for a tutorial session of the 22nd International Society for Music Information Retrieval Conference, Nov 8-12, 2021. Please visit this https URL for the original, web book format 摘要:音乐分类是一项音乐信息检索(MIR)任务,用于将音乐项目分类为类别、情绪和乐器等标签。它还与音乐相似性和音乐偏好等其他概念密切相关。在本教程中,我们将重点放在两个方向上——最近的监督学习以外的训练计划和音乐分类模型的成功应用。这本网络书的目标读者是对最先进的音乐分类研究和构建真实世界应用程序感兴趣的研究人员和实践者。我们假设观众熟悉基本的机器学习概念。在这本书中,我们提出三个讲座如下:1。音乐分类概述:任务定义,应用,现有方法,数据集,2。超越监督学习:音乐分类的半监督和自我监督学习,3。面向现实世界的应用:讨论较少,但在实践中重要的研究问题。 摘要:Music classification is a music information retrieval (MIR) task to classify music items to labels such as genre, mood, and instruments. It is also closely related to other concepts such as music similarity and musical preference. In this tutorial, we put our focus on two directions - the recent training schemes beyond supervised learning and the successful application of music classification models. The target audience for this web book is researchers and practitioners who are interested in state-of-the-art music classification research and building real-world applications. We assume the audience is familiar with the basic machine learning concepts. In this book, we present three lectures as follows: 1. Music classification overview: Task definition, applications, existing approaches, datasets, 2. Beyond supervised learning: Semi- and self-supervised learning for music classification, 3. Towards real-world applications: Less-discussed, yet important research issues in practice.

【8】 Dataset of Spatial Room Impulse Responses in a Variable Acoustics Room for Six Degrees-of-Freedom Rendering and Analysis 标题:用于六自由度渲染和分析的变声学房间空间脉冲响应数据集 链接:https://arxiv.org/abs/2111.11882

作者:Thomas McKenzie,Leo McCormack,Christoph Hold 机构:Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland 备注:3 pages, 3 figures, 2 tables 摘要:从物理声学建模和语音增强到虚拟现实应用,房间声学测量被用于音频研究的许多领域。本文记录了可变声学室内空间室内脉冲响应(SRIRs)数据集测量的技术规范和选择。使用了两个球形话筒阵列:mh声学特征话筒em32和Zylia ZM-1,分别能够捕获四阶和三阶双音。该数据集由三个声源和七个接收器位置组成,重复五种不同混响水平的房间声学配置。数据集的可能应用包括六自由度(6DoF)分析和渲染、SRIR插值方法和空间去冗余技术。 摘要:Room acoustics measurements are used in many areas of audio research, from physical acoustics modelling and speech enhancement to virtual reality applications. This paper documents the technical specifications and choices made in the measurement of a dataset of spatial room impulse responses (SRIRs) in a variable acoustics room. Two spherical microphone arrays are used: the mh Acoustics Eigenmike em32 and the Zylia ZM-1, capable of up to fourth- and third-order Ambisonic capture, respectively. The dataset consists of three source and seven receiver positions, repeated with five configurations of the room's acoustics with varying levels of reverberation. Possible applications of the dataset include six degrees-of-freedom (6DoF) analysis and rendering, SRIR interpolation methods, and spatial dereverberation techniques.

【9】 SpeechMoE2: Mixture-of-Experts Model with Improved Routing 标题:SpeechMoE2:改进路由的专家混合模型 链接:https://arxiv.org/abs/2111.11831

作者:Zhao You,Shulin Feng,Dan Su,Dong Yu 机构:Tencent AI Lab, Shenzhen, China, Tencent AI Lab, Bellevue, WA, USA 备注:5 pages, 1 figure. Submitted to ICASSP 2022 摘要:将基于专家的声学模型与动态路由机制相结合,在语音识别中取得了良好的效果。路由器结构的设计原则对于大模型容量和高计算效率至关重要。我们以前的工作SpeechMoE只使用局部图形嵌入来帮助路由器做出路由决策。为了进一步提高针对不同域和重音的语音识别性能,我们提出了一种新的路由器结构,该结构将额外的全局域和重音嵌入到路由器输入中,以提高适应性。实验结果表明,与SpeechMoE相比,SpeechMoE2在多域和多重音任务中都能获得更低的字符错误率(CER),且参数具有可比性。首先,该方法为多域任务和多重音任务分别提供了高达1.6%-4.8%的相对CER改进和1.9%-17.7%的相对CER改进。此外,增加专家数量还可以实现性能的持续改进,并保持计算成本不变。 摘要:Mixture-of-experts based acoustic models with dynamic routing mechanisms have proved promising results for speech recognition. The design principle of router architecture is important for the large model capacity and high computational efficiency. Our previous work SpeechMoE only uses local grapheme embedding to help routers to make route decisions. To further improve speech recognition performance against varying domains and accents, we propose a new router architecture which integrates additional global domain and accent embedding into router input to promote adaptability. Experimental results show that the proposed SpeechMoE2 can achieve lower character error rate (CER) with comparable parameters than SpeechMoE on both multi-domain and multi-accent task. Primarily, the proposed method provides up to 1.6% - 4.8% relative CER improvement for the multidomain task and 1.9% - 17.7% relative CER improvement for the multi-accent task respectively. Besides, increasing the number of experts also achieves consistent performance improvement and keeps the computational cost constant.

【10】 Effect of noise suppression losses on speech distortion and ASR performance 标题:噪声抑制损失对语音失真和ASR性能的影响 链接:https://arxiv.org/abs/2111.11606

作者:Sebastian Braun,Hannes Gamper 机构:Microsoft Research, Redmond, WA, USA 备注:submitted to ICASSP 2022 摘要:基于深度学习的语音增强技术已经朝着提高语音质量的方向迅速发展,同时模型也越来越紧凑,可用于实时边缘推断。然而,语音质量与模型大小直接相关,小模型通常仍然无法达到足够的质量。此外,引入的语音失真和伪影极大地损害了语音质量和可懂度,并且通常会显著降低自动语音识别(ASR)率。在这项工作中,我们阐明了频谱复压缩均方误差(MSE)损失的成功,以及它的幅度和相位感知项如何与语音失真与降噪之间的权衡有关。我们进一步研究了将预训练的无参考预测因子与平均意见分数(MOS)和单词错误率(WER)相结合,以及预训练的ASR嵌入和声音事件检测。我们的分析表明,经过预训练的网络没有一个在强频谱损耗下增加显著性能。 摘要:Deep learning based speech enhancement has made rapid development towards improving quality, while models are becoming more compact and usable for real-time on-the-edge inference. However, the speech quality scales directly with the model size, and small models are often still unable to achieve sufficient quality. Furthermore, the introduced speech distortion and artifacts greatly harm speech quality and intelligibility, and often significantly degrade automatic speech recognition (ASR) rates. In this work, we shed light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off. We further investigate integrating pre-trained reference-less predictors for mean opinion score (MOS) and word error rate (WER), and pre-trained embeddings on ASR and sound event detection. Our analyses reveal that none of the pre-trained networks added significant performance over the strong spectral loss.

3.eess.AS音频处理:

【1】 Dataset of Spatial Room Impulse Responses in a Variable Acoustics Room for Six Degrees-of-Freedom Rendering and Analysis 标题:用于六自由度渲染和分析的变声学房间空间脉冲响应数据集 链接:https://arxiv.org/abs/2111.11882

作者:Thomas McKenzie,Leo McCormack,Christoph Hold 机构:Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland 备注:3 pages, 3 figures, 2 tables 摘要:从物理声学建模和语音增强到虚拟现实应用,房间声学测量被用于音频研究的许多领域。本文记录了可变声学室内空间室内脉冲响应(SRIRs)数据集测量的技术规范和选择。使用了两个球形话筒阵列:mh声学特征话筒em32和Zylia ZM-1,分别能够捕获四阶和三阶双音。该数据集由三个声源和七个接收器位置组成,重复五种不同混响水平的房间声学配置。数据集的可能应用包括六自由度(6DoF)分析和渲染、SRIR插值方法和空间去冗余技术。 摘要:Room acoustics measurements are used in many areas of audio research, from physical acoustics modelling and speech enhancement to virtual reality applications. This paper documents the technical specifications and choices made in the measurement of a dataset of spatial room impulse responses (SRIRs) in a variable acoustics room. Two spherical microphone arrays are used: the mh Acoustics Eigenmike em32 and the Zylia ZM-1, capable of up to fourth- and third-order Ambisonic capture, respectively. The dataset consists of three source and seven receiver positions, repeated with five configurations of the room's acoustics with varying levels of reverberation. Possible applications of the dataset include six degrees-of-freedom (6DoF) analysis and rendering, SRIR interpolation methods, and spatial dereverberation techniques.

【2】 SpeechMoE2: Mixture-of-Experts Model with Improved Routing 标题:SpeechMoE2:改进路由的专家混合模型 链接:https://arxiv.org/abs/2111.11831

作者:Zhao You,Shulin Feng,Dan Su,Dong Yu 机构:Tencent AI Lab, Shenzhen, China, Tencent AI Lab, Bellevue, WA, USA 备注:5 pages, 1 figure. Submitted to ICASSP 2022 摘要:将基于专家的声学模型与动态路由机制相结合,在语音识别中取得了良好的效果。路由器结构的设计原则对于大模型容量和高计算效率至关重要。我们以前的工作SpeechMoE只使用局部图形嵌入来帮助路由器做出路由决策。为了进一步提高针对不同域和重音的语音识别性能,我们提出了一种新的路由器结构,该结构将额外的全局域和重音嵌入到路由器输入中,以提高适应性。实验结果表明,与SpeechMoE相比,SpeechMoE2在多域和多重音任务中都能获得更低的字符错误率(CER),且参数具有可比性。首先,该方法为多域任务和多重音任务分别提供了高达1.6%-4.8%的相对CER改进和1.9%-17.7%的相对CER改进。此外,增加专家数量还可以实现性能的持续改进,并保持计算成本不变。 摘要:Mixture-of-experts based acoustic models with dynamic routing mechanisms have proved promising results for speech recognition. The design principle of router architecture is important for the large model capacity and high computational efficiency. Our previous work SpeechMoE only uses local grapheme embedding to help routers to make route decisions. To further improve speech recognition performance against varying domains and accents, we propose a new router architecture which integrates additional global domain and accent embedding into router input to promote adaptability. Experimental results show that the proposed SpeechMoE2 can achieve lower character error rate (CER) with comparable parameters than SpeechMoE on both multi-domain and multi-accent task. Primarily, the proposed method provides up to 1.6% - 4.8% relative CER improvement for the multidomain task and 1.9% - 17.7% relative CER improvement for the multi-accent task respectively. Besides, increasing the number of experts also achieves consistent performance improvement and keeps the computational cost constant.

【3】 Effect of noise suppression losses on speech distortion and ASR performance 链接:https://arxiv.org/abs/2111.11606

作者:Sebastian Braun,Hannes Gamper 机构:Microsoft Research, Redmond, WA, USA 备注:submitted to ICASSP 2022 摘要:基于深度学习的语音增强技术已经朝着提高语音质量的方向迅速发展,同时模型也越来越紧凑,可用于实时边缘推断。然而,语音质量与模型大小直接相关,小模型通常仍然无法达到足够的质量。此外,引入的语音失真和伪影极大地损害了语音质量和可懂度,并且通常会显著降低自动语音识别(ASR)率。在这项工作中,我们阐明了频谱复压缩均方误差(MSE)损失的成功,以及它的幅度和相位感知项如何与语音失真与降噪之间的权衡有关。我们进一步研究了将预训练的无参考预测因子与平均意见分数(MOS)和单词错误率(WER)相结合,以及预训练的ASR嵌入和声音事件检测。我们的分析表明,经过预训练的网络没有一个在强频谱损耗下增加显著性能。 摘要:Deep learning based speech enhancement has made rapid development towards improving quality, while models are becoming more compact and usable for real-time on-the-edge inference. However, the speech quality scales directly with the model size, and small models are often still unable to achieve sufficient quality. Furthermore, the introduced speech distortion and artifacts greatly harm speech quality and intelligibility, and often significantly degrade automatic speech recognition (ASR) rates. In this work, we shed light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off. We further investigate integrating pre-trained reference-less predictors for mean opinion score (MOS) and word error rate (WER), and pre-trained embeddings on ASR and sound event detection. Our analyses reveal that none of the pre-trained networks added significant performance over the strong spectral loss.

【4】 Romanian Speech Recognition Experiments from the ROBIN Project 标题:罗宾计划的罗马尼亚语音识别实验 链接:https://arxiv.org/abs/2111.12028

作者:Andrei-Marius Avram,Vasile Păiş,Dan Tufiş 机构:Research Institute for Artificial Intelligence, Romanian Academy 备注:12 pages, 3 figures, ConsILR2020 摘要:接受社会辅助机器人的基本功能之一是其与环境中其他代理的通信能力。在罗宾项目的背景下,研究了通过与机器人的语音交互进行情景对话。本文介绍了使用深度神经网络进行的不同语音识别实验,重点是在仍然可靠的情况下产生快速(网络本身的延迟小于100ms)的模型。尽管关键的期望特征之一是低延迟,但最终的深度神经网络模型实现了识别罗马尼亚语言的最新结果,与语言模型结合时,获得了9.91%的单词错误率(WER),因此,在提供改进的运行时性能的同时,改进了先前的结果。此外,我们还探讨了两个用于纠正ASR输出的模块(连字符和大写字母恢复以及未知单词纠正),目标是ROBIN项目的目标(封闭微世界中的对话)。我们设计了一个基于API的模块化架构,允许集成引擎(机器人内部或外部)根据需要将可用模块链接在一起。最后,我们将所提出的设计集成到RELATE平台中,并通过上传文件或录制新语音向web用户提供ASR服务,从而对该设计进行测试。 摘要:One of the fundamental functionalities for accepting a socially assistive robot is its communication capabilities with other agents in the environment. In the context of the ROBIN project, situational dialogue through voice interaction with a robot was investigated. This paper presents different speech recognition experiments with deep neural networks focusing on producing fast (under 100ms latency from the network itself), while still reliable models. Even though one of the key desired characteristics is low latency, the final deep neural network model achieves state of the art results for recognizing Romanian language, obtaining a 9.91% word error rate (WER), when combined with a language model, thus improving over the previous results while offering at the same time an improved runtime performance. Additionally, we explore two modules for correcting the ASR output (hyphen and capitalization restoration and unknown words correction), targeting the ROBIN project's goals (dialogue in closed micro-worlds). We design a modular architecture based on APIs allowing an integration engine (either in the robot or external) to chain together the available modules as needed. Finally, we test the proposed design by integrating it in the RELATE platform and making the ASR service available to web users by either uploading a file or recording new speech.

【5】 Longitudinal Speech Biomarkers for Automated Alzheimer's Detection 标题:用于阿尔茨海默病自动检测的纵向语音生物标志物 链接:https://arxiv.org/abs/2111.11859

作者:Jordi Laguarta Soler,Brian Subirana 机构:Specialty section:, This article was submitted to, Human-Media Interaction, a section of the journal, Frontiers in Computer Science, Citation:, Laguarta J and Subirana B (,), Longitudinal Speech Biomarkers for, Automated Alzheimer’s Detection., Front. Comput. Sci. ,:,. 备注:None 摘要:我们介绍了一种新的音频处理架构,开放语音脑模型(OVBM),提高了阿尔茨海默氏症(AD)从自发语音纵向辨别的检测准确性。我们还概述了OVBM设计方法,该方法将我们引入到这种体系结构中,这种体系结构通常可以结合多模式生物标记物,同时针对多种疾病和其他人工智能任务。我们的方法的关键是使用多个相互补充的生物标记物,当其中两个唯一地识别目标疾病中的不同受试者时,我们说它们是正交的。我们通过2019冠状病毒疾病的16种生物标志物,其中三种是正交的,说明了同时显示出明显的无关疾病如AD和COVID-19的最先进的鉴别方法。受麻省理工学院大脑和机器中心研究的启发,OVBM结合了四个智能模块的生物标记物实现:大脑OS块和重叠音频样本,并聚集来自感觉流和认知核心的生物标记物特征,为目标任务创建一个符号组合模型的多模态图神经网络。我们将其应用于AD,在原始音频上达到93.8%以上的最新准确率,同时提取受试者显著性图,使用多个生物标记物纵向跟踪相关疾病进展,16在报告的AD任务中。最终目的是通过检测发病和治疗影响来帮助医疗实践,以便对干预方案进行纵向测试。使用OBVM设计方法,我们引入了一种新的肺和呼吸道生物标记物,该标记物使用200000多个咳嗽样本创建,以预训练一个区分咳嗽文化起源的模型。该咳嗽数据集设定了一个新的基准,作为最大的音频健康数据集,在2020年4月有30000多名受试者参与,首次证明了咳嗽文化偏见。 摘要:We introduce a novel audio processing architecture, the Open Voice Brain Model (OVBM), improving detection accuracy for Alzheimer's (AD) longitudinal discrimination from spontaneous speech. We also outline the OVBM design methodology leading us to such architecture, which in general can incorporate multimodal biomarkers and target simultaneously several diseases and other AI tasks. Key in our methodology is the use of multiple biomarkers complementing each other, and when two of them uniquely identify different subjects in a target disease we say they are orthogonal. We illustrate the methodology by introducing 16 biomarkers, three of which are orthogonal, demonstrating simultaneous above state-of-the-art discrimination for apparently unrelated diseases such as AD and COVID-19. Inspired by research conducted at the MIT Center for Brain Minds and Machines, OVBM combines biomarker implementations of the four modules of intelligence: The brain OS chunks and overlaps audio samples and aggregates biomarker features from the sensory stream and cognitive core creating a multi-modal graph neural network of symbolic compositional models for the target task. We apply it to AD, achieving above state-of-the-art accuracy of 93.8% on raw audio, while extracting a subject saliency map that longitudinally tracks relative disease progression using multiple biomarkers, 16 in the reported AD task. The ultimate aim is to help medical practice by detecting onset and treatment impact so that intervention options can be longitudinally tested. Using the OBVM design methodology, we introduce a novel lung and respiratory tract biomarker created using 200,000 cough samples to pre-train a model discriminating cough cultural origin. This cough dataset sets a new benchmark as the largest audio health dataset with 30,000 subjects participating in April 2020, demonstrating for the first-time cough cultural bias.

【6】 Upsampling layers for music source separation 链接:https://arxiv.org/abs/2111.11773

作者:Jordi Pons,Joan Serrà,Santiago Pascual,Giulio Cengarle,Daniel Arteaga,Davide Scaini 机构:Dolby Laboratories 备注:Demo page: this http URL 摘要:上采样伪影是由有问题的上采样层和上采样时出现的光谱副本引起的。此外,根据所使用的上采样层,此类伪影可以是音调伪影(加性高频噪声)或滤波伪影(减法,衰减某些频带)。在这项工作中,我们通过研究不同的伪影如何相互作用并评估它们对模型性能的影响,研究在产生的音频中具有上采样伪影的实际含义。为此,我们对用于音乐源分离的大量上采样层进行了基准测试:不同的转置和亚像素卷积设置、不同的插值上采样器(包括基于拉伸和sinc插值的两个新层)和不同的基于小波的上采样器(包括一个新的可学习小波层)。我们的结果表明,与插值上采样器相关的滤波伪影在感知上更可取,即使它们倾向于获得更差的客观分数。 摘要:Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in the resulting audio, by studying how different artifacts interact and assessing their impact on the models' performance. To that end, we benchmark a large set of upsampling layers for music source separation: different transposed and subpixel convolution setups, different interpolation upsamplers (including two novel layers based on stretch and sinc interpolation), and different wavelet-based upsamplers (including a novel learnable wavelet layer). Our results show that filtering artifacts, associated with interpolation upsamplers, are perceptually preferrable, even if they tend to achieve worse objective scores.

【7】 Guided-TTS:Text-to-Speech with Untranscribed Speech 标题:Guided-TTS:带未转录语音的文本到语音转换 链接:https://arxiv.org/abs/2111.11755

作者:Heeseung Kim,Sungwon Kim,Sungroh Yoon 机构:Data Science and AI Lab., Seoul National University, Interdisciplinary Program in Artificial Intelligence, Seoul National University 摘要:大多数神经文本到语音(TTS)模型需要来自所需说话人的成对数据来进行高质量语音合成,这限制了大量未翻译数据用于训练的使用。在这项工作中,我们提出了引导TTS,一种高质量的TTS模型,可以从未翻译的语音数据中学习生成语音。引导TTS将一个无条件扩散概率模型与一个单独训练的音素分类器相结合,用于文本到语音。通过对语音的无条件分布建模,我们的模型可以利用未翻译的数据进行训练。对于文本到语音合成,我们通过音素分类指导无条件DDPM的生成过程,从给定转录本的条件分布生成mel谱图。我们表明,引导TTS与现有方法相比,在LJSpeech没有任何转录本的情况下取得了相当的性能。我们的结果进一步表明,在多峰值大规模数据上训练的单说话人相关音素分类器可以引导不同说话人无条件地进行TTS。 摘要:Most neural text-to-speech (TTS) models requirepaired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.

【8】 ADTOF: A large dataset of non-synthetic music for automatic drum transcription 标题:ADTOF:用于自动鼓改写的非合成音乐的大型数据集 链接:https://arxiv.org/abs/2111.11737

作者:Mickael Zehren,Marco Alunno,Paolo Bientinesi 机构:Universidad EAFIT Medellín, Umeå Universitet 备注:Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, Online, pp. 818-824 摘要:在有旋律乐器(DTM)存在的情况下,最先进的鼓转录方法是以监督方式训练的机器学习模型,这意味着它们依赖于标记的数据集。问题在于,可用的公共数据集在大小或现实性上都是有限的,因此对于训练目的来说是次优的。事实上,目前最好的结果是通过涉及真实和合成数据集的相当复杂的多步骤训练过程获得的。为了解决这个问题,从观察到节奏游戏玩家社区提供了大量带注释的数据开始,我们策划了一个新的众包鼓转录数据集。该数据集包含真实世界的音乐,是手动注释的,比任何其他非合成数据集大大约两个数量级,因此它是用于训练目的的主要候选数据集。然而,由于众包的原因,最初的注释包含错误。我们将讨论如何通过自动更正不同类型的错误来提高数据集的质量。当用于训练流行的DTM模型时,数据集产生的性能与最先进的DTM相匹配,从而证明了注释的质量。 摘要:The state-of-the-art methods for drum transcription in the presence of melodic instruments (DTM) are machine learning models trained in a supervised manner, which means that they rely on labeled datasets. The problem is that the available public datasets are limited either in size or in realism, and are thus suboptimal for training purposes. Indeed, the best results are currently obtained via a rather convoluted multi-step training process that involves both real and synthetic datasets. To address this issue, starting from the observation that the communities of rhythm games players provide a large amount of annotated data, we curated a new dataset of crowdsourced drum transcriptions. This dataset contains real-world music, is manually annotated, and is about two orders of magnitude larger than any other non-synthetic dataset, making it a prime candidate for training purposes. However, due to crowdsourcing, the initial annotations contain mistakes. We discuss how the quality of the dataset can be improved by automatically correcting different types of mistakes. When used to train a popular DTM model, the dataset yields a performance that matches that of the state-of-the-art for DTM, thus demonstrating the quality of the annotations.

【9】 A Contextual Latent Space Model: Subsequence Modulation in Melodic Sequence 链接:https://arxiv.org/abs/2111.11703

作者:Taketo Akama 机构:Sony Computer Science Laboratories, Tokyo, Japan 备注:22nd International Society for Music Information Retrieval Conference (ISMIR), 2021; 8 pages 摘要:一些序列的生成模型,如音乐和文本,允许我们只编辑子序列,给定周围的上下文序列,这在以交互方式指导生成中起着重要作用。然而,编辑子序列主要涉及从可能的生成空间对子序列进行随机重采样。我们提出了一个上下文潜在空间模型(CLSM),以便用户能够利用生成空间中的方向感探索子序列生成,例如插值,以及探索变化——语义相似的可能子序列。基于上下文的先验知识和译码器构成了CLSM的生成模型,而基于上下文位置的编码器则是CLSM的推理模型。在实验中,我们使用了一个单声道符号音乐数据集,证明了我们的上下文潜在空间在插值方面比基线更平滑,并且生成的样本质量优于基线模型。生成示例可在线获取。 摘要:Some generative models for sequences such as music and text allow us to edit only subsequences, given surrounding context sequences, which plays an important part in steering generation interactively. However, editing subsequences mainly involves randomly resampling subsequences from a possible generation space. We propose a contextual latent space model (CLSM) in order for users to be able to explore subsequence generation with a sense of direction in the generation space, e.g., interpolation, as well as exploring variations -- semantically similar possible subsequences. A context-informed prior and decoder constitute the generative model of CLSM, and a context position-informed encoder is the inference model. In experiments, we use a monophonic symbolic music dataset, demonstrating that our contextual latent space is smoother in interpolation than baselines, and the quality of generated samples is superior to baseline models. The generation examples are available online.

【10】 Music Classification: Beyond Supervised Learning, Towards Real-world Applications 标题:音乐分类:超越监督学习,走向现实应用 链接:https://arxiv.org/abs/2111.11636

作者:Minz Won,Janne Spijkervet,Keunwoo Choi 备注:This is a web book written for a tutorial session of the 22nd International Society for Music Information Retrieval Conference, Nov 8-12, 2021. Please visit this https URL for the original, web book format 摘要:音乐分类是一项音乐信息检索(MIR)任务,用于将音乐项目分类为类别、情绪和乐器等标签。它还与音乐相似性和音乐偏好等其他概念密切相关。在本教程中,我们将重点放在两个方向上——最近的监督学习以外的训练计划和音乐分类模型的成功应用。这本网络书的目标读者是对最先进的音乐分类研究和构建真实世界应用程序感兴趣的研究人员和实践者。我们假设观众熟悉基本的机器学习概念。在这本书中,我们提出三个讲座如下:1。音乐分类概述:任务定义,应用,现有方法,数据集,2。超越监督学习:音乐分类的半监督和自我监督学习,3。面向现实世界的应用:讨论较少,但在实践中重要的研究问题。 摘要:Music classification is a music information retrieval (MIR) task to classify music items to labels such as genre, mood, and instruments. It is also closely related to other concepts such as music similarity and musical preference. In this tutorial, we put our focus on two directions - the recent training schemes beyond supervised learning and the successful application of music classification models. The target audience for this web book is researchers and practitioners who are interested in state-of-the-art music classification research and building real-world applications. We assume the audience is familiar with the basic machine learning concepts. In this book, we present three lectures as follows: 1. Music classification overview: Task definition, applications, existing approaches, datasets, 2. Beyond supervised learning: Semi- and self-supervised learning for music classification, 3. Towards real-world applications: Less-discussed, yet important research issues in practice.

机器翻译,仅供参考

0 人点赞